Acar, EsraHopfgartner, FrankAlbayrak, Sahin2018-04-172018-04-172013978-1-4503-2404-5https://depositonce.tu-berlin.de/handle/11303/7608http://dx.doi.org/10.14279/depositonce-6798Detecting violent scenes in movies is an important video content understanding functionality e.g., for providing automated youth pro- tection services. One key issue in designing algorithms for violence detection is the choice of discriminative features. In this paper, we employ mid-level audio features and compare their discriminative power against low-level audio and visual features. We fuse these mid-level audio cues with low-level visual ones at the decision level in order to further improve the performance of violence detection. We use Mel-Frequency Cepstral Coefficients (MFCC) as audio and average motion as visual features. In order to learn a violence model, we choose two-class support vector machines (SVMs). Our experimental results on detecting violent video shots in Hollywood movies show that mid-level audio features are more discriminative and provide more precise results than low-level ones. The detection performance is further enhanced by fusing the mid-level audio cues with low-level visual ones using an SVM-based decision fusion.en000 Informatik, Informationswissenschaft, allgemeine Werkealgorithmsperformanceexperimentationbag-of-audio-wordsmel-frequency cepstral coefficients,motiondecision fusionsupport vector machineViolence detection in Hollywood movies by the fusion of visual and mid-level audio cuesConference Object