Violence detection in Hollywood movies by the fusion of visual and mid-level audio cues
Detecting violent scenes in movies is an important video content understanding functionality e.g., for providing automated youth pro- tection services. One key issue in designing algorithms for violence detection is the choice of discriminative features. In this paper, we employ mid-level audio features and compare their discriminative power against low-level audio and visual features. We fuse these mid-level audio cues with low-level visual ones at the decision level in order to further improve the performance of violence detection. We use Mel-Frequency Cepstral Coefficients (MFCC) as audio and average motion as visual features. In order to learn a violence model, we choose two-class support vector machines (SVMs). Our experimental results on detecting violent video shots in Hollywood movies show that mid-level audio features are more discriminative and provide more precise results than low-level ones. The detection performance is further enhanced by fusing the mid-level audio cues with low-level visual ones using an SVM-based decision fusion.
Published in: Proceedings of the 21st ACM international conference on Multimedia - MM ’13, 10.1145/2502081.2502187, ACM