A Lightweight ROI-Based 3D Convolutional Neural Network with Spatial Attention for Violence Detection in Videos

Maryam Sarfraz; Syed Makhdoom Muhammad Mehdi; Kanwal Yousaf

doi:10.33411/IJIST/1827

Authors

Maryam Sarfraz University of Engineering and Technology (UET), Taxila
Syed Makhdoom Muhammad Mehdi University of Engineering and Technology (UET), Taxila
Kanwal Yousaf University of Engineering and Technology (UET), Taxila

DOI:

https://doi.org/10.33411/IJIST/1827

Keywords:

Violence Detection, Video Surveillance, 3D CNN, Spatial Attention, Region of Interest.

Abstract

Violence detection in videos is a crucial component of intelligent surveillance systems, enabling early intervention and enhancing public safety in environments such as streets, stations, and stadiums. This study proposes a lightweight ROI-based 3D Convolutional Neural Network (3D CNN) with a spatial attention mechanism for efficient and accurate violence detection in videos. The proposed framework first extracts spatio-temporal clips using dense optical flow to generate regions of interest (ROIs), which are then used to construct 16-frame spatio-temporal clips. These clips are processed by a 3D CNN integrated with spatial attention modules to learn discriminative spatial and temporal features while suppressing background noise. Final classification is performed through fully connected layers with a sigmoid activation function. The proposed model is evaluated on three benchmark datasets, Real-Life Violence (RLV), Hockey Fight, and Action Movies. Experimental results demonstrate strong performance across all datasets. On the Action Movies dataset, the model achieves an accuracy, precision, recall, and F1-score of 98.50%, 97.92%, 97.85%, and 97.88%, respectively. For the Hockey Fight dataset, the corresponding values are 96.10%, 95.40%, 95.20%, and 95.30%, while for the RLV dataset, the model attains 94.85% accuracy, 94.10% precision, 93.90% recall, and 94.00% F1-score. Furthermore, the proposed approach exhibits stable performance across multiple runs, with a standard deviation of less than 1.2%, indicating robustness and consistency. Compared with state-of-the-art models such as ResNet-50, YOLOv9, and baseline 3D CNN architectures incorporating attention mechanisms, the proposed method achieves consistent improvements of approximately 2.5%–6.2% in accuracy across all datasets while maintaining lower computational complexity. The results confirm that the proposed method is both accurate and computationally efficient, making it suitable for real-time violence detection in video surveillance systems.

References

P. M. Sethi, H. Mohapatra, A. K. Dalai, P. B. Landge, and S. R. Mishra School, “Deep Learning-Based Violence Detection: A YOLO V7 Approach for Real-World Security Applications,” 2025 Int. Conf. Adv. Smart, Secur. Intell. Comput., pp. 1–8, May 2025, doi: 10.1109/ASSIC64892.2025.11158209.

“(PDF) Artificial Intelligence Based Surveillance Systems: A Survey, Challenges and Future Trends.” Accessed: May 06, 2026. [Online]. Available: https://www.researchgate.net/publication/397870555_Artificial_Intelligence_Based_Surveillance_Systems_A_Survey_Challenges_and_Future_Trends

N. Mumtaz et al., “An overview of violence detection techniques: current challenges and future directions,” Artif. Intell. Rev. 2022 565, vol. 56, no. 5, pp. 4641–4666, Oct. 2022, doi: 10.1007/S10462-022-10285-3.

Muhammad Qasim Khan, Sohail Nawaz Sabir, Fazal Malik, and Muhsin Khan, “Deep Convolutional Network For Automatic Violence Detection in Surveillance Videos Using Transfer Learning,” Kashf J. Multidiscip. Res., vol. 2, no. 02, pp. 251–275, Feb. 2025, doi: 10.71146/KJMR270.

Fath U.Min Ullah, Amin Ullah, “Violence Detection Using Spatiotemporal Features with 3D Convolutional Neural Network,” Sensors, vol. 19, no. 11, p. 2472, 2019, doi: https://doi.org/10.3390/s19112472.

“A lightweight convolutional neural network architecture for violence detection in video sequences | Scientific Reports.” Accessed: May 06, 2026. [Online]. Available: https://www.nature.com/articles/s41598-026-37743-0

D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video Transformer Network,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2021-October, pp. 3156–3165, 2021, doi: 10.1109/ICCVW54120.2021.00355.

N. Han, J. Chen, C. Shi, Y. Zeng, G. Xiao, and H. Chen, “BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval,” Jun. 2022, Accessed: May 06, 2026. [Online]. Available: http://arxiv.org/abs/2110.15609

“Attention-Based CNN-BiGRU-Transformer Model for Human Activity Recognition.” Accessed: May 06, 2026. [Online]. Available: https://www.mdpi.com/2076-3417/15/23/12592

J. Silva Deena et al., “Real-time based Violence Detection from CCTV Camera using Machine Learning Method,” 2022 Int. Conf. Ind. 4.0 Technol. I4Tech 2022, 2022, doi: 10.1109/I4TECH55392.2022.9952805.

“Real Life Violence Situations Dataset.” Accessed: Mar. 26, 2026. [Online]. Available: https://www.kaggle.com/datasets/mohamedmustafa/real-life-violence-situations-dataset

“Hockey Fight Vidoes.” Accessed: Mar. 26, 2026. [Online]. Available: https://www.kaggle.com/datasets/yassershrief/hockey-fight-vidoes

“Movies-Violence/Non-violence videos.” Accessed: May 06, 2026. [Online]. Available: https://www.kaggle.com/datasets/pratt3000/moviesviolencenonviolence

S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, “CBAM: Convolutional block attention module,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11211 LNCS, pp. 3–19, 2018, doi: 10.1007/978-3-030-01234-2_1.

Gunnar Farnebäck, “Two-Frame Motion Estimation Based on Polynomial Expansion,” Image Anal., pp. 363–370, 2003, [Online]. Available: https://link.springer.com/chapter/10.1007/3-540-45103-X_50

L. Xu, C. Gong, J. Yang, Q. Wu, and L. Yao, “Violent video detection based on MoSIFT feature and sparse coding,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 3538–3542, 2014, doi: 10.1109/ICASSP.2014.6854259.

H. Mohammadi and E. Nazerfard, “Video Violence Recognition and Localization Using a Semi-Supervised Hard Attention Model,” Sep. 2022, Accessed: May 06, 2026. [Online]. Available: http://arxiv.org/abs/2202.02212

Z. Yi, Z. Sun, J. Feng, and K. Jia, “3D Residual Networks with Channel-Spatial Attention Module for Action Recognition,” Proc. - 2020 Chinese Autom. Congr. CAC 2020, pp. 5171–5174, Nov. 2020, doi: 10.1109/CAC51589.2020.9326923.

S. Vosta and K. -C. Yow, “KianNet: A Violence Detection Model Using an Attention-Based CNN-LSTM Structure,” IEEE Access, vol. 12, pp. 2198–2209, 2024, doi: 10.1109/ACCESS.2023.3339379.

“(PDF) Efficient Spatio-Temporal Modeling Methods for Real-Time Violence Recognition.” Accessed: May 06, 2026. [Online]. Available: https://www.researchgate.net/publication/351830532_Efficient_Spatio-Temporal_Modeling_Methods_for_Real-Time_Violence_Recognition

“(PDF) Inflated 3D ConvNet context analysis for violence detection.” Accessed: May 06, 2026. [Online]. Available: https://www.researchgate.net/publication/357467249_Inflated_3D_ConvNet_context_analysis_for_violence_detection

I. A. Dewi et al., “Spatiotemporal Attention Mechanism on ResNet-ConvGRU for Video-Based Violence Detection,” 2025 5th Int. Conf. Intell. Cybern. Technol. Appl. ICICyTA 2025, pp. 431–436, 2025, doi: 10.1109/ICICYTA68677.2025.11362630.

Zahidul Islam, Mohammad Rukonuzzaman, Raiyan Ahmed, Md. Hasanul Kabir, “Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM,” arXiv:2102.10590, 2021, [Online]. Available: https://arxiv.org/abs/2102.10590

D. K. Ghosh and A. Chakrabarty, “Two-stream Multi-dimensional Convolutional Network for Real-time Violence Detection,” Nov. 2022, Accessed: May 06, 2026. [Online]. Available: http://arxiv.org/abs/2211.04255

“(PDF) Detecting Violence in Video Based on Deep Features Fusion Technique.” Accessed: May 06, 2026. [Online]. Available: https://www.researchgate.net/publication/360012132_Detecting_Violence_in_Video_Based_on_Deep_Features_Fusion_Technique

“Violence Detection In Surveillance Videos Using Deep Learning | Request PDF.” Accessed: May 06, 2026. [Online]. Available: https://www.researchgate.net/publication/346070026_Violence_Detection_In_Surveillance_Videos_Using_Deep_Learning

T. Aremu, L. Zhiyuan, R. Alameeri, M. Khan, and A. El Saddik, “SSIVD-Net: A Novel Salient Super Image Classification & Detection Technique for Weaponized Violence,” Nov. 2023, Accessed: May 06, 2026. [Online]. Available: http://arxiv.org/abs/2207.12850