Вестник КРАУНЦ. Физ.-мат. науки. 2020. Т. 33. № 4. C. 132-149. ISSN 2079-6641
Содержание выпуска/Contents of this issue
Научная статья
УДК 004.032.26 + 004.93
Нейросетевая модель многомодального распознавания человеческой агрессии
М.Ю. Уздяев
Федеральное государственное бюджетное учреждение науки «Санкт-Петербургский Федеральный исследовательский центр Российской академии наук» (СПб ФИЦ РАН), Санкт-Петербургский институт информатики и автоматизации Российской академии наук, лаборатория автономных робототехнических систем, 14 линия д.
39, г. Санкт-Петербург, 199178, Россия
E-mail: uzdyaev.m@iias.spb.su
Увеличение количества пользователей социокиберфизических систем, умных пространств, систем интернета вещей актуализирует проблему выявления деструктивных действий пользователей, таких как агрессия. При этом, деструктивные действия пользователей могут быть представлены в различных модальностях: двигательная активность тела, сопутствующее выражение лица, невербальное речевое поведение, вербальное речевое поведение. В статье рассматривается нейросетевая модель многомодального распознавания человеческой агрессии, основанная на построении промежуточного признакового пространства, инвариантного виду обрабатываемой модальности. Предлагаемая модель позволяет распознавать с высокой точностью агрессию в условиях отсутствия или недостатка информации какой-либо модальности. Экспериментальное исследование показало 81:8% верных распознаваний на наборе данных IEMOCAP. Также приводятся результаты экспериментов распознавания агрессии на наборе данных IEMOCAP для 15 различных сочетаний обозначенных выше модальностей.
Ключевые слова: распознавание агрессии, анализ поведения, нейронные сети, многомодальная обработка данных.
DOI: 10.26117/2079-6641-2020-33-4-132-149
Поступила в редакцию: 18.11.2020
В окончательном варианте: 10.12.2020
Для цитирования. Уздяев М.Ю. Нейросетевая модель многомодального распознавания человеческой агрессии // Вестник КРАУНЦ. Физ.-мат. науки. 2020. Т. 33. № 4. C. 132-149. DOI: 10.26117/2079-6641-2020-33-4-132-149
Контент публикуется на условиях лицензии Creative Commons Attribution 4.0 International
(https://creativecommons.org/licenses/by/4.0/deed.ru)
© Уздяев М. Ю., 2020
Финансирование. Работа выполнена при поддержке РФФИ (проект № 18-29-22061_МК
Конкурирующие интересы. Конфликтов интересов в отношении авторства и публикации нет.
Авторский вклад и ответственность. Автор участвовал в написании статьи и полностью несет ответственность за предоставление окончательной версии статьи в печать.
Research Article
MSC 62M45
Neural network model for multimodal recognition of human aggression
M. Yu. Uzdyaev
St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Laboratory of autonomous robotic systems, 39, 14th Line, 199178, St. Petersburg, Russia.
E-mail: uzdyaev.m@iias.spb.su
Growing user base of socio-cyberphysical systems, smart environments, IoT (Internet of Things) systems actualizes the problem of revealing of destructive user actions, such as various acts of aggression. Thereby destructive user actions can be represented in different modalities: locomotion, facial expression, associated with it, non-verbal speech behavior, verbal speech behavior. This paper considers a neural network model of multi-modal recognition of human aggression, based on the establishment of an intermediate feature space, invariant to the actual modality, being processed. The proposed model ensures high-fidelity aggression recognition in the cases when data on certain modality are scarce or lacking. Experimental research showed 81.8% correct recognition instances on the IEMOCAP dataset. Also, experimental results are given concerning aggression recognition on the IEMOCAP dataset for 15 different combinations of the modalities, outlined above.
Key words: aggression recognition, behavior analysis, neural networks, multimodal data processing.
DOI: 10.26117/2079-6641-2020-33-4-132-149
Original article submitted: 19.11.2020
Revision submitted: 19.12.2020
For citation. Uzdyaev M.Yu. Neural network model for multimodal recognition of human aggression. Vestnik KRAUNC. Fiz.-mat. nauki. 2020, 33: 4, 132-149. DOI: 10.26117/2079-6641-2020-33-4-132-149
Competing interests. The author declare that there are no conflicts of interest regarding authorship and publication.
Contribution and Responsibility. The author contributed to this article. The author is solely responsible for providing the final version of the article in print. The final version of the manuscript was approved by the author.
The content is published under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/deed.ru)
© Uzdyaev M.Yu., 2020
Funding. This work was supported by the Russian Foundation for Basic Research (project No. 18-29-22061_MK.
Список литература/References
- Berkowitz L., Aggression: Its causes, consequences, and control, Mcgraw-Hill Book Company, 1993, 158 pp.
- Bandura A., Aggression: A social learning analysis., prentice-hall, 1973.
- Ениколопов С. Н., “Понятие агрессии в современной психологии”, Прикладная психология, 2001, №1, 60–72. [Enikolopov S. N., “Ponyatie agressii v sovremennoy psikhologii”, Prikladnaya psikhologiya, 2001, №1, 60–72 (in Russian)].
- Buss A. H., The psychology of aggression., Wiley, 1961.
- El Ayadi M., Kamel M. S., Karray F., “Survey on speech emotion recognition: Features, classification schemes, and databases”, Pattern Recognition, 44:3 (2011), 572–587.
- Trigeorgis G. et al., “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network”, 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP)., IEEE, 2016, 5200–5204.
- De Souza F. D. M. et al., “Violence detection in video using spatio-temporal features”, Graphics, Patterns and Images (SIBGRAPI), 2010 23rd SIBGRAPI Conference on., IEEE, 2010, 224–230.
- Lefter I., Rothkrantzac L.J.M., Burghoutsb G.J., “A comparative study on automatic audiovisual fusion for aggression detection using meta-information”, Pattern Recognition Letters, 2010, 1953–1963.
- Lefter I. et al., “Addressing multimodality in overt aggression detection”, International Conference on Text, Speech and Dialogue, Springer, Berlin, Heidelberg, 2010, 25–32.
- Zajdel W. et al., “CASSANDRA: audio-video sensor fusion for aggression detection”, 2007 IEEE conference on advanced video and signal based surveillance, IEEE, 2007, 200–205.
- Kooij J. F. P. et al., “Multi-modal human aggression detection”, Computer Vision and Image Understanding, 144 (2016), 106–120.
- Qiu Q. et al., “Multimodal information fusion for automated recognition of complex agitation behaviors of dementia patients”, 2007 10th International Conference on Information Fusion, IEEE, 1–8.
- Giannakopoulos T. et al., “Audio-visual fusion for detecting violent scenes in videos”, Hellenic conference on artificial intelligence, Springer, Berlin, Heidelberg, 2010, 91–100.
- Giannakopoulos T. et al., “An extended set of haar-like features for rapid object detection”, Proceedings. international conference on image processing, 1, IEEE, 2002, I–I.
- Yang Z., Multi-modal aggression detection in trains, 2009.
- Lefter I., Burghouts G. J., Rothkrantz L. J. M., “Learning the fusion of audio and video aggression assessment by meta-information from human annotations”, 2012 15th International Conference on Information Fusion, IEEE, 2012, 1527–1533.
- Lefter I., Multimodal Surveillance: Behavior analysis for recognizing stress and aggression., 2014.
- Lefter I., et al., “NAA: A multimodal database of negative affect and aggression”, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE, 2017, 21–27.
- Lefter I., Rothkrantz L. J. M., “Multimodal cross-context recognition of negative interactions”, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), IEEE, 2017, 56–61.
- Patwardhan A., Knapp G., “Aggressive actions and anger detection from multiple modalities using Kinect”, 2016.
- Levonevskii D. et al., “Methods for Determination of Psychophysiological Condition of User Within Smart Environment Based on Complex Analysis of Heterogeneous Data”, Proceedings
of 14th International Conference on Electromechanics and Robotics “Zavalishin’s Readings”, Springer, Singapore, 2020, 511–523. - Уздяев М.Ю. и др., “Методы детектирования агрессивных пользователей информационного пространства на основе генеративно-состязательных нейронных сетей”, Информационно-измерительные и управляющие системы, 17:5 (2019), 60–68. [Uzdiaev M. et al., “Metody detektirovaniya agressivnykh pol’zovateley informatsionnogo prostranstva na osnove generativno-sostyazatel’nykh neyronnykh setey”, Informatsionnoizmeritel’nye i upravlyayushchie sistem, 17:5 (2019), 60–68 (in Russian)].
- Uzdiaev M., “Methods of Multimodal Data Fusion and Forming Latent Representation in the Human Aggression Recognition Task”, 2020 IEEE 10th International Conference on Intelligent Systems (IS), IEEE, 2020, 399–403.
- Zhang K., et al., “Joint face detection and alignment using multitask cascaded convolutional networks”, IEEE Signal Processing Letters, 23:10 (2016), 1499–1503.
- Zhang X., et al., “Shufflenet: An extremely efficient convolutional neural network for mobile devices”, Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 6848–6856.
- Mollahosseini A., Hasani B., Mahoor M. H., “Affectnet: A database for facial expression, valence, and arousal computing in the wild”, IEEE Transactions on Affective Computing, 10:1 (2017), 18–31.
- Ioffe S., Szegedy C., “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, 2015, arXiv: 1502.03167.
- Nair V., Hinton G. E., “Rectified linear units improve restricted boltzmann machines”, Proceedings of the 27th international conference on machine learning (ICML-10), 2010, 807–814.
- Hara K., Kataoka H., Satoh Y., “Learning spatio-temporal features with 3D residual networks for action recognition”, Proceedings of the IEEE International Conference on Computer Vision, 2017, 3154–3160.
- Hara K., Kataoka H., Satoh Y., “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?”, Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, 6546–6555.
- He K., et al., “Deep residual learning for image recognition”, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, 770–778.
- Kay W., et al., “The kinetics human action video dataset”, 2017, arXiv: 1705.06950.
- Simonyan K., Zisserman A., “Very deep convolutional networks for large-scale image recognition”, 2014, arXiv: 1409.1556.
- Hochreiter S., Schmidhuber J., “Long short-term memory”, Neural computation, 9:8 (1997), 1735–1780.
- Schuster M., Paliwal K. K., “Bidirectional recurrent neural networks”, IEEE transactions on Signal Processing, 45:11 (1997), 2673–2681.
- Srivastava N., et al., “Dropout: a simple way to prevent neural networks from overfitting”, The journal of machine learning research, 15:1 (2014), 1929–1958.
- Busso C., et al., “IEMOCAP: Interactive emotional dyadic motion capture database”, Language resources and evaluation, 42:4 (2008), 335.
- https://pytorch.org/.
- Chicco D., “Siamese neural networks: An overview”, Artificial Neural Networks, 2020, 73–94.
- Goodfellow I., et al., “Generative adversarial nets”, Advances in neural information processing systems, 2014, 2672–2680.
Уздяев Михаил Юрьевич – младший научный сотрудник лаборатории больших данных социокиберфизических систем,
Санкт-Петербургский институт информатики и автоматизации РАН, г. Санкт-Петербург, Россия, ORCID 0000-0002-7032-0291.
Uzdyaev Mikhail Yur’evich – Junior Researcher, Laboratory of Big Data of Sociocyberphysical Systems, St. Petersburg Institute for Informatics and Automation RAS, St. Petersburg, Russia, ORCID 0000-0002-7032-0291.