A Hybrid Acoustic And Deep Learning Approach For Enhanced Speech Emotion Recognition
DOI:
https://doi.org/10.59890/ijaamr.v1i1.291Keywords:
Emotion, Recognition, Human Computer Interaction, Deep Network, Hybrida ArchitectureAbstract
Emotion recognition in speech is a key research topic in human-computer interaction. Understanding emotions in conversations can shed light on a person's well-being. This study introduces a hybrid architecture that combines acoustic and deep features for improved speech emotion recognition. Acoustic features like RMS energy and MFCC are extracted from voice records. Additionally, sound spectrogram images are processed using deep networks like VGG16 and ResNet to obtain deep features. These are merged into a hybrid feature vector, refined by the ReliefF algorithm. For classification, the Support Vector Machine is employed. Testing on datasets like RAVDESS and EMO-DB yielded accuracy rates up to 90.21%. Our method consistently outperformed existing techniques in accuracy.
References
A. B. Kandali, A. Routray, and T. K. Basu, ‘‘Emotion recognition from assamese speeches using MFCC features and GMM classifier,’’ in Proc. IEEE Region Conf. (TENCON), Nov. 2008, pp. 1–5, doi: 10.1109/tencon. 2008.4766487.
A. M. Badshah, N. Rahim, N. Ullah, J. Ahmad, K. Muhammad, M. Y. Lee, S. Kwon, and S. W. Baik, ‘‘Deep features-based speech emotion recognition for smart affective services,’’ Multimedia Tools Appl., vol. 78, no. 5, pp. 5571–5589, Mar. 2019, doi: 10.1007/s11042-017-5292-7.
A. Milton, S. Sharmy Roy, and S. Tamil Selvi, ‘‘SVM scheme for speech emotion recognition using MFCC feature,’’ Int. J. Comput. Appl., vol. 69, no. 9, pp. 34–39, May 2013, doi: 10.5120/11872-7667.
D. Połap, ‘‘Model of identity verification support system based on voice and image samples,’’ J. Univers. Comput. Sci., vol. 24, pp. 460–474, Jan. 2018.
D. V. Waghmare, R. Deshmukh, P. Shrishrimal, and G. Janvale, ‘‘Emotion recognition system from artificial Marathi speech using MFCC and LDA techniques,’’ in Proc. 5th Int. Conf. Adv. Commun., Netw., Comput., 2014, pp. 1–10. [16]
S. Demircan and H. Kahramanli, ‘‘Feature extraction from speech data for emotion recognition,’’ J. Adv. Comput. Netw., vol. 2, no. 1, pp. 28–30, 2014, doi: 10.7763/jacn.2014.v2.76.
F. Chenchah and Z. Lachiri, ‘‘Acoustic emotion recognition using linear and nonlinear cepstral coefficients,’’ Int. J. Adv. Comput. Sci. Appl., vol. 6, no. 11, pp. 1–4, 2015, doi: 10.14569/ijacsa.2015.061119.
G. Lu, L. Yuan, W. Yang, J. Yan, and H. Li, ‘‘Speech emotion recognition based on long short-term memory and convolutional neural networks,’’ J. Nanjing Univ. Posts Telecommun., vol. 38, no. 5, pp. 63–69, Nov. 2018, doi: 10.14132/j.cnki.1673-5439.2018.05.009.
H. Gunes and M. Piccardi, ‘‘Bi-modal emotion recognition from expressive face and body gestures,’’ J. Netw. Comput. Appl., vol. 30, no. 4, pp. 1334–1345, Nov. 2007, doi: 10.1016/j.jnca.2006.09.007.
H.-S. Bae, H.-J. Lee, and S.-G. Lee, ‘‘Voice recognition based on adaptive MFCC and deep learning,’’ in Proc. IEEE 11th Conf. Ind. Electron. Appl. (ICIEA), Jun. 2016, pp. 1542–1546, doi: 10.1109/iciea.2016.7603830.
K. Han, D. Yu, and I. Tashev, ‘‘Speech emotion recognition using deep neural network and extreme learning machine,’’ in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2014, pp. 1–5.
K. R. Malik, M. Ahmad, S. Khalid, H. Ahmad, F. Al-Turjman, and S. Jabbar, ‘‘Image and command hybrid model for vehicle control using Internet of Vehicles,’’ Trans. Emerg. Telecommun. Technol., vol. 31, no. 5, p. e3774, 2019, doi: 10.1002/ett.3774.
M. Grimm, K. Kroschel, E. Mower, and S. Narayanan, ‘‘Primitivesbased evaluation and estimation of emotions in speech,’’ Speech Commun., vol. 49, nos. 10–11, pp. 787–800, Oct. 2007, doi: 10.1016/j.specom.2007.01.010.
M. Nardelli, G. Valenza, A. Greco, A. Lanata, and E. P. Scilingo, ‘‘Recognizing emotions induced by affective sounds through heart rate variability,’’ IEEE Trans. Affect. Comput., vol. 6, no. 4, pp. 385–394, Oct. 2015, doi: 10.1109/TAFFC.2015.2432810.
P. Schlegel, S. Kniesburges, S. Dürr, A. Schützenberger, and M. Döllinger, ‘‘Machine learning based identification of relevant parameters for functional voice disorders derived from endoscopic high-speed recordings,’’ Sci. Rep., vol. 10, no. 1, p. 10517, Jun. 2020, doi: 10.1038/s41598-020- 66405-y.
S. Mittal, S. Agarwal, and M. J. Nigam, ‘‘Real time multiple face recognition: A deep learning approach,’’ in Proc. Int. Conf. Digit. Med. Image Process. (DMIP), 2018, pp. 70–76, doi: 10.1145/3299852.3299853.
T. Anvarjon, Mustaqeem, and S. Kwon, ‘‘Deep-net: A lightweight CNNbased speech emotion recognition system using deep frequency features,’’ Sensors, vol. 20, no. 18, p. 5212, Sep. 2020, doi: 10.3390/s20185212.
V. Garg, H. Kumar, and R. Sinha, ‘‘Speech based emotion recognition based on hierarchical decision tree with SVM, BLG and SVR classifiers,’’ in Proc. Nat. Conf. Commun. (NCC), Feb. 2013, pp. 1–5, doi: 10.1109/ncc.2013.6487987.
Y. Huang, G. Zhang, X. Li, and F. Da, ‘‘Small sample size speech emotion recognition based on global features and weak metric learning,’’ Acta Acust., vol. 37, pp. 330–338, May 2012. [19]
X. Li and M. Akagi, ‘‘Improving multilingual
speech emotion recognition by combining acoustic features in a three-layer model,’’ Speech Commun., vol. 110, pp. 1–12, Jul. 2019, doi: 10.1016/j.specom.2019.04.004.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Gargi Bharshankar

This work is licensed under a Creative Commons Attribution 4.0 International License.



