Interpretable Speech Emotion Recognition: A Comparative Study of BiLSTM Temporal Attention and Transformer-Based Multi-Head Self-Attention

Authors

  • Rexcharles Enyinna Donatus Department of Aerospace Engineering, Air Force Institute of Technology, Kaduna, Nigeria

DOI:

https://doi.org/10.70112/ajes-2025.14.2.4286

Keywords:

Speech Emotion Recognition, Interpretable Deep Learning, BiLSTM, Temporal Attention, Multi-Head Self-Attention, Transformer, MFCC, RAVDESS

Abstract

Speech Emotion Recognition (SER) is an important area of affective computing that enables machines to understand and respond to human emotions. However, many deep learning approaches that achieve high accuracy provide limited insight into how predictions are made, which reduces their practical reliability in sensitive domains such as education and healthcare. This study presents a comparative analysis of two attention-based models for SER using the RAVDESS dataset: a Bidirectional Long Short-Term Memory (BiLSTM) network with temporal attention and a Transformer model with multi-head self-attention. Acoustic features were extracted using 40 Mel-Frequency Cepstral Coefficients (MFCCs) together with their first- and second-order derivatives, forming a 120-dimensional input feature vector. Both models were trained and evaluated on identical data splits using accuracy, precision, recall, and F1-score. The BiLSTM with temporal attention achieved an accuracy of 70.14% and F1-score of 68.76%, outperforming the Transformer model, which recorded 51.39% and 48.30%, respectively. Attention weight analysis showed that the BiLSTM model concentrated more effectively on emotionally relevant segments of speech, improving interpretability and performance. The findings suggest that incorporating temporal attention provides a better balance between recognition accuracy and model transparency, supporting the development of reliable and explainable SER systems for real-world human–machine interaction.

References

[1] H. Lian, C. Lu, S. Li, Y. Zhao, C. Tang, and Y. Zong, “Recognition: Speech, text, and face,” MDPI, Basel, Switzerland, pp. 1–33, 2023.

[2] R. E. Donatus, B. L. Pal, I. H. Donatus, and U. O. Chiedu, “Comparative analysis of spectrogram and MFCC representations for speech emotion recognition using machine learning,” vol. 13, no. 2, pp. 41–47, 2024.

[3] Z. Yang, Z. Li, S. Zhou, L. Zhang, and S. Serikawa, “Speech emotion recognition based on multi-feature speed rate and LSTM,” Neurocomputing, vol. 601, p. 128177, 2024, doi: 10.1016/j.neucom.2024.128177.

[4] S. S. Chandurkar, S. V. Pede, and S. A. Chandurkar, “System for prediction of human emotions and depression level with recommendation of suitable therapy,” Asian J. Comput. Sci. Technol., vol. 6, no. 2, pp. 5–12, 2017, doi: 10.51983/ajcst-2017.6.2.1787.

[5] R. E. Donatus, I. H. Donatus, and U. O. Chiedu, “Exploring the impact of convolutional neural networks on facial emotion detection and recognition,” Asian J. Electr. Sci., vol. 13, no. 1, pp. 35–45, 2024.

[6] G. Kaur and S. Baghla, “Speech recognition using cross-correlation algorithm intended for noise reduction,” Asian J. Comput. Sci. Technol., vol. 7, no. 3, pp. 48–52, 2018, doi: 10.51983/ajcst-2018.7.3.1899.

[7] M. S. Likitha, S. R. R. Gupta, K. Hasitha, and A. U. Raju, “Speech-based human emotion recognition using MFCC,” in Proc. Int. Conf. Wireless Commun., Signal Process. Netw., 2017, pp. 2257–2260.

[8] N. Kumar, S. Kobir, and R. Ahmed, “Enhanced speech emotion recognition with efficient channel attention guided deep CNN-BiLSTM framework,” arXiv preprint arXiv:xxxx.xxxxx, 2024.

[9] Y. Xia, “CNN-BLSTM with attention model for speech emotion recognition,” pp. 1–14, 2023.

[10] S. Chen, M. Zhang, X. Yang, Z. Zhao, T. Zou, and X. Sun, “The impact of attention mechanisms on speech emotion recognition,” Sensors, vol. 21, no. 22, 2021, doi: 10.3390/s21227530.

[11] Benzirar, M. Hamidi, and M. F. Bouami, “Conception of speech emotion recognition methods: A review,” Indones. J. Electr. Eng. Comput. Sci., vol. 37, no. 3, pp. 1856–1864, 2025, doi: 10.11591/ijeecs.v37.i3.pp1856-1864.

[12] H. S. Kumbhar and S. U. Bhandari, “Speech emotion recognition using MFCC features and LSTM network,” in Proc. 5th Int. Conf. Comput., Commun., Control Autom. (ICCUBEA), 2019, pp. 1–3, doi: 10.1109/ICCUBEA47591.2019.9129067.

[13] E. Ghaleb, J. Niehues, and S. Asteriadis, “Joint modelling of audio-visual cues using attention mechanisms for emotion recognition,” Multimedia Tools Appl., vol. 82, no. 8, pp. 11239–11264, 2023, doi: 10.1007/s11042-022-13557-w.

[14] J. H. Chowdhury, S. Ramanna, and K. Kotecha, “Speech emotion recognition with lightweight deep neural ensemble model using handcrafted features,” Sci. Rep., vol. 15, no. 1, pp. 1–14, 2025, doi: 10.1038/s41598-025-95734-z.

[15] P. Karmakar, S. W. Teng, and G. Lu, “Thank you for attention: A survey on attention-based artificial neural networks for automatic speech recognition,” Intell. Syst. Appl., vol. 23, p. 200406, 2024, doi: 10.1016/j.iswa.2024.200406.

[16] P. Kumar, S. Malik, and B. Raman, “Interpretable multimodal emotion recognition using hybrid fusion of speech and image data,” Multimedia Tools Appl., vol. 83, no. 10, pp. 28373–28394, 2024, doi: 10.1007/s11042-023-16443-1.

[17] S. Das, N. N. Lønfeldt, A. K. Pagsberg, and L. H. Clemmensen, “Towards interpretable and transferable speech emotion recognition: Latent representation-based analysis of features, methods and corpora,” arXiv preprint arXiv:2105.02055, 2021. [Online]. Available: http://arxiv.org/abs/2105.02055

[18] C. Lu, W. Zheng, H. Lian, Y. Zong, and C. Tang, “Speech emotion recognition via an attentive time-frequency neural network,” IEEE Trans. Comput. Social Syst., vol. 10, no. 6, pp. 3159–3168, 2022.

[19] H. Zhang, H. Huang, and H. Han, “Attention-based convolution skip bidirectional long short-term memory network for speech emotion recognition,” IEEE Access, vol. 9, pp. 5332–5342, 2021, doi: 10.1109/ACCESS.2020.3047395.

[20] Z. Zhang and K. Wang, “Multiple attention convolutional-recurrent neural networks for speech emotion recognition,” in Proc. 10th Int. Conf. Affect. Comput. Intell. Interact. Workshops (ACIIW), 2022, pp. 1–8, doi: 10.1109/ACIIW57231.2022.10086021.

[21] C. Lu et al., “Learning local-to-global feature aggregation for speech emotion recognition,” arXiv preprint arXiv:2306.01491, 2023.

[22] W. Chen, X. Xing, X. Xu, J. Pang, and L. Du, “DST: Deformable speech transformer for emotion recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2023, pp. 1–5, doi: 10.1109/ICASSP49357.2023.10096966.

Downloads

Published

20-10-2025

How to Cite

Rexcharles Enyinna Donatus. (2025). Interpretable Speech Emotion Recognition: A Comparative Study of BiLSTM Temporal Attention and Transformer-Based Multi-Head Self-Attention. Asian Journal of Electrical Sciences, 14(2), 21–27. https://doi.org/10.70112/ajes-2025.14.2.4286

Similar Articles

1 2 3 4 5 6 7 8 > >> 

You may also start an advanced similarity search for this article.