1MTech (Computer Engineering) (Pursuing), Bharati Vidyapeeth (Deemed to be University) College of Engineering, Pune
2Professor, Department of Computer Engineering, Bharati Vidyapeeth (Deemed to be University) College of Engineering, Pune
3Associate Professor, Department of Computer Engineering, Bharati Vidyapeeth (Deemed to be University) College of Engineering, Pune
Corresponding Author: Moin Azam, Email: moinazam7@gmail.com
Emotion Recognition (SER) is a solution to the problem of extracting human emotional states from vocal signals, which is a key element in improving human-computer relations in such areas as healthcare, customer service, and entertainment. There are still many obstacles of SER, which are connected with variation of speech patterns, cultural differences, and environmental noise. Machine learning provides a rather strong alternative to address these challenges as they use the analysis of vocal characteristics, including tone, pitch, and intensity. This research work introduces the new structure for SER exploiting Toronto Emotional Speech Set (TESS) dataset, which has 2,800 quality bilingual audio sounds that express seven emotions. The proposed methodology combines old fashioned machine learning and the deep learning approach to extract and classify the emotional cues. Some of the extractable features include Mel-Frequency Cepstral Coefficients (MFCCs), chroma, and Mel-spectrogram for the purpose of reflecting timbral, harmonic, and temporal properties. Two old approaches, Logistic Regression and Decision Tree, set the baseline, and a hybrid deep learning model is presented, which is composed of Convolutional Neural Networks (CNN) and Deep Neural Networks (DNN). The hybrid model uses Conv1D layers for the localized extraction of features and DNN to higher-level abstractions with sparse categorical cross-entropy loss and the Adam optimizer. Assessed on the TESS dataset, the hybrid model outperforms the traditional methods by 98.39% in the accuracy. This framework shows a lot of promise for actual applications, such as an emotion-aware AI system and personalized music recommendation, and points to the direction of future research, in terms of noise robustness and cross-cultural generalization.
Keywords: Speech Emotion Recognition, Machine Learning, Deep Learning, Mel-Frequency Cepstral Coefficients, Convolutional Neural Networks, Deep Neural Networks, Toronto Emotional Speech Set, Emotion Classification
How to cite this article: Azam M, Patil SH, Dhotre SS. Unveiling Emotions in Speech: A Novel Machine Learning Framework for Vocal Sentiment Analysis. Int J Drug Deliv Technol. 2026;16(10s): 910-919. DOI: 10.25258/ijddt.16.10s.106
Source of support: Nil.
Conflict of interest: None