Review
Copyright ©The Author(s) 2023.
World J Psychiatry. Jan 19, 2023; 13(1): 1-14
Published online Jan 19, 2023. doi: 10.5498/wjp.v13.i1.1
Table 1 Acoustic differences related to prosody and spectral features of the word (sahar) produced by a Persian female speaker in neutral and anger situations

Neutral
Angry
Prosody features
Mean Fundamental frequency (F0)200 Hz225 Hz
Minimum of the fundamental frequency194 Hz223 Hz
Maximum of the fundamental frequency213 Hz238 Hz
Mean intensity60 dB78 dB
Spectral features
First formant frequency (F1)853 Hz686 Hz
Second formant frequency (F2) 2055 Hz1660 Hz
Third formant frequency (F3)3148 Hz2847 Hz
Fourth formant frequency (F4)4245 Hz3678 Hz
Table 2 Different approaches to recognizing the emotional indicators in speech
ApproachesEmotional indicators
Psychological(1) Positive and negative emotion words; (2) Standard function word categories; (3) Content categories; (4) The way of pronoun usage; and (5) Acoustic variables (such as pitch variety, pause time, speaking rate and emphasis)
Linguistic(1) Phonetic: Spectral analysis, temporal analysis; (2) Semantic & Discourse-pragmatic: Words, field, cultural identity, emotional implicatures, illocutionary acts, deixis and indexicality; and (3) Cognitive: Metaphor, metonymy
Data science(1) SER: Looking at sounds with acoustic and spectral features; and (2) NLP: Looking at words with specific semantic properties, word embedding
Table 3 A brief description of some data science models/methods
Method/Model
Short description
Ref.
HMMA HMM is a statistical model that can be used to describe the evolution of observable events that depend on internal factors, which are not directly observable. The observed event is called a ‘symbol’ and the invisible factor underlying the observation is called a ‘state’. A HMM consists of two stochastic processes, namely, an invisible process of hidden states and a visible process of observable symbols. The hidden states form a Markov chain and the probability distribution of the observed symbol depend on the underlying stateVia this model, the observations are modeled in two layers: One visible and the other invisible. Thus, it is useful in classification problems where raw observations are to be put into a number of categories that are more meaningful to us (Supplementary Figure 1)[121,122]
Gaussian mixture modelA Gaussian mixture model is a probabilistic model that assumes all data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters (Supplementary Figure 2)[123]
KNNKNN is a type of supervised learning algorithm used for classification. KNN tries to predict the correct class for the test data by calculating the distance between the test data and all training points. The algorithm then selects the K number of points which are closest to the test data. The KNN algorithm calculates the probability of the test data belonging to the classes of ‘K’ training data where the class that holds the highest probability (by majority voting) will be selected (Supplementary Figure 3)[123]
SVMThe SVM is an algorithm that finds a hyperplane in an N-dimensional space (N: The number of features) that distinctly classifies the data points in a way that the plane has the maximum margin, i.e., the maximum distance between data points of the two classes. Maximizing this margin distance would allow the future test points to be classified more accurately. Support vectors are data points that are closer to the hyperplane and influence the position as well as orientation of the hyperplane (Supplementary Figure 4)[123]
Artificial neural networkAn artificial neural network is a network of interconnected artificial neurons. An artificial neuron which is inspired by the actual neuron is modeled with inputs which are multiplied by weights, and then passed to a mathematical function which determines the activation of the neuron. The neurons in a neural network are grouped into layers. There are three main types of layers: – Input Layer – Hidden Layer(s) – Output Layer. Depending on the architecture of the network, outputs of some neurons are carried along with certain weights as inputs to some other neurons. By passing an input through these layers, the neural network finally outputs a value (discrete or continuous) which can be used to perform various classification/regression tasks. In this context, the neural network first has to learn the set of weights via the patterns within the so called training dataset, which is a sufficiently large set of input data labeled with their corresponding correct (expected) output (Supplementary Figure 5)[124]
Bayes classifierBayes classifier, which is based on Bayes’ theorem in probability, models the probabilistic relationships between the feature set and the class variable. Based on the modeled relationships, it estimates the class membership probability of the unseen example, in such a way that it minimizes the probability of misclassification[123]
Linear discriminant analysisLinear discriminant analysis is a method used in statistical machine learning, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting linear combination can be used as a linear classifier, or, as a means to dimension reduction prior to the actual classification task[124]