Improving Speech Emotion Recognition
with Unsupervised Speaking Style Transfer

Leyuan Qu1, Wei Wang2, Cornelius Weber3, Pengcheng Yue1, Taihao Li1 and Stefan Wermter3

1. Institute of Artificial Intelligence, Zhejiang Lab

2. International Cultural Exchange College, Xinjiang University

3. Department of Informatics, University of Hamburg

Abstract

Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to generate similar numbers of samples for each class to tackle the data imbalance issue as well. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that the augmented model not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.

Mel-spectrogram Comparison
Emotion Style Transfer

EmoAug generates new emotional data (second row) by transferring prosodic attributes—such as stress, rhythm, and intensity—from a reference utterance (first row), while retaining the original utterance's semantic content.

Original Reference
Converted




More Samples
Original Converted


Acknowledgement

This work was supported in part by the National Science and Technology Major Project of China (2021ZD0114303), in part by the Youth Foundation Project of Zhejiang Lab (K2023KH0AA01), in part by the CML Project funded by the DFG, and in part by the Philosophy and Social Science Training Project of Xinjiang University (23CPY049).

BibTeX

@inproceedings{qu2024improving,
    author    = {Leyuan Qu, Wei Wang, Cornelius Weber, Pengcheng Yue, Taihao Li, Stefan Wermter},
    title     = {IMPROVING SPEECH EMOTION RECOGNITION WITH UNSUPERVISED SPEAKING STYLE TRANSFER},
    journal   = {IEEE International Conference on Acoustics, Speech and Signal Processing},
    year      = {2024}
    }