Prosody2Vec

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Leyuan Qu¹, Taihao Li¹, Cornelius Weber², Theresa Pekarek-Rosin², Fuji Ren³ and Stefan Wermter²

1. Institute of Artificial Intelligence, Zhejiang Lab

2. Department of Informatics, University of Hamburg

3. School of Computer Science and Engineering, University of Electronic Science and Technology of China

Abstract

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations. Some audio samples can be found on our demo website.

Video Demo

Five Generation Tasks

We show the five generation tasks presented in our paper as follows.

1. Audio Demo for Emotional Voice Conversion (EVC)

Neutral to Angry

Neutral	CycleGAN-EVC	StarGAN-EVC	Seq2Seq-EVC	Emovox	Prosody2Vec (Ours)	Target	Prosody Reference

Neutral to Happy

Neutral	CycleGAN-EVC	StarGAN-EVC	Seq2Seq-EVC	Emovox	Prosody2Vec (Ours)	Target	Prosody Reference

Neutral to Sad

Neutral	CycleGAN-EVC	StarGAN-EVC	Seq2Seq-EVC	Emovox	Prosody2Vec (Ours)	Target	Prosody Reference

2. Audio Demo for Emotion Style Transfer

Other to Angry

Source		Prosody2Vec	Prosody Reference
Happy


Sad

Other to Happy

Source		Prosody2Vec	Prosody Reference
Angry


Sad

Other to Sad

Source		Prosody2Vec	Prosody Reference
Angry


Happy

3. Audio Demo for Cross-lingual Emotional Voice Conversion (EVC)

Original	Prosody2Vec (Angry)	German Reference (Angry)

4. Audio Demo for Speaking Style Transfer

Original Angry	Converted Angry 1	Converted Angry 2	Converted Angry 3

5. Audio Demo for Singing Style Transfer

Original	Generated	Reference Singing Voice

Acknowledgement: this work was supported in part by the CML, LeCareBot and MoReSpace Projects funded by the DFG, the Major Scientific Project of Zhejiang Lab under Grant (No. 2020KB0AC01), and Youth Foundation Project of Zhejiang Lab (No. 111011-AA2301).