Diffsody

Disentanglement of Prosody Representations via
Diffusion Models and Scheduled Gradient Reversal

Leyuan Qu¹, Cornelius Weber², Yingming Gao³, Taihao Li¹ and Stefan Wermter²

1. Institute of Artificial Intelligence, Zhejiang Lab

2. Department of Informatics, University of Hamburg

3. School of Artificial Intelligence, Beijing University of Posts and Telecommunications

Abstract

Prosody plays a fundamental role in human speech and communication, facilitating intelligibility and conveying emotional and cognitive states. Extracting accurate prosodic information from speech is vital for building assistive technology, such as controllable speech synthesis, speaking style transfer, and speech emotion recognition. However, it is challenging to disentangle speaker-independent prosodic representations since prosodic attributes, such as intonation, excessively entangle with speaker-specific attributes, e.g. pitch. In this paper, we propose a novel model, called Diffsody, to disentangle and refine prosody representations. (1) To disentangle prosodic representations, we leverage the expressive generative ability of a diffusion model by conditioning it on quantified semantic information and pretrained speaker embeddings. Additionally, a prosody encoder automatically learns prosody representations used for spectrogram reconstruction in an unsupervised fashion. (2) To refine and learn speaker-invariant prosody representations, a scheduled Gradient Reversal Layer (sGRL) is proposed and integrated into the prosody encoder of Diffsody. We thoroughly evaluate Diffsody through qualitative and quantitative means. t-SNE visualization and speaker verification experiments demonstrate the efficacy of the sGRL method in preventing speaker-specific information leakage. Experimental results on speaker-independent Speech Emotion Recognition (SER) and Automatic Depression Detection (ADD) tasks demonstrate that Diffsody can efficiently factorize speaker-independent prosody representations, resulting in a significant boost in SER and ADD. In addition, Diffsody synergistically integrates with the semantic representation model HuBERT, which leads to a discernibly elevated performance, outperforming contemporary methods in both SER and ADD tasks. Furthermore, the Diffsody model exhibits promising potential for various practical applications, such as voice or emotion conversion.

Model Architecture

Denoising Process

The diffusion denoiser reconstructs mel-scale spectrograms from Gaussian noises conditioned on semantic, speaker, and prosody information.

Ablation Study on Refinement Module

The refinement module helps the prosody encoder to capture and reconstruct more variants on prosody.

	Audio	Mel-spectrograms
without Refinement Module
with Refinement Module
Ground Truth

Comparison of GRL and sGRL on Voice Cloning

Diffsody can perform voice cloning by directly replacing the speaker encoder inputs with a differnet voice. Our proposed sGRL can effectively prevent speaker information leaky and guarantee the speaker information only learned from the speaker encoder.

	Original Audio	Reference Voice

Method	Generated Audio	Generated Mel
without GRL/sGRL
with GRL
with sGRL

Speaking Style Transfer

When representing style reference inputs to the prosody encoder, we can transfer the given styles to original speech while keeping the voice unchanged.

Original Audio	Reference Style	Generated Audio

Acknowledgement

This work was supported in part by the National Science and Technology Major Project of China (2021ZD0114303), in part by the Youth Foundation Project of Zhejiang Lab (K2023KH0AA01), and in part by the CML Project funded by the DFG.