Baseline - No text embeddings used

The baseline TTS system generates speech directly from the given text input, without any additional conditioning prompts. The only way the prosody could be influenced here is the text that is to be read itself. This approach serves as a foundation for comparison.

Proposed - System conditioned on text embeddings

The proposed TTS system is conditioned on natural language prompts. The goal is to achieve more expressive and contextually appropriate speech output, if the text to be read is used as the prompt. Alternatively, a different text can be used as prompt to transfer the expected prosody of the prompt over to the text that is to be read.



Using the Input Text as Prompt

Emotion Input Sentence Baseline Proposed
Anger You can't be serious, how dare you not tell me you were going to marry her?
Joy I really enjoy the beach in the summer.
Neutral You can go to the Employment Development Office and pick it up.
Sadness Lily broke up with me last week, in fact, she dumped me.
Surprise He was astonished when he saw them come alone, and asked what had happened to them.


Using a different Prompt

Emotion Prompt Input Sentence Proposed
Anger You can't be serious, how dare you not tell me you were going to marry her? Lily broke up with me last week, in fact, she dumped me.
Joy I really enjoy the beach in the summer. You can go to the Employment Development Office and pick it up.
Neutral You can go to the Employment Development Office and pick it up. You can't be serious, how dare you not tell me you were going to marry her?
Sadness Lily broke up with me last week, in fact, she dumped me. He was astonished when he saw them come alone, and asked what had happened to them.
Surprise He was astonished when he saw them come alone, and asked what had happened to them. I really enjoy the beach in the summer.