10:00 Short welcome

10:15 Prosody modeling: Are we modeling production or perception?
Yi Xu, University College London, London, UK

Abstract: Speech is an information system that encodes and decodes messages through articulation and audition. There is therefore little doubt that a common neural representation of the messages is shared between production and perception. What is unclear is the exact nature of this presentation. I will argue that speech production and perception are linked only at a highly abstract categorical level, because they follow very different laws. Production is an articulatory process governed by physiological and physical mechanisms, and these mechanisms are what generate surface forms such as F0 trajectories. In contrast, to decode speech signals into contrastive categories, it is not necessary to know how the surface forms are originally generated, as long as the categories are sufficiently separated from one another. I will show with data that the modeling of prosody production is most effective if articulatory mechanisms are directly simulated, but the recognition of tone and intonation can be achieved without knowledge of production mechanisms. Most importantly, recovering articulatory parameters for the sake of perception would be not only too costly, but also ineffective against cross-speaker variability.

11:00 An End-to-End Approach to train a physiologically plausible model of intonation
Bastian Schnell, Idiap Research Institute, Martigny, Switzerland


Abstract: The generalized command response (GCR) model represents intonation as a superposition of muscle responses to spike command signals. While its parameters can be easily extracted by Matching Pursuit from audio, it is difficult to generate them from text. We have previously shown that the spikes can be predicted by a two-stage system, consisting of a recurrent neural network and a post-processing procedure, but the responses themselves were fixed dictionary atoms. We propose an end-to-end neural architecture that replaces the dictionary atoms with trainable second-order recurrent elements analogous to recursive filters. We demonstrate gradient stability under modest conditions, and show that the system can be trained by imposing temporal sparsity constraints. Subjective listening tests demonstrate that the system can synthesize intonation with high naturalness, comparable to state-of-the-art acoustic models, and retains the physiological plausibility of the GCR mode.

11:45 Prosodic latent space mapping using a Variational Recurrent Prosody Model
Branislav Gerazov, GIPSA-lab, Grenoble, France; FEEIT, Skopje, Macedonia

Abstract: The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model has been shown to be able to decompose prosody into elementary multiparametric functional contours through the iterative training of neural network contour generators using analysis-by-synthesis. Based on this paradigm we recently proposed the Weighted SFC (WSFC) model that allows it to capture also the scale, i.e. prominence, of these elementary contours based on their linguistic context. We are now proposing a deep model that in addition to decomposing prosody into its constituent contours, can also capture a part of their variance. The  Variational Prosody Model (VPM) and the Variational Recurrent Prosody Model (VRPM) comprise a network of variational encoding (recurrent) neural network contour generators, which map the linguistic context of the contours into a prosodic latent space. The prosodic latent space can then be used to generate prosodic contours by sampling it. The V(R)PM is shown to be of use in the exploration of the hidden prosodic variability in the functional prosodic contours and prosody itself. It can also be used in a speech synthesis scenario to facilitate generation of a dynamic and varied prosody contour that is not severely affected by averaging effects.

12:30 End of Workshop

>> More information : Branislav Gerazov

Visuel
Image
Picto event
Type d'évènement
Étiquette de l’événement
Mode d'affichage
Sans la vignette (utilisé principalement pour les vieux contenus avec une vignette générique...)
oldid
844