Research

General approach

My approach is to study speech as a multimodal and embodied phenomenon through the following steps:

  • design experimental setups to acquire acoustic, articulatory, visual and physiological signals;
  • develop machine-learning models able to relate these signals to linguistic, motor or perceptual representations;
  • integrate these models into interactive systems, such as assistive spoken-communication technologies or humanoid robots, that enter the sensory-motor loops regulating human communication.

The applications of my work include:

  • the development of voice technologies for people with spoken-communication disorders;
  • the improvement of robots' socio-communicative abilities;
  • the study, through modeling and simulation, of cognitive mechanisms involved in language and speech acquisition.

Computational modeling of speech perception, production and acquisition mechanisms

Objectives

  • Explore how sensory-motor, physical and social interactions shape language and speech learning.
  • Understand how these interactions can be used to improve the efficiency and adaptability of conversational AI.

Main results

  • Development of a neural computational model of vocal imitation enabling self-supervised learning of speech sensory-motor relationships, in particular acoustic-articulatory relationships.
  • Evidence for the role of motor inference in the discovery of phonetic units.
  • Evidence for the role of invariant acoustic representations in articulatory learning.
  • Study of universal phonetic biases for few-shot acoustic unit discovery.
Self-supervised vocal imitation model from Lavechin and Hueber

Self-supervised vocal imitation model: learning relationships between perception, articulatory gestures and speech production.

Publications

Context: DevAI&Speech chair (MIAI Cluster), Bayesian Cognition and Machine Learning for Speech Communication chair (MIAI), ERC Speech Unit(e)s, Marvin Lavechin's postdoctoral work, Marc-Antoine Georges's PhD, Angelo Ortiz's PhD.

Speech representation learning

Objectives

  • Learn rich, interpretable and controllable representations of speech and audio signals in a self-supervised or weakly supervised way.
  • Use these representations for enhancement and restoration of pathological speech.

Main results

  • Theoretical formulation of a new class of self-supervised models: Dynamical Variational Autoencoders (DVAE).
  • Regularization of VAE latent spaces to enable interpretable control of musical timbre, in collaboration with Arturia.
  • Development of methods for restoring missing speech segments through speech inpainting.
Speech inpainting architecture proposed by Asaad et al.

Speech inpainting architecture: transferring SSL representations to the reconstruction of masked speech segments.

Publications

Context: Fanny Roche's PhD, collaboration with Arturia, Marc-Antoine Georges's PhD, collaborations with INRIA RobotLearn and LPNC.

Silent speech interface

Objectives

  • Convert articulated but non-vocalized speech into text or an intelligible acoustic signal.
  • Understand and model prosody control in silent speech.

Main results

  • First silent-speech communication interface based on the acquisition of articulatory data combining tongue ultrasound and video.
Silent speech interface using tongue ultrasound and lip video

Silent speech interface: multimodal articulatory acquisition and conversion into audible speech.

Publications

Context: my PhD and Eric Tatulli's postdoctoral work.

Automatic processing of gestural languages

Objectives

  • Recognize and automate French Cued Speech.
  • Synthesize French Cued Speech from text.
  • Analyze the temporal coordination between hand movements and lip movements in cued speech.
  • Automatically translate French Sign Language (LSF) into text.

Main results

  • First complete decoding and synthesis system for French Cued Speech based on a fully neural pipeline.
  • Mediapi-RGB corpus for training automatic LSF translation models.
Architecture for automatic recognition of French Cued Speech

Automatic recognition of French Cued Speech: combining lip, hand and linguistic information.

Publications

Context: H2020 Comm4Child project, Sanjana Sankar's PhD.

Incremental text-to-speech synthesis

Objectives

  • Reduce the latency of text-to-speech systems.
  • Produce natural speech before the full sentence is available.

Main results

  • Quantification of the impact of future context on the prosodic quality of neural TTS.
  • Prosody improvement using predicted future text.
  • Fine-tuning of a GPT model to predict online the presence of contrastive focus on a word during text entry.
Conventional text-to-speech and incremental text-to-speech

Incremental TTS: reducing latency in speech-synthesis-assisted interaction.

Publications

Articulatory biofeedback for speech therapy

Objectives

  • Make articulatory gestures visible to support speech therapy.
  • Evaluate the contribution of tongue ultrasound and articulatory models in clinical settings.
  • Develop systems able to predict articulatory movements in real time directly from the speech signal.

Main results

  • Post-glossectomy rehabilitation protocols using ultrasound visual illustration and feedback.
  • Contribution of visual illustration of articulators for post-stroke speech rehabilitation.
  • C-GMR algorithm for adapting Gaussian-mixture-regression models to a new speaker.
Articulatory biofeedback using tongue ultrasound

Articulatory biofeedback: making tongue movements visible to support speech therapy.

Publications

Context: Diandra Fabre's PhD, Marion Girod-Roux's Master's internship, Revison project, Vizart3D project, collaborations with DDL Lyon, Lyon University Hospital, Rocheplane Medical Center, LPNC and INRIA RobotLearn.

Brain-computer interfaces for speech

Objectives

  • Explore the restoration of spoken communication from intracranial brain signals (ECoG) related to speech production.

Main results

  • Identification of key methodological constraints for designing a speech BCI.
  • First demonstration of a real-time, speaker-adaptive acoustic-articulatory conversion system based on electromagnetic articulography.

Publications

Context: ANR BrainSpeak and H2020-FETPROACT BrainCom projects, in collaboration with INSERM, Florent Bocquelet's PhD.


Grenoble Images Parole Signal Automatique laboratoire

UMR 5216 CNRS - Grenoble INP - Université Grenoble Alpes