Research projects

    My research activities deal with the automatic processing of speech, with a special interest in capturing and modeling the articulatory gestures and the different electrophysiological signals involved in speech production. My goal is to develop automatic speech recognition and synthesis systems that exploit these multimodal signals for people with communication disorders. More precisely, these systems aim either at restoring oral communication when parts of the speech production chain is damaged (speech prosthesis), or at facilitating the treatment of speech sound disorders (assisted speech therapy). To build such systems, my approach is to capture multimodal speech-related signals using a variety of experimental devices, to model the statistical relationships between those signals using machine learning, and finally to implement these models in real-time systems that can interact with the low-level sensorimotor loops involved in speech perception and speech motor control.


    See below some of my current research projects:


Silent Speech Interfaces


A “silent speech interface” (SSI) is a device that allows speech communication without the necessity of vocalizing. SSI could be used in situations where silence is required (as a silent cell phone), or for communication in very noisy environments. Further applications are possible in the medical field. For example, SSI could be used by laryngectomized patients as an alternative to electrolarynx which provides a very robotic voice; to oesophageal speech, which is difficult to master; or to tracheo-oesoephageal speech, which requires additional surgery. The design of a SSI has recently received considerable attention from the speech research community. In the approach developed in my PhD, articulatory movements are captured by a non-invasive multimodal imaging system composed of an ultrasound transducer placed beneath the chin and a video camera in front of the lips.


We are following two lines of research:

- the decoding of silent articualtion at word level (i.e. visual-only speech recognition)

- the conversion of silent articulation into an intelligible speech signal in real-time.



These classification and regression issues are addressed using machine learning techniques (mostly Dynamic Bayesian Networks (Hueber et al., 2016) and more recently deep learning (Tatulli et al., 2017)).


Related funded projects:

  • "Ultraspeech II" (GIPSA-lab), funded by the Christian Benoît Award
  • "Revoix" (ANR, 2009-2011), in collaboration with SIGMA-lab (ESPCI ParisTech) and LPP Université Sorbonne Nouvelle.
  • "Ouisper", (ANR, 2006-2009, SIGMA-lab ESPCI ParisTech, LTCI Telecom ParisTech, VTVL University of Maryland)
  • "Cassis" (PHC Sakura, 2009-2010, GIPSA-lab, SIGMA-lab, LTCI, NAIST Japan)


Check out our very first real-time prototype (developed in the Ultraspeech2 project, funded by the Christian Benoit Award). These are preliminary results, based on a "light" version of our acoustic-articulatory mapping algorithm.



Visual articulatory feedback


Systems of visual sensory feedback aim at providing the speaker visual information about his/her own articulation, in real-time. Several studies show that this kind of system can be useful for both speech therapy and Computer Aided Pronunciation Training (CAPT). The system developed at GIPSA-lab is based on a 3D talking head used in an augmented speech scenario, i.e. it displayed all speech articulators including the tongue and the velum.


We are following two lines of research:

- the automatic animation of the articulatory talking head (lips, tongue, jaw) from the audio speech signal of any arbitrary user, in real-time.


Visual biofeedback based on acoustic-articulatory inversion


- the automatic animation of the tongue model of the articulatory talking head from ultrasound images of the vocal tract of any arbitrary user (Fabre et al., 2017).



In both cases, the mapping between acoustic/articulatory data from the user and the control parameters of the talking head is performed using machine learning techniques. To that purpose, we developed a dedicated algorithm called Cascaded Gaussian Mixture Regression (C-GMR) (Hueber et al., 2015) (Girin et al., 2017). This algorithm can (within limits) process articulatory movements that users cannot achieve when they start to use the system (source code of the C-GMR algorithm available here). This property is indispensable for the targeted therapeutic applications. The algorithm exploits a probabilistic model based on a large articulatory database acquired from an "expert" speaker capable of pronouncing all of the sounds in one or more languages. This model is automatically adapted to the morphology/voice of each new user, over the course of a short system calibration phase, during which the patient must pronounce a few phrases. This system, validated in a laboratory for healthy speakers, is now being tested in a simplified version in a clinical trial for patients who have had tongue surgery.


Some results:

Real-time prototype based on the first approach (animation of the talking head form the user's voice, using the C-GMR technique, (Hueber et al., 2015)).



Real-time prototype based on the second approach (animation of the talking head form the ultrasound images of the user's vocal tract, using the C-GMR technique (Fabre et al., 2017).



Related funded projects:

  • Diandra Fabre's PhD (funded by Région Rhones-Alpes)
  • Project Living Book of Anatomy (Persyval-lab, funding for the post-doctoral position of Eric Tatulli)
  • "Vizart3D" (Pôle CSVB, Université Joseph Fourier, Grenoble, 2012-2013, GIPSA-lab)



Incremental Text-to-Speech Synthesis


This research project aims at developing an incremental Text-To-Speech system (iTTS) in order to improve the user experience of people with communication disorders who use a TTS system in their daily life. Contrary to a conventional TTS, an iTTS system aims at delivering the synthetic voice while the user is typing (eventually with a delay of one word), and thus before the full sentence is available. By reducing the latency between text input and speech output, iTTS should enhance the interactivity of communication. Besides, iTTS could be chained with incremental speech recognition systems, in order to design highly responsive speech-to-speech conversion system (for application in automatic translation, silent speech interface, real-time enhancement of pathological voice, etc.).


Check out this video of our first prototype developed in the context of the SpeakRightNow project and Maël Pouget's PhD based on our "Adaptive latency POS tagger" (Pouget et al., 2016) and our HMM-based voice training procedure adapted to the incremental TTS paradigm (Pouget et al., 2015):



Related funded projects:

  • Project SpeakRightNow (AGIR, funding for the post-doctoral position of Olha Nahorna)
  • Maël Pouget's PhD (National grant)

Grenoble Images Parole Signal Automatique laboratoire

UMR 5216 CNRS - Grenoble INP - Université Joseph Fourier - Université Stendhal