BEN YOUSSEF
Atef
Researcher
Projects Publications Teaching Research CV
Publications

Publications by Atef Ben Youssef : cite - bib

ben_youssef.bib


Nicolas Sabouret, Björn Schuller, Lucas Paletta, Erik Marchi, Hazaël Jones and Atef Ben Youssef. Intelligent User Interfaces in Digital Games for Empowerment and Inclusion. In 12th International Conference on Advances in Computer Entertainment Technology (ACE 2015), Iskandar Malaysia [ bib | .pdf ] Gold Paper Award (best paper)

In the context of the development of Serious Games, this position paper sheds light on their possible use to support and enhance social inclusion through the presentation of three research projects in this domain. We first describe the context of our works and research. We then report on several events joining the scientific community that focused on the research on Digital Games for Empowerment and Inclusion. Last, we propose some results of our experience in this domain. The presented work particularly highlights the importance of user-centered design in interdisciplinary cooperation with psychologists, sociologists, vulnerable end users and the practitioners. We stress the relevance of progress beyond the state-of-the-art in recently focused areas of AI, such as, in computational theory-of-mind, in affective multimodal computing, in learning of socio-emotional skills, and in recovery of human attention processes. Typical examples are given to illustrate the projects’ research results.

Atef Ben Youssef, Mathieu Chollet, Hazaël Jones, Nicolas Sabouret, Catherine Pelachaud, and Magalie Ochs. Towards a socially adaptive virtual agent. In 15th International Conference on Intelligent Virtual Agents (IVA 2015), pages 3-16, Delft, the Netherlands. [ bib | .pdf ]

This paper presents a socially adaptive virtual agent that can adapt its behaviour according to social constructs (e.g. attitude, relationship) that are updated depending on the behaviour of its interlocutor. We consider the context of job interviews with the virtual agent playing the role of the recruiter. The evaluation of our approach is based on a comparison of the socially adaptive agent to a simple scripted agent and to an emotionally-reactive one. Videos of these three different agents in situation have been created and evaluated by 83 participants. This subjective evaluation shows that the simulation and expression of social attitude is perceived by the users and impacts on the evaluation of the agent's credibility. We also found that while the emotion expression of the virtual agent has an immediate impact on the user's experience, the impact of the virtual agent's attitude expression's impact is stronger after a few speaking turns.

Atef Ben Youssef, Mathieu Chollet, Hazaël Jones, Nicolas Sabouret, Catherine Pelachaud, and Magalie Ochs. An architecture for a socially adaptive virtual recruiter in job interview simulations. In International Workshop on Intelligent Digital Games for Empowerment and Inclusion (IDGEI 2015) at Intelligent User Interfaces (IUI 2015), Atlanta, Georgia, USA, 2015. [ bib | .pdf ]

This paper presents an architecture for an adaptive virtual recruiter in the context of job interview simulation. This architecture allows the virtual agent to adapt its behaviour according to social constructs (e.g. attitude, relationship) that are updated depending on the behaviour of their interlocutor. During the whole interaction, the system analyses the behaviour of the human participant, builds and updates mental states of the virtual agent and adapts its social attitude expression. This adaptation mechanism can be applied to a wide spectrum of application domains in Digital Inclusion, where the user need to train social skills with a virtual peer.

Atef Ben Youssef, Nicolas Sabouret, Sylvain Caillou. Subjective Evaluation of a BDI-based Theory of Mind model. In Proc. Workshop Affect, Compagnon Artificiel, Interaction (WACAI), pages 120-125, Rouen, France, 2014. [ bib | .pdf ]

Theory of Mind (ToM) plays an important role in affective interactions and several logic-based models of ToM have been proposed in the literature to enhance the credibility of intelligent virtual agents. However, the evaluation of the impact of such a model remains a difficult question. In this paper, we present an evaluation of a Belief-DesireIntension (BDI)-based ToM model using a subjective study based on human-agent interaction. We first briefly present the main principles of the considered ToM model and the dimensions to evaluate. We then present our protocol and we show that the use of the ToM model improved the believability of the vitual agent, both at the cognitive and at the expressive level.},

Hazaël Jones, Nicolas Sabouret, Atef Ben Youssef. Strategic Intentions based on an Affective Model and a simple Theory of Mind. In Proc. Workshop Affect, Compagnon Artificiel, Interaction (WACAI), pages 59-64, Rouen, France, 2014. [ bib | .pdf ]

This paper presents a computational model for reasoning about affects of the interlocutor, using a Theory of Mind (ToM) paradigm: the system manipulates representations of beliefs about the interlocutor's affects, preferences and goals. Our affective model is designed for the context of job interview simulation, but it does not depend on a specific set of affects. It relies on simple rules for selecting topics depending on the virtual agent's personality. We have implemented it using an OCC-based representation of emotions and a PAD model for moods.

Atef Ben Youssef, Hiroshi Shimodaira, and David Braude. Speech driven talking head from estimated articulatory features. In Proc. ICASSP, pages 4606-4610, Florence, Italy, May 2014. [ bib | .pdf ]

In this paper, we present a talking head in which the lips and head motion are controlled using articulatory movements estimated from speech. A phonesize HMM-based inversion mapping is employed and trained in a semi-supervised fashion. The advantage of the use of articulatory features is that they can drive the lips motions and they have a close link with head movements. Speech inversion normally requires the training data recorded with electromagnetic articulograph (EMA), which restricts the naturalness of head movements. The present study considers a more realistic recording condition where the training data for the target speaker are recorded with a usual motion capture system rather than EMA. Different temporal clustering techniques are investigated for HMM-based mapping as well as a GMM-based frame-wise mapping as a baseline system. Objective and subjective experiments show that the synthesised motions are more natural using an HMM system than a GMM one, and estimated EMA features outperform prosodic features.

Atef Ben Youssef, Hiroshi Shimodaira, and David A. Braude. Articulatory Features Based Talking Heads Using Speech Inversion. In IEICE technical report. Speech 113(161), pages 63-68, Lyon, France, Jul 2013. [ bib | .pdf ]

We present a talking head in which the lips and head motion are controlled using articulatory movements estimated from speech. The advantage of the use of articulatory features is that they can drive the lip motions and they have a close link with head movement. In the literature, speech-driven head motion synthesis has mainly investigated using prosodic features. In the proposed approach, we estimate lips and head motions from the intermediate of articulatory features predicted from speech. In order to obtain spontaneous speech, measured head and articulatory movements, acquired by EMA, of 12 people in dialogue, were synchronously recorded with audio. Measured articulatory data were compared to those predicted from speech using a phone-size HMM-based inversion mapping system trained in a semi-supervised fashion. Canonical correlation analysis (CCA) shows that the articulatory features are more correlated with head rotation than prosodic and/or cepstral speech features. Multi-stream head-cluster-size HMMs are trained jointly on the synchronous streams of speech and head motion data. Subjective evaluation was performed to evaluate the synthesised lips and head movements.

Atef Ben Youssef, Hiroshi Shimodaira, and David A. Braude. Articulatory features for speech-driven head motion synthesis. In Proc. Interspeech, pages 2758-2762, Lyon, France, August 2013. [ bib | .pdf ]

This study investigates the use of articulatory features for speech-driven head motion synthesis as opposed to prosody features such as F0 and energy which have been mainly used in the literature. In the proposed approach, multi-stream HMMs are trained jointly on the synchronous streams of speech and head motion data. Articulatory features can be regarded as an intermediate parametrisation of speech that are expected to have a close link with head movement. Measured head and articulatory movements acquired by EMA were synchronously recorded with speech. Measured articulatory data was compared to those predicted from speech using an HMM-based inversion mapping system trained in a semi-supervised fashion. Canonical correlation analysis (CCA) on a data set of free speech of 12 people shows that the articulatory features are more correlated with head rotation than prosodic and/or cepstral speech features. It is also shown that the synthesised head motion using articulatory features give higher correlations with the original head motion than when only prosodic features are used.

David A. Braude, Hiroshi Shimodaira, and Atef Ben Youssef. Template-warping based speech driven head motion synthesis. In Proc. Interspeech, pages 2763-2767, Lyon, France, August 2013. [ bib | .pdf ]

We propose a method for synthesising head motion from speech using a combination of an Input-Output Markov model (IOMM) and Gaussian mixture models trained in a supervised manner. A key difference of this approach compared to others is to model the head motion in each angle as a series of templates of motion rather than trying to recover a frame-wise function. The templates were chosen to reflect natural patterns in the head motion, and states for the IOMM were chosen based on statistics of the templates. This reduces the search space for the trajectories and stops impossible motions such as discontinuities from being possible. For synthesis our system warps the templates to account for the acoustic features and the other angles' warping parameters. We show our system is capable of recovering the statistics of the motion that were chosen for the states. Our system was then compared to a baseline that used a frame-wise mapping that is based on previously published work. A subjective preference test that includes multiple speakers showed participants have a preference for the segment based approach. Both of these systems were trained on storytelling free speech.

Atef Ben Youssef, Hiroshi Shimodaira, and David A. Braude. Head motion analysis and synthesis over different tasks. In Proc. of Intelligent Virtual Agents, pages 285-294. Springer, September 2013. [ bib | .pdf ]

It is known that subjects vary in their head movements. This paper presents an analysis of this variety over different tasks and speakers and their impact on head motion synthesis. Measured head and articulatory movements acquired by an ElectroMagnetic Articulograph (EMA) synchronously recorded with audio was used. Data set of speech of 12 people recorded on different tasks confirms that the head motion variate over tasks and speakers. Experimental results confirmed that the proposed models were capable of learning and synthesising task-dependent head motions from speech. Subjective evaluation of synthesised head motion using task models shows that trained models on the matched task is better than mismatched one and free speech data provide models that predict preferred motion by the participants compared to read speech data.

David A. Braude, Hiroshi Shimodaira, and Atef Ben Youssef. The University of Edinburgh head-motion and audio storytelling (UoE-HaS) dataset. In Proc. of Intelligent Virtual Agents, pages 466-467. Springer, 2013. [ bib | .pdf ]

In this paper we announce the release of a large dataset of storytelling monologue with motion capture for the head and body. Initial tests on the dataset indicate that head motion is more dependant on the speaker than the style of speech.

Thomas Hueber, Atef Ben Youssef, Gérard Bailly, Pierre Badin, and Frédéric Elisei. Cross-speaker acoustic-to-articulatory inversion using phone-based trajectory HMM for pronunciation training. In Proc. Interspeech, Portland, Oregon, USA, 2012. [ bib | .pdf ]

The article presents a statistical mapping approach for crossspeaker acoustic-to-articulatory inversion. The goal is to estimate the most likely articulatory trajectories for a reference speaker from the speech audio signal of another speaker. This approach is developed in the framework of our system of visual articulatory feedback developed for computer-assisted pronunciation training applications (CAPT). The proposed technique is based on the joint modeling of articulatory and acoustic features, for each phonetic class, using full-covariance trajectory HMM. The acoustic-to-articulatory inversion is achieved in 2 steps: 1) finding the most likely HMM state sequence from the acoustic observations; 2) inferring the articulatory trajectories from both the decoded state sequence and the acoustic observations. The problem of speaker adaptation is addressed using a voice conversion approach, based on trajectory GMM.

Gérard Bailly, Pierre Badin, Lionel Revéret, and Atef Ben Youssef. Sensorimotor characteristics of speech production. Cambridge University Press, 2012. [ bib | DOI ]
Atef Ben Youssef. Control of talking heads by acoustic-to-articulatory inversion for language learning and rehabilitation. PhD thesis, Grenoble University, October 2011. [ bib | .pdf ]

This thesis presents a visual articulatory feedback system in which the visible and non visible articulators of a talking head are controlled by inversion from a speaker's voice. Our approach to this inversion problem is based on statistical models built on acoustic and articulatory data recorded on a French speaker by means of an electromagnetic articulograph. A first system combines acoustic speech recognition and articulatory speech synthesis techniques based on hidden Markov Models (HMMs). A second system uses Gaussian mixture models (GMMs) to estimate directly the articulatory trajectories from the speech sound. In order to generalise the single speaker system to a multi-speaker system, we have implemented a speaker adaptation method based on the maximum likelihood linear regression (MLLR) that we have assessed by means of a reference articulatory recognition system. Finally, we present a complete visual articulatory feedback demonstrator.

Keywords: visual articulatory feedback; acoustic-to-articulatory speech inversion mapping; ElectroMagnetic Articulography (EMA); hidden Markov models (HMMs), Gaussian mixture models (GMMs); speaker adaptation; face-to-tongue mapping

Atef Ben Youssef, Thomas Hueber, Pierre Badin, and Gérard Bailly. Toward a multi-speaker visual articulatory feedback system. In Proc. Interspeech, pages 589-592, Florence, Italie, August 2011. [ bib | .pdf ]

In this paper, we present recent developments on the HMMbased acoustic-to-articulatory inversion approach that we develop for a "visual articulatory feedback" system. In this approach, multi-stream phoneme HMMs are trained jointly on synchronous streams of acoustic and articulatory data, acquired by electromagnetic articulography (EMA). Acousticto- articulatory inversion is achieved in two steps. Phonetic and state decoding is first performed. Then articulatory trajectories are inferred from the decoded phone and state sequence using the maximum-likelihood parameter generation algorithm (MLPG). We introduce here a new procedure for the reestimation of the HMM parameters, based on the Minimum Generation Error criterion (MGE). We also investigate the use of model adaptation techniques based on maximum likelihood linear regression (MLLR), as a first step toward a multispeaker visual articulatory feedback system.

Atef Ben Youssef, Thomas Hueber, Pierre Badin, Gérard Bailly, and Frédéric Elisei. Toward a speaker-independent visual articulatory feedback system. In 9th International Seminar on Speech Production, ISSP9, Montreal, Canada, 2011. [ bib | .pdf ]
Thomas Hueber, Pierre Badin, Gérard Bailly, Atef Ben Youssef, Frédéric Elisei, Bruce Denby, and Gérard Chollet. Statistical mapping between articulatory and acoustic data. application to silent speech interface and visual articulatory feedback. In Proceedings of the 1st International Workshop on Performative Speech and Singing Synthesis (p3s), Vancouver, Canada, 2011. [ bib | .pdf ]

This paper reviews some theoretical and practical aspects of different statistical mapping techniques used to model the relationships between the articulatory gestures and the resulting speech sound. These techniques are based on the joint modeling of articulatory and acoustic data using Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM). These methods are implemented in two systems: (1) the silent speech interface developed at SIGMA and LTCI laboratories which converts tongue and lip motions, captured during silent articulation by ultrasound and video imaging, into audible speech, and (2) the visual articulatory feedback system, developed at GIPSA-lab, which automatically animates, from the speech sound, a 3D orofacial clone displaying all articulators (including the tongue). These mapping techniques are also discussed in terms of real-time implementation.

Keywords: statistical mapping silent speech ultrasound visual articulatory feedback talking head HMM GMM

Atef Ben Youssef, Pierre Badin, and Gérard Bailly. Can tongue be recovered from face? the answer of data-driven statistical models. In Proc. Interspeech, pages 2002-2005, Makuhari, Japan, September 2010. [ bib | .pdf ]

This study revisits the face-to-tongue articulatory inversion problem in speech. We compare the Multi Linear Regression method (MLR) with two more sophisticated methods based on Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs), using the same French corpus of articulatory data acquired by ElectroMagnetoGraphy. GMMs give overall results better than HMMs, but MLR does poorly. GMMs and HMMs maintain the original phonetic class distribution, though with some centralisation effects, effects still much stronger with MLR. A detailed analysis shows that, if the jaw / lips / tongue tip synergy helps recovering front high vowels and coronal consonants, the velars are not recovered at all. It is therefore not possible to recover reliably tongue from face

Atef Ben Youssef, Pierre Badin, Gérard Bailly, and Viet-Anh Tran. Méthodes basées sur les hmms et les gmms pour l'inversion acoustico-articulatoire en parole. In Proc. of JEP, pages 249-252, Mons, Belgium, May 2010. [ bib | .pdf ]

Two speech inversion methods are implemented and compared. In the first, multistream Hidden Markov Models (HMMs) of phonemes are jointly trained from synchronous streams of articulatory data acquired by EMA and speech spectral parameters; an acoustic recognition system uses the acoustic part of the HMMs to deliver a phoneme chain and the states durations; this information is then used by a trajectory formation procedure based on the articulatory part of the HMMs to resynthesise the articulatory data. In the second, Gaussian Mixture Models (GMMs) are trained on these streams to associate directly articulatory frames with acoustic frames in context. Over a corpus of 17 minutes uttered by a French speaker, the RMS error was 1,66 mm with the HMMs and 2,25 mm with the GMMs.

Atef Ben Youssef, Pierre Badin, and Gérard Bailly. Acoustic-to-articulatory inversion in speech based on statistical models. In Proc. of AVSP 2010, pages 160-165, Hakone, Kanagawa, Japon, 2010. [ bib | .pdf ]

Two speech inversion methods are implemented and compared. In the first, multistream Hidden Markov Models (HMMs) of phonemes are jointly trained from synchronous streams of articulatory data acquired by EMA and speech spectral parameters; an acoustic recognition system uses the acoustic part of the HMMs to deliver a phoneme chain and the states durations; this information is then used by a trajectory formation procedure based on the articulatory part of the HMMs to resynthesise the articulatory movements. In the second, Gaussian Mixture Models (GMMs) are trained on these streams to directly associate articulatory frames with acoustic frames in context, using Maximum Likelihood Estimation. Over a corpus of 17 minutes uttered by a French speaker, the RMS error was 1.62 mm with the HMMs and 2.25 mm with the GMMs.

Pierre Badin, Atef Ben Youssef, Gérard Bailly, Frédéric Elisei, and Thomas Hueber. Visual articulatory feedback for phonetic correction in second language learning. In Workshop on Second Language Studies: Acquisition, Learning, Education and Technology, Tokyo, Japan, 2010. [ bib | .pdf ]

Orofacial clones can display speech articulation in an augmented mode, i.e. display all major speech articulators, including those usually hidden such as the tongue or the velum. Besides, a number of studies tend to show that the visual articulatory feedback provided by ElectroPalatoGraphy or ultrasound echography is useful for speech therapy. This paper describes the latest developments in acoustic-to-articulatory inversion, based on statistical models, to drive orofacial clones from speech sound. It suggests that this technology could provide a more elaborate feedback than previously available, and that it would be useful in the domain of Computer Aided Pronunciation Training

Atef Ben Youssef, Pierre Badin, Gérard Bailly, and Panikos Heracleous. Acoustic-to-articulatory inversion using speech recognition and trajectory formation based on phoneme hidden markov models. In Proc. Interspeech, pages 2255-2258, Brighton, UK, September 2009. [ bib | .pdf ]

In order to recover the movements of usually hidden articulators such as tongue or velum, we have developed a data-based speech inversion method. HMMs are trained, in a multistream framework, from two synchronous streams: articulatory movements measured by EMA, and MFCC + energy from the speech signal. A speech recognition procedure based on the acoustic part of the HMMs delivers the chain of phonemes and together with their durations, information that is subsequently used by a trajectory formation procedure based on the articulatory part of the HMMs to synthesise the articulatory movements. The RMS reconstruction error ranged between 1.1 and 2. mm.

Atef Ben Youssef, Viet-Anh Tran, Pierre Badin, and Gérard Bailly. Hmms and gmms based methods in acoustic-to-articulatory speech inversion. In Proc. of RJCP, pages 186-192, Avignon, France, 2009. [ bib | .pdf ]

Afin de récupérer les mouvements des articulateurs tels que les lèvres, la mâchoire ou la langue, à partir du son de parole, nous avons développé et comparé deux méthodes d'inversion basées l'une sur les modèles de Markov cachés (HMMs) et l'autre sur les modèles de mélanges de gaussiennes (GMMs). Les mouvements des articulateurs sont caractérisés par les coordonnées médiosagittales de bobines d'un articulographe électromagnétique (EMA) fixées sur les articulateurs. Dans la première méthode, des HMMs à deux flux, acoustique et articulatoire, sont entrainés à partir de signaux acoustique et articulatoire synchrones. Le HMM acoustique sert à reconnaitre les phones, ainsi que leurs durées. Ces informations sont ensuite utilisées par le HMM articulatoire pour synthétiser les trajectoires articulatoires. Pour la deuxième méthode, un GMM d'association directe entre traits acoustiques et articulatoires est entrainé sur le même corpus suivant le critère de minimum d'erreur quadratique moyenne (MMSE) à partir des trames acoustiques d'empan temporel plus ou moins grand. Pour un corpus de données EMA mono-locuteur enregistré par un locuteur français, l'erreur RMS de reconstruction sur le corpus de test pour la méthode fondée sur les HMMs se situe entre 1.96 et 2.32 mm, tandis qu'elle se situe entre 2.46 et 2.95 mm pour la méthode basée sur les GMMs.

Laurent Besacier, Atef Ben Youssef, and Hervé Blanchon. The lig arabic/english speech translation system at iwslt08. In International Workshop on Spoken Language Translation (IWSLT) 2008, pages 58-62, Hawaii, USA, 2008. [ bib | .pdf ]

This paper is a description of the system presented by the LIG laboratory to the IWSLT08 speech translation evaluation. The LIG participated, for the second time this year, in the Arabic to English speech translation task. For translation, we used a conventional statistical phrase-based system developed using the moses open source decoder. We describe chronologically the improvements made since last year, starting from the IWSLT 2007 system, following with the improvements made for our 2008 submission. Then, we discuss in section 5 some post-evaluation experiments made very recently, as well as some on-going work on Arabic / English speech to text translation. This year, the systems were ranked according to the (BLEU+METEOR)/2 score of the primary ASR output run submissions. The LIG was ranked 5th/10 based on this rule.



Grenoble Images Parole Signal Automatique laboratoire

UMR 5216 CNRS - Grenoble INP - Université Joseph Fourier - Université Stendhal

Flag Counter
* Since Fri 31 May 2013: CSTR & LIMSI *