Talking Heads by cloning

During my post-doctoral research position in the Synthesis group at ICP, I have been working on the creation, representation and coding of animated, textured 3D face clones, also known as "Talking Heads". We also worked on Cued Speech, where an hand can help hear-impaired people.

Summary

We first track a real video of a talking human (only seen from front). This results in very few articulatory speech parameters : only 6 parameters per coded frame (possibly 6 extra ones for head moves)
once transmitted and received (possibly in a MPEG-4 compatible way), the parameters values can drive a realistic 3D synthetic rendering (the first side view in the video example).
in the MPEG-4 compatible way, we can even derive face articulation for another speaker appearance, as seen in the rightmost video (where one also can guess jaw bone moves).

1.2MB .avi
Original front-only video capture, reconstructed side view & transposition

Insights by videos

2MB .avi
Original front-only video capture with reconstructed articulated mesh

The next sequences show some realistic looking reconstructions (black background) of another video message (original on blue background), either on the original locutor (left) or another one.

*A synthetic side-view, from an original front-only capture, applied on another speaker*
1MB .avi	2MB .avi	1.2MB .avi

Reconstruction is 3D, as can be noticed from the previous 3D movements of the lips (reconstructed using only front view information !) or on the next sequence, where the reconstructed talking head is rotating.

*A longer message, revealing the 3D structure as well as jaw and teeth*
	Full sequence (36 sec, 20MB, avi) Intro sentence (7 sec, 4MB, avi) Half resolution intro (7 sec, 2MB, avi)

We can also construct a temporal model of his speech activity (capturing coarticulation of the visemes), and build a text to audio-visual speech synthetisor. The next video just shows a resulting sentence (in french).

10MB .avi
Output of the Audiovisual Speech synthetisor developed at ICP

Using Clones for telecommunications

I was involved in the now finished Tempo Valse RNRT research project, which is a special case of audiovisual communication, involving only front-view images, a la visiophone. Very good intelligibility with very low bandwidth are concilied by transmitting only the articulatory parameters (in what we call a labiophone). That way, the synthetic videos express better quality and transmit faster than the big .avi files on this page...
Details can be browsed (in french, but with lot of pictures and some videos), about these points :

the context of our talking heads,
the creation of the articulatory model,
the textured synthesis of the 3D model,
the analysis by synthesis of real videos.
the MPEG-4 encoding/decoding of audio-visual speech.

Talking Heads by cloning

Summary

Insights by videos

Using Clones for telecommunications

See also...