Vous êtes ici : GIPSA-lab > Formation > Thèses soutenues
SONG Guanghan

Effet du son dans les vidéos sur la direction du regard : Contribution à la modélisation de la saillance audiovisuelle


Directeur de thèse :     Denis PELLERIN

École doctorale : Electronique, electrotechnique, automatique, traitement du signal (EEATS)

Spécialité : Signal, image, parole, télécoms

Structure de rattachement : Université Grenoble Alpes

Établissement d'origine : Southeast University - Chine

Financement(s) : bourse attribuée par un gouvernement étranger


Date d'entrée en thèse : 01/02/2010

Date de soutenance : 14/06/2013


Composition du jury :
Patrick Le Callet ,Rapporteur
Hervé Glotin, Rapporteur
Christophe Garcia , Examinateur
Didier Coquin , Examinateur
Denis Pellerin, Directeur de thèse


Résumé : Humans receive large quantity of information from the environment with sight and hearing. To help us to react rapidly and properly, there exist mechanisms in the brain to bias attention towards particular regions, namely the salient regions. This attentional bias is not only influenced by vision, but also influenced by audio-visual interaction. According to existing literature, the visual attention can be studied towards eye movements, however the sound effect on eye movement in videos is little known. The aim of this thesis is to investigate the influence of sound in videos on eye movement and to propose an audio-visual saliency model to predict salient regions in videos more accurately. For this purpose, we designed a first audio-visual experiment of eye tracking. We created a database of short video excerpts selected from various films. These excerpts were viewed by participants either with their original soundtrack (AV condition), or without soundtrack (V condition). We analyzed the difference of eye positions between participants with AV and V conditions. The results show that there does exist an effect of sound on eye movement and the effect is greater for the on-screen speech class. Then, we designed a second audio-visual experiment with thirteen classes of sound. Through comparing the difference of eye positions between participants with AV and V conditions, we conclude that the effect of sound is different depending on the type of sound, and the classes with human voice (i.e. speech, singer, human noise and singers classes) have the greatest effect. More precisely, sound source significantly attracted eye position only when the sound was human voice. Moreover, participants with AV condition had a shorter average duration of fixation than with V condition. Finally, we proposed a preliminary audio-visual saliency model based on the findings of the above experiments. In this model, two fusion strategies of audio and visual information were described: one for speech sound class, and one for musical instrument sound class. The audio-visual fusion strategies defined in the model improves its predictability with AV condition.

GIPSA-lab, 11 rue des Mathématiques, Grenoble Campus BP46, F-38402 SAINT MARTIN D'HERES CEDEX - 33 (0)4 76 82 71 31