TalkingHeads dataset

The dataset serves for investigating the relationships between audio and video features while people are discussing.

The dataset contains four conversations lasting, on average, 6 minutes (but novel sequences are going to appear, keep in touch!). The data was recorded in a 3:5 x 2:5 meters wide outdoor area, during a cloudy day in summer 2011. The total number of subjects is 15 (1 female and 14 males), with 4 different participants per conversation (only one subject participated in two conversations). The subjects include 4 academics, 5 undergraduate students, 2 MSc students, 3 postdoctoral researchers, and 1 PhD student. The ages range between 20 and 40 years and the subjects were unaware of the actual goals of the experiments. Data were captured at 25 frames per second with a camera positioned 7 meters above the floor and facing downward. The subjects were asked to wear differently colored shirts, in order to make the tracking easier. The audio was recorded at 44100 Hz with 4 wireless headset microphones, each transmitting to its own receiver. Each audio recording has been segmented into speech and non-speech segments using a robust VAD algorithm based on pitch [1]. This latter was extracted at regular time steps of 10 ms with Praat [2], a package including the pitch extraction technique described in [3].,

  1. Khondaker, A., Ghulam, M.: Improved noise reduction with pitch enabled voice activity detection. In: ISIVC2008 (2008)
  2. Boersma, P.: Praat, a system for doing phonetics by computer. Glot International 5(9/10), 341{345 (2001)
  3. Boersma, P.: Accurate short term analysis of the fundamental frequency and the harmonics to noise ratio of a sampled sound. IEEE Transactions on Image Processing 17, 97{110 (1993)

To obtain this dataset, send me an email, indicating as subject [TalkingHeads dataset]. Note that the dataset is available only for research purposes.