Network of Excellence Peer-to-Peer Tagged Media

Rich Speech Retrieval Dataset

Participants receive a set of Internet videos from and accompanying metadata, automatic speech recognition transcripts and shot information with key frames. This task requires participants to locate jumpin points for playback of a range of speech acts (e.g. advice, apology) in the Internet videos using features derived from speech, audio, visual content or associated textual or social information.

Use scenario

Audio-visual archivists, media professionals, journalists and researchers spend much time exploring data to find content for various tasks, for example in researching a news story or providing material for a documentary. Also for the general public, search for a particular point within a personal audio-visual collection can provide a frustrating experience. In order to improve the efficiency of these tasks audio-visual search tools should be able to find relevant items, but also find the point at which audition playback should begin.

The MediaEval 2011 Rich Speech Retrieval task requires participants to identify jump-in points to commence playback within items relevant to a text search request from an audio-visual collection. The task is called a ‘rich speech’ task as the queries combine both entities (names, places, events) and speech acts.

Example query: Find the optimal jump-in point for the audio-visual fragment where the mayor of Amsterdam thanks the fire department for their courageous behaviour during the disaster on New Year’s Eve.

Ground truth and evaluation

Evaluation will be based on accuracy of predicted jump-in point compared to a manually identified ground truth, and also the mean reciprocal rank at which the relevant item containing the relevant jump-in point is retrieved. The jump-in point evaluation metric will be similar to that used for the Czech Speech Retrieval task at CLEF 2006 described in Oard et al. (2007).

Ground truth of the relevant jump-in point for each search topic will be generated by human annotators in a process that approximates the formulation of natural language queries.

The official evaluation metric will be mean Generalized Reciprocal Rank (mGRR) which is the mean of (distance scoring metric / rank at which relevant item is retrieved). If there are two or more items referring to the relevant segment in the ranked list, only the first one will be considered for evaluation.

As an additional evaluation we will also report Mean Reciprocal Rank (MRR) and Mean Jump-in Score (MJS) for the same segment, where the former is the mean of (1/ rank at which relevant item is retrieved) and the latter is the mean of normalized distances from the manually selected jump-in points for each topic. 

Participants can use any available data to complete this task, including user provided data, analysis of the content itself, and external data resources.

Details of the type of each topic will be released with track results to enable participants to analyse the effectiveness of their approach for each topic class.

Release of the evaluation script

There are 4 input parameters for the script:

     1.     window_size: the size of the window (plus/minus) around groundtruth start of the segment in which the jumpinpoint retrieved by the system will be                          considered as correct, in seconds (10, 30 and 60 seconds)

     2.     granularity: granularity step used for counting the penalty for the distance from the actual jumpinpoint within the window, in seconds. 

     3.     QueryResults: qrel file with groundtruth information

     4.     RankedList: run


The data set comprises a set of shows (i.e., channels). It contains ca. 350 hours of data spanning a total of 1974 episodes. The set is predominantly English. Participants will be provided with a video file for each episode along with metadata (e.g., title + description), speech recognition transcripts (1-best transcripts and confusion networks), and shot boundaries from the video stream.


Release of the development dataset

The development dataset contains 247 episodes gathered from a range of shows.

Speech recognition transcripts, 2010

The data is predominantly English, but there are also Dutch, French and Spanish shows mixed in. LIMISI and Vocapia research have supplied us with transcripts for all of these languages. We believe that the language mixture is more representative of the languages on the Web. We cannot guarantee that the language detector picked up the right language, so it is possibly the case that the language  of the speech transcript is the wrong language for the video.


Speech recognition transcripts, 2011

LIMSI and Vocapia research provided us with the speech recognition transcripts obtained from the up-to-date version of the system. These transcripts are "confusion networks", meaning that for a given time-point (time code), they may contain more than one hypotheses of the recognizer.



This file contains lines of the following form:



query ID

google style query

illocutionary act category



Illocutioinary act categories correspond to the following illocutionary acts from

apology - expressive

definition - assertive

opinion - expressive

promise - commissive

warning - commissive

The distribution of the query examples across acts in the devset might not coincide with the distribution in the testset.



This file contains lines of the following form:

query_ID fileName start_time end_time

1 SPI-FridaySeptember122008587 2.7 2.23 

(2.7 = 2 minutes 7 seconds)


User generated transcript of the relevant segment

This file contains lines of the following form:

 query ID transcript 


Shot boundaries and key frames

Shot segmentation was carried out automatically by software from the Technische Universitaet Berlin. Note that because of the automatic detection procedure shot boundary information will not necessarily be perfect. The shot segmentations consist of three items for each video.


  • .xml Contains the time markers for the shot. Each shot is listed as a segment with a start time, end time, and name of the representative keyframe extracted for the shot.
  • .Shot If you would like to make use of information about the type of shot boundaries should look at this file. The shot boundary is given in terms of its keyframe index and is assigned a type code:HardCut = 10, Blank/Fade = 20, Dissolve_Mid = 35
  • a folder containing the key frames (in .jpg format) referenced by the corresponding .xml


The shot detection is described in the paper:

Kelm, P. Schmiedeke, S. and Sikora, T. 2009. Feature-based video key frame extraction for low quality video sequences. Proceedings of the 10th Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS '09), pp.25-28, 6-8 May 2009. Note, however, that for each shot a key frame is extracted in the middle of the sequence.



The metadata consists of the metadata that was assigned by the creator to the episode upon upload. In particular, the (contains the episode title) element and the (contains a description of the episode) element in this metadata can be useful for predicting the genre.



This file contains lines of the following form:

1082    buddhism

tag_ID     tag_name

Reference labels for the development dataset in trec_eval format.

This file contains lines of the following form:

1082 0 BG_10523 0

tag_ID iter docno rel


The tag_ID is the tag code from the tag list, iter can be ignored, docno is the archive number identifying the episode and the relevance, rel, indicates if the tag was manually assigned for that episode (rel is "1" if it was assigned and "0" if not).



Martha Larson




Contact: Martha Larson, m.a.(lastname)