Rich Speech Retrieval Dataset
Participants receive a set of Internet videos from blip.tv and accompanying metadata, automatic speech recognition transcripts and shot information with key frames. This task requires participants to locate jumpin points for playback of a range of speech acts (e.g. advice, apology) in the Internet videos using features derived from speech, audio, visual content or associated textual or social information.
Use scenario
Audio-visual archivists, media professionals, journalists and researchers spend much time exploring data to find content for various tasks, for example in researching a news story or providing material for a documentary. Also for the general public, search for a particular point within a personal audio-visual collection can provide a frustrating experience. In order to improve the efficiency of these tasks audio-visual search tools should be able to find relevant items, but also find the point at which audition playback should begin.
The MediaEval 2011 Rich Speech Retrieval task requires participants to identify jump-in points to commence playback within items relevant to a text search request from an audio-visual collection. The task is called a ‘rich speech’ task as the queries combine both entities (names, places, events) and speech acts.
Example query: Find the optimal jump-in point for the audio-visual fragment where the mayor of Amsterdam thanks the fire department for their courageous behaviour during the disaster on New Year’s Eve.
Ground truth and evaluation
Evaluation will be based on accuracy of predicted jump-in point compared to a manually identified ground truth, and also the mean reciprocal rank at which the relevant item containing the relevant jump-in point is retrieved. The jump-in point evaluation metric will be similar to that used for the Czech Speech Retrieval task at CLEF 2006 described in Oard et al. (2007).
Ground truth of the relevant jump-in point for each search topic will be generated by human annotators in a process that approximates the formulation of natural language queries.
The official evaluation metric will be mean Generalized Reciprocal Rank (mGRR) which is the mean of (distance scoring metric / rank at which relevant item is retrieved). If there are two or more items referring to the relevant segment in the ranked list, only the first one will be considered for evaluation.
As an additional evaluation we will also report Mean Reciprocal Rank (MRR) and Mean Jump-in Score (MJS) for the same segment, where the former is the mean of (1/ rank at which relevant item is retrieved) and the latter is the mean of normalized distances from the manually selected jump-in points for each topic.
Participants can use any available data to complete this task, including user provided data, analysis of the blip.tv content itself, and external data resources.
Details of the type of each topic will be released with track results to enable participants to analyse the effectiveness of their approach for each topic class.
Release of the evaluation script
There are 4 input parameters for the script:
1. window_size: the size of the window (plus/minus) around groundtruth start of the segment in which the jumpinpoint retrieved by the system will be considered as correct, in seconds (10, 30 and 60 seconds)
2. granularity: granularity step used for counting the penalty for the distance from the actual jumpinpoint within the window, in seconds.
3. QueryResults: qrel file with groundtruth information
4. RankedList: run
DataThe data set comprises a set of blip.tv shows (i.e., channels). It contains ca. 350 hours of data spanning a total of 1974 episodes. The set is predominantly English. Participants will be provided with a video file for each episode along with metadata (e.g., title + description), speech recognition transcripts (1-best transcripts and confusion networks), and shot boundaries from the video stream.
Release of the development dataset
The development dataset contains 247 episodes gathered from a range of blip.tv shows.
Speech recognition transcripts, 2010
The data is predominantly English, but there are also Dutch, French and Spanish shows mixed in. LIMISI and Vocapia research have supplied us with transcripts for all of these languages. We believe that the language mixture is more representative of the languages on the Web. We cannot guarantee that the language detector picked up the right language, so it is possibly the case that the language of the speech transcript is the wrong language for the video.
Speech recognition transcripts, 2011
LIMSI and Vocapia research provided us with the speech recognition transcripts obtained from the up-to-date version of the system. These transcripts are "confusion networks", meaning that for a given time-point (time code), they may contain more than one hypotheses of the recognizer.
Queries
This file contains lines of the following form:
query ID
query
google style query
illocutionary act category
Illocutioinary act categories correspond to the following illocutionary acts from http://en.wikipedia.org/wiki/Illocutionary_act:
apology - expressive
definition - assertive
opinion - expressive
promise - commissive
warning - commissive
The distribution of the query examples across acts in the devset might not coincide with the distribution in the testset.
Groundtruth
This file contains lines of the following form:
query_ID fileName start_time end_time
1 SPI-FridaySeptember122008587 2.7 2.23
(2.7 = 2 minutes 7 seconds)
User generated transcript of the relevant segment
This file contains lines of the following form:
query ID transcript
Shot boundaries and key frames
Shot segmentation was carried out automatically by software from the Technische Universitaet Berlin. Note that because of the automatic detection procedure shot boundary information will not necessarily be perfect. The shot segmentations consist of three items for each video.
- .xml Contains the time markers for the shot. Each shot is listed as a segment with a start time, end time, and name of the representative keyframe extracted for the shot.
- .Shot If you would like to make use of information about the type of shot boundaries should look at this file. The shot boundary is given in terms of its keyframe index and is assigned a type code:HardCut = 10, Blank/Fade = 20, Dissolve_Mid = 35
- a folder containing the key frames (in .jpg format) referenced by the corresponding .xml
The shot detection is described in the paper:
Kelm, P. Schmiedeke, S. and Sikora, T. 2009. Feature-based video key frame extraction for low quality video sequences. Proceedings of the 10th Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS '09), pp.25-28, 6-8 May 2009. Note, however, that for each shot a key frame is extracted in the middle of the sequence.
Metadata
The metadata consists of the metadata that was assigned by the creator to the blip.tv episode upon upload. In particular, the (contains the episode title) element and the (contains a description of the episode) element in this metadata can be useful for predicting the genre.
Tags
This file contains lines of the following form:
1082 buddhism
tag_ID tag_name
Reference labels for the development dataset in trec_eval format.
This file contains lines of the following form:
1082 0 BG_10523 0
tag_ID iter docno rel
The tag_ID is the tag code from the tag list, iter can be ignored, docno is the archive number identifying the episode and the relevance, rel, indicates if the tag was manually assigned for that episode (rel is "1" if it was assigned and "0" if not).
Contact
Martha Larson
m.a.(lastname)@tudelft.nl