Network of Excellence Peer-to-Peer Tagged Media

SpokenWeb Search Dataset

The task involves searching FOR audio content WITHIN audio content USING an audio content query. The task requires researchers to build a language-independent audio search system so that, given an audio query, it should be able to find the appropriate audio file(s) and the (approximate) location of query term within the audio file(s). As a contrastive condition (i.e. a "general" run), participants can also run systems not based on an audio query, as we will also provide the search term in lexical form. Note that language labels and pronunciation dictionaries will not be provided. The lexical form cannot be used to deduce the language in the audio-only condition.

Use scenario

Imagine you want to build a simple speech recognition system, or at least a spoken term detection (STD) system in a new dialect, for which only very few audio examples are available. Maybe there even is no written form for that dialect? Is it possible to do something useful (i.e. identify the topic of a query) by using only very limited resources? 

Ground truth and evaluation

The ground truth is created manually by native speakers, and provided by the task organizers, following the principles of NIST's Spoken Term Detection (STD) evaluations.

The primary evaluation metric will be ATWV (Average Term Weighted Value). For information on ATWV, please consult the NIST paper below.


Participants are provided with a data set that has been kindly made available by the Spoken Web team at IBM Research, India. The audio content is spontaneous speech that has been created over phone in a live setting by low-literate users. While most of the audio content is related to farming practices, there are other domains as well. The data set comprises audio from four different Indian languages: English, Hindi, Gujarati and Telugu. Each data item is ca. 4-30 secs in length.

As already mentioned above, participants are allowed to use any additional resources they might have available, as long as their use is documented in the working notes paper.

Release of Development Data

The development set contains 400 utterances (100 per language) and 64 queries (16 per language), all as audio files recorded in 8kHz/ 16bit, as WAV file. For each query, we will also provide the lexical form in UTF8 encoding. For each utterance, we provide 0-n matching queries (but not the location of the match). 

There are 4 directories in the Spoken Web data. "Audio" has the 400 audio files in four languages. "Transcripts" has the corresponding word level transcriptions in roman characters of the audio file. "QueryAudio" has the 64 query audio terms. "QueryTranscripts" has the corresponding word level roman transcription of the QueryAudio files. The file "Mapping.txt" shows which Query is present in which Audio file.

Release of Test Data

The test set consists of 200 utterances (50 per language) and 36 queries (9 per language) as audio files, with the same characteristics. We will again provide the lexical form of the query, but not the matching utterances.

The evaluation data consists of two directories. EvaluationAudio directory contains 200 audio files that are the utterance. This has 50 audio files for each of the four languages. EvaluationQueryAudio contains the 36 audio files that are the query audio terms. This has 9 audio queries from each of the four languages.

The written form of the search queries is also provided in the directory EvaluationQueryTranscripts (EvaluationTranscripts may be made available later).

Data will be provided as a "termlist" XML file, in which the "termid" corresponds to the filename of the audio query. This information will be packaged together with the scoring software (see below), for example:






Martha Larson




Contact: Martha Larson, m.a.(lastname)