Network of Excellence Peer-to-Peer Tagged Media

Genre Tagging Dataset

Participants receive a set of Internet videos from and accompanying metadata, automatic speech recognition transcripts and shot information with key frames. This task requires participants to automatically assign genre tags (e.g., "politics", "sports", "art") to Internet videos using features derived from speech, audio, visual content or associated textual or social information.  

Use scenario

Genre information in the form of genre tags can provide valuable support for users searching and browsing the Internet for video. Much video is, however, not tagged or not optimally tagged. This task attempts to automatically generate genre labels such as they are used to organize videos on video platforms such as


The dataset for the MediaEval 2011 Genre Tagging Task was created by PetaMedia, the Network of Excellence that sponsors MediaEval 2011. Dataset creation involved collecting Creative Commons video from and also collecting the Twitter social network associated with this data. LIMSI and Vocapia research supplied the automatic speech recognition (ASR) transcripts for this task (more information on ASR transcripts below).

Please note that although the video is Creative Commons licensed, the accompanying resources are not. For this reason, you were asked to fill a usage agreement before receiving the dataset for this task.

Release of development data

The development dataset contains 247 episodes gathered from a range of shows.

Genre Tag ground truth

Ground truth (reference labels) for the development set is made available in the trec_eval format. In the terminology of trec_eval, this file is referred to as the "qrels" file.

This file contains the membership of all 247 videos of the devset to the 26 genres and contains 247 x 26 lines. The structure of the file is described in the following example that shows two lines of the file:


qid  iter  docno                            rel

1013 0     Gsiemens-IETWeek1218             0

1013 0     Gvtvnews-GVTVNEWS010709NCTV11940 1

The qid is the label code for a genre label, iter can be ignored, docno is the identifier for the episode and the relevance, rel, indicates if the qid is correctly assigned to the docno. If it is a correct label for that episode rel is "1", if it is a wrong label for that episode rel is "0". Episode 'Gsiemens-IETWeek1218', for example, does not have the genre label with ID 1013 (= 'movies_and_television'). Episode 'Gvtvnews-GVTVNEWS010709NCTV11940', on the other hand, is ground truth labeled with this genre tag.

The following table summarizes the distribution of the videos in the development set over the genres. Do not assume that this distribution will also hold of the test set.


qid Genre  no. of dev videos 
1000 art 4
1001 autos_and_vehicles
1002 business
1003 citizen_journalism 11 
1004 comedy
1005 conferences_and_other_events 2
1006 default_category 40 
1007 documentary
1008 educational 19 
1009 food_and_drink 4
1010 gaming 4
1011 health
1012 literature
1013 movies_and_television
1014 music_and_entertainment
1015 personal_or_auto-biographical
1016 politics 45 
1017 religion 16 
1018 school_and_education
1019 sports
1020 technology 27
1021 the_environment
1022 the_mainstream_media
1023 travel
1024 videoblogging 10 
1025 web_development_and_sites



The metadata consists of the metadata that was assigned by the creator to the episode upon upload. In particular, the  (contains the episode title) element and the (contains a description of the episode) element in this metadata can be useful for predicting the genre.

Speech recognition transcripts 

The data is predominantly English, but there are also Dutch, French and Spanish shows mixed in. LIMISI and VECSYS research have supplied us with transcripts for all of these languages. We believe that the language mixture is more representative of the languages on the Web. We cannot guarantee that the language detector picked up the right language, so it is possibly the case that the language  of the speech transcript is the wrong language for the video.


Speech recognition transcripts, 2011

LIMSI and Vocapia research provided us with the speech recognition transcripts obtained from the up-to-date version of the system. These transcripts are "confusion networks", meaning that for a given time-point (time code), they may contain more than one hypotheses of the recognizer.

Shot boundaries and key frames

Shot segmentation was carried out automatically by software from the Technische Universitaet Berlin. Note that because of the automatic detection procedure shot boundary information will not necessarily be perfect. The shot segmentations consist of three items for each video.


  • .xml Contains the time markers for the shot. Each shot is listed as a segment with a start time, end time, and name of the representative keyframe extracted for the shot.
  • .Shot If you would like to make use of information about the type of shot boundaries should look at this file. The shot boundary is given in terms of its keyframe index and is assigned a type code:HardCut = 10, Blank/Fade = 20, Dissolve_Mid = 35
  • a folder containing the key frames (in .jpg format) referenced by the corresponding .xml

The shot detection is described in the paper:

Kelm, P. Schmiedeke, S. and Sikora, T. 2009. Feature-based video key frame extraction for low quality video sequences. Proceedings of the 10th Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS '09), pp.25-28, 6-8 May 2009. Note, however, that for each shot a key frame is extracted in the middle of the sequence.


Social data 

The social data is gathered from Twitter. The collection was created such that it contains videos (i.e., episodes) that have been referenced (i.e., linked to) in tweets by Twitter users. These codes are necessary to be able to work with the rest of the Twitter social data associated with the collection. 

This directory contains three sub-directories:


Information from Level 0 users.

T1, T2, T3:

Information from Level 1 and Level 2 users


Here is an explanation of the levels:


Level 0 users: These are users who have referenced videos in their tweets. (Their profiles are in a file namedauthors posts are in blipposts)

Level 1 users: Friends and followers of all Level 0 users

Level 2 users: Friends of all Level 1 users. Their profiles are provided. (Followers of Level 1 users were not collected)


NB: There are many more videos (i.e., episodes) included in the social network than are contained in the devset (or in the devset + testset for that matter). The users and posts related to these videos may or may not be useful. If you decide to make use of the social data, this is something that you decide for yourself.


In User_postsLX.xml users are listed with their tweets as sub-elements, as follows:




          Tue Nov 03 05:53:33 +0000 2009  

          Still feeling lethargic from a dinner of perogies, garlic sausage and homemade cabbage rolls. #burp  






NB: For Level 0 this information is contained in a file in the Level0 directory named: blipposts.txt(although not in a nice .xml format)  


In User_profilesLX.xml user profiles are listed using the following format:
        JD Rucker
        Orange County, CA
        Social Media. It's what I do. I love to find and share high-quality content.
        Sat Dec 29 05:57:41 +0000 2007
        0084B4                     BDDCAD


NB: For Level 0 links to the profiles of the users are contained in a file in the Level0 directory namedauthors    In User_followersLX.xml and User_friendsLX.xml Users are listed with their followers as sub-elements, as follows:




NB: T1, T2, T3 are folders containing three parts of the data. The videos were split over three machines for the purposes of crawling the related social networks. For this reason, the data in these folders can be expected to overlap in the terms of Level 1 and Level 2 users.


Martha Larson




Contact: Martha Larson, m.a.(lastname)