Network of Excellence Peer-to-Peer Tagged Media

MediaEval

  • Overview

    In this section you will find some datasets that have been used in research on multimedia.

  • Placing Task Dataset

    The task is part of the MediaEval Benchmark and requires participants to assign geographical coordinates (latitude and longitude) to each provided test video. Participants can make use of metadata and audio and visual features as well as external resources, depending on the run. Note that a minimum of one run that uses only audio/visual features will be required

  • Genre Tagging Dataset

    Participants receive a set of Internet videos from blip.tv and accompanying metadata, automatic speech recognition transcripts and shot information with key frames. This task requires participants to automatically assign genre tags (e.g., "politics", "sports", "art") to Internet videos using features derived from speech, audio, visual content or associated textual or social information.  

  • Rich Speech Retrieval Dataset

    Participants receive a set of Internet videos from blip.tv and accompanying metadata, automatic speech recognition transcripts and shot information with key frames. This task requires participants to locate jumpin points for playback of a range of speech acts (e.g. advice, apology) in the Internet videos using features derived from speech, audio, visual content or associated textual or social information.

  • SpokenWeb Search Dataset

    The task involves searching FOR audio content WITHIN audio content USING an audio content query. The task requires researchers to build a language-independent audio search system so that, given an audio query, it should be able to find the appropriate audio file(s) and the (approximate) location of query term within the audio file(s). As a contrastive condition (i.e. a "general" run), participants can also run systems not based on an audio query, as we will also provide the search term in lexical form. Note that language labels and pronunciation dictionaries will not be provided. The lexical form cannot be used to deduce the language in the audio-only condition.

  • Affect Task: Violent Scenes Detection Dataset

    This task requires participants to deploy multimodal features to automatically detect portions of movies containing violent material. Violence is defined as "physical violence or accident resulting in human injury or pain". Any features automatically extracted from the video, including the subtitles, can be used by participants. No external additional data such as metadata collected from the Internet can be used in this task. Only the content of the DVDs is allowed for feature extraction.

  • Social Event Detection Task Dataset

    This task requires participants to discover events and detect media items that are related to either a specific social event or an event-class of interest.