Network of Excellence Peer-to-Peer Tagged Media

Placing Task Dataset

The task is part of the MediaEval Benchmark and requires participants to assign geographical coordinates (latitude and longitude) to each provided test video. Participants can make use of metadata and audio and visual features as well as external resources, depending on the run. Note that a minimum of one run that uses only audio/visual features will be required


Use scenario

Geotagging on the internet: Most videos online are not assigned a geotag.


Ground truth and evaluation

The geo-coordinates associated with the Flickr will be used as the ground truth. Since these do not always serve to precisely pinpoint the location of the video, we evaluate at each of a series of widening circles: 1km, 10km, 100km, 1000km, 10000km. We are also interested in the issue of videos that have been uploaded by an uploader who was unseen in the development (i.e., training) data. In order to examine this issue, we will calculate a second set of scores over that part of the test data containing unseen uploaders only.




The data set is an extension of the 2010 Placing Task data set and contains a set of geotagged Flickr videos as well as the metadata for geotagged Flickr images. A set of basic visual features extracted for all images and for the frames of the videos is provided. Evaluation of runs submitted by participating groups is based on distances between the predicted and the actual geo-coordinates. Ground truth is supplied by Flickr users uploaded the videos and the images. All videos and images are shared by their owners under Creative Commons license.



Flickr images (useful for development purposes)

For development purposes we provide metadata and low-level visual features extracted from a large set of Flickr images. Note that we do not supply the images, just the low-level visual features and the images. If you would like the images, you need to download them yourself using the links in the metadata.


Using geographic bounding boxes of various size and the Flickr API, we collected metadata for 3,185,258 CC-licensed Flickr photos uniformly sampled from all parts of the world. Most, but not all, photos have textual tags. All photos have geotags of at least region level accuracy. Accuracy shows what zoom level the user used when placing the photo on the map. There are 16 zoom and hence 16 accuracy levels (e.g., 6 - region level, 12 - city level, 16 - street level). Note that we include Flickr images since they are potentially helpful for development purposes. The test set, however, will include only videos.


There is a text file with each line representing a photo. Example:


75905404@N00/3089149570 : : GeoData[longitude=150.69044 latitude=-33.751354 accuracy=15] : flowers flower australia nsw newsouthwales agapanthus pemrith  : 1228560221000 : 1228640133000


So, each line has the format:

UserID/PhotoID : HTMLLinkToPhoto : GeoData[longitude latitude accuracy] : tags : DateTaken : Date Uploaded 


UserIDs and PhotoIDs are Flickr user and photo identifiers, dates are unix timestamps, links are to medium sized photos.


Pre-extracted low-level visual features (from keyframes and images)

We extracted visual features for video keyframes and training images, using the open source library LIRE ( ), with the default parameter settings, and the default image size which is 500 pixels on the long side.  The feature vector file has the following format:

image_id \t feature_vector

feature vector always "starts" from the shortname of the feature


12179.mp4-023.jpg cedd  144 0 0 4 0 0 5 0 4 1 0 1 0 0 0 0 0 
12179.mp4-023.jpg gabor 60 9.1060

12179.mp4-023.jpg fcth 192 0 0 6 0 


Contact: Martha Larson, m.a.(lastname)