Network of Excellence Peer-to-Peer Tagged Media

Affect Task: Violent Scenes Detection Dataset

This task requires participants to deploy multimodal features to automatically detect portions of movies containing violent material. Violence is defined as "physical violence or accident resulting in human injury or pain". Any features automatically extracted from the video, including the subtitles, can be used by participants. No external additional data such as metadata collected from the Internet can be used in this task. Only the content of the DVDs is allowed for feature extraction.

Use case scenario

This challenge derives from a use case at Technicolor. The use case involves helping users choose movies that are suitable for children of different ages. The movies should be suitable in terms of their violent content, e.g., for viewing by users' families. Users select or reject movies by previewing parts of the movies (i.e., scenes or segments) that include the most violent moments.


Ground truth and evaluation

The ground truth is created by human assessors and is provided by the task organizers. In addition to segments containing physical violence (as defined above), annotations include the following high-level concepts: presence of blood, fights, presence of fire, presence of guns, presence of cold arms, car chases and gory scenes, for the visual modality. Note that participants are welcome to carry out detection of the high-level concepts. However, concept detection is not the goal of the task and these high-level concept annotations are only provided for training purposes. The test set will be released without any concept annotation.

Automatically generated shot segmentation (shot boundaries + extracted keyframes) will be provided along with the dataset.

Several performance measures will be used for diagnostic purposes (false alarm rate, miss detection rate, AEC, etc.) . However, system comparison will be based upon a detection cost function weighting false alarms and missed detections, according to

     C = C_{fa} P_{fa} + C_{miss} P_{miss}

where the costs C_{fa} =1 and C_{miss} = 10 are arbitrarily defined to reflect (a) the prior probability of the situation and (b) the cost for the use of making an error.  P_{fa} and P_{miss} are the estimated probabilities of resp. false alarm (false positive) and missed detection (false negative) given the system's output and the ground truth. In the shot classification, the false alarm and miss probabilities will be calculated on a per shot basis while, in the segment level run, they will be computed on a per unit of time basis.

To avoid the sole evaluation of systems at given operating points and enable full comparison of the pros and cons of each system, we will use detection error trade-off (DET) curves whenever possible, plotting P_{fa} as a function of P_{miss} given a segmentation and a score for each segment, where the higher the score, the more likely the violence. Note that in the segment level run, DET curves are possible only for systems returning a dense segmentation (a list of segments that spans the entire video): segments not present in the returned list of segments will be considered as non violent for all thresholds.


A set of ca. 15 Hollywood movies that must be purchased by the participants. The movies are of different genres (from extremely violent movies to movies without violence).


Specifications of the ground truth and the concept annotations.

Annotation process

All annotations were conducted on the mpeg files resulting from the ripping process described above. The mpeg files were opened with the software VirtualDub, version 1.9.11, Note that VirtualDub will not read mpeg files as they are. You will have to add the fccHandler's mpeg, which can be found at the following address  The starting and ending times for segments are provided by the VirtualDub interface as frame numbers.


Annotation of violent segments

All annotations will be provided in text format, leading to files with the extension .txt. File names have been chosen to give clear indications on their content.


Definition of a violent segment

Defining the term “Violence” is not an easy task, as this notion remains subjective and thus dependent on people. In the context of MediaEval 2011, we therefore took the following definition: violence is defined as "physical violence or accident resulting in human injury or pain".

Each violent segment will contain only one action as defined above, whenever it is possible. Some cases where different actions are overlapping will be proposed as a single segment. We will try to comment this as much as possible in the annotation files by adding tag 'multiple_action_scene'.


Exceptions/borderline cases

This section gives some examples of some borderline cases and the decision that was taken when annotating these types of content.

  • Shots in which the results of some physical violence are shown but not the violent act itself (e.g. in movie “Reservoir Dogs”, when one sees a man suffering and covered with blood in a car).

     Since there is no action, these scenes are not included in this year's target.

  • A dead body is shown on screen, possibly with a lot of injuries and blood, and without seeing how it was injured.

     Again, no action is seen, these scenes are not included.

  • Somebody is slapping another person in his face but this action does not result in bleeding.

     There is an action resulting in pain, therefore it is included into the violent segments.

  • Somebody is shooting somebody else with the clear intent to kill, but the targeted person escapes with no injury.

     For this year - and to stick to our violent segment definition, these events corresponding to 'intention to kill' will not be included. 

  • Scenes in which one knows that somebody is shooting some people but visually one sees nothing (no gun, no people, no blood, nothing) but the audio is clear on the message: one hears the gunshots, possibly with screams afterwards.?

     This is violent action according to the definition: the action and its results are present in the audio modality. This was therefore annotated as a violent segment.

  • Scenes in which one sees surgery without anesthetic, and of course resulting in pain (e.g. in movie “Saving private Ryan”, first scene on the beach).

     This is an action resulting in pain. It was therefore included into the annotations.

  • Segments where one sees the violent act (e.g. shooting), one sees the results (pain or injury) but both shots are separated by one single shot - non violent - of duration ~6 seconds. 

     These segments/shots were kept in the annotation process. The middle shot was either kept or not depending on its duration: if it is very short, it was included.

  • Segments in which a human is shooting a manikin

     This was kept, as in that case, the manikin has a humanoid shape and was considered by the character as a 'human'. 

  •  A dragon throws fire to another dragon, dragon on fire falls, but we do not know for sure if there is a rider on it

     We assumed that the dragon could not be fighting alone, therefore we kept this segment. 

  • Segments where one sees somebody shooting a moving tank (we did not see the people inside)

     As above, we assume that there were people inside. Therefore we kept these segments.

  • Segments in which one sees the destruction of a whole city

     Again, city could not be empty. Segments were kept.

  • Man hitting amputated arm of a skeleton pirate

     This one was kept. 


Annotation format

Annotations for the violent segments are provided in text files, one file per movie. Each violent segment is described by its starting frame number, ending frame number, with possible additional tag (i.e. multiple_action_scenes).

According to the DVD references provided, all DVDs belong to region 2, meaning that they are encoded with a frame rate of 25fps. The starting and ending frame numbers are therefore given according to this frame rate.

Example of violent segment annotation file:

3487       3576       tags

3678       3689       tags


Annotation of additional features

One text file is provided per annotated feature.


Visual features


Presence of blood

As soon as blood is visually present in the images, it is annotated. This follows the same format as what it used for the violent segments:

starting_frame ending_frame percentage

Starting and ending frames are given at 25fps.

Tag ‘percentage’ is taken among the following values: 1%, 5% 10%, 25% and 50% with the following meanings:

  • 1%: amount of blood pixels in the image represents between 1 and 5% of the image surface.
  • 5%: surface_of_blood_pixels Î[5%, 10%[
  • 10%: surface_of_blood_pixels Î[10%, 25%[
  • 25%: surface_of_blood_pixels Î[25%, 50%[
  • 50%: surface_of_blood_pixels Î[50%, 100%]


Blood example: 89555 89685 25%



Different types of fights were annotated, resulting in different tags in file:

  • 1vs1: only two people fighting
  • small: for a small group of people (number of people was not counted, it will roughly correspond to less than 10)
  • large: for a large group of people (> 10)
  • distant attack: no real fight but somebody is shot or attacked at distance (gunshot, arrow, car, etc)

It could possibly be human against animal. 

This will follow the same format as what is used for the violent segments:

starting_frame ending_frame tag_if_necessary

Starting and ending frames are given at 25fps.

Fight example: 91314 91396 large 


Presence of fire

As soon as fire is visually present in the images, it is annotated. It could be a big fire as well as fire coming out of a gun while shooting. It could be also a candle or a cigarette lighter, or even a cigarette, or sparks. A space shuttle taking off will also generate fire. This will include explosions. When the fire is not yellow or orange, an additional tag indicates its color. In case too many extra colors are visible, a 'multicolor' tag will be used.

This follows the same format as what it used for the violent segments:

starting_frame ending_frame tag_if_necessary

Starting and ending frames are given at 25fps.

Fire examples:

  • 159414 159425 blue 
  • 3890 3987 multicolor 
  • Other example with fire on a single frame: 62 62


Presence of firearms (guns and assimilated)

When any type of guns or assimilated arms is shown on screen, it is annotated. Guns with bayonets were annotated as guns, whenever a part of it is seen, even if it is a part of the bayonet.

This follows the same format as what it used for the violent segments:

starting_frame ending_frame

Starting and ending frames are given at 25fps.

Firearm example: 29783 29792


Presence of cold arms

Same as for firearms but for any kind of cold arms. Guns with bayonets were annotated also as cold arms, only when the bayonet is visible.

Cold arm example: 29783 29792


Car chases

Annotations of car chases indicate segments showing a car chase. This follows the same format as for violent segments:

starting_frame ending_frame

Starting and ending frames are given at 25fps.

Car chase example: 157469 158301


Gory scenes

Annotations of gory scenes will indicate graphic images of bloodletting and/or tissue damage. It will include horror or war representations. As this is also a subjective and difficult notion to define, some additional segments showing really disgusting mutants or creatures were annotated. Additional tags describing the event/scene were added in this case.


A non exhaustive example list is:

1 Death from being eaten
2 Death by bisection or dismemberment (excluding decapitation)
3 Death by crushing
4 Death by shredding
5 Death from blinding
6 Death by chainsaw
7 Death from decapitation
8 Death due to contact with a caustic or otherwise deadly substance
9 Death from a fall into a molten substance
10 Death by fluid extraction
11 Death by gunfire
12 Death by impalement or crucifixion
13 Death due to an improper use of explosives
14 Death by violent organ removal
15 Death from slicing by such a sharp object, so that it takes some moments for victim to fall apart
16 Death from decompression or over-pressure
17 Miscellaneous/Other


Annotations will follow the same format: 

starting_frame ending_frame tags_if_necessary

Starting and ending frames are given at 25fps.

Audio features: screams, gunshots and explosions

Some audio annotations previously done on an older version of the corpus are also provided 'as is'. Note that they were not revisited. In particular, some synchronization issues might appear between these annotations and the new versions of the DVDs (especially for movies Leon and SavingPrivateRyan). They can be found in this directory (file audio_annotations.tgz).


Three concepts were annotated: screams, gunshots and explosions, one file per concept, using the same format as for the video concepts:

starting_frame ending_frame.

Note that for these audio concepts only, starting and ending frames are not always given at 25fps (Leon was annotated at a frame rate of 30fps, Saving Private Ryan at a frame rate of 24fps).


Provided shot segmentations

Automatically generated shot boundaries with their corresponding keyframes will also be provided with each movie. 

Shot segmentation was carried out by Technicolor's software. Note that because of the automatic detection procedure shot boundary information will not necessarily be perfect. The shot segmentation is provided in text format in file with suffix *-shot.txt. It contains one shot per line with the same format as above:

starting_frame ending_frame.

The shot segmentation results will also be provided as an XML file. This file will contain shot boundaries as timecodes, together with their corresponding keyframe with timecode and frame number.

Keyframes will be provided in jpg format. Keyframe files will contain the corresponding keyframe number in their names.



Martha Larson



Contact: Martha Larson, m.a.(lastname)