Challenge Details


Dataset Summary

For both tasks, we are using AV-Deepfake1M (Arxiv, GitHub) dataset. This dataset is a large scale dataset addressing the content-driven multimodal deepfakes, which contains over 1M videos and 2K speakers in total.

Dataset Stats

In AV-Deepfake1M, each video only contain very few or no fake visual/audio segments. We host the challenge targeting 2 tasks. The participants are expected to develop the models on the train & val set, and submit the predictions on the test set. The top-3 winner will be determined by the performance on the test set, and required to submit the code (Docker) for the final checking.

Metadata Details
The metadata is a json file for each subset (train, val), which is a list of dictionaries. The fields in the dictionary are as follows. Please note the frame-level labels are only available in temporal localization (Task 2).
  • file: the path to the video file.
  • original: if the current video is fake, the path to the original video; otherwise, the original path in VoxCeleb2.
  • split: the name of the current subset.
  • modify_type: the type of modifications in different modalities, which can be ["real", "visual_modified", "audio_modified", "both_modified"]. We evaluate the deepfake detection (Task 1) performance based on this field.
  • audio_model: the audio generation model used for generating this video.
  • fake_segments: the timestamps of the fake segments. We evaluate the temporal localization (Task 2) performance based on this field.
  • audio_fake_segments: the timestamps of the fake segments in audio modality.
  • visual_fake_segments: the timestamps of the fake segments in visual modality.
  • video_frames: the number of frames in the video.
  • audio_frames: the number of frames in the audio.

Task 1: Deepfake Detection with Limited Label Access

In this task, we aim to detect the deepfake videos with the video-level labels access. Although the dataset containing the full annotation of the timestamps of the fake segments (i.e. frame-level labels), the participants in this task can only use the video-level label to train the model.

Real: The video is real, which means there is no fake segment in the video.

Fake: The video is fake, which means there is at least one fake segment in the video.

The metrics for this task is the AUC score.

The model output should be single confidence number of the input video being fake. The expected submission format is like below.

000001.mp4;0.9128
000002.mp4;0.9142
000003.mp4;0.0174
000004.mp4;0.2021
...

Task 2: Deepfake Temporal Localization

In this task, we aim to temporal localize the fake segments in the videos with the full-level labels access. The participants in this task can access the frame-level labels, which contains the timestamps of the fake segments.

The metrics for this task are AP (Average Precision) and AR (Average Recall).

Score=18IoU{0.5,0.75,0.9,0.95}AP@IoU+110N{50,30,20,10,5}AR@NScore = \frac{1}{8}\sum_{IoU\in\{0.5,0.75,0.9,0.95\}}AP@IoU +\frac{1}{10}\sum_{N\in\{50,30,20,10,5\}}AR@N

The model output should be the temporal localization of the fake segments in the input video. The expected format is a json file with following structure.

{
    "000001.mp4": [
        [0.9541, 4.0, 4.8], // [confidence to be fake, start time, end time]
        [0.2315, 4.24, 5.52],
        [0.0012, 0.48, 2.6],
        [0.0002, 6.56, 8.8],
        ...
    ],
    ...
}

Frequently Answered Questions

Q: Can external data be used in the challenge?

A: Only public external data can be used.


Q: How to convert the segments label in metadata to frame-level and video-level labels?

A: Codes are below.

# real: 0, fake: 1
# frame_label is only for temporal localization task
frame_label = np.zeros(video_frames)
for start, end in segments:
    frame_label[int(start * fps):int(end * fps)] = 1
# video_label is for both classification and temporal localization task
video_label = len(segments) > 0