Challenge Details


Dataset Summary

For both tasks, we are using AV-Deepfake1M++ dataset. This dataset is a large scale dataset addressing the content-driven multimodal deepfakes, which contains around 2M videos and more speakers in total than previous AV-Deepfake1M.

Subset #Videos #Real #Fake #Frames #Time #Subjects
Training 1.10M 0.30M 0.80M 264M 2934H 2606*
Validation 0.08M 0.02M 0.06M 18M 205H 1676*
TestA TBD TBD TBD TBD TBD TBD
TestB TBD TBD TBD TBD TBD TBD

*The subjects in the training and validation sets are overlapped.

In AV-Deepfake1M++, each video only contain very few or no fake visual/audio segments. We host the challenge targeting 2 tasks. The participants are expected to develop the models on the train & val set, and submit the predictions on the testA set. The top-3 winner will be determined by the performance on the testA set, and required to submit the training and testing code (Docker) for the final checking to determine the final winner on the testB set.

Metadata Details
The metadata is a json file for each subset (train, val), which is a list of dictionaries. The fields in the dictionary are as follows. Please note the frame-level labels are only available in temporal localization (Task 2).
  • file: the path to the video file.
  • original: if the current video is fake, the path to the original video; otherwise, the original path in the source dataset.
  • split: the name of the current subset.
  • modify_type: the type of modifications in different modalities, which can be ["real", "visual_modified", "audio_modified", "both_modified"]. We evaluate the deepfake detection (Task 1) performance based on this field.
  • audio_model: the audio generation model used for generating this video.
  • video_model: the visual generation model used for generating this video.
  • fake_segments: the timestamps of the fake segments. We evaluate the temporal localization (Task 2) performance based on this field.
  • audio_fake_segments: the timestamps of the fake segments in audio modality.
  • visual_fake_segments: the timestamps of the fake segments in visual modality.
  • video_frames: the number of frames in the video.
  • audio_frames: the number of frames in the audio.

Prepare the Dataset

Please register the challenge and submit the EULA, then you will be shared with a huggingface repository.

huggingface-cli login
huggingface-cli download ControlNet/AV-Deepfake1M-PlusPlus --repo-type dataset --local-dir ./AV-Deepfake1M-PlusPlus

Then, unzip the zip files. Please go to the subset you want to unzip, run

7z x train.zip.001

It will unzip the all volumes automatically.

The dataloader from the AV-Deepfake1M SDK might be helpful.


Task 1: Deepfake Detection with Limited Label Access

In this task, we aim to detect the deepfake videos with the video-level labels access. Although the dataset containing the full annotation of the timestamps of the fake segments (i.e. frame-level labels), the participants in this task can only use the video-level label to train the model.

Real: The video is real, which means there is no fake segment in the video.

Fake: The video is fake, which means there is at least one fake segment in the video.

The metrics for this task is the AUC score.

The model output should be single confidence number of the input video being fake. The expected submission format is like below.

000001.mp4;0.9128
000002.mp4;0.9142
000003.mp4;0.0174
000004.mp4;0.2021
...

Task 2: Deepfake Temporal Localization

In this task, we aim to temporal localize the fake segments in the videos with the full-level labels access. The participants in this task can access the frame-level labels, which contains the timestamps of the fake segments.

The metrics for this task are AP (Average Precision) and AR (Average Recall).

Score=18IoU{0.5,0.75,0.9,0.95}AP@IoU+110N{50,30,20,10,5}AR@NScore = \frac{1}{8}\sum_{IoU\in\{0.5,0.75,0.9,0.95\}}AP@IoU +\frac{1}{10}\sum_{N\in\{50,30,20,10,5\}}AR@N

The model output should be the temporal localization of the fake segments in the input video. The expected format is a json file with following structure.

{
    "000001.mp4": [
        [0.9541, 4.0, 4.8], // [confidence to be fake, start time, end time]
        [0.2315, 4.24, 5.52],
        [0.0012, 0.48, 2.6],
        [0.0002, 6.56, 8.8],
        ...
    ],
    ...
}

Frequently Answered Questions

Q: Can external data be used in the challenge?

A: Only public external data can be used.


Q: Can foundation models be used in the challenge?

A: Yes if it is available public.


Q: How to convert the segments label in metadata to frame-level and video-level labels?

A: Codes are below.

# real: 0, fake: 1
# frame_label is only for temporal localization task
frame_label = np.zeros(video_frames)
for start, end in segments:
    frame_label[int(start * fps):int(end * fps)] = 1
# video_label is for both classification and temporal localization task
video_label = len(segments) > 0