For both tasks, we are using AV-Deepfake1M++ dataset. This dataset is a large scale dataset addressing the content-driven multimodal deepfakes, which contains around 2M videos and more speakers in total than previous AV-Deepfake1M.
Subset | #Videos | #Real | #Fake | #Frames | #Time | #Subjects |
---|---|---|---|---|---|---|
Training | 1.10M | 0.30M | 0.80M | 264M | 2934H | 2606* |
Validation | 0.08M | 0.02M | 0.06M | 18M | 205H | 1676* |
TestA | TBD | TBD | TBD | TBD | TBD | TBD |
TestB | TBD | TBD | TBD | TBD | TBD | TBD |
*The subjects in the training and validation sets are overlapped.
In AV-Deepfake1M++, each video only contain very few or no fake visual/audio segments. We host the challenge targeting 2 tasks. The participants are expected to develop the models on the train & val set, and submit the predictions on the testA set. The top-3 winner will be determined by the performance on the testA set, and required to submit the training and testing code (Docker) for the final checking to determine the final winner on the testB set.
Please register the challenge and submit the EULA, then you will be shared with a huggingface repository.
huggingface-cli login
huggingface-cli download ControlNet/AV-Deepfake1M-PlusPlus --repo-type dataset --local-dir ./AV-Deepfake1M-PlusPlus
Then, unzip the zip files. Please go to the subset you want to unzip, run
7z x train.zip.001
It will unzip the all volumes automatically.
The dataloader from the AV-Deepfake1M SDK might be helpful.
In this task, we aim to detect the deepfake videos with the video-level labels access. Although the dataset containing the full annotation of the timestamps of the fake segments (i.e. frame-level labels), the participants in this task can only use the video-level label to train the model.
Real: The video is real, which means there is no fake segment in the video.
Fake: The video is fake, which means there is at least one fake segment in the video.
The metrics for this task is the AUC score.
The model output should be single confidence number of the input video being fake. The expected submission format is like below.
000001.mp4;0.9128
000002.mp4;0.9142
000003.mp4;0.0174
000004.mp4;0.2021
...
In this task, we aim to temporal localize the fake segments in the videos with the full-level labels access. The participants in this task can access the frame-level labels, which contains the timestamps of the fake segments.
The metrics for this task are AP (Average Precision) and AR (Average Recall).
The model output should be the temporal localization of the fake segments in the input video. The expected format is a json file with following structure.
{
"000001.mp4": [
[0.9541, 4.0, 4.8],
// [confidence to be fake, start time, end time]
[0.2315, 4.24, 5.52],
[0.0012, 0.48, 2.6],
[0.0002, 6.56, 8.8],
...
],
...
}
Q: Can external data be used in the challenge?
A: Only public external data can be used.
Q: Can foundation models be used in the challenge?
A: Yes if it is available public.
Q: How to convert the segments label in metadata to frame-level and video-level labels?
A: Codes are below.
# real: 0, fake: 1
# frame_label is only for temporal localization task
frame_label = np.zeros(video_frames)
for start, end in segments:
frame_label[int(start * fps):int(end * fps)] = 1
# video_label is for both classification and temporal localization task
video_label = len(segments) > 0