For both tasks, we are using AV-Deepfake1M (Arxiv, GitHub) dataset. This dataset is a large scale dataset addressing the content-driven multimodal deepfakes, which contains over 1M videos and 2K speakers in total.
In AV-Deepfake1M, each video only contain very few or no fake visual/audio segments. We host the challenge targeting 2 tasks. The participants are expected to develop the models on the train & val set, and submit the predictions on the test set. The top-3 winner will be determined by the performance on the test set, and required to submit the code (Docker) for the final checking.
In this task, we aim to detect the deepfake videos with the video-level labels access. Although the dataset containing the full annotation of the timestamps of the fake segments (i.e. frame-level labels), the participants in this task can only use the video-level label to train the model.
Real: The video is real, which means there is no fake segment in the video.
Fake: The video is fake, which means there is at least one fake segment in the video.
The metrics for this task is the AUC score.
The model output should be single confidence number of the input video being fake. The expected submission format is like below.
000001.mp4;0.9128
000002.mp4;0.9142
000003.mp4;0.0174
000004.mp4;0.2021
...
In this task, we aim to temporal localize the fake segments in the videos with the full-level labels access. The participants in this task can access the frame-level labels, which contains the timestamps of the fake segments.
The metrics for this task are AP (Average Precision) and AR (Average Recall).
The model output should be the temporal localization of the fake segments in the input video. The expected format is a json file with following structure.
{
"000001.mp4": [
[0.9541, 4.0, 4.8],
// [confidence to be fake, start time, end time]
[0.2315, 4.24, 5.52],
[0.0012, 0.48, 2.6],
[0.0002, 6.56, 8.8],
...
],
...
}
Q: Can external data be used in the challenge?
A: Only public external data can be used.
Q: How to convert the segments label in metadata to frame-level and video-level labels?
A: Codes are below.
# real: 0, fake: 1
# frame_label is only for temporal localization task
frame_label = np.zeros(video_frames)
for start, end in segments:
frame_label[int(start * fps):int(end * fps)] = 1
# video_label is for both classification and temporal localization task
video_label = len(segments) > 0