"DeepSpeak is a new, comprehensive dataset designed to aid in the detection of deepfake media, comprising 34 hours of footage that includes 9 hours of real videos from 220 diverse individuals and over 25 hours of synthetic deepfake content generated through advanced face-swap and lip-sync technologies. The dataset aims to support the digital forensics community by providing valuable resources for developing detection techniques, with plans for regular updates to keep pace with advancements in generative AI. Researchers are encouraged to access and utilize the dataset as part of ongoing efforts to understand and mitigate the implications of deepfake technology."
In the rapidly evolving landscape of digital media, the advent of deepfake technology presents both exciting possibilities and significant challenges. To bolster the efforts in media forensics, our research team has introduced DeepSpeak, a groundbreaking dataset designed to provide valuable resources for understanding and detecting deepfake videos. This post summarizes the key aspects and findings of our paper on the DeepSpeak dataset.
DeepSpeak is a large-scale dataset consisting of 34 hours of total footage, capturing both real and deepfake videos of people vocalizing and gesturing in front of their webcams. Specifically, the dataset encompasses:
The dataset is publicly available for research and non-commercial purposes, with future versions planned to reflect advancements in deepfake technologies.
We employed a rigorous and systematic approach to collect real video data. Utilizing the Prolific research platform, participants were carefully selected based on gender and age diversity, ensuring representation across different racial and ethnic backgrounds. Each individual was paid and provided with clear instructions to record themselves responding to a set of predefined prompts, capturing both scripted and unscripted content.
The final dataset included a structured collection of responses categorized into scripted, unscripted, and action-based prompts.
The deepfake component of DeepSpeak was meticulously crafted using two primary techniques: face-swapping and lip-syncing.
We utilized three unique configurations for the face-swap process, leveraging:
For the lip-sync aspect, two prominent techniques were employed:
The primary objective of the DeepSpeak dataset is to assist the digital forensics community in developing and refining detection techniques for deepfake content. Given the rapid advancement of generative AI technologies, we anticipate updating DeepSpeak with fresh data biannually. This will ensure that the dataset remains relevant and conducive for educational, commercial, and forensic applications.
DeepSpeak serves as an essential resource for researchers, educators, and practitioners interested in the detection of deepfake media. By offering a rich collection of diverse audio-visual data, we hope to empower the community to better understand and mitigate the implications of deepfakes in today's digital landscape.
If you're interested in leveraging the DeepSpeak dataset for your research or projects, feel free to access it here.
We welcome feedback and collaborative efforts to advance this critical area of study!