Tldr;

"DeepSpeak is a new, comprehensive dataset designed to aid in the detection of deepfake media, comprising 34 hours of footage that includes 9 hours of real videos from 220 diverse individuals and over 25 hours of synthetic deepfake content generated through advanced face-swap and lip-sync technologies. The dataset aims to support the digital forensics community by providing valuable resources for developing detection techniques, with plans for regular updates to keep pace with advancements in generative AI. Researchers are encouraged to access and utilize the dataset as part of ongoing efforts to understand and mitigate the implications of deepfake technology."

Summary

Introducing DeepSpeak: A Comprehensive Dataset for Detecting Deepfakes

In the rapidly evolving landscape of digital media, the advent of deepfake technology presents both exciting possibilities and significant challenges. To bolster the efforts in media forensics, our research team has introduced DeepSpeak, a groundbreaking dataset designed to provide valuable resources for understanding and detecting deepfake videos. This post summarizes the key aspects and findings of our paper on the DeepSpeak dataset.

What is DeepSpeak?

DeepSpeak is a large-scale dataset consisting of 34 hours of total footage, capturing both real and deepfake videos of people vocalizing and gesturing in front of their webcams. Specifically, the dataset encompasses:

Real Videos: 6,226 recordings from 220 diverse individuals across various demographics. These recordings represent a total of 9 hours of footage, collected with participant consent.
Fake Videos: Over 25 hours of synthetic content generated through advanced face-swap and lip-sync technologies. These include state-of-the-art deepfake applications that incorporate both existing and newly generated voices.

The dataset is publicly available for research and non-commercial purposes, with future versions planned to reflect advancements in deepfake technologies.

Data Collection Methodology

We employed a rigorous and systematic approach to collect real video data. Utilizing the Prolific research platform, participants were carefully selected based on gender and age diversity, ensuring representation across different racial and ethnic backgrounds. Each individual was paid and provided with clear instructions to record themselves responding to a set of predefined prompts, capturing both scripted and unscripted content.

Recording Specifications

Format: Videos were recorded as .webm files, later converted to .mp4 for consistency.
Quality: Standard recording occurred at a resolution of 1280x720 pixels, with a bitrate of 8 megabits per second, ensuring high-quality audio and video.

The final dataset included a structured collection of responses categorized into scripted, unscripted, and action-based prompts.

Generating Deepfakes

The deepfake component of DeepSpeak was meticulously crafted using two primary techniques: face-swapping and lip-syncing.

Face-Swap Techniques

We utilized three unique configurations for the face-swap process, leveraging:

FaceFusion: A base model that accurately superimposes faces while correcting for artifacts.
Enhanced FaceFusion: An extension integrating generative adversarial networks (GANs) to enhance the realism of the output.
FaceFusion Live: A configuration designed for simulated real-time streaming to generate dynamic content.

Lip-Sync Models

For the lip-sync aspect, two prominent techniques were employed:

Wav2Lip: A well-known model designed for generating lip movements that synchronize with an audio track.
VideoRetalking: A more advanced pipeline capable of rendering new lip movements while maintaining facial expressions.

Applications and Future Directions

The primary objective of the DeepSpeak dataset is to assist the digital forensics community in developing and refining detection techniques for deepfake content. Given the rapid advancement of generative AI technologies, we anticipate updating DeepSpeak with fresh data biannually. This will ensure that the dataset remains relevant and conducive for educational, commercial, and forensic applications.

Conclusion

DeepSpeak serves as an essential resource for researchers, educators, and practitioners interested in the detection of deepfake media. By offering a rich collection of diverse audio-visual data, we hope to empower the community to better understand and mitigate the implications of deepfakes in today's digital landscape.

If you're interested in leveraging the DeepSpeak dataset for your research or projects, feel free to access it here.

We welcome feedback and collaborative efforts to advance this critical area of study!

DeepSpeak Dataset v1.0