SynSFX: Multi-Model Sound Effects Synthesis Dataset for Deepfake Detection

April 27, 2026

Introduction

SynSFX (Synthetic Sound Effects) is a large-scale benchmark for non-speech audio deepfake detection. While speech anti-spoofing has advanced rapidly, detectors trained on speech—such as AASIST and RawNet2—often collapse to near-random performance on synthetic sound effects.

SynSFX addresses this gap with a transparent, reproducible corpus of 43,374 audio clips totaling approximately 180 hours, spanning both authentic environmental recordings and outputs from seven state-of-the-art text-to-audio (TTA) models.

Key design features:

Unprecedented scale for isolated sound-effect forensics
Seven diverse generators (diffusion, transformer, and latent architectures)
Shared Prompt Subset — 1,890 identical prompts across all generators, enabling controlled cross-model comparison
Predefined train / validation / test splits for standardized benchmarking
📥 Download link: SynSFX dataset on Hugging Face
📄 Paper: SynSFX: Multi-Model Sound Effects Synthesis Dataset for Deepfake Detection and Evaluation (IEEE submission)
License: Academic research only.

Data Sources

SynSFX comprises two primary partitions:

Authentic Audio Subset (16,922 clips · ~120 h)

Curated from five established open-source repositories:

Source	Clips	Duration	Role
AudioCaps	4,000	11.0 h	Environmental recordings with captions
Clotho	3,839	24.0 h	Natural environmental audio
ESC-50	2,000	2.8 h	50 everyday sound categories
TACoS	5,000	31.1 h	Activity-related audio events
WavCaps	2,083	50.9 h	Large-scale web-sourced ambient sounds

Synthetic Audio Subset (26,460 clips · ~58 h)

Generated using seven TTA architectures, anonymized as A1–A7 in the released corpus:

Model	Architecture family	Sample rate
A1	Diffusion (AudioLDM v1)	16 kHz
A2	Diffusion (AudioLDM v2)	16 kHz
A3	Transformer (AudioCraft / AudioGen)	16 kHz
A4	Latent diffusion (Stable Audio)	44.1 kHz
A5	Diffusion (Make-An-Audio)	16 kHz
A6	Multimodal (MMAudio)	44.1 kHz
A7	Flow matching (TangoFlux)	44.1 kHz

Prompts were expanded from concise baselines (e.g. "footsteps on gravel") into rich scene descriptions using LLMs (ChatGPT, Gemini), then filtered by human review.

Corpus Architecture

The corpus uses 28,350 unique textual prompts, structured for both diversity and controlled comparison:

Subset	Prompts	Description
Shared Prompt Subset	1,890	Identical prompts sent to all seven generators — isolates generator-specific artifacts
Exclusive Prompt Subsets	~1,890 per model	Unique prompts per architecture, ensuring statistical balance

All clips are stored as uncompressed WAV at each model's native sample rate to preserve generation artifacts.

Dataset Split

SynSFX is released with predefined splits for reproducible evaluation:

Partition	Clips (approx.)	Purpose
Train	~30,400	Model fine-tuning
Validation	~4,300	Hyperparameter tuning
Test (in-domain)	~4,300	Evaluation on seen generators (A1–A7)
Test (out-of-domain)	1,113	Zero-shot evaluation on unseen commercial generator + UrbanSound8K real audio

Metadata

Each split ships with a metadata file (synsfx_train.txt, synsfx_dev.txt, synsfx_test.txt). Each line contains:

audio_path	generator	prompt_id	class_label	split_tag
clips/A3/000142.wav	A3	SP-0042	synthetic	shared_prompt
clips/real/clotho_0183.wav	-	-	authentic	exclusive