SynSFX: Multi-Model Sound Effects Synthesis Dataset for Deepfake Detection
April 27, 2026

Introduction
SynSFX (Synthetic Sound Effects) is a large-scale benchmark for non-speech audio deepfake detection. While speech anti-spoofing has advanced rapidly, detectors trained on speech—such as AASIST and RawNet2—often collapse to near-random performance on synthetic sound effects.
SynSFX addresses this gap with a transparent, reproducible corpus of 43,374 audio clips totaling approximately 180 hours, spanning both authentic environmental recordings and outputs from seven state-of-the-art text-to-audio (TTA) models.
Key design features:
-
Unprecedented scale for isolated sound-effect forensics
-
Seven diverse generators (diffusion, transformer, and latent architectures)
-
Shared Prompt Subset — 1,890 identical prompts across all generators, enabling controlled cross-model comparison
-
Predefined train / validation / test splits for standardized benchmarking
-
📥 Download link: SynSFX dataset on Hugging Face
-
📄 Paper: SynSFX: Multi-Model Sound Effects Synthesis Dataset for Deepfake Detection and Evaluation (IEEE submission)
-
License: Academic research only.
Data Sources
SynSFX comprises two primary partitions:
Authentic Audio Subset (16,922 clips · ~120 h)
Curated from five established open-source repositories:
| Source | Clips | Duration | Role |
|---|---|---|---|
| AudioCaps | 4,000 | 11.0 h | Environmental recordings with captions |
| Clotho | 3,839 | 24.0 h | Natural environmental audio |
| ESC-50 | 2,000 | 2.8 h | 50 everyday sound categories |
| TACoS | 5,000 | 31.1 h | Activity-related audio events |
| WavCaps | 2,083 | 50.9 h | Large-scale web-sourced ambient sounds |
Synthetic Audio Subset (26,460 clips · ~58 h)
Generated using seven TTA architectures, anonymized as A1–A7 in the released corpus:
| Model | Architecture family | Sample rate |
|---|---|---|
| A1 | Diffusion (AudioLDM v1) | 16 kHz |
| A2 | Diffusion (AudioLDM v2) | 16 kHz |
| A3 | Transformer (AudioCraft / AudioGen) | 16 kHz |
| A4 | Latent diffusion (Stable Audio) | 44.1 kHz |
| A5 | Diffusion (Make-An-Audio) | 16 kHz |
| A6 | Multimodal (MMAudio) | 44.1 kHz |
| A7 | Flow matching (TangoFlux) | 44.1 kHz |
Prompts were expanded from concise baselines (e.g. "footsteps on gravel") into rich scene descriptions using LLMs (ChatGPT, Gemini), then filtered by human review.
Corpus Architecture
The corpus uses 28,350 unique textual prompts, structured for both diversity and controlled comparison:
| Subset | Prompts | Description |
|---|---|---|
| Shared Prompt Subset | 1,890 | Identical prompts sent to all seven generators — isolates generator-specific artifacts |
| Exclusive Prompt Subsets | ~1,890 per model | Unique prompts per architecture, ensuring statistical balance |
All clips are stored as uncompressed WAV at each model's native sample rate to preserve generation artifacts.
Dataset Split
SynSFX is released with predefined splits for reproducible evaluation:
| Partition | Clips (approx.) | Purpose |
|---|---|---|
| Train | ~30,400 | Model fine-tuning |
| Validation | ~4,300 | Hyperparameter tuning |
| Test (in-domain) | ~4,300 | Evaluation on seen generators (A1–A7) |
| Test (out-of-domain) | 1,113 | Zero-shot evaluation on unseen commercial generator + UrbanSound8K real audio |
Metadata
Each split ships with a metadata file (synsfx_train.txt, synsfx_dev.txt, synsfx_test.txt). Each line contains:
| audio_path | generator | prompt_id | class_label | split_tag |
|---|---|---|---|---|
| clips/A3/000142.wav | A3 | SP-0042 | synthetic | shared_prompt |
| clips/real/clotho_0183.wav | - | - | authentic | exclusive |
