Singing Voice Deepfake Detection

Yongyi Zang*, You Zhang* (Equal contribution), Mojtaba Heydari, Zhiyao Duan

Audio Information Research Lab, University of Rochester

TL;DR We propose the novel task of singing voice deepfake detection (SVDD) and present our collected dataset SingFake.

Submitted to ICASSP 2024

Due to download constriants, only a portion of SingFake is used in the ICASSP 2024 paper. Download SingFake (as used in paper).


The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/val/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these systems lag significantly behind their performance on speech test data. When trained on SingFake, either using separated vocal tracks or song mixtures, these systems show substantial improvement. However, our evaluations also identify challenges associated with unseen singers, communication codecs, languages, and musical contexts, calling for dedicated research into singing voice deepfake detection.

Dataset Design

We source deepfake singing samples from publicly available popular user-generated content websites where users upload both bonafide and deepfake samples of singing. We provide metadata annotations for these URLs.

Set Bonafide Or Spoof Language Singer Model
Training spoof Mandarin Bella_Yao DiffSinger
Training bonafide Mandarin Bella_Yao N/A
... ... ... ... ...
Validation spoof Cantonese G.E.M. Link
Validation bonafide Spanish G.E.M. N/A
... ... ... ... ...
T01 spoof Mandarin Stefanie_Sun unknown
T01 bonafide Mandarin Stefanie_Sun N/A
... ... ... ... ...
T02 spoof Mandarin Angela_Chang Sovits4.0
T02 bonafide Mandarin Angela_Chang N/A
... ... ... ... ...
T04 Spoof Persian Dariush unknown
T04 Bonafide Persian Dariush unknown
... ... ... ... ...

Dataframe header along with several data samples of SingFake URL Annotations.
URL and Title fields are omitted to save space.

We also provide a dataset split for extensive and controlled evaluation. The dataset was split into train, validation, and test sets to ensure singers were distinct in each split. Test set T01 contains seen-in-training singer Stefanie Sun to evaluate performance on a familiar singer; Test set T02 has 6 unseen singers to evaluate generalization. Test set T03 simulates lossy codecs (MP3 128 Kbps, AAC 64 Kbps, OPUS 64 Kbps, and Vorbis 64 Kbps) by compressing T02 audio. Test set T04 contains Persian singers to evaluate effects of language and musical style differences.

SingFake dataset partition. Each color represents a subset, and each slice denotes an AI singer.
T03 is excluded here since it contains the same song clips as T02 but is repeated 4 times through 4 different codecs.

Samples from the SingFake Dataset

We select several samples from SingFake for demonstration.

T01, Spoof, Stefanie Sun

T02, Spoof, Eason Chen

Training, Spoof, Jay Chou

Separation Pipeline

We used Demucs, a state-of-the-art music source separation model, to extract the vocals from each song. The extracted vocals were processed through PyAnnote's Voice Activity Detection pipeline to identify active singing regions and segment the vocals and original mixes into clips. All clips were resampled to 16 kHz.

color photo black and white


After Separation

An vocal separation example. Mixture and separated vocals are visualized with 128-bin mel spectrograms under 16 kHz.

Speech CM heavily degrades on SVDD task

We train speech CM systems on ASVspoof 2019 for 100 epochs and select best checkpoint on validation set. All systems performed well on ASVspoof 2019 evaluation data. However, when tested on SingFake T02 singing data, performance degraded significantly with ~50% EER on song mixtures. On separated vocals, some EER improved to ~38%, suggesting vocals are more speech-like without accompaniment. But LFCC and Wav2Vec2 systems still had ~50% EER, indicating they overfit to speech and don't generalize to singing.

Method ASVspoof2019 SingFake-T02
LA - Eval Mixture Vocals
AASIST 0.83 58.12 37.91
Spectrogram+ResNet 4.57 51.87 37.65
LFCC+ResNet 2.41 45.12 54.88
Wav2Vec2+AASIST 7.03 56.75 57.26

Test results on speech and singing voice with CM systems trained on speech utterance from ASVspoof2019LA (EER (%)).

Training on singing voices improves SVDD performance

We trained models on the SingFake dataset to see if it improves performance. Models were trained on full song mixtures or separated vocals. Performance declined from training set to test set 1 (seen singers, unseen songs), test set 2 (unseen singers, unseen songs), test set 3 (unseen codec + test set 2) and test set 4 (unseen language/musical context), showing increasing difficulty. All systems had good training performance, suggesting SingFake helps learn SVDD. Systems trained on separated vocals generally outperformed those trained on mixtures, except Wav2Vec2+AASIST. This indicates separated vocals highlight deepfake artifacts while mixtures have more interference. Wav2Vec2+AASIST performed best overall, excelling at learning from mixtures and showing robustness.

Method Setting Train T01 T02 T03 T04
AASIST Mixture 4.10 7.29 11.54 17.29 38.54
Vocals 3.39 8.37 10.65 13.07 43.94
Spectrogram+ResNet Mixture 4.97 14.88 22.59 24.15 48.76
Vocals 5.31 11.86 19.69 21.54 43.94
LFCC+ResNet Mixture 10.55 21.35 32.40 31.85 50.07
Vocals 2.90 15.88 22.56 23.62 39.27
Wav2Vec2+AASIST (Joint-finetune) Mixture 1.57 4.62 8.23 13.62 42.77
Vocals 1.70 5.39 9.10 10.03 42.19

Evaluation results for SVDD systems on all testing conditions in our SingFake dataset (EER (%)).
Best setting for each set is shown in bold.

Call for participation

Detecting deepfakes in singing voices poses unique challenges stemming from the diverse instrumental accompaniments and music genres, as we discussed in this paper:

As AI-generated content causes distrust in artistic domains, transparency around content's origin is crucial for rebuilding that trust. SVDD research could empower the general public to make informed decisions.
We invite participation in advancing SVDD research from the community. If you are interested in contributing, please reach out. Robust SVDD will arise from collaborative efforts.

We eagerly anticipate the research community driving progress in this important area. Together, we can meet the challenges of this complex task.