SingFake: Singing Voice Deepfake Detection

SingFake

Singing Voice Deepfake Detection

Yongyi Zang*, You Zhang* (Equal contribution), Mojtaba Heydari, Zhiyao Duan

yongyi.zang@rochester.edu, you.zhang@rochester.edu, mheydari@ur.rochester.edu, zhiyao.duan@rochester.edu

Audio Information Research Lab, University of Rochester

TL;DR We propose the novel task of singing voice deepfake detection (SVDD) and present our collected dataset SingFake.

Accepted to ICASSP 2024

Due to download constriants, only a portion of SingFake is used in the ICASSP 2024 paper. Download SingFake (as used in paper).

Abstract

The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/val/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these systems lag significantly behind their performance on speech test data. When trained on SingFake, either using separated vocal tracks or song mixtures, these systems show substantial improvement. However, our evaluations also identify challenges associated with unseen singers, communication codecs, languages, and musical contexts, calling for dedicated research into singing voice deepfake detection.

Dataset Design

We source deepfake singing samples from publicly available popular user-generated content websites where users upload both bonafide and deepfake samples of singing. We provide metadata annotations for these URLs.

Set	Bonafide Or Spoof	Language	Singer	Model
Training	spoof	Mandarin	Bella_Yao	DiffSinger
Training	bonafide	Mandarin	Bella_Yao	N/A
...	...	...	...	...
Validation	spoof	Cantonese	G.E.M.	Link
Validation	bonafide	Spanish	G.E.M.	N/A
...	...	...	...	...
T01	spoof	Mandarin	Stefanie_Sun	unknown
T01	bonafide	Mandarin	Stefanie_Sun	N/A
...	...	...	...	...
T02	spoof	Mandarin	Angela_Chang	Sovits4.0
T02	bonafide	Mandarin	Angela_Chang	N/A
...	...	...	...	...
T04	Spoof	Persian	Dariush	unknown
T04	Bonafide	Persian	Dariush	unknown
...	...	...	...	...

Dataframe header along with several data samples of SingFake URL Annotations.
URL and Title fields are omitted to save space.

We also provide a dataset split for extensive and controlled evaluation. The dataset was split into train, validation, and test sets to ensure singers were distinct in each split. Test set T01 contains seen-in-training singer Stefanie Sun to evaluate performance on a familiar singer; Test set T02 has 6 unseen singers to evaluate generalization. Test set T03 simulates lossy codecs (MP3 128 Kbps, AAC 64 Kbps, OPUS 64 Kbps, and Vorbis 64 Kbps) by compressing T02 audio. Test set T04 contains Persian singers to evaluate effects of language and musical style differences.

SingFake dataset partition. Each color represents a subset, and each slice denotes an AI singer.
T03 is excluded here since it contains the same song clips as T02 but is repeated 4 times through 4 different codecs.

Samples from the SingFake Dataset

We select several samples from SingFake for demonstration.

T01, Spoof, Stefanie Sun

T02, Spoof, Eason Chen

Training, Spoof, Jay Chou

Separation Pipeline

We used Demucs, a state-of-the-art music source separation model, to extract the vocals from each song. The extracted vocals were processed through PyAnnote's Voice Activity Detection pipeline to identify active singing regions and segment the vocals and original mixes into clips. All clips were resampled to 16 kHz.

Mixture

After Separation

An vocal separation example. Mixture and separated vocals are visualized with 128-bin mel spectrograms under 16 kHz.

Speech CM heavily degrades on SVDD task

We train speech CM systems on ASVspoof 2019 for 100 epochs and select best checkpoint on validation set. All systems performed well on ASVspoof 2019 evaluation data. However, when tested on SingFake T02 singing data, performance degraded significantly with ~50% EER on song mixtures. On separated vocals, some EER improved to ~38%, suggesting vocals are more speech-like without accompaniment. But LFCC and Wav2Vec2 systems still had ~50% EER, indicating they overfit to speech and don't generalize to singing.

Method	ASVspoof2019	SingFake-T02
Method	LA - Eval	Mixture	Vocals
AASIST	0.83	58.12	37.91
Spectrogram+ResNet	4.57	51.87	37.65
LFCC+ResNet	2.41	45.12	54.88
Wav2Vec2+AASIST	7.03	56.75	57.26

Test results on speech and singing voice with CM systems trained on speech utterance from ASVspoof2019LA (EER (%)).

Training on singing voices improves SVDD performance

We trained models on the SingFake dataset to see if it improves performance. Models were trained on full song mixtures or separated vocals. Performance declined from training set to test set 1 (seen singers, unseen songs), test set 2 (unseen singers, unseen songs), test set 3 (unseen codec + test set 2) and test set 4 (unseen language/musical context), showing increasing difficulty. All systems had good training performance, suggesting SingFake helps learn SVDD. Systems trained on separated vocals generally outperformed those trained on mixtures, except Wav2Vec2+AASIST. This indicates separated vocals highlight deepfake artifacts while mixtures have more interference. Wav2Vec2+AASIST performed best overall, excelling at learning from mixtures and showing robustness.

Method	Setting	Train	T01	T02	T03	T04
Method	Setting	AASIST	Mixture	4.10	7.29	11.54	17.29	38.54
Vocals	3.39	AASIST	8.37	10.65	13.07	43.94
Spectrogram+ResNet	Mixture	4.97	14.88	22.59	24.15	48.76
Spectrogram+ResNet	Vocals	5.31	11.86	19.69	21.54	43.94
LFCC+ResNet	Mixture	10.55	21.35	32.40	31.85	50.07
LFCC+ResNet	Vocals	2.90	15.88	22.56	23.62	39.27
Wav2Vec2+AASIST (Joint-finetune)	Mixture	1.57	4.62	8.23	13.62	42.77
Wav2Vec2+AASIST (Joint-finetune)	Vocals	1.70	5.39	9.10	10.03	42.19

Evaluation results for SVDD systems on all testing conditions in our SingFake dataset (EER (%)).
Best setting for each set is shown in bold.

Call for participation

Detecting deepfakes in singing voices poses unique challenges stemming from the diverse instrumental accompaniments and music genres, as we discussed in this paper:

Artifacts from instrumental accompanyment. Anything less than an ideal source separation may mask deepfake detection cues, confusing SVDD systems.
Robustness towards unseen musical contexts. Singing voice deepfake detection systems face challenges in generalizing across diverse musical genres, requiring further research to disentangle genre effects from deepfake cues to enable more genre-agnostic detection.

As AI-generated content causes distrust in artistic domains, transparency around content's origin is crucial for rebuilding that trust. SVDD research could empower the general public to make informed decisions.
We invite participation in advancing SVDD research from the community. If you are interested in contributing, please reach out. Robust SVDD will arise from collaborative efforts.

We eagerly anticipate the research community driving progress in this important area. Together, we can meet the challenges of this complex task.