ADD-GP (Audio Deepfake Detection with Gaussian Processes)

a novel, few-shot adaptive framework specifically designed for Audio Deepfake Detection (ADD)

ADD-GP (Audio Deepfake Detection with Gaussian Processes) is a novel, few-shot adaptive framework specifically designed for Audio Deepfake Detection (ADD). It addresses the critical need for detection systems to efficiently adapt to new and evolving deepfake generation methods with minimal data.

Here’s a breakdown of what ADD-GP is and its key features:

Foundation in Gaussian Processes (GP) and Deep Kernel Learning (DKL):
- ADD-GP is based on a Gaussian Process (GP) classifier. GPs define a distribution over functions and are well-suited for few-shot detection due to their non-parametric nature.
- It uses a Deep Kernel Learning (DKL) framework. DKL combines classical kernels with a neural network transformation, such as applying an RBF kernel on features computed by a deep network. This allows the model to learn highly expressive feature representations while maintaining the flexibility and robustness of GPs.
- The framework utilizes XLS-R features. XLS-R is a Wav2Vec2-based self-supervised model that serves as the front-end feature extractor, providing meaningful audio representations that improve the generalization capability of detection models.
Key Capabilities and Benefits:
- Effective Few-Shot Adaptation for New TTS Models: A primary contribution of ADD-GP is its ability to adapt better to new, unseen Text-to-Speech (TTS) models with relatively few samples. This is crucial because current ADD algorithms often generalize poorly to deepfakes from unseen generation models, which is a significant security concern as new models are rapidly introduced. For instance, it was shown to adapt effectively to 11Labs deepfakes, where other methods struggled.
- Personalized Deepfake Detection: ADD-GP can be used for personalized deepfake detection, offering greater robustness and adaptability. This means it can be tailored to detect deepfakes for a specific speaker, even with one-shot adaptability (e.g., using a single generated example) or a very small number of samples (e.g., 10 real and 10 fake samples from a single individual).
- Robustness to Catastrophic Forgetting: Unlike many other methods that show a decrease in performance on original, known TTS models when adapted to new ones, ADD-GP demonstrates robustness against “catastrophic forgetting”, maintaining its performance on previously seen TTSs.
- Well-Calibrated Uncertainty Estimates: An important benefit of its Bayesian GP approach is that it provides accurate measures of uncertainty, offering users a reliable measure of confidence in its predictions rather than just a binary decision. Calibration curves show that ADD-GP returns well-calibrated results [25, 22 Figure 1].
- Mitigation of Data Collection Challenges: Its few-shot adaptation capability is particularly valuable because many leading commercial TTS models, like 11Labs, are only accessible via paid APIs, limiting the feasibility of large-scale data collection for retraining detection models.
Evaluation and Performance:
- ADD-GP’s effectiveness is demonstrated through evaluation on LibriFake, a benchmark dataset specifically constructed to evaluate few-shot and one-shot deepfake audio detection approaches on novel TTS models. LibriFake includes synthetic samples generated by various state-of-the-art voice cloning models, including yourTTS, Whisper-Speech, Vall-e-x, F5-TTS, and 11Labs, with 11Labs chosen as the “unseen TTS” due to poor initial detection performance.
- Experiments show that ADD-GP achieves very low Equal Error Rates (EER) in in-distribution (ID) TTS scenarios (0.1%) and significantly outperforms baselines in few-shot adaptation to out-of-distribution (OOD) TTS models like 11Labs. For example, with 100-shot adaptation, ADD-GP achieves an EER of 2.78% without MixPro and 0.54% with MixPro, compared to other methods that remain much higher.
- It can achieve comparable performance with just a 1-shot adaptation in personalized detection (0.61% EER on LibriFake) to a 100-shot adaptation in non-personalized experiments [24, 22 Table 2].

In summary, ADD-GP is a crucial advancement for deepfake detection, offering an adaptable, robust, and generalizable solution to combat the rapidly evolving threat landscape of synthetic speech.

ADD-GP (Audio Deepfake Detection with Gaussian Processes)

Leave a Reply Cancel reply