ConStat: Performance-Based Contamination Detection in Large Language Models

What consitutes contamination?

We argue that in the context of LLMs, the traditional definition of contamination is not sufficient. We propose a new performance-based definition of contamination, which defines contamination based on its outcome rather than its cause.

In traditional machine learning, contamination refers to any information flow between the performance benchmark and model training. For LLMs, this usually means including test set samples or their equivalents in the training data.
This narrow perspective has several flaws. First, it doesn't address the main issue: whether test set performance accurately predicts real-world performance. Second, in the era of zero-shot learning, we aim to measure performance on "unseen" tasks, yet we train on internet-scale data that likely contains samples of almost any task, making contamination thresholds unclear. Third, focusing only on test sample inclusion ignores other sources of contamination like model and hyperparameter selection based on benchmark performance.
To overcome these limitations, we define contamination based on its outcome, rather than its cause. Informally, we define contamination as artificially inflated performance on a benchmark that does not generalize to real-world performance on the corresponding task. This definition aligns better with the practical implications of contamination and avoids the mentioned issues.

How to test for performance-based contamination?

ConStat is a statistical method that tests for performance-based contamination in LLMs. Given a reference benchmark and reference models, it computes the likelihood of the observed performance of the model under the null hypothesis that the model is not contaminated.

We can detect performance-based contamination by comparing a model's performance on a benchmark to its performance on a reference benchmark. If the model performs much better on the benchmark, it is contaminated. Different reference benchmarks test different types of contamination:
- Rephrased samples test syntax-specific contamination.
- Samples from the same distribution test sample-specific contamination.
- Different benchmarks for the same task test benchmark-specific contamination.
Unfortunately, direct comparison of performance between the benchmarks is not sufficient, as the difficulty of the benchmark and its reference benchmark may differ. To account for this, we introduce the hardness correction function which maps a performance from the reference benchmark to an uncontaminated performance on the actual benchmark. We use several reference models to estimate this function.
Combining these observations, we propose ConStat, which uses bootstrapping to create a confidence interval for expected benchmark performance without contamination using the hardness correction function. Comparing this expected performance to actual performance yields a p-value for contamination and an estimate \( \hat{\delta} \) of the contamination's influence by subtracting actual performance from predicted uncontaminated performance.

How well does ConStat perform?

In a controlled setting, we show that ConStat is much better at detecting and quantifying contamination than existing methods.

We finetuned 61 models with various hyperparameters on the ARC-Challenge, GSM8k, Hellaswag, and MMLU benchmarks. We then evaluated ConStat's ability to detect and quantify contamination in these models.
ConStat significantly outperforms prior methods in detecting contamination. It detects contamination in 60 out of 61 models under sample-specific contamination.
Previous methods cannot quantify the influence of contamination on performance. ConStat's estimates (\( \hat{\delta} \)) are well-calibrated and provide a reliable measure of contamination's impact on performance (\(r^2 = 0.94\)).

Which popular models are contaminated?

We use ConStat to detect and quantify contamination in several widely-used models. We find contamination in Mistral-7b, Llama-3-70B and Yi-34B.

Can we use ConStat to detect and quantify contamination in popular models? Yes! We evaluated over 40 models across four benchmarks. Our results show each model's performance, the p-value for contamination, the estimated influence of contamination (\( \hat{\delta} \)), and a 95% confidence lower bound of this estimate (\( \hat{\delta}_{0.95} \)).
Several models, including Mistral-7b, Llama-3-70B, and Yi-34b, show significant contamination.
We also find significant contamination in top models from the Open LLM Leaderboard. Therefore, be cautious when selecting models based on their leaderboard scores due to the impact of model selection.

Citation

@article{dekoninck2024constat,
        title={ConStat: Performance-Based Contamination Detection in Large Language Models}, 
        author={Jasper Dekoninck and Mark Niklas Müller and Martin Vechev},
        year={2024},
        archivePrefix={arXiv},
        primaryClass={cs.LG}
  }