Use Suitable Baselines, Benchmarks, and Metrics (G7)

tl;dr: Researchers MUST justify all benchmark and metric choices in the PAPER and SHOULD summarize benchmark structure, task types, and limitations. Where possible, traditional (non-LLM) baselines SHOULD be used for comparison. Researchers MUST explain why the selected metrics are suitable for the specific study. They SHOULD report established metrics to make study results comparable, but can report additional metrics that they consider appropriate. Due to the inherent non-determinism of LLMs, experiments SHOULD be repeated; the result distribution SHOULD then be reported using descriptive statistics.

Rationale

Meaningful evaluation requires well-understood, valid measurement instruments. Without justified benchmarks and metrics, claims about LLM performance lack the rigor needed for scientific comparison and cumulative progress. Benchmarks, baselines, and metrics play an important role in assessing the effectiveness of LLMs and LLM-based tools. A benchmark is a “standard tool for the competitive evaluation and comparison of competing systems or components according to specific characteristics, such as performance, dependability, or security” (Kistowski et al. 2015). Meanwhile, “a metric is a method, algorithm, or procedure for assigning one or more numbers to a phenomenon” (Ralph et al. 2024). A baseline is a reference point, enabling comparison of LLMs against traditional algorithms with lower computational costs.

Recommendations

When selecting benchmarks and metrics, it is important to fully understand the benchmark tasks, what exactly is being measured, and how it relates to the (often latent) variables researchers actually care about. If one or more metrics or benchmarks are used, researchers:

MUST briefly justify (in the PAPER) why selected benchmarks and metrics are suitable for the given task or study;
MUST discuss the reliability and validity (especially construct validity) of selected metrics and benchmarks;
SHOULD summarize the structure and tasks of the selected benchmark(s), including the programming language(s) and descriptive statistics such as the number of contained tasks and test cases;
SHOULD discuss the limitations of the selected benchmark(s) (e.g. widely used benchmarks such as HumanEval (M. Chen et al. 2021) and MBPP (Austin et al. 2021) Python functions assess a very specific part of software development, which is not representative of the full breadth of SE work (Chandra 2025)).
SHOULD include an example of a task and the corresponding test case(s) to illustrate the structure of the benchmark.

If multiple benchmarks exist for the same task, researchers SHOULD compare performance between benchmarks. When selecting only a subset of all available benchmarks, researchers SHOULD use the most specific benchmarks given the context.

Furthermore, researchers SHOULD check whether a less resource-intensive approach (e.g., for static analysis tasks or program repair) can serve as a baseline. If so, the LLM or LLM-based tool should be compared with such baselines using suitable metrics. Even if LLM-based tools outperform baselines, researchers SHOULD discuss whether the resources consumed justify the (often marginal) improvements (Menzies 2025).

To compare traditional and LLM-based approaches or different LLM-based tools, researchers SHOULD report established metrics whenever possible, as this allows secondary research. They can, of course, report additional metrics that they consider appropriate. We briefly discuss common metrics in the Example(s) subsection below. As mentioned, researchers MUST motivate why they chose a certain metric or variant thereof for their particular study.

Due to LLM non-determinism, researchers SHOULD repeat experiments and report descriptive statistics of model or tool performance (e.g., arithmetic mean, median, confidence intervals, standard deviations) (Agarwal et al. 2021; Bjarnason, Silva, and Monperrus 2026). When comparing models or tools, researchers SHOULD use appropriate inferential statistics (e.g., hypothesis tests such as the Mann-Whitney U test or bootstrap-based comparisons, along with effect sizes) rather than relying solely on differences in means or other summary statistics.

The number of required repetitions depends on factors such as the study type, the variability of the task, and the desired precision of estimates. As with sample sizes for human validation (Human Validation), researchers SHOULD justify their chosen number of repetitions, for example, through a power analysis (Cohen 1992; Bjarnason, Silva, and Monperrus 2026) or by monitoring the convergence of descriptive statistics across incremental runs (Blackwell, Barry, and Cohn 2024). A pilot study can help estimate the expected variability and inform this decision.

From a measurement perspective, researchers SHOULD reflect on the theories, values, and measurement models on which the benchmarks and metrics they have selected for their study are based. For example, the phenomena labeled “bugs” in a large open dataset of software bugs are according to a certain theory of what constitutes a bug, and from the values and perspectives of the people who labeled the dataset. Reflecting on the context in which these labels were assigned and discussing whether and how the labels generalize to a new study context is crucial.

Example(s)

Common Metrics:

Two common metrics used for generation tasks are BLEU-N and pass@k. BLEU-N (Papineni et al. 2002) was originally developed for evaluating machine translation quality by measuring n-gram precision between a candidate and reference text, ranging from $0$ (dissimilar) to $1$ (similar). It has been widely adopted in SE for code generation tasks, though its validity in this context is debatable: n-gram overlap does not capture functional correctness, and syntactically different code can be semantically equivalent (see Challenges below). Code-specific variations such as CodeBLEU (Ren et al. 2020) and CrystalBLEU (Eghbali and Pradel 2022) attempt to address these limitations by introducing additional heuristics such as AST matching.

The metric pass@k reports the likelihood that a model correctly completes a code snippet at least once within $k$ attempts. Originally introduced as success rate at B by Kulal et al. (2019) (Kulal et al. 2019) and popularized by M. Chen et al. (2021) (M. Chen et al. 2021), it ranges from $0$ (no correct solution in $k$ tries) to $1$ (at least one correct solution), with correctness typically defined by passing test cases.

The pass@k metric is defined as:

$\text{pass@}k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\, \text{where:}$

- $n$ is the total number of generated samples per prompt, - $c$ is the number of correct samples among $n$ , and - $k$ is the number of samples drawn.

The choice of $k$ depends on the downstream task: pass@1 is critical for single-suggestion scenarios like code completion, while higher $k$ values (e.g., $2$ , $5$ , $10$ ) assess multi-attempt capability (Rozière et al. 2023; Guo et al. 2024; Hui et al. 2024; Li et al. 2023). However, pass@k is not a universal metric suitable for all generation tasks. It requires a binary notion of correctness, making the metric appropriate for code synthesis evaluated via unit tests, but unsuitable for open-ended generation tasks such as comment generation, where multiple valid outputs exist.

If a study evaluates an LLM-based tool for supporting humans, a relevant metric is the acceptance rate, meaning the ratio of all accepted artifacts (e.g., test cases, code snippets) in relation to all artifacts that were generated and presented to the user. Another way of evaluating LLM-based tools is calculating inter-model agreement. This allows researchers to assess how dependent a tool’s performance is on specific models and versions. Metrics used to measure inter-model agreements include general agreement (percentage), Cohen’s kappa, and Krippendorff’s $\alpha$ (see Human Validation for recommended thresholds and best practices for measuring agreement).

Common problem types for LLM-based studies are classification, recommendation, and generation, each requiring different metrics (Hou et al. 2024). Hu et al. (2025) (Hu et al. 2025) categorized 191 LLM benchmarks by SE task, providing a valuable reference. Common metrics include BLEU, pass@k, Accuracy@k, and Exact Match for generation; Mean Reciprocal Rank for recommendation; and Precision, Recall, F1-score, and Accuracy for classification.

Benchmark Examples:

Benchmarks used for code generation include HumanEval (available on GitHub) (M. Chen et al. 2021), MBPP (available on Hugging Face) (Austin et al. 2021), ClassEval (available on GitHub) (Du et al. 2023), LiveCodeBench (available on GitHub) (Jain et al. 2024), and SWE-bench (available on GitHub) (Jimenez et al. 2024). An example of a code translation benchmark is TransCoder (Lachaux et al. 2020) (available on GitHub).

Most benchmarks focus on generation tasks, but benchmarks for classification and recommendation also exist. For classification, DiverseVul (available on GitHub) (Y. Chen et al. 2023) provides vulnerable and non-vulnerable functions evaluated using standard classification metrics. For recommendation, CodeSearchNet (available on GitHub) (Husain et al. 2019) contains code-documentation pairs evaluated using Mean Reciprocal Rank.

Benefits

Benchmarks and metrics improve reproducibility and comparability across studies, leading to faster improvements in the cumulative body of knowledge. It is simply easier to assess a new system against one or a few benchmarks than hundreds of competing systems, and when most systems are assessed against the same benchmarks and metrics, we can see which approaches work best (at least, on those benchmarks). This comparison enables progress tracking; for example, when researchers iteratively improve a new LLM-based tool and test it against benchmarks after significant changes. For practitioners, leaderboards (published benchmark results of models) support selecting models for downstream tasks. Meanwhile, baselines help us understand just how much LLMs improve performance over non-LLM-enabled alternatives.

Challenges

Computer scientists have a long history of using oversimplified, unvalidated metrics (Ralph and Tempero 2018) (e.g., lines of code, CPU time) as proxies for complex, multidimensional latent variables (e.g., system size, environmental sustainability). In fields such as psychology, where measurement theory has a longer tradition (Borsboom 2005), metrics are backed by foundational theories or extensive empirical validation of their psychometric properties. The fundamental challenge with AI metrics and benchmarks is that many have neither firmly understood theoretical underpinnings nor extensive empirical validation of their construct, measurement, and ecological validity.

For example, pass@k is intended to measure a model’s functional correctness, that is, its ability to generate code that produces the expected output for a given specification. However, functional correctness is only one dimension of code quality. pass@k does not capture maintainability, readability, security, or efficiency of the generated code, all of which are critical for downstream use. Furthermore, “correctness” is defined entirely by test suites, whose coverage is itself unvalidated: a solution that passes all provided tests may still be incorrect for untested inputs. Whether a test suite adequately operationalizes correctness for a given task is a subjective judgment that is rarely examined.

Similarly, HumanEval is intended to measure a model’s ability to synthesize short Python functions from docstring specifications. This operationalizes a narrow slice of “code generation ability”: it covers neither multi-file tasks, nor debugging, nor the use of existing codebases—tasks that dominate real-world software engineering (Chandra 2025). Notably, neither pass@k nor HumanEval has undergone rigorous empirical validation of its construct validity; their widespread adoption rests on face validity and convenience rather than on evidence that they reliably measure what researchers intend them to measure (Cao et al. 2025).

Benchmarks and metrics also have generalizability problems. Prominent LLM benchmarks such as HumanEval and MBPP use Python, so researchers can optimize for Python’s idiosyncrasies, creating illusory improvements that do not generalize to other languages. Similarly, the metric BLEU-N is a syntactic metric, so code can score highly without being executable. The metric Exact Match, meanwhile, does not account for functional equivalence of syntactically different code. Both BLEU-N and Exact Match are influenced by code formatting, which confounds their intended use.

Execution-based metrics such as pass@k directly evaluate correctness by running test cases, but they require a setup with an execution environment. When researchers observe unexpected values for certain metrics, the specific results should be investigated in more detail to uncover potential problems.

Finally, benchmark data contamination—where the benchmark itself is part of the training data—may lead to artificially high performance if the model remembers the solution from the training data rather than actually solving the task based on the input data (Xu et al. 2024). Therefore, for proprietary LLMs that do not release their training data, researchers SHOULD consider using human validation, curating new data, or refactoring existing data (Xu et al. 2024).

To mitigate contamination, researchers can create new benchmark datasets by collecting data after a specified cutoff date; researchers MUST disclose data sources and collection dates for each release. However, this temporal approach requires continuous updates as new models may include the benchmark in future training data. Alternatively, keeping the benchmark private prevents inclusion in training sets, but requires trust in the benchmark creator and a system to execute it without leaking data. Researchers can also evaluate how contaminated their benchmark is (Choi et al. 2025).

Study Types

This guideline MUST be followed for all study types that automatically evaluate the performance of LLMs or LLM-based tools. The design of a benchmark and the selection of appropriate metrics are highly dependent on the specific study type and research goal. Recommending specific metrics for specific study types is beyond the scope of these guidelines, but Hu et al. (2025) provide a good overview of existing metrics for evaluating LLMs (Hu et al. 2025).

For Benchmarking LLMs, this guideline is of primary importance: researchers MUST use established benchmarks or rigorously justify the creation of new ones, and MUST report standard metrics to enable cross-study comparison. For LLMs as Annotators, the research goal might be to assess which model comes close to a ground truth dataset created by human annotators. Especially for open annotation tasks, selecting suitable metrics to compare LLM-generated and human-generated labels is important. In general, annotation tasks can vary significantly. Are multiple labels allowed for the same sequence? Are the available labels predefined, or should the LLM generate a set of labels independently? Due to this task dependence, researchers MUST justify their metric choice, explaining what aspects of the task it captures together with known limitations. For LLMs for Tools, researchers SHOULD benchmark the tool against suitable baselines using appropriate metrics that capture the tool’s intended contribution. For LLMs as Judges, researchers SHOULD report inter-rater agreement metrics and validity measures to demonstrate the reliability and quality of LLM judgments. For LLMs for Synthesis, researchers SHOULD specify metrics for comparing synthesized outputs (e.g., coverage, faithfulness) and justify their appropriateness for the synthesis task. For Studying LLM Usage, researchers SHOULD justify the measurement instruments and metrics used for studying LLM usage patterns, including any survey scales or behavioral measures. For LLMs as Subjects, researchers SHOULD compare simulated and real human responses using appropriate metrics to assess simulation fidelity. If researchers assess a well-established task such as code generation, they SHOULD report standard metrics such as pass@k and compare the performance between models. If non-standard metrics are used, researchers MUST state their reasoning.

Advice for Reviewers

Reviewers should expect manuscripts to:

clearly identify the constructs or variables the study aims to measure (e.g., LLM performance, quality of generated code), including independent, dependent, and control variables;
present their measurement model, i.e., which metrics, benchmarks, or baselines are used and how they relate to the target constructs;
justify the selection of any metrics, benchmarks, and baselines used;
discuss in detail the assumptions, reliability, and validity—especially construct validity—of each benchmark and metric;
articulate any limitations regarding construct and measurement validity.

As with other guidelines, missing information about baselines or metrics is typically a revision request. However, vague descriptions that conflate broad concerns (e.g., effectiveness, quality) with specific counting methods should be questioned. Ubiquity of a benchmark does not imply validity or appropriateness for a given context. Manuscripts should convey a solid understanding of construct and measurement validity by explaining and justifying their measurement models.

References

Agarwal, Rishabh, Max Schwarzer, Pablo Samuel Castro, Aaron C. Courville, and Marc G. Bellemare. 2021. “Deep Reinforcement Learning at the Edge of the Statistical Precipice.” In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Virtual, edited by Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, 29304–20. https://proceedings.neurips.cc/paper/2021/hash/f514cec81cb148559cf475e7426eed5e-Abstract.html.

Austin, Jacob, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, et al. 2021. “Program Synthesis with Large Language Models.” CoRR abs/2108.07732. https://arxiv.org/abs/2108.07732.

Bjarnason, Bjarni Haukur, André Silva, and Martin Monperrus. 2026. “On Randomness in Agentic Evals.” https://arxiv.org/abs/2602.07150.

Blackwell, Robert E., Jon Barry, and Anthony G. Cohn. 2024. “Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores.” https://arxiv.org/abs/2410.03492.

Borsboom, Denny. 2005. Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge University Press. https://doi.org/10.1017/CBO9780511490026.

Cao, Jialun, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Ruixi Qiao, et al. 2025. “How Should We Build a Benchmark? Revisiting 274 Code-Related Benchmarks for LLMs.” CoRR abs/2501.10711. https://arxiv.org/abs/2501.10711.

Chandra, Satish. 2025. “Benchmarks for AI in Software Engineering (BLOG@CACM).” https://cacm.acm.org/blogcacm/benchmarks-for-ai-in-software-engineering/.

Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. 2021. “Evaluating Large Language Models Trained on Code.” CoRR abs/2107.03374. https://arxiv.org/abs/2107.03374.

Chen, Yizheng, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David A. Wagner. 2023. “DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection.” In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, RAID 2023, Hong Kong, China, October 16-18, 2023, 654–68. ACM. https://doi.org/10.1145/3607199.3607242.

Choi, Hyeong Kyu, Maxim Khanov, Hongxin Wei, and Yixuan Li. 2025. “How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence.” CoRR abs/2502.00678. https://doi.org/10.48550/ARXIV.2502.00678.

Cohen, Jacob. 1992. “A Power Primer.” Psychological Bulletin 112 (1): 155–59. https://doi.org/10.1037/0033-2909.112.1.155.

Du, Xueying, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. “ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-Level Code Generation.” CoRR abs/2308.01861. https://doi.org/10.48550/ARXIV.2308.01861.

Eghbali, Aryaz, and Michael Pradel. 2022. “CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code.” In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, 28:1–12. ACM. https://doi.org/10.1145/3551349.3556903.

Guo, Daya, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, et al. 2024. “DeepSeek-Coder: When the Large Language Model Meets Programming - the Rise of Code Intelligence.” CoRR abs/2401.14196. https://doi.org/10.48550/ARXIV.2401.14196.

Hou, Xinyi, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. “Large Language Models for Software Engineering: A Systematic Literature Review.” ACM Trans. Softw. Eng. Methodol. 33 (8): 1–79. https://doi.org/10.1145/3695988.

Hu, Xing, Feifei Niu, Junkai Chen, Xin Zhou, Junwei Zhang, Junda He, Xin Xia, and David Lo. 2025. “Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks.” CoRR abs/2505.08903. https://doi.org/10.48550/ARXIV.2505.08903.

Hui, Binyuan, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, et al. 2024. “Qwen2.5-Coder Technical Report.” CoRR abs/2409.12186. https://doi.org/10.48550/ARXIV.2409.12186.

Husain, Hamel, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. “CodeSearchNet Challenge: Evaluating the State of Semantic Code Search.” CoRR abs/1909.09436. http://arxiv.org/abs/1909.09436.

Jain, Naman, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.” CoRR abs/2403.07974. https://doi.org/10.48550/ARXIV.2403.07974.

Jimenez, Carlos E., John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. “SWE-Bench: Can Language Models Resolve Real-World Github Issues?” In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66.

Kistowski, Jóakim v., Jeremy A. Arnold, Karl Huppler, Klaus-Dieter Lange, John L. Henning, and Paul Cao. 2015. “How to Build a Benchmark.” In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, 333–36. ICPE ’15. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2668930.2688819.

Kulal, Sumith, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy Liang. 2019. “SPoC: Search-Based Pseudocode to Code.” CoRR abs/1906.04908. http://arxiv.org/abs/1906.04908.

Lachaux, Marie-Anne, Baptiste Rozière, Lowik Chanussot, and Guillaume Lample. 2020. “Unsupervised Translation of Programming Languages.” CoRR abs/2006.03511. https://arxiv.org/abs/2006.03511.

Li, Raymond, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, et al. 2023. “StarCoder: May the Source Be with You!” CoRR abs/2305.06161. https://doi.org/10.48550/ARXIV.2305.06161.

Menzies, Tim. 2025. “The Case for Compact AI.” Commun. ACM 68 (9): 6–7. https://doi.org/10.1145/3746057.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “Bleu: A Method for Automatic Evaluation of Machine Translation.” In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–18. ACL. https://doi.org/10.3115/1073083.1073135.

Ralph, Paul, Miikka Kuutila, Hera Arif, and Bimpe Ayoola. 2024. “Teaching Software Metrology: The Science of Measurement for Software Engineering.” In Handbook on Teaching Empirical Software Engineering, edited by Daniel Méndez, Paris Avgeriou, Marcos Kalinowski, and Nauman Bin Ali, 101–54. Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-71769-7\5.

Ralph, Paul, and Ewan D. Tempero. 2018. “Construct Validity in Software Engineering Research and Software Metrics.” In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering, EASE2018, edited by Austen Rainer, Stephen G. MacDonell, and Jacky W. Keung, 13–23. ACM. https://doi.org/10.1145/3210459.3210461.

Ren, Shuo, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. “CodeBLEU: A Method for Automatic Evaluation of Code Synthesis.” CoRR abs/2009.10297. https://arxiv.org/abs/2009.10297.

Rozière, Baptiste, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, et al. 2023. “Code Llama: Open Foundation Models for Code.” CoRR abs/2308.12950. https://doi.org/10.48550/ARXIV.2308.12950.

Xu, Cheng, Shuhao Guan, Derek Greene, and M. Tahar Kechadi. 2024. “Benchmark Data Contamination of Large Language Models: A Survey.” CoRR abs/2406.04244. https://doi.org/10.48550/ARXIV.2406.04244.