Guidelines
This set of guidelines is currently a DRAFT and based on a discussion session with researchers at the 2024 International Software Engineering Research Network (ISERN) meeting and at the 2nd Copenhagen Symposium on Human-Centered Software Engineering AI. This draft is meant as a starting point for further discussions in the community with the aim of developing a common understanding of how we should conduct and report empirical studies involving large language models (LLMs). See also the pages on study types and scope.
The wording of the recommendations follows RFC 2119 and 8174.
Overview
- Declare LLM Usage and Role
- Report Model Version and Configuration
- Report Tool Architecture and Supplemental Data
- Report Prompts and their Development
- Report Interaction Logs
- Use Human Validation for LLM Outputs
- Use an Open LLM as a Baseline
- Report Suitable Baselines, Benchmarks, and Metrics
- Report Limitations and Mitigations
Declare LLM Usage and Role
Recommendations
When conducting any kind of empirical study involving LLMs, researchers MUST clearly declare that an LLM was used. This is, for example, required by the ACM Policy on Authorship: “The use of generative AI tools and technologies to create content is permitted but must be fully disclosed in the Work” [1]. Researchers SHOULD report the exact purpose of using an LLM in a study, the tasks it was used to automate, and the expected outcomes.
Example(s)
The ACM Policy on Authorship suggests to to disclose the usage of Generative AI tools in the acknowledgements section of papers, for example: “ChatGPT was utilized to generate sections of this Work, including text, tables, graphs, code, data, citations” [1]. Similarly, the acknowledgements section could also be used to disclose Generative AI usage for other aspects of the research, if not explicitly described in other parts of the paper. The ACM policy further suggests: “If you are uncertain about the need to disclose the use of a particular tool, err on the side of caution, and include a disclosure in the acknowledgements section of the Work” [1].
Advantages
Transparency in the usage of LLMs helps in understanding the context and scope of the study, facilitating better interpretation and comparison of results. Beyond this declaration, we recommend authors to be explicit about the LLM version they used (see Section Report Model Version and Configuration) and the LLM’s exact role (see Section Report Tool Architecture and Supplemental Data).
Challenges
We do not expect any challenges for researchers following this guideline.
Study Types
This guideline MUST be followed for all study types.
References
[1] Association for Computing Machinery, “ACM Policy on Authorship.” https://www.acm.org/publications/policies/new-acm-policy-on-authorship, 2023.
Report Model Version and Configuration
Recommendations
LLMs (especially when used as a service) are frequently updated, and different versions may produce varying results. Moreover, the model configuration and parameters influence the output generation of the models. It is crucial to document the specific version of the LLM used in the study, along with the date when the experiments were conducted, and the exact configuration being used. Furthermore, detailed documentation of the configuration and parameters used during the study is necessary for reproducibility. Our recommendations is to report: Additionally, a thorough description of the hosting environment of the LLM or LLM-based tool should be provided, especially in studies focusing on performance or any time-sensitive measurement.
-
Model name
-
Model version
-
The maximum token length for prompts.
-
The configured temperature that controls randomness, and all other relevant parameters that affect output generation.
-
Whether historical context was considered when generating responses.
Example(s)
For an OpenAI model, researchers might report that “A gpt-4
model was integrated via the Azure OpenAI Service, and configured with a temperature of 0.7, top_p set to 0.8, and a maximum token length of
- We used version
0125-Preview
, system fingerprintfp_6b68a8204b
, seed value23487
, and ran our experiment on 10th January 2025” [1], [2]. However, there are also related challenges (see Section Challenges).
TODO: Talk about local or self-hosted models as well.
Advantages
By providing this information, researchers enable others to reproduce the study under the same or similar conditions.
Challenges
Different model providers and modes of operating the models allow for varying degrees of information. For example, OpenAI provides a model version and a system fingerprint describing the backend configuration that can also influence the output. However, the fingerprint is indeed just intended to detect changes to the model or its configuration. As a user, one cannot go back to a certain fingerprint. As a beta feature, OpenAI also lets users set a seed parameter to receive “(mostly) consistent output” [3]. However, the seed value does not allow for full reproducibility and the fingerprint changes frequently.
TODO: Also “open” models come with challenges in terms of reproducibility (https://github.com/ollama/ollama/issues/5321).
Study Types
TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [4]).
TODO: I guess when studying the usage of a tool such as ChatGPT or Copilot, it might not be possible to report all of this?
References
[1] OpenAI, “OpenAI API Introduction.” https://platform.openai.com/docs/api-reference/chat/streaming, 2025.
[2] Microsoft, “Azure OpenAI Service models.” https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models, 2025.
[3] OpenAI, “How to make your completions outputs consistent with the new seed parameter.” https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter, 2023.
[4] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.
Report Tool Architecture and Supplemental Data
Recommendations
Oftentimes, there is a layer around the LLM that preprocesses data, prepares prompts or filters user requests. One example is ChatGPT, which, at the time of writing these guidelines, primarily uses the GPT-4o model. GitHub Copilot also relies on the same model, and researchers can build their own tools utilizing GPT-4o directly (e.g., via the OpenAI API). The infrastructure around the bare model can significantly contribute to the performance of a model in a given task. Therefore, it is important that researchers clearly describe the architecture and what the LLM contributes to the tool or method presented in a research paper.
If the LLM is used as a standalone system (e.g., ChatGPT-4o API without additional architecture layers), researchers SHOULD provide a brief explanation of how it was used rather than detailing a full system architecture. However, if the LLM is integrated into a more complex system with preprocessing, retrieval mechanisms, fine-tuning, or autonomous agents, researchers MUST clearly document the tool architecture, including how the LLM interacts with other components such as databases, retrieval mechanisms, external APIs, and reasoning frameworks. A high-level architectural diagram SHOULD be provided in these cases to improve transparency. To enhance clarity, researchers SHOULD explain design decisions, particularly regarding model access (e.g., API-based, fine-tuned, self-hosted) and retrieval mechanisms (e.g., keyword search, semantic similarity matching, rule-based extraction). Researchers MUST NOT omit critical architectural details that could impact reproducibility, such as hidden dependencies or proprietary tools that influence model behavior.
If the LLM is part of an agent-based system that autonomously plans, reasons, or executes tasks, researchers MUST describe its architecture, including the agent’s role (e.g., planner, executor, coordinator), whether it is a single-agent or multi-agent system, how it interacts with external tools and users, and the reasoning framework used (e.g., chain-of-thought, self-reflection, multi-turn dialogue, tool usage). Researchers MUST NOT present an agent-based system without detailing how it makes decisions and executes tasks.
If a retrieval or augmentation method is used (e.g., retrieval-augmented generation (RAG), rule-based retrieval, structured query generation, or hybrid approaches), researchers MUST describe how external data is retrieved, stored, and integrated into the LLM’s responses. This includes specifying the type of storage or database used (e.g., vector databases, relational databases, knowledge graphs) and how the retrieved information is selected and used. Stored data used for context augmentation MUST be reported, including details on data preprocessing, versioning, and update frequency. If this data is not confidential, an anonymized snapshot of the data used for context augmentation SHOULD be made available.
Similarly, if the LLM is fine-tuned, researchers MUST describe the fine-tuning goal (e.g., domain adaptation, task specialization), procedure (e.g., full fine-tuning, parameter-efficient fine-tuning), and dataset (source, size, preprocessing, availability). They should include training details (e.g., compute resources, hyperparameters, loss function) and performance metrics (benchmarks, baseline comparison). If the data used for fine-tuning is not confidential, an anonymized snapshot of the data used for fine-tuning the model SHOULD be made available.
Example(s)
Some empirical studies in software engineering involving LLMs have documented the architecture and supplemental data aligning with the recommended guidelines. Hereafter, we provide two examples.
Schäfer et al. conducted an empirical evaluation of using LLMs for automated unit test generation [1]. The authors provide a comprehensive description of the system architecture, detailing how the LLM is integrated into the software development workflow to analyze codebases and produce corresponding unit tests. The architecture includes components for code parsing, prompt formulation, interaction with the LLM, and integration of the generated tests into existing test suites. The paper also elaborates on the datasets utilized for training and evaluating the LLM’s performance in unit test generation. It specifies the sources of code samples, the selection criteria, and the preprocessing steps undertaken to prepare the data.
Dhar et al. conducted an exploratory empirical study to assess whether LLMs can generate architectural design decisions [2]. The authors detail the system architecture, including the decision-making framework, the role of the LLM in generating design decisions, and the interaction between the LLM and other components of the system. The study provides information on the fine-tuning approach and datasets used for evaluation, including the source of the architectural decision records, preprocessing methods, and the criteria for data selection.
Advantages
Documenting the architecture and supplemental data of LLM-based systems enhances reproducibility, transparency, and trust [3]. In empirical software engineering studies, this is essential for experiment replication, result validation, and benchmarking. Clear documentation of RAG, fine-tuning, and data storage enables comparison, optimizes efficiency, and upholds scientific rigor and accountability, fostering reliable and reusable research.
Challenges
Researchers face challenges in documenting LLM-based architectures, including proprietary APIs and dependencies that restrict disclosure, managing large-scale retrieval databases, and ensuring efficient query execution. They must also balance transparency with data privacy concerns, adapt to the evolving nature of LLM integrations, and, depending on the context, handle the complexity of multi-agent interactions and decision-making logic, all of which can impact reproducibility and system clarity.
Study Types
This guideline MUST be followed for all empirical study types involving LLMs, especially those using fine-tuned or self-hosted models, retrieval-augmented generation (RAG) or alternative retrieval methods, API-based model access, and agent-based systems where LLMs handle autonomous planning and execution.
References
[1] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Trans. Software Eng., vol. 50, no. 1, pp. 85–105, 2024, doi: 10.1109/TSE.2023.3334955.
[2] R. Dhar, K. Vaidhyanathan, and V. Varma, “Can LLMs generate architectural design decisions? - an exploratory empirical study,” in 21st IEEE international conference on software architecture, ICSA 2024, hyderabad, india, june 4-8, 2024, IEEE, 2024, pp. 79–89. doi: 10.1109/ICSA59870.2024.00016.
[3] Q. Lu, L. Zhu, X. Xu, Z. Xing, and J. Whittle, “Toward responsible AI in the era of generative AI: A reference architecture for designing foundation model-based systems,” IEEE Softw., vol. 41, no. 6, pp. 91–100, 2024, doi: 10.1109/MS.2024.3406333.
Report Prompts and their Development
Recommendations
Prompts are critical in empirical software engineering studies involving LLMs. Depending on the task, prompts may include various types of content, such as source code, execution traces, error messages, natural language descriptions, or even screenshots and other multi-modal inputs. These elements significantly influence the model’s output, and understanding how exactly they were formatted and integrated is essential for transparency and reproducibility.
Researchers MUST report the full text of prompts used, along with any surrounding instructions, metadata, or contextual information. The exact structure of the prompt should be described, including the order and format of each element. For example, when using code snippets, researchers should specify whether they were enclosed in markdown-style code blocks (e.g., triple backticks), if line breaks and whitespace were preserved, and whether additional annotations (e.g., comments) were included. Similarly, for other artifacts such as error messages, stack traces, or non-text elements like screenshots, researchers should explain how these were presented. If rich media was involved, such as in multi-modal models, details on how these inputs were encoded or referenced in the prompt are crucial.
When dealing with extensive or complex prompts, such as those involving large codebases or multiple error logs, researchers MUST describe strategies they used for handling input length constraints. Approaches might include truncating, summarizing, or splitting prompts into multiple parts. Token optimization measures, such as simplifying code formatting or removing unnecessary comments, should also be documented if applied.
In terms of strategy, prompts can vary widely based on the task design. Researchers MUST specify whether zero-shot, one-shot, or few-shot prompting was used. For few-shot prompts, the examples provided to the model should be clearly outlined, along with the rationale for selecting them. If multiple versions of a prompt were tested, researchers should describe how these variations were evaluated and how the final design was chosen.
In cases where prompts are generated dynamically—such as through preprocessing, template structures, or retrieval-augmented generation (RAG)—the process MUST be thoroughly documented. This includes explaining any automated algorithms or rules that influenced prompt generation. For studies involving human participants, where users might create or modify prompts themselves, researchers MUST describe how these prompts were collected and analyzed. If full disclosure is not feasible due to privacy concerns, summaries and representative examples should be provided.
To ensure full reproducibility, researchers MUST make all prompts and prompt variations publicly available in an online appendix, replication package, or repository. If the full set of prompts is too extensive to include in the paper itself, researchers SHOULD still provide representative examples and describe variations in the main body of the paper. For example, a recent paper by Anandayuvaraj et al. [1] is a good example of making prompts available online. In the paper, the authors analyze software failures reported in news articles and use prompting to automate tasks such as filtering relevant articles, merging reports, and extracting detailed failure information. Their online appendix contains all the prompts used in the study, providing valuable transparency and supporting reproducibility.
Prompt development is often iterative, involving collaboration between human researchers and AI tools. Researchers SHOULD report any instances where LLMs were used to suggest prompt refinements, as well as how those suggestions were incorporated. Furthermore, prompts may need to be revised in response to failure cases where the model produced incorrect or incomplete outputs. Iterative changes based on human feedback and pilot testing results should also be included in the documentation.
Finally, pilot testing and prompt evaluation are vital for ensuring that prompts yield reliable results. If such testing was conducted, researchers SHOULD summarize key insights, including how different prompt variations affected output quality and which criteria were used to finalize the prompt design.
Example(s)
A debugging study may use a prompt structured like this:
You are a coding assistant. Below is a Python script that fails with an error. Analyze the code and suggest a fix. Code: ``` def divide(a, b): return a / b print(divide(10, 0)) ``` Error message: ZeroDivisionError: division by zero
The study should document that the code was enclosed in triple backticks, specify whether additional context (e.g., stack traces or annotations) was included, and explain how variations of the prompt were tested.
A good example of comprehensive prompt reporting is provided by Liang et al. [2]. The authors make the exact prompts available in their online appendix on Figshare, including details such as code blocks being enclosed in triple backticks. While this level of detail would not fit within the paper itself, the paper thoroughly explains the rationale behind the prompt design and data output format. It also includes one overview figure and two concrete examples, ensuring transparency and reproducibility while keeping the main text concise.
Advantages
Providing detailed documentation of prompts enhances reproducibility and comparability. It allows other researchers to replicate the study under similar conditions, refine prompts based on documented improvements, and evaluate how different types of content (e.g., source code vs. execution traces) influence LLM behavior. This transparency also enables a better understanding of how formatting, prompt length, and structure impact results across various studies.
Challenges
One challenge is the complexity of prompts that combine multiple components, such as code, error messages, and explanatory text. Formatting differences—such as whether markdown or plain text was used—can affect how LLMs interpret inputs. Additionally, prompt length constraints may require careful management, particularly for tasks involving extensive artifacts like large codebases.
For multi-modal studies, handling non-text artifacts such as screenshots introduces additional complexity. Researchers must decide how to represent such inputs, whether by textual descriptions, image encoding, or data references. Lastly, proprietary LLMs (e.g., Copilot) may obscure certain details about internal prompt processing, limiting full transparency.
Privacy and confidentiality concerns can also hinder prompt sharing, especially when sensitive data is involved. In these cases, researchers should provide anonymized examples and summaries wherever possible.
Study Types
Reporting requirements may vary depending on the study type. For tool evaluation studies, researchers MUST explain how prompts were generated and structured within the tool. Controlled experiments MUST provide exact prompts for all conditions, while observational studies SHOULD summarize common prompt patterns and provide representative examples if full prompts cannot be shared.
References
[1] D. Anandayuvaraj, M. Campbell, A. Tewari, and J. C. Davis, “FAIL: Analyzing software failures from the news using LLMs,” in Proceedings of the 39th IEEE/ACM international conference on automated software engineering, 2024, pp. 506–518.
[2] J. T. Liang et al., “Can gpt-4 replicate empirical software engineering research?” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1330–1353, 2024.
Report Interaction Logs
Recommendations
Previous guidelines aim to address the reproducibility problem in this context, but even reporting full version and parameters might not be always sufficient. Indeed, LLMs can still behave non-deterministically even if parameters are fixed [1]: while decoding strategies and parameters can be fixed by defining seeds, setting temperature to 0, using deterministic decoding strategies etc., non-determinism can arise from batching, input preprocessing, and floating point arithmetic on GPUs.
This is why, when reproducibility is important and transparency is needed, researchers should report full interaction logs, that is, all prompts and responses generated by the LLM or LLM-based tool in the context of the presented study. Reporting this is especially important when reporting a study targeting commercial SaaS solutions based on LLMs (e.g., ChatGPT) or novel tools that integrate LLMs via cloud APIs where there is even less guarantee of reproducing the state of the LLM-powered system at a later point by a reader of the study who wants to replicate it.
In a sense, this is not different than reporting results of interviews with human studies in qualitative studies that include interviews: there researchers also report the full interaction between the interviewers and the participants. Intuitively, this is the same because both a human participant and the OpenAI ChatGPT might change their answers if asked the same question at two months distance; thus the transcript of the actual conversation is important to be tracked.
Example(s)
In their paper “Investigating ChatGPT’s Potential to Assist in Requirements Elicitation Processes” [2], Ronanki et al. report the full answers of ChatGPT and they upload them in a Zenodo record .
TODO: QUESTION: Does anybody have a better example? I could not find one.
Advantages
The advantage of following this guideline is the transparency and reproducibility of the resulting research.
Moreover, the guideline is easy to follow. Transcripts are easy to obtain (if we continue with the mental model of a LLM as an interviewee, this is especially evident in contrast with obtaining transripts with human users). Even for systems where the interaction is based on voice, the interaction is first translated to text using speech-to-text methods, so it can also be easily obtained. In this sense, there is no excuse for researchers not reporting full transcripts.
One other advantage is that, while for human participants conversations often cannot be reported due to confidentiality, LLM conversations can (e.g. as of beginning of 2025, the for-profile OpenAI company allows the sharing of chat transcripts: https://openai.com/policies/sharing-publication-policy/).
Challenges
Given that chat transcripts are easy to generate, a study might end up with a very large appendix. Consequently, online storage might be needed. Services such as Zenodo, or other long term storage for research artifacts, will likely have to be used in such situations.
Not all systems allow the reporting of interaction logs with the same ease. E.g. chat bot systems are easy to report the interactions with; OTOH, auto-complete systems, like GitHub Copilot, will be much harder to report. Indeed, the fact that CoPilot provided a recommendation during a coding session can not be replicated unless one re-creates the whole codebase state at that given point in time (and that of GitHub Copilot too). One way to report that would be sharing a screencast of the coding session. But this might be found to be too troublesome by some.
Study Types
This guideline SHOULD be followed for all study types.
References
[1] S. Chann, “Non-determinism in GPT-4 is caused by Sparse MoE.” https://152334h.github.io/blog/non-determinism-in-gpt-4/, 2023.
[2] K. Ronanki, C. Berger, and J. Horkoff, “Investigating ChatGPT’s potential to assist in requirements elicitation processes,” in 2023 49th euromicro conference on software engineering and advanced applications (SEAA), IEEE, 2023, pp. 354–361.
Use Human Validation for LLM Outputs
Recommendations
While LLMs can automate many tasks, it is important to validate their outputs with human judgment. For natural language processing tasks, a large-scale study has shown that LLMs have significant variation in their results, which limits their reliability as a direct substitute for human raters [1]. Human validation helps ensure the accuracy and reliability of the results, as LLMs may sometimes produce incorrect or biased outputs. Especially in studies where LLMs are used to support researchers, human validation is generally recommended to ensure validity. For studies using LLMs as annotators, the proposed process by Ahmed et al. [2], which includes an initial few-shot learning and, given good results, the replacement of one human annotator by an LLM, might be a way forward.
Researchers may employ human validation to complement existing measures of software-related constructs. For example, proxies for software quality, such as code complexity or the number of code smells, may be complemented by human ratings of maintainability, readability, or understandability. In the case of more abstract variables or psychometric measurements, human validation may be the only way of measuring a specific construct. For example, measuring human factors such as trust, cognitive load, and comprehension levels may inherently require human evaluation.
When conducting empirical measurements, researchers should clearly define the construct that they are measuring and specify the methods used for measurement. Further, they should use established measurement methods and instruments that are empirically validated [3], [4]. Measuring a construct may require aggregating input from multiple subjects. For example, a study may assess inter-rater agreement using measures such as Cohen’s Kappa or Krippendorff’s Alpha before aggregating ratings. In some cases, researchers may also combine multiple measures into single composite measures. As an example, they may evaluate both the time taken and accuracy when completing a task and aggregate them into a composite measure for the participants’ overall performance. In these cases, researchers should clearly describe their method of aggregation and document their reasoning for doing so.
When employing human validation, additional confounding factors should be controlled for, such as the level of expertise or experience with LLM-based applications or their general attitude towards AI-based tools. Researchers should control for these factors through methods such as stratified sampling or by categorizing participants based on experience levels.
Example(s)
As an example, Khojah et al. [5] augmented the results of their study using human measurement. Specifically, they asked participants to provide ratings regarding their experience, trust, perceived effectiveness and efficiency, and scenarios and lessons learned in their experience with ChatGPT.
Choudhuri et al. [6] evaluated the perceptions of students of their experience with ChatGPT in a controlled experiment. They added this data to extend their results from the task performance in a series of software engineering tasks.
Xue et al. [7] conducted a controlled experiment in which they evaluated the impact of ChatGPT on the performance and perceptions of students in an introductory programming course. They employed multiple measures to judge the impact of the LLM from the perspective of humans. In their study, they recorded the students’ screens, evaluated the answers they provided in tasks, and distributed a post-study survey to get direct opinions from the students.
Advantages
Incorporating human judgment in the evaluation process adds a layer of quality control and increases the trustworthiness of the study’s findings, especially when explicitly reporting inter-rater reliability metrics. For instance, “A subset of 20% of the LLM-generated annotations was reviewed and validated by experienced software engineers to ensure accuracy. Using Cohen’s Kappa, an inter-rater reliability of was reached.”
Incorporating feedback from individuals from the target population strengthens external validity by grounding study findings in real-world usage scenarios and may positively impact the transfer of study results to practice. Researchers may uncover additional opportunities to further improve the LLM or LLM-based tool based on the reported experiences.
Challenges
Measuring variables through human validation can be challenging. Ensuring that the operationalization of a desired construct and the method of measuring it are appropriate requires a good understanding of the studied concept and construct validity in general, and a systematic design approach for the measurement instruments.
Human judgment is often very subjective and may lead to large variability between different subjects due to differences in expertise, interpretation, and biases among evaluators. Controlling for this subjectivity will require additional rigor when conducting the statistical analysis of the study results.
Recruiting participants as human validators will always incur additional resources compared to machine-generated measures. Researchers must weigh the cost and time investment incurred by the recruitment process against the potential benefits for the validity of their study results.
Study Types
These guidelines apply to any study type that incorporates human validation.
MUST
-
Clearly define the construct measured through human validation.
-
Describe how the construct is operationalized in the study, specifying the method of measurement.
-
Employ established and widely accepted measurement methods and instruments.
SHOULD
-
Use empirically validated measures.
-
Complement automated or machine-generated measures with human validation where possible.
MAY
- Use multiple different measures (e.g., expert ratings, surveys, task performance) for human validation.
References
[1] A. Bavaresco et al., “LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks,” CoRR, vol. abs/2406.18403, 2024, doi: 10.48550/ARXIV.2406.18403.
[2] T. Ahmed, P. T. Devanbu, C. Treude, and M. Pradel, “Can LLMs replace manual annotation of software engineering artifacts?” CoRR, vol. abs/2408.05534, 2024, doi: 10.48550/ARXIV.2408.05534.
[3] R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman, “Measures for explainable AI: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-AI performance,” Frontiers Comput. Sci., vol. 5, 2023, doi: 10.3389/FCOMP.2023.1096257.
[4] S. A. C. Perrig, N. Scharowski, and F. Brühlmann, “Trust issues with trust scales: Examining the psychometric quality of trust measures in the context of AI,” in Extended abstracts of the 2023 CHI conference on human factors in computing systems, CHI EA 2023, hamburg, germany, april 23-28, 2023, A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, and A. Peters, Eds., ACM, 2023, pp. 297:1–297:7. doi: 10.1145/3544549.3585808.
[5] R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto, “Beyond code generation: An observational study of ChatGPT usage in software engineering practice,” Proc. ACM Softw. Eng., vol. 1, no. FSE, pp. 1819–1840, 2024, doi: 10.1145/3660788.
[6] R. Choudhuri, D. Liu, I. Steinmacher, M. A. Gerosa, and A. Sarma, “How far are we? The triumphs and trials of generative AI in learning software engineering,” in Proceedings of the 46th IEEE/ACM international conference on software engineering, ICSE 2024, lisbon, portugal, april 14-20, 2024, ACM, 2024, pp. 184:1–184:13. doi: 10.1145/3597503.3639201.
[7] Y. Xue, H. Chen, G. R. Bai, R. Tairas, and Y. Huang, “Does ChatGPT help with introductory programming?an experiment of students using ChatGPT in CS1,” in Proceedings of the 46th international conference on software engineering: Software engineering education and training, SEET@ICSE 2024, lisbon, portugal, april 14-20, 2024, ACM, 2024, pp. 331–341. doi: 10.1145/3639474.3640076.
Use an Open LLM as a Baseline
Recommendations
To ensure that empirical studies using Large Language Models (LLMs) in software engineering are reproducible and comparable, we recommend incorporating an open LLM as a baseline for analysis. This applies whether you’re using LLMs to explore something or evaluating them on on specific software engineering tasks. Sometimes, including an open LLM baseline might be impossible. Even in that case, using an open LLM on an early version of your product before it is final is a good idea. Open LLMs are available in places like Hugging Face or it is possible to run them on a local computer using tools like Ollama or LM Studio. We reccomend providing a replication package to facilitate other researchers checking your work.This should have clear, step-by-step instructions on how to get the same results you did using open LLM models. This makes the research more reliable since it allows others to confirm what you found. For example, researchers could report: “We compared our results to Meta’s Code Llama, which is available on Hugging Face.” and researchers shall include a link to the replication package.
The term Open when applied to an LLM can have various meanings. [1] discusses three types of openness (transparency, reusability and extensibility) and what openness in AI can and cannot provide. Moreover, the Open Source Initiative (OSI) [2] provides a definition of open-source AI that serves as a useful framework for evaluating the openness of AI models. In simple terms, according to OSI, open-source AI means you have access to everything you need to use the AI, such as understanding, modifying, sharing, retraining, and recreating. Thus, researchers must be clear about what “open” means in their context.
Finally, we recommend reporting and analysing inter-model agreement metrics. These metrics quantify the consistency between your model’s outputs and the baseline, thus they support the identification of potential biases or disagreement areas. Moreover, we recommend reporting the model confidence scores to analyse the model uncertainty. The analysis of inter-model agreement and model confidence provides valuable insights into the reliability and robustness of LLM performance, allowing a deeper understanding of their capabilities and limitations.
Example(s)
-
Benchmarking a Proprietary LLM: Researchers want to know how good their own LLM is at writing code. They might compare it against something like StarCoderBase, which everyone can use. We recommend to report exactly the versions of each model , the details of the HW, and the precise prompts. They could look at things like pass@k scores and how fast it provides answers.
-
Evaluating an LLM-Powered Tool: A team developing an AI-driven code review tool might want to assess the quality of suggestions generated by both a proprietary LLM and an open alternative. Human evaluators could then independently rate the relevance and correctness of the suggestions, providing an objective measure of the tool’s effectiveness.
-
Ensuring Reproducibility with a Replication Package: A study on bug localization that uses a closed-source LLM could support Reproducibility by including a replication package. This package might contain a script that automatically reruns the same experiments using an open-source LLM—such as Llama 3—and generates a comparative report.
Advantages
-
Improved Reproducibility: researchers can independently replicate experiments.
-
More Objective Comparisons: Using a standardized baseline allows for more unbiased evaluations.
-
Greater Transparency: Open models enable the analysis of how data is processed, which supports researchers in identifying potential biases and limitations.
-
Long-Term Accessibility: Unlike proprietary models, which may become unavailable, open LLMs remain available for future studies.
-
Lower Costs: Open-source models usually have fewer licensing restrictions, which makes them more accessible to researchers with limited funding.
Challenges
-
Performance Differences: Open models may not always match the latest proprietary LLMs in accuracy or efficiency, making it harder to demonstrate improvements.
-
Computational Demands: Running large open models requires hardware resources, including high-performance GPUs and significant memory.
-
Defining “Openness”: The term open is evolving—many so-called open models provide access to weights but do not disclose training data or methodologies. We are aware that the definition of an “open” model is actively being discussed, and many open models are essentially only “open weight” [3].
-
We consider the Open Source AI Definition proposed by the Open Source Initiative (OSI) [2] to be a first step towards defining true open-source models.
-
Implementation Complexity: Unlike cloud-based APIs from proprietary providers, setting up and fine-tuning open models can be technically demanding due to the possible limited documentation.
Study Types
-
Tool Evaluation: An open LLM baseline must be included if technically feasible. If integration is too complex, researchers should at least report initial benchmarking results using open models.
-
Benchmarking Studies and Controlled Experiments: An open LLM must be one of the models evaluated.
-
Observational Studies: If an open LLM is impossible, the researchers should acknowledge its absence and discuss potential impacts on their findings.
-
Qualitative Studies: If the LLM is used for exploratory data analysis or to compare alternative interpretations of results then an LLM baseline may be reported.
References
[1] D. G. Widder, M. Whittaker, and S. M. West, “Why ‘open’AI systems are actually closed, and why this matters,” Nature, vol. 635, no. 8040, pp. 827–833, 2024.
[2] Open Source Initiative (OSI), “Open Source AI Definition 1.0.” https://opensource.org/ai/open-source-ai-definition.
[3] E. Gibney, “Not all ‘open source’ AI models are actually open,” Nature News, 2024, doi: 10.1038/d41586-024-02012-5.
Report Suitable Baselines, Benchmarks, and Metrics
Recommendations
Benchmarks, baselines, and metrics play a significant role in assessing the effectiveness of LLMs or LLM-based tools for working on specific tasks. When selecting benchmarks, it is important to understand the contained data, task, and solutions, because this determines what exactly the benchmark assesses. We recommend researchers to briefly summarize the select benchmark and why they consider it suitable for their study. If multiple benchmark exist for the same task, the goal should always be to compare performance between benchmarks. For example, many benchmarks focus heavily on Python, and often on isolated functions. This assesses a very specific part of software development, which is certainly not representative for the full breadth of software engineering. For many tasks LLMs are being evaluated for, there exist traditional non-LLM-based approaches (e.g., for program repair) that can serve as a baseline. Even of LLM-based tools perform better, the question is whether the additional resources consumed justify the potentially marginal improvements. We recommend researchers to always check whether such traditional baselines exist and if they do, compare them with the LLM or LLM-based tool using suitable metrics. To make such comparisons between traditional and LLM-based approaches, or comparison between LLM-based tools based a benchmark, researchers need to carefully select suitable metrics. In general, we recommend to use established metrics whenever possible (see summary below), as this enables secondary research. We further recommend researchers to carefully argue why the selected metrics are suitable for the given task or study. If an LLM-based tool that is supposed to support humans is evaluated, an relevant metric might be the acceptance rate, meaning the ratio of all accepted artifacts (e.g., test cases, code snippets) in relation to all artifacts that were generated and presented to the user. Another way of evaluating LLM-based tools is calculating inter-model agreement (see also Section Use an Open LLM as a Baseline). This also allows researchers to assess how dependent a tool’s performance is on specific models.
LLM-based generation is non-deterministic by-design. This non-determinism requires the repetition experiments to statistically assess the performance of a model or tool using the arithmetic mean, confidence intervals, or standard deviations. We acknowledge that hyperparameters such as temperature, top-p and top-k can reduce the non-determinism, but for most real-world use cases, the non-determinism is actually desired. TODO: Can we add a reference for that? Like: Setting temperature to 0 reduces effectiveness for solving certain tasks?
Example(s)
Benchmarks are model- and tool-independent standardized tests used to assess the performance of LLMs on specific tasks such as code summarization or generation. They consist of predefined, standardized problems along with their expected solutions. Metrics are used to quantify the performance for the benchmark tasks, enabling a comparison. Since LLMs require substantial hardware resources, baselines serve as a comparison to assess their performance against traditional algorithms with lower computational costs. Thus, a baseline represents a reference point for the measured LLM.
Two prevalent benchmarks used for code generation are HumanEval (GitHub) [1] and MBPP (GitHub) [2]. Both benchmarks consist of code snippets written in Python sourced from publicly available repositories. Each snippet consists of four parts: a prompt based on function definition and a corresponding description what the function should accomplish, a canonical solution, an entry point for execution, and tests. The input of the LLM is the entire prompt. The output of the LLM is evaluated either against the canonical solution using metrics or against a test suite. Other benchmarks for code generation include ClassEval (GitHub) [3], LiveCodeBench (GitHub) [4], and SWE-bench (GitHub) [5]. An example of a code translation benchmark is TransCoder [6] (GitHub).
According to Hou et al., main problems types for LLMs are classification, recommendation and generation problems [7]. Each of these problem types require a different set of metrics. Hou et al. provide a comprehensive overview of benchmarks categorized by software engineering tasks [7]. Common metrics for assessing generation tasks are BLEU, pass@k, Accuracy/ Accuracy@k, and Exact Match. The most common recommendation task metric is Mean Reciprocal Rank. For classification tasks, classical machine learning metrics such as Precision, Recall, F1-score, and Accuracy are often reported.
We now briefly discuss two common metrics used for generation tasks. BLEU-N [1] is a similarity score based on n-gram precision between two strings, ranging from to
. Values close to
depict dissimilar values closer to
represent similar content. A value closer to
indicates that the model is more capable of generating the expected output for code generation. BLEU has multiple variations. TODO: Is BLEU a synonym for BLEU-N? CodeBLEU [8] and CrystalBLEU [9] are the most notable variations tailored to code, by introducing additional heuristics such as AST matching. As mentioned above, researchers should motivate why they chose a certain metric or variant thereof for their particular study.
The metric pass@k reports the likelihood of a model correctly completing a code snippet at least once within k tries. TODO: Who came up with pass@k? Reference? What correctness is depends on the given task. For code generation, correctness is often defined based on test cases. A passing test then means that the solution is correct. The resulting pass rate ranges from to
. A pass rate of
indicates that the model was not able to generate a single correct solution within
tries. A pass rate of
indicates that the model successfully generated at least one correct solution in
tries. The metric is defined as:
Where is the total number of generated samples per prompt,
is the number of correct samples among
, and
is the number of samples.
Choosing an appropriate value for k depends on the downstream task of the model and how end-users interact with the model. A high pass rate for pass@1 is highly desirable in tasks where the system only presents one solution or if a single solution requires high computational effort. For example, code completion depends on a single prediction since the end user typically sees only a single suggestion. Pass rates for higher values (e.g.,
,
,
) indicate whether the model can solve the given task within multiple attempts. Scenarios where
is reasonable include human-in-the-loop workflows or reranking approaches. For example, an end user might choose between multiple suggestions for generated test cases.
TODO: Add SE papers that use some of the benchmarks and metrics discussed above
Advantages
Challenges
A general challenge with benchmarks for LLMs is that the most prominent ones, such as HumanEval and MBPP, use Python, introducing a bias towards this specific programming language and its idiosyncrasies. Since model performance is measured against these benchmarks, researchers often optimize for them. As a result, performance may degrade if programming languages other than Python are used.
Many closed-source models, such as those released by OpenAI, achieve exceptional performance on certain tasks but lack transparency and reproducibility. Benchmark leaderboards, particularly for code generation, are led by close-sourced models. While researchers should compare performance against these models, they must consider that providers might discontinue them or apply undisclosed pre- or postprocessing beyond the researcher’s control (see also Section see also Section Use an Open LLM as a Baseline).
Challenges with individual metrics include that, for example, BLEU is a syntactic metric and hence does not measure semantic correctness or structural correctness. Thus, a high BLEU score does not directly indicate that the generated code is executable. TODO: Mention that alternatives exist? Do they come with other downsides? When researchers observe unexpected values for certain metrics, the specific results should be investigated in more detail to uncover potential problems. Such problems can, for example, be related to formatting, since code formatting highly influences metrics such as BLEU or Exact Match.
Another challenge to consider is that metrics usually capture one specific aspect of a task or solution. For instance, metrics such as pass@k do not reflect qualitative aspects of code such as maintainability, cognitive load, or readability. These aspects are critical for the downstream task and influence the overall usability. Moreover, benchmarks are isolated test sets and may not fully represent real-world applications. For example, benchmarks such as HumanEval synthesize code based on written specifications. However, such explicit descriptions are rare in real-world applications. Thus, evaluating the model performance with benchmarks might not reflect real-world tasks and end-user usability.
Finally, benchmark data contamination [10] continues to be a major challenge as well. In many cases, the training data set for an LLM is not released in conjunction with the model. The benchmark itself could be part of the model’s training dataset. Such benchmark contamination may lead to the model remembering the actual solution from the training data rather than solving the new task based on the seen data. This leads to artificially high performance on the benchmark. For unforeseen scenarios, however, the model might perform much worse.
Study Types
TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [11]).
References
[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the association for computational linguistics, july 6-12, 2002, philadelphia, PA, USA, ACL, 2002, pp. 311–318. doi: 10.3115/1073083.1073135.
[2] J. Austin et al., “Program synthesis with large language models,” CoRR, vol. abs/2108.07732, 2021, Available: https://arxiv.org/abs/2108.07732
[3] X. Du et al., “ClassEval: A manually-crafted benchmark for evaluating LLMs on class-level code generation,” CoRR, vol. abs/2308.01861, 2023, doi: 10.48550/ARXIV.2308.01861.
[4] N. Jain et al., “LiveCodeBench: Holistic and contamination free evaluation of large language models for code,” CoRR, vol. abs/2403.07974, 2024, doi: 10.48550/ARXIV.2403.07974.
[5] C. E. Jimenez et al., “SWE-bench: Can language models resolve real-world github issues?” in The twelfth international conference on learning representations, ICLR 2024, vienna, austria, may 7-11, 2024, OpenReview.net, 2024. Available: https://openreview.net/forum?id=VTF8yNQM66
[6] M.-A. Lachaux, B. Rozière, L. Chanussot, and G. Lample, “Unsupervised translation of programming languages,” CoRR, vol. abs/2006.03511, 2020, Available: https://arxiv.org/abs/2006.03511
[7] X. Hou et al., “Large language models for software engineering: A systematic literature review,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, Dec. 2024, doi: 10.1145/3695988.
[8] S. Ren et al., “CodeBLEU: A method for automatic evaluation of code synthesis,” CoRR, vol. abs/2009.10297, 2020, Available: https://arxiv.org/abs/2009.10297
[9] A. Eghbali and M. Pradel, “CrystalBLEU: Precisely and efficiently measuring the similarity of code,” in 37th IEEE/ACM international conference on automated software engineering, ASE 2022, rochester, MI, USA, october 10-14, 2022, ACM, 2022, pp. 28:1–28:12. doi: 10.1145/3551349.3556903.
[10] C. Xu, S. Guan, D. Greene, and M. T. Kechadi, “Benchmark data contamination of large language models: A survey,” CoRR, vol. abs/2406.04244, 2024, doi: 10.48550/ARXIV.2406.04244.
[11] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.
Report Limitations and Mitigations
Recommendations
When using large language models (LLMs) in empirical studies for software engineering, researchers face unique challenges and potential limitations that can influence the validity, reliability, and reproducibility [1] of their findings including:
Reproducibility: A cornerstone of open science is the ability to reproduce research results. Even though the inherent non-deterministic nature of LLMs is a strength in many use cases, its impact on reproducibility is a challenge. To enable reproducibility, researchers
-
MUST disclose a replication package for their study and
-
SHOULD perform multiple evaluation iterations of their experiments (see Report Suitable Baselines, Benchmarks, and Metrics) to account for non-deterministic outputs of LLMs or
-
MAY disable non-determinism by setting the temperature to 0 and stating a seed value if the research allows for it
Besides non-determinism, the behavior of an LLM depends on many external factors, such as model version, API updates, or prompt variations. To ensure reproducibility, researchers
-
SHOULD follow the best practices described in Report Model Version and Configuration and
-
SHOULD follow the best practices described in Report Tool Architecture and Supplemental Data and
-
SHOULD follow the best practices described in Report Prompts and their Development and
-
SHOULD follow the best practices described in Report Interaction Logs and
-
SHOULD follow the best practices described in Report Suitable Baselines, Benchmarks, and Metrics and
-
SHOULD address generalization challenges (see below) to ensure reproducibility over time
Generalization: Even though the topic of generalizability is not new, it has gained new relevance with the increasing interest in LLMs. In LLM studies generalizability boils down to two main concerns: First, are the results specific to an LLM, or can they be achieved with other LLMs too?
-
If generalizability to other LLMs is not in the scope of the research, this MUST be clearly explained or
-
If generalizability is in scope, researchers MUST compare their results or subsets of the results (if not possible e.g., due to computational cost) with other LLMs to assess the generalizability of their findings.
Second, will these results still be valid in the future? Multiple studies ( [2], [3]) found, that the performance of proprietary LLMs (like GPT) within the same version (e.g. GPT 4) decreased over time. Reporting the model version and configuration is not sufficient in such cases. To date, the only way of mitigating this limitation is the usage of an Open LLM with a transparently communicated versioning and archiving (see Use an Open LLM as a Baseline).
-
Hence researchers SHOULD employ open LLMs to set a reproducible baseline (see Use an Open LLM as a Baseline).
-
If this is not possible, researchers SHOULD test and report their results over an extended period of time as a proxy of the results’ relevance over time.
Data Leakage: Data leakage/contamination/overfitting occurs when information from outside the training dataset influences the model, leading to overly optimistic performance estimates. With the growing reliance on big datasets, the risks of inter-dataset duplication increases (e.g., [4], [5]). In the context of LLMs for software engineering, this can for example manifest as samples from the pre-train dataset appearing in the fine-tune or evaluation dataset, potentially compromising the validity of evaluation results [6].
-
Hence, to ensure the validity of the evaluation results, researchers SHOULD carefully curate the fine-tuning and evaluation datasets to prevent inter-dataset duplication.
-
If information about the pre-train dataset of the employed LLM is available, researchers SHOULD assess the inter-dataset duplication and MUST discuss the potential data leakage.
-
In addition, researchers MUST NOT leak their fine-tune or evaluation datasets into the improvement process of the LLM (e.g., OpenAI’s functionality called “Improve the model for everyone“), as this potentially impacts further evaluation rounds (e.g., think about pass@k), and exacerbates the issue of reproducibility.
-
If training a LLM from scratch, researchers MAY consider using open datasets (such as Together AI’s RedPajama [7]) that already incorporate duplication (with the positive side-effect of potentially improving performance [8]).
Scalability and Cost: Conducting studies with LLMs is a resource demanding endeavor. For self-hosted LLMs, the respective hardware needs to be provided, for managed LLMs, the service cost has to be considered. The challenge becomes more pronounced as LLMs grow larger, research architectures get more complex, and experiments become more computationally expensive e.g., multiple repetitions to assess performance in the face of non-determinism (see Report Suitable Baselines, Benchmarks, and Metrics). Consequently, resource-intensive research remains predominantly the domain of well-funded researchers, hindering researchers with limited resources from replicating or extending the study results.
-
Hence, for transparency reasons, researchers SHOULD report the cost associated with executing the study. If the study employed self-hosted LLMs, researchers SHOULD report the hardware used. If the study employed managed LLMs, the service cost SHOULD be reported.
-
To ensure research result validity and replicability, researchers MUST provide the LLM outputs as evidence for validation at different granularities (e.g., when employing multi-agent systems).
-
Additionally, researchers SHOULD include a sample, selected using an accepted sampling strategy, to allow partial replication of results.
Misleading Performance Metrics: While metrics such as BLEU or ROUGE are commonly used to evaluate the performance of LLMs, they may not capture other relevant, software engineering-specific aspects such as functional correctness or the runtime performance of automatically generated code [9].
-
Researchers SHOULD clarify and state all relevant requirements, and employ and report the metrics to measure their satisfaction (e.g., test-case success rate).
-
Researchers SHOULD follow the best practices described in Report Suitable Baselines, Benchmarks, and Metrics.
Ethical Concerns with Sensitive Data: Sensitive data can range from personal to proprietary data, each with its own set of ethical concerns. A big threat of prorpietary LLMs and sensitive data is that the data’s usage for model improvements (see “Data Leakage“). Hence using sensitive data can lead to privacy and IP violations. Another threat is the implicit bias of LLMs potentially leading to discrimination or unfair treatment of individuals or groups. To mitigate these concerns, researchers:
-
SHOULD minimize the sensitive data used in their studies and
-
MUST follow applicable regulations (e.g., GDPR) and individual processing agreements and
-
SHOULD create a data management plan outlining how the data is handled and protected against leakage and discrimination and
-
SHOULD apply for approval from the ethics committee of their organization (if required)
Performance and Resource Consumption: “The field of AI is currently primarily driven by research that seeks to maximize model accuracy — progress is often used synonymously with improved prediction quality. This endless pursuit of higher accuracy over the decade of AI research has significant implications for computational resource requirements and environmental footprint. To develop AI technologies responsibly, we must achieve competitive model accuracy at a fixed or even reduced computational and environmental cost.” [10]
The performance of an LLM is usually measured in terms of traditional metrics such as accuracy, precision, and recall or more contemporary metrics such as pass@k, or BLEU-N (see Report Suitable Baselines, Benchmarks, and Metrics). However, given how resource-hungry LLMs are, resource consumption has to become a key indicator for performance to assess research progress responsibly. While research predominantly focused on the energy consumption during the early phases of LLMs (e.g., data center manufacturing, data acquisition, training), inference - i.e. the use of the LLM - often becomes similarly or even more resource-intensive ( [11], [10], [12], [13], [14]). Hence, researchers:
-
SHOULD aim for lower resource consumption on the model side. This can be achieved by selecting smaller (e.g., GPT 4o mini instead of GPT 4o) or newer models as a base model for the study or by employing techniques such as model pruning, quantization, knowledge distillation, etc. [14].
-
SHOULD reduce resource consumption when using the LLMs, e.g. by restricting the number of queries, input tokens, or output tokens [14], with different prompt engineering techniques (e.g., on average zero-shot prompts seem to emit less CO2 than Chain Of Thought prompts), or by carefully sampling smaller datasets for fine-tuning and evaluation instead of using large datasets in their entirety.
To report the environmental impact of a study researchers:
-
SHOULD use software such as CodeCarbon or Experiment Impact Tracker to track and quantify the carbon footprint of the study or
-
SHOULD report an estimation of the carbon footprint through tools like MLCO2 Impact or
-
SHOULD detail the LLM version and configuration as described in Report Model Version and Configuration, state the hardware or model-hoster and report the total number of requests, accumulated input tokens, and output tokens.
-
MUST justify why an LLM was chosen over existing approaches and set the achieved results in relation to the higher resource consumption of LLMs (also see Report Suitable Baselines, Benchmarks, and Metrics).
Example(s)
Reproducibility: An example highlighting the need for caution around replicability is the study of Staudinger et al. [15] who attempted the repliaction of an LLM study. They aimed to replicate the results of a previous study that did not provide a replication package. However, they were not able to reproduce the exact results, even though they saw similar trends to the original study. They consider their results as not reliable enough for a systematic review.
Generalization: To analyze whether the results of proprietary LLMs transfer to open LLMs, Staudinger et al. [15] benchmarked previous results using GPT3.5 and GPT4 against Mistral and Zephyr. They found, that the employed open-source models could not deliver the same performance as the proprietary models, restricting the effect to certain proprietary models. Individual studies already started to highlight the uncertainty about the generalizability of their results in the future. In [16] Jesse et al. acknowledge the issue that LLMs evolve over time and that this evolution might impact the study’s results.
Data Leakage: Since much research in software engineering evolves around code, inter-dataset code duplication has been extensively researched and addressed over the years to curate deduplicated datasets (e.g., by Lopes in 2017 [4], Allamanis in 2019 10.1145/3359591.3359735?, Karmakar in 2023 [17], or Lopez in 2025 [6]). The issue of inter-dataset duplication has also attracted interest in other disciplines, with growing demands for data mining. For example in the biology field, Lakiotaki et al. [18] acknowledge and address the overlap between multiple common disease datasets. In the domain of code generation, Coignion et al. [19] evaluated the performance of LLMs to produce leet code. To mitigate the issue of inter-dataset duplication, they only used leet code problems published after 01.01.2023, reducing the likelihood of LLMs having seen those problems before. Further, they discuss the performance differences of LLMs on different datasets in light of potential inter-dataset duplication. In [20] Zhou et al. performed an empirical evaluation of data leakage in 83 software engineering benchmarks. While most benchmarks suffer from minimal leakage, very few suffered from leakage of up to 100%. They found a high impact of data leakage on the performance evaluation. A starting point for studies that aim to assess and mitigate inter-dataset duplication are the Falcon LLMs. The technology innovation institute publicly provides access to parts of its training data for the Falcon LLMs [21] via Huggingface. Through this dataset, it is possible to reduce the overlap between the pre-train and evaluation data, improving the validity of the evaluation results. A starting point to prevent actively leaking data into a LLM improvement process, is to ensure that the data is not used to train the model (e.g., via OpenAI’s data control functionality, or the OpenAI API instead of the web interface) [22].
Scalability and Cost: An example of stating the cost of a study can be found in [15] by Staudinger et al.m who specified the costs for the study as “120 USD in API calls for GPT 3.5 and GPT 4, and 30 USD in API calls for Mistral AI. Thus, the total LLM cost of our reproducibility study was 150 USD”.
Ethical Concerns with Sensitive Data: Bias can occur in datasets as well as in LLMs that have been trained on them and results in various types of discrimination. Gallegos et al. propose metrics to quantify biases in various tasks (e.g., text generation, classification, question answering) [23].
Performance and Resource Consumption: In [24] Tinnes et al. balanced the dataset size between the need for manual semantic analysis and computational resource consumption.
Advantages
Generalization: Mitigating threats to generalizability through the integration of open LLMs as a baseline or the reporting of results over an extended period of time can increase the validity, reliability, and replicability of a study’s results.
Data Leakage: Assessing and mitigating the effects of inter-dataset duplication strengthens a study’s validity and reliability, as it prevents overly optimistic performance estimates that do not apply to previously unknown samples.
Scalability and Cost: Reporting the cost associated with executing a study not only increases transparency but also supports secondary literature in setting primary research into perspective. Providing replication packages entailing direct LLM output evidence as well as samples for partial replicability are paramount steps towards open and inclusive research in the light of resource inequality among researchers.
Performance and Resource Consumption: Mindfully deciding and justifying the usage of LLMs over other approaches can lead to more efficient and sustainable approaches. Reporting the environmental impact of the usage of LLMs also sets the stage for more sustainable research practices in the field of AI.
Challenges
Generalization: With commercial LLMs evolving over time, the generalizability of results to future versions of the model is uncertain. Employing open LLMs as a baseline can mitigate this limitation, but may not always be feasible due to computational cost.
Data Leakage: Most LLM providers do not publicly offer information about the datasets employed for pre-training, impeding the assessment of inter-dataset duplication effects.
Scalability and Cost: Consistently keeping track of and reporting the cost involved in a research endeavor is challenging. Building a coherent replication package that includes LLM outputs and samples for partial replicability requires additional effort and resources.
Misleading Performance Metrics: Defining all requirements beforehand to ensure the usage of suitable metrics can be challenging, especially in exploratory research. In this growing field of research, finding the right metrics to evaluate the performance of LLMs in software engineering for specific tasks is challenging. Report Suitable Baselines, Benchmarks, and Metrics can serve as a starting point.
Ethical Concerns with Sensitive Data: Ensuring compliance across jurisdictions is difficult with different regions having different regulations and requirements (e.g., GDPR in the EU, CCPA in California). Selecting datasets and models with less bias is challenging, as the bias in LLMs is often not transparently reported.
Performance and Resource Consumption: Measuring or estimating the environmental impact of a study is challenging and might not always be feasible. Especially in exploratory research, the impact is hard to estimate beforehand, making it difficult to justify the usage of LLMs over other approaches.
Study Types
The limitations and mitigations SHOULD be followed for all study types in a sensible manner, i.e. depending on the applicability to the individual study.
References
[1] J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence: The threats of using llms in software engineering,” in Proceedings of the 2024 ACM/IEEE 44th international conference on software engineering: New ideas and emerging results, 2024, pp. 102–106.
[2] L. Chen, M. Zaharia, and J. Zou, “How is ChatGPT’s behavior changing over time?” CoRR, vol. abs/2307.09009, 2023, doi: 10.48550/ARXIV.2307.09009.
[3] D. Li, K. Gupta, M. Bhaduri, P. Sathiadoss, S. Bhatnagar, and J. Chong, “Comparing GPT-3.5 and GPT-4 accuracy and drift in radiology diagnosis please cases,” Radiology, vol. 310, no. 1, p. e232411, 2024, doi: 10.1148/radiol.232411.
[4] C. V. Lopes et al., “Déjàvu: A map of code duplicates on GitHub,” Proc. ACM Program. Lang., vol. 1, no. OOPSLA, pp. 84:1–84:28, 2017, doi: 10.1145/3133908.
[5] M. Allamanis, “The adverse effects of code duplication in machine learning models of code,” in Proceedings of the 2019 ACM SIGPLAN international symposium on new ideas, new paradigms, and reflections on programming and software, onward! 2019, athens, greece, october 23-24, 2019, H. Masuhara and T. Petricek, Eds., ACM, 2019, pp. 143–153. doi: 10.1145/3359591.3359735.
[6] J. A. H. López, B. Chen, M. Saad, T. Sharma, and D. Varró, “On inter-dataset code duplication and data leakage in large language models,” IEEE Trans. Software Eng., vol. 51, no. 1, pp. 192–205, 2025, doi: 10.1109/TSE.2024.3504286.
[7] T. Computer, RedPajama: An open dataset for training large language models. (2023). Available: https://github.com/togethercomputer/RedPajama-Data
[8] K. Lee et al., “Deduplicating training data makes language models better,” in Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), ACL 2022, dublin, ireland, may 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds., Association for Computational Linguistics, 2022, pp. 8424–8445. doi: 10.18653/V1/2022.ACL-LONG.577.
[9] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation,” in Advances in neural information processing systems 36: Annual conference on neural information processing systems 2023, NeurIPS 2023, new orleans, LA, USA, december 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., 2023. Available: http://papers.nips.cc/paper\files/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Conference.html
[10] C.-J. Wu et al., “Sustainable AI: Environmental implications, challenges and opportunities,” in Proceedings of the fifth conference on machine learning and systems, MLSys 2022, santa clara, CA, USA, august 29 - september 1, 2022, D. Marculescu, Y. Chi, and C.-J. Wu, Eds., mlsys.org, 2022. Available: https://proceedings.mlsys.org/paper\files/paper/2022/hash/462211f67c7d858f663355eff93b745e-Abstract.html
[11] A. de Vries, “The growing energy footprint of artificial intelligence,” Joule, vol. 7, no. 10, pp. 2191–2194, 2023.
[12] Z. Fu, F. Chen, S. Zhou, H. Li, and L. Jiang, “LLMCO2: Advancing accurate carbon footprint prediction for LLM inferences,” CoRR, vol. abs/2410.02950, 2024, doi: 10.48550/ARXIV.2410.02950.
[13] P. Jiang, C. Sonne, W. Li, F. You, and S. You, “Preventing the immense increase in the life-cycle energy and carbon footprints of LLM-powered intelligent chatbots,” Engineering, vol. 40, pp. 202–210, 2024, doi: https://doi.org/10.1016/j.eng.2024.04.002.
[14] N. E. Mitu and G. T. Mitu, “The hidden cost of AI: Carbon footprint and mitigation strategies,” Available at SSRN 5036344, 2024.
[15] M. Staudinger, W. Kusa, F. Piroi, A. Lipani, and A. Hanbury, “A reproducibility and generalizability study of large language models for query generation,” in Proceedings of the 2024 annual international ACM SIGIR conference on research and development in information retrieval in the asia pacific region, SIGIR-AP 2024, tokyo, japan, december 9-12, 2024, T. Sakai, E. Ishita, H. Ohshima, F. Hasibi, J. Mao, and J. M. Jose, Eds., ACM, 2024, pp. 186–196. doi: 10.1145/3673791.3698432.
[16] K. Jesse, T. Ahmed, P. T. Devanbu, and E. Morgan, “Large language models and simple, stupid bugs,” in 20th IEEE/ACM international conference on mining software repositories, MSR 2023, melbourne, australia, may 15-16, 2023, IEEE, 2023, pp. 563–575. doi: 10.1109/MSR59073.2023.00082.
[17] A. Karmakar, M. Allamanis, and R. Robbes, “JEMMA: An extensible java dataset for ML4Code applications,” Empir. Softw. Eng., vol. 28, no. 2, p. 54, 2023, doi: 10.1007/S10664-022-10275-7.
[18] K. Lakiotaki, N. Vorniotakis, M. Tsagris, G. Georgakopoulos, and I. Tsamardinos, “BioDataome: A collection of uniformly preprocessed and automatically annotated datasets for data-driven biology,” Database J. Biol. Databases Curation, vol. 2018, p. bay011, 2018, doi: 10.1093/DATABASE/BAY011.
[19] T. Coignion, C. Quinton, and R. Rouvoy, “A performance study of LLM-generated code on leetcode,” in Proceedings of the 28th international conference on evaluation and assessment in software engineering, EASE 2024, salerno, italy, june 18-21, 2024, ACM, 2024, pp. 79–89. doi: 10.1145/3661167.3661221.
[20] X. Zhou et al., “LessLeak-bench: A first investigation of data leakage in LLMs across 83 software engineering benchmarks.” 2025. Available: https://arxiv.org/abs/2502.06215
[21] Technology Innovation Institute, “Falcon-refinedweb (revision 184df75).” Hugging Face, 2023. doi: 10.57967/hf/0737 .
[22] S. Balloccu, P. Schmidtová, M. Lango, and O. Dusek, “Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,” in Proceedings of the 18th conference of the european chapter of the association for computational linguistics, EACL 2024 - volume 1: Long papers, st. Julian’s, malta, march 17-22, 2024, Y. Graham and M. Purver, Eds., Association for Computational Linguistics, 2024, pp. 67–93. Available: https://aclanthology.org/2024.eacl-long.5
[23] I. O. Gallegos et al., “Bias and fairness in large language models: A survey,” CoRR, vol. abs/2309.00770, 2023, doi: 10.48550/ARXIV.2309.00770.
[24] C. Tinnes, A. Welter, and S. Apel, “Software model evolution with large language models: Experiments on simulated, public, and industrial datasets.”