Guidelines

Our guidelines focus on LLMs, that is, foundational models that use text as in- and output. We do not address multi-modal foundation models that support other media such as images. However, many of our recommendations apply to multi-modal models as well.

The main goal of our guidelines is to enable reproducibility and replicability of empirical studies involving LLMs in software engineering. While we consider LLM-based studies to have characteristics that differ from traditional empirical studies (e.g., their inherent non-determinism and the fact that truly open models are rare), previous guidelines regarding open science and empirical studies still apply. Although full reproducibility of LLM study results is very challenging given LLM’s non-determinism, transparency on LLM usage, methods, data, and architecture, as suggested by our guidelines, is an essential prerequisite for future replication studies.

The wording of our guidelines (MUST, SHOULD, MAY) follows RFC 2119 and RFC 8174. Throughout the following sections, we mention the information we expect researchers to report in the PAPER and/or in the SUPPLEMENTARY MATERIAL. We are aware that different publication venues have different page limits and that not all aspects can be reported in the PAPER. If information MUST be reported in the PAPER, we explicitly mention this in the specific guidelines. Of course, it is better to report essential information in the SUPPLEMENTARY MATERIAL than not at all. The SUPPLEMENTARY MATERIAL SHOULD be published according to the ACM SIGSOFT Open Science Policies.

At the beginning of each section, we provide a brief tl;dr summary that lists the most important aspects of the corresponding guideline. In addition to our recommendations, we provide examples from the SE research community and beyond, as well as the advantages and potential challenges of following the respective guidelines. We conclude each guideline by linking it to the study types. We start with an aggregation of all tl;dr summaries, followed by the individual guidelines:

Overview

Declare LLM Usage and Role
Report Model Version, Configuration, and Customizations
Report Tool Architecture beyond Models
Report Prompts, their Development, and Interaction Logs
Use Human Validation for LLM Outputs
Use an Open LLM as a Baseline
Use Suitable Baselines, Benchmarks, and Metrics
Report Limitations and Mitigations

tl;dr

(1) Declare LLM Usage and Role:
Researchers MUST disclose any use of LLMs to support empirical studies in their PAPER. They SHOULD report their purpose, automated tasks, and expected benefits.

(2) Report Model Version, Configuration, and Customizations:
Researchers MUST report the exact LLM model or tool version, configuration, and experiment date in the PAPER. For fine-tuned models, they MUST describe the fine-tuning goal, dataset, and procedure. Researchers SHOULD include default parameters, explain model choices, compare base- and fine-tuned model using suitable metrics and benchmarks, and share fine-tuning data and weights (or alternatively justify why they cannot share them).

(3) Report Tool Architecture Beyond Models:
Researchers MUST describe the full architecture of LLM-based tools that they develop in their PAPER. This includes the role of the LLM, interactions with other components, and the overall system behavior. If autonomous agents are used, researchers MUST specify agent roles, reasoning frameworks, and communication flows. Hosting, hardware setup, and latency implications MUST be reported. For tools using retrieval or augmentation methods, data sources, integration mechanisms, and update and versioning strategies MUST be described. For ensemble architectures, the coordination logic between models MUST be explained. Researchers SHOULD include architectural diagrams and justify design decisions. Non-disclosed confidential or proprietary components MUST be acknowledged as reproducibility limitations.

(4) Report Prompts, their Development, and Interaction Logs
Researchers MUST publish all prompts, including their structure, content, formatting, and dynamic components. If full prompt disclosure is not feasible, for example, due to privacy or confidentiality concerns, summaries or examples SHOULD be provided. Prompt development strategies (e.g., zero-shot, few-shot), rationale, and selection process MUST be described. When prompts are long or complex, input handling and token optimization strategies MUST be documented. For dynamically generated or user-authored prompts, generation and collection processes MUST be reported. Prompt reuse across models and configurations MUST be specified. Researchers SHOULD report prompt revisions and pilot testing insights. To address model non-determinism and ensure reproducibility, especially when targeting SaaS-based commercial tools, full interaction logs (prompts and responses) SHOULD be included if privacy and confidentiality can be ensured.

(5) Use Human Validation for LLM Outputs:
If assessing the quality of generated artifacts is important and no reference datasets or suitable comparison metrics exist, researchers SHOULD use human validation for LLM outputs. If they do, they MUST define the measured construct (e.g., usability, maintainability) and describe the measurement instrument in the PAPER. Researchers SHOULD consider human validation early in the study design (not as an afterthought), build on established reference models for human-LLM comparison, and share their instruments as SUPPLEMENTARY MATERIAL. When aggregating LLM judgments, methods and rationale SHOULD be reported and inter-rater agreement SHOULD be assessed. Confounding factors SHOULD be controlled for, and power analysis SHOULD be conducted to ensure statistical robustness.

(6) Use an Open LLM as a Baseline:
Researchers SHOULD include an open LLM as a baseline when using commercial models and report inter-model agreement. By open LLM, we mean a model that everyone with the required hardware can deploy and operate. Such models are usually “open weight.” A full replication package with step-by-step instructions SHOULD be provided as part of the SUPPLEMENTARY MATERIAL.

(7) Use Suitable Baselines, Benchmarks, and Metrics:
Researchers MUST justify all benchmark and metric choices in the PAPER and SHOULD summarize benchmark structure, task types, and limitations. Where possible, traditional (non-LLM) baselines SHOULD be used for comparison. Researchers MUST explain why the selected metrics are suitable for the specific study. They SHOULD report established metrics to make study results comparable, but can report additional metrics that they consider appropriate. Due to the inherent non-determinism of LLMs, experiments SHOULD be repeated; the result distribution SHOULD then be reported using descriptive statistics.

(8) Report Limitations and Mitigations:
Researchers MUST transparently report study limitations, including the impact of non-determinism and generalizability constraints. The PAPER MUST specify whether generalization across LLMs or across time was assessed, and discuss model and version differences. Authors MUST describe measurement constructs and methods, disclose any data leakage risks, and avoid leaking evaluation data into LLM improvement pipelines. They MUST provide model outputs, discuss sensitive data handling, ethics approvals, and justify LLM usage in light of its resource demands.

We describe the individual guidelines in more detail in the following.

Declare LLM Usage and Role

tl;dr: Researchers MUST disclose any use of LLMs to support empirical studies in their PAPER. They SHOULD report their purpose, automated tasks, and expected benefits.

Recommendations

When conducting any kind of empirical study involving LLMs, researchers MUST clearly declare that an LLM was used (see Motivation and Scope for what we consider relevant research support). This SHOULD be done in a suitable section of the PAPER, for example, in the introduction or research methods section. For authoring scientific articles, this transparency is, for example, required by the ACM Policy on Authorship: “The use of generative AI tools and technologies to create content is permitted but must be fully disclosed in the Work” (Association for Computing Machinery 2023). Beyond generic authorship declarations and declarations that LLMs were used as part of the research process, researchers SHOULD report the exact purpose of using an LLM in a study, the tasks it was used to automate, and the expected benefits in the PAPER.

Example(s)

The ACM Policy on Authorship suggests to disclose the usage of GenAI tools in the acknowledgments section of the PAPER, for example: “ChatGPT was utilized to generate sections of this Work, including text, tables, graphs, code, data, citations” (Association for Computing Machinery 2023). Similarly, the acknowledgments section could also be used to disclose GenAI usage for other aspects of the research, if not explicitly described in other sections. The ACM policy further suggests: “If you are uncertain about the need to disclose the use of a particular tool, err on the side of caution, and include a disclosure in the acknowledgments section of the Work” (Association for Computing Machinery 2023). For double-blind review, during which the acknowledgments section is usually hidden, researchers can add a temporary “AI Disclosure” section in the same place the acknowledgments section would appear. An example of an LLM disclosure beyond writing support can be found in a recent paper by Lubos et al. (2024) (Lubos et al. 2024), in which they write in the methodology section:

“We conducted an LLM-based evaluation of requirements utilizing the Llama 2 language model with 70 billion parameters, fine-tuned to complete chat responses…”

Advantages

Transparency in the use of LLMs helps other researchers understand the context and scope of the study, facilitating better interpretation and comparison of the results. Beyond this declaration, we recommend researchers to be explicit about the LLM version they used (see Section Report Model Version, Configuration, and Customizations) and the LLM’s exact role (see Section Report Tool Architecture beyond Models).

Challenges

In general, we expect following our recommendations to be straightforward. One challenge might be authors’ reluctance to disclose LLM usage for valid use cases because they fear that AI-generated content makes reviewers think that the authors’ work is less original. In fact, there is evidence suggesting that AI disclosure can negatively affect trust in authors (Schilke and Reimann 2025). However, the ACM Policy on Authorship is very clear in that any use of GenAI tools to create content MUST be disclosed. Before the introduction of LLMs, human proof-reading was possible and allowed and was not required to be declared. Our guidelines focus on research support beyond proof-reading and writing support (see Motivation and Scope). However, over time, the threshold of what must be declared and what not will most likely evolve.

Study Types

Researchers MUST follow this guideline for all study types.

References

Association for Computing Machinery. 2023. “ACM Policy on Authorship.” https://www.acm.org/publications/policies/new-acm-policy-on-authorship.

Lubos, Sebastian, Alexander Felfernig, Thi Ngoc Trang Tran, Damian Garber, Merfat El Mansi, Seda Polat Erdeniz, and Viet-Man Le. 2024. “Leveraging LLMs for the Quality Assurance of Software Requirements.” In 32nd IEEE International Requirements Engineering Conference, RE 2024, Reykjavik, Iceland, June 24-28, 2024, edited by Grischa Liebel, Irit Hadar, and Paola Spoletini, 389–97. IEEE. https://doi.org/10.1109/RE59067.2024.00046.

Schilke, Oliver, and Martin Reimann. 2025. “The Transparency Dilemma: How AI Disclosure Erodes Trust.” Organizational Behavior and Human Decision Processes 188: 104405. https://doi.org/https://doi.org/10.1016/j.obhdp.2025.104405.

Report Model Version, Configuration, and Customizations

tl;dr: Researchers MUST report the exact LLM model or tool version, configuration, and experiment date in the PAPER. For fine-tuned models, they MUST describe the fine-tuning goal, dataset, and procedure. Researchers SHOULD include default parameters, explain model choices, compare base- and fine-tuned model using suitable metrics and benchmarks, and share fine-tuning data and weights (or alternatively justify why they cannot share them).

This guideline focuses on documenting the model-specific aspects of empirical studies involving LLMs. While Section Report Tool Architecture beyond Models addresses how LLMs are integrated into larger systems and tools, here we concentrate on the models themselves, their version, configuration parameters, and any direct customizations applied to the model (e.g., fine-tuning). Whether only this guideline or also the following one applies for a particular study depends on the exact study setup and tool architecture. In any event, as soon as an LLM is involved, the information outlined in this guideline is essential to enable reproducibility and replicability.

Recommendations

LLMs or LLM-based tools, especially those offered as-a-service, are frequently updated; different versions may produce different results for the same input. Moreover, configuration parameters such as temperature or seed values affect content generation. Therefore, researchers MUST document in the PAPER which model or tool version they used in their study, along with the date when the experiments were carried out and the configured parameters that affect output generation. Since default values might change over time, researchers SHOULD always report all configuration values, even if they used the defaults. Checksums and fingerprints MAY be reported since they identify specific versions and configurations. Depending on the study context, other properties such as the context window size (number of tokens) MAY be reported. Researchers SHOULD motivate in the PAPER why they selected certain models, versions, and configurations. Potential reasons can be monetary (e.g., no funding to integrate large commercial models), technical (e.g., existing hardware only supports smaller models), or methodological (e.g., planned comparison to previous work). Depending on the specific study context, additional information regarding the experiment or tool architecture SHOULD be reported (see Section Report Tool Architecture beyond Models).

A common customization approach for existing LLMs is fine-tuning. If a model was fine-tuned, researchers MUST describe the fine-tuning goal (e.g., improving the performance for a specific task), the fine-tuning procedure (e.g., full fine-tuning vs. Low-Rank Adaptation (LoRA), selected hyperparameters, loss function, learning rate, batch size, etc.), and the fine-tuning dataset (e.g., data sources, the preprocessing pipeline, dataset size) in the PAPER. Researchers SHOULD either share the fine-tuning dataset as part of the SUPPLEMENTARY MATERIAL or explain in the PAPER why the data cannot be shared (e.g., because it contains confidential or personal data that could not be anonymized). The same applies to the fine-tuned model weights. Suitable benchmarks and metrics SHOULD be used to compare the base model with the fine-tuned model (see Section Use Suitable Baselines, Benchmarks, and Metrics).

In summary, our recommendation is to report:

Model/tool name and version (MUST in PAPER);

All relevant configured parameters that affect output generation (MUST in PAPER);

Default values of all available parameters (SHOULD);

Checksum/fingerprint of used model version and configuration (MAY);

Additional properties such as context window size (MAY).

For fine-tuned models, additional recommendations apply:

Fine-tuning goal (MUST in PAPER);

Fine-tuning dataset creation and characterization (MUST in PAPER);

Fine-tuning parameters and procedure (MUST in PAPER);

Fine-tuning dataset and fine-tuned model weights (SHOULD);

Validation metrics and benchmarks (SHOULD).

Commercial models (e.g., GPT-4o) or LLM-based tools (e.g., ChatGPT) might not give researchers access to all required information. Our suggestion is to report what is available and openly acknowledge limitations that hinder reproducibility (see also Sections Report Prompts, their Development, and Interaction Logs and Report Limitations and Mitigations).

Example(s)

Based on the documentation that OpenAI and Azure provide (OpenAI 2025; Microsoft 2025), researchers might, for example, report:

“We integrated a gpt-4 model in version 0125-Preview via the Azure OpenAI Service, and configured it with a temperature of 0.7, top_p set to 0.8, a maximum token length of 512, and the seed value 23487. We ran our experiment on 10th January 2025’ (system fingerprint fp_6b68a8204b).

Kang, Yoon, and Yoo (2023) provide a similar statement in their paper on exploring LLM-based bug reproduction (Kang, Yoon, and Yoo 2023):

“We access OpenAI Codex via its closed beta API, using the code-davinci-002 model. For Codex, we set the temperature to 0.7, and the maximum number of tokens to 256.”

Our guidelines additionally suggest to report a checksum/fingerprint and exact dates, but otherwise this example is close to our recommendations.

Similar statements can be made for self-hosted models. However, when self-hosting models, the SUPPLEMENTARY MATERIAL can become a true replication package, providing specific instructions to reproduce study results. For example, for models provisioned using ollama, one can report the specific tag and checksum of the model being used, e.g., “llama3.3, tag 70b-instruct-q8_0, checksum d5b5e1b84868.” Given suitable hardware, running the corresponding model in its default configuration is then as easy as executing one command in the command line (see also Section Use an Open LLM as a Baseline): ollama run llama3.3:70b-instruct-q8_0

An example of a study involving fine-tuning is Dhar, Vaidhyanathan, and Varma (2024)’s work (Dhar, Vaidhyanathan, and Varma 2024). They conducted an exploratory empirical study to assess whether LLMs can generate architectural design decisions. The authors detail the system architecture, including the decision-making framework, the role of the LLM in generating design decisions, and the interaction between the LLM and other components of the system (see Section Report Tool Architecture beyond Models). The authors provide information on the fine-tuning approach and datasets used for the evaluation, including the source of the architectural decision records, preprocessing methods, and the criteria for data selection.

Advantages

Reporting of the information described above is a prerequisite for the verification, reproduction, and replication of LLM-based studies under the same or similar conditions. As mentioned before, LLMs are inherently non-deterministic. However, this cannot be an excuse to dismiss the verifiability and reproducibility of empirical studies involving LLMs. Although exact reproducibility is hard to achieve, researchers can do their best to come as close as possible to that gold standard. Part of that effort is reporting the information outlined in this guideline.

Challenges

Different model providers and modes of operating the models allow for varying degrees of information. For example, OpenAI provides a model version and a system fingerprint describing the backend configuration, which can also influence the output. However, in fact, the fingerprint is intended only to detect changes in the model or its configuration; one cannot go back to a certain fingerprint. As a beta feature, OpenAI lets users set a seed parameter to receive “(mostly) consistent output” (OpenAI 2023). However, the seed value does not allow for full reproducibility and the fingerprint changes frequently. While, as motivated above, open models significantly simplify re-running experiments, they also come with challenges in terms of reproducibility, as generated outputs can be inconsistent despite setting the temperature to 0 and using a seed value (see GitHub issue for Llama3).

Study Types

This guideline MUST be followed for all study types for which the researcher has access to (parts of) the model’s configuration. They MUST always report the configuration that is visible to them, acknowledging the reproducibility challenges of commercial tools and models that are offered as-a-service. Depending on the specific study type, researchers SHOULD provide additional information on the architecture of a tool they built (see Section Report Tool Architecture beyond Models), prompts and interactions logs (see Section Report Prompts, their Development, and Interaction Logs), and specific limitations and mitigations (see Section Report Limitations and Mitigations).

For example, when Studying LLM Usage in Software Engineering by focusing on commercial tools such as ChatGPT or GitHub Copilot, researchers MUST be as specific as possible in describing their study setup. The configured model name, version, and the date when the experiment was conducted MUST always be reported. In those cases, reporting other aspects, such as prompts and interaction logs, is essential.

References

Dhar, Rudra, Karthik Vaidhyanathan, and Vasudeva Varma. 2024. “Can LLMs Generate Architectural Design Decisions? - an Exploratory Empirical Study.” In 21st IEEE International Conference on Software Architecture, ICSA 2024, Hyderabad, India, June 4-8, 2024, 79–89. IEEE. https://doi.org/10.1109/ICSA59870.2024.00016.

Kang, Sungmin, Juyeon Yoon, and Shin Yoo. 2023. “Large Language Models Are Few-Shot Testers: Exploring LLM-Based General Bug Reproduction.” In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, 2312–23. IEEE. https://doi.org/10.1109/ICSE48619.2023.00194.

Microsoft. 2025. “Azure OpenAI Service models.” https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models.

OpenAI. 2023. “How to make your completions outputs consistent with the new seed parameter.” https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter.

———. 2025. “OpenAI API Introduction.” https://platform.openai.com/docs/api-reference/chat/streaming.

Report Tool Architecture Beyond Models

tl;dr: Researchers MUST describe the full architecture of LLM-based tools that they develop in their PAPER. This includes the role of the LLM, interactions with other components, and the overall system behavior. If autonomous agents are used, researchers MUST specify agent roles, reasoning frameworks, and communication flows. Hosting, hardware setup, and latency implications MUST be reported. For tools using retrieval or augmentation methods, data sources, integration mechanisms, and update and versioning strategies MUST be described. For ensemble architectures, the coordination logic between models MUST be explained. Researchers SHOULD include architectural diagrams and justify design decisions. Non-disclosed confidential or proprietary components MUST be acknowledged as reproducibility limitations.

This section addresses the system-level aspects of LLM-based tools that researchers develop, complementing Section Report Model Version, Configuration, and Customizations, which focuses on model-specific details. While standalone LLMs require limited architectural documentation, a more detailed description is required for study setups and tool architectures that integrate LLMs with other components to create more complex systems. This section provides guidelines for documenting these broader architectures.

Recommendations

Oftentimes, LLM-based tool have complex software layers around the model (or models) that pre-processes data, prepares prompts, filters user requests, or post-processes responses. An example is ChatGPT, which allows users to select from different GPT models. GitHub Copilot allows users to select the same models, but its answers may significantly differ from ChatGPT, as GitHub Copilot automatically adds context from the software project in which it is used. Researchers can build their own tools using GPT models directly (e.g., via the OpenAI API). The infrastructure and business logic around the bare model can significantly contribute to the performance of a tool for a given task. Therefore, researchers MUST clearly describe the tool architecture and what exactly the LLM (or ensemble of LLMs) contributes to the tool or method presented in a research paper.

If an LLM is used as a standalone system, for example, by sending prompts directly to a GPT-4o model via the OpenAI API without pre-processing the prompts or post-processing the responses, a brief explanation of this approach is usually sufficient. However, if LLMs are integrated into more complex systems with pre-processing, retrieval mechanisms, or autonomous agents, researchers MUST provide a detailed description of the system architecture in the PAPER. Aspects to consider are how the LLM interacts with other components such as databases, external APIs, and frameworks. If the LLM is part of an agent-based system that autonomously plans or executes tasks, researchers MUST describe its exact architecture, including the agents’ roles (e.g., planner, executor, coordinator), whether it is a single-agent or multi-agent system, how it interacts with external tools and users, and the reasoning framework used (e.g., chain-of-thought, self-reflection, multi-turn dialogue).

For agent-based systems that use external tools (e.g., Claude Code and its subagents), researchers MUST discuss agent behavior traceability by clearly delineating three distinct components: (1) LLM Input/Output: Internal deliberation, planning, and interpretation performed by the model. (2) External tool calls: Specific invocations of APIs, databases, file systems, or other external services that the agent explicitly triggers (e.g., via the Model Context Protocol). (3) User or system interactions: Human-in-the-loop feedback, environment responses, or multi-agent communication. Researchers SHOULD report complete execution traces that show the sequence and causality between these components, including inputs and outputs of all external tool calls. This traceability is essential for determining whether task success is attributable to the LLM’s output and tool-calling capabilities, the external tools’ functionality, or their interaction patterns. When full tool call logging is not feasible due to privacy or proprietary constraints, researchers SHOULD provide representative examples or anonymized traces that demonstrate the agent’s decision-making process.

Researchers SHOULD provide a high-level architectural diagram to improve transparency. To improve clarity, researchers SHOULD explain design decisions, particularly regarding how the models were hosted and accessed (API-based, self-hosted, etc.) and which retrieval mechanisms were implemented (keyword search, semantic similarity matching, rule-based extraction, etc.). Researchers MUST NOT omit critical architectural details that could affect reproducibility, such as dependencies on proprietary tools that influence tool behavior. Especially for time-sensitive measurements, the previously mentioned description of the hosting environment is central, as it can significantly impact the results. Researchers MUST clarify whether local infrastructure or cloud services were used, including detailed infrastructure specifications and latency considerations.

If retrieval or augmentation methods were used (e.g., retrieval-augmented generation (RAG), rule-based retrieval, structured query generation, or hybrid approaches), researchers MUST describe how external data is retrieved, stored, and integrated into the LLM’s responses. This includes specifying the type of storage or database used (e.g., vector databases, relational databases, knowledge graphs) and how the retrieved information is selected and used. Stored data used for context augmentation MUST be reported, including details on data preprocessing, versioning, and update frequency. If this data is not confidential, an anonymized snapshot of the data used for context augmentation SHOULD be made available.

For ensemble models, in addition to following the Report Model Version, Configuration, and Customizations guideline for each model, the researchers MUST describe the architecture that connects the models. The PAPERMUST at least contain a high-level description, and details can be reported in the SUPPLEMENTARY MATERIAL. Aspects to consider include documenting the logic that determines which model handles which input, the interaction between models, and the architecture for combining outputs (e.g., majority voting, weighted averaging, sequential processing).

Example(s)

Some empirical studies involving LLMs in SE have documented the architecture and supplemental data according to our guidelines. In the following, we provide two examples.

Schäfer et al. (2024) conducted an empirical evaluation of using LLMs for automated unit test generation (Schäfer et al. 2024). The authors provide a comprehensive description of the system architecture, detailing how the LLM is integrated into the software development workflow to analyze codebases and produce the corresponding unit tests. The architecture includes components for code parsing, prompt formulation, interaction with the LLM, and integration of the generated tests into existing test suites. The paper also elaborates on the datasets utilized for training and evaluating the LLM’s performance in unit test generation. It specifies the sources of code samples, the selection criteria, and the preprocessing steps undertaken to prepare the data.

A second example is Yan et al. (2024)’s IVIE tool (Yan et al. 2024), which integrates LLMs into the VS Code interface. The authors document the tool architecture, detailing the IDE integration, context extraction from code editors, and the formatting pipeline for LLM-generated explanations. This documentation illustrates how architectural components beyond the core LLM affect the overall tool performance and user experience.

Advantages

Usually, researchers implement software layers around the bare LLMs, using different architectural patterns. These implementations significantly impact the performance of LLM-based tools and hence need to be documented in detail. Documenting the architecture and supplemental data of LLM-based systems enhances reproducibility and transparency (Lu et al. 2024). In empirical software engineering studies, this is essential for experiment replication, result validation, and benchmarking. A clear documentation of the architecture and the supplemental data that was used enables comparison and upholds scientific rigor and accountability, fostering reliable and reusable research.

Challenges

Researchers face challenges in documenting LLM-based architectures, including proprietary APIs and dependencies that restrict disclosure, managing large-scale retrieval databases, and ensuring efficient query execution. They must also balance transparency with data privacy concerns, adapt to the evolving nature of LLM integrations, and, depending on the context, handle the complexity of multi-agent interactions and decision-making logic, all of which can impact reproducibility and system clarity.

Study Types

This guideline MUST be followed for all studies that involve tools with system-level components beyond bare LLMs, from lightweight wrappers that pre-process user input or post-process model outputs, to systems employing retrieval-augmented methods or complex agent-based architectures.

References

Lu, Qinghua, Liming Zhu, Xiwei Xu, Zhenchang Xing, and Jon Whittle. 2024. “Toward Responsible AI in the Era of Generative AI: A Reference Architecture for Designing Foundation Model-Based Systems.” IEEE Softw. 41 (6): 91–100. https://doi.org/10.1109/MS.2024.3406333.

Schäfer, Max, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation.” IEEE Trans. Software Eng. 50 (1): 85–105. https://doi.org/10.1109/TSE.2023.3334955.

Yan, Litao, Alyssa Hwang, Zhiyuan Wu, and Andrew Head. 2024. “Ivie: Lightweight Anchored Explanations of Just-Generated Code.” In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, edited by Florian ’Floyd’Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Max L. Wilson, Phoebe O. Toups Dugas, and Irina Shklovski, 140:1–15. ACM. https://doi.org/10.1145/3613904.3642239.

Report Prompts, their Development, and Interaction Logs

tl;dr: Researchers MUST publish all prompts, including their structure, content, formatting, and dynamic components. If full prompt disclosure is not feasible, for example, due to privacy or confidentiality concerns, summaries or examples SHOULD be provided. Prompt development strategies (e.g., zero-shot, few-shot), rationale, and selection process MUST be described. When prompts are long or complex, input handling and token optimization strategies MUST be documented. For dynamically generated or user-authored prompts, generation and collection processes MUST be reported. Prompt reuse across models and configurations MUST be specified. Researchers SHOULD report prompt revisions and pilot testing insights. To address model non-determinism and ensure reproducibility, especially when targeting SaaS-based commercial tools, full interaction logs (prompts and responses) SHOULD be included if privacy and confidentiality can be ensured.

Prompts are critical for any study involving LLMs. Depending on the task, prompts may include various types of content, including instructions, context, and input data, and output indicators. For SE tasks, the context can include source code, execution traces, error messages, and other forms of natural language content. The output can be unstructured text or a structured format such as JSON. Prompts significantly influence the quality of the output of a model, and understanding how exactly they were formatted and integrated into an LLM-based study is essential to ensure transparency, verifiability, and reproducibility. We remind the reader that these guidelines do not apply when LLMs are used solely for language polishing, paraphrasing, translation, tone or style adaptation, or layout edits (see Motivation and Scope).

Recommendations

Researchers MUST report the all prompts that were used for an empirical study, including all instructions, context, input data, and output indicators. The only exception to this is if transcripts might identify anonymous participants or reveal personal or confidential information, for example, when working with industry partners. Prompts can be reported using a structured template that contains placeholders for dynamically added content (see Section LLMs as Judges for an example).

Moreover, specifying the exact output format of the prompts is crucial. For example, when using code snippets, researchers should specify whether they were enclosed in markdown-style code blocks (indented by four space characters, enclosed in triple backticks, etc.), if line breaks and whitespace were preserved, and whether additional annotations such as comments were included. Similarly, for other artifacts such as error messages, stack traces, researchers should explain how these were presented.

Researchers MUST also explain in the PAPER how they developed the prompts and why they decided to follow certain prompting strategies. If prompts from the early phases of a research project are unavailable, they MUST at least summarize the prompt evolution. However, given that prompts can be stored as plain text, there are established SE techniques such as version control systems that can be used to collect the required provenance information.

Prompt development is often iterative, involving collaboration between human researchers and AI tools. Researchers SHOULD report any instances in which LLMs were used to suggest prompt refinements, as well as how these suggestions were incorporated. Furthermore, prompts may need to be revised in response to failure cases where the model produced incorrect or incomplete outputs. Pilot testing and prompt evaluation are vital to ensure that prompts yield reliable results. If such testing was conducted, researchers SHOULD summarize key insights, including how different prompt variations affected output quality and which criteria were used to finalize the prompt design. A prompt changelog can help track and report the evolution of prompts throughout a research project, including key revisions, reasons for changes, and versioning (e.g., v1.0: initial prompt; v1.2: added output formatting; v2.0: incorporated examples of ideal responses). Since prompt effectiveness varies between models and model versions, researchers MUST make clear which prompts were used for which models in which versions and with which configuration (see Section Report Model Version, Configuration, and Customizations).

Researchers MUST specify whether zero-shot, one-shot, or few-shot prompting was used. For few-shot prompts, the examples provided to the model SHOULD be clearly outlined, along with the rationale for selecting them. If multiple versions of a prompt were tested, researchers SHOULD describe how these variations were evaluated and how the final design was chosen.

When dealing with extensive or complex prompt context, researchers MUST describe the strategies they used to handle input length constraints. Approaches might include truncating, summarizing, or splitting prompts into multiple parts. Token optimization measures, such as simplifying code formatting or removing unnecessary comments, MUST also be documented if applied.

In cases where prompts are generated dynamically, such as through preprocessing, template structures, or retrieval-augmented generation (RAG), the process MUST be thoroughly documented. This includes explaining any automated algorithms or rules that influenced prompt generation. For studies involving human participants who create or modify prompts, researchers MUST describe how these prompts were collected and analyzed.

To ensure full reproducibility, researchers MUST make all prompts and prompt variations publicly available as part of their SUPPLEMENTARY MATERIAL. If the complete set of prompts cannot be included in the PAPER, researchers SHOULD provide summaries and representative examples. This also applies if complete disclosure is not possible due to privacy or confidentiality concerns. For prompts containing sensitive information, researchers MUST: (1) anonymize personal identifiers, (ii) replace proprietary code with placeholders, and (iii) clearly highlight modified sections.

When trying to verify results, even with the exact same prompts, decoding strategies, and parameters, LLMs can still behave non-deterministically. Non-determinism can arise from batching, input preprocessing, and floating point arithmetic on GPUs (Chann 2023). Thus, in order to enable other researchers to verify the conclusions that researchers have drawn from LLM interactions, researchers SHOULD report the full interaction logs (prompts and responses) as part of their SUPPLEMENTARY MATERIAL. Reporting this is especially important for studies targeting commercial software-as-a-service (SaaS) solutions such as ChatGPT. The rationale for this is similar to the rationale for reporting interview transcripts in qualitative research. In both cases, it is important to document the entire interaction between the interviewer and the participant. Just as a human participant might give different answers to the same question asked two months apart, the responses from ChatGPT can also vary over time (see also Section LLMs as Subjects). Therefore, keeping a record of the actual conversation is crucial for accuracy and context and shows depth of engagement for transparency.

Example(s)

A paper by Anandayuvaraj et al. (2024) (Anandayuvaraj et al. 2024) is a good example of making prompts available online. In that paper, the authors analyze software failures reported in news articles and use prompting to automate tasks such as filtering relevant articles, merging reports, and extracting detailed failure information. Their online appendix contains all the prompts used in the study, providing transparency and supporting reproducibility.

Liang et al. (2024)’s paper is a good example of comprehensive prompt reporting. The authors make the exact prompts available in their SUPPLEMENTARY MATERIAL on Figshare, including details such as code blocks being enclosed in triple backticks. The PAPER thoroughly explains the rationale behind the prompt design and the data output format. It also includes an overview figure and two concrete examples, enabling transparency and reproducibility while keeping the main text concise. An example of reporting full interaction logs is the study by Ronanki, Berger, and Horkoff (2023) (Ronanki, Berger, and Horkoff 2023), for which they reported the full answers of ChatGPT and uploaded them to Zenodo.

Advantages

As motivated above, providing detailed documentation of the prompts improves the verifiability, reproducibility, and comparability of LLM-based studies. It allows other researchers to replicate the study under similar conditions, refine prompts based on documented improvements, and evaluate how different types of content (e.g., source code vs. execution traces) influence LLM behavior. This transparency also enables a better understanding of how formatting, prompt length, and structure impact results across various studies.

An advantage of reporting full interaction logs is that, while for human participants conversations often cannot be reported due to confidentiality, LLM conversations can. Detailed logs enable future reproduction/replication studies to compare results using the same prompts. This could be valuable for tracking changes in LLM responses over time or across different versions of the model. It would also enable secondary research to study how consistent LLM responses are over model versions and identify any variations in its performance for specific SE tasks.

Challenges

One challenge is the complexity of prompts that combine multiple components, such as code, error messages, and explanatory text. Formatting differences (e.g., Markdown vs. plain text) can affect how LLMs interpret input. Additionally, prompt length constraints may require careful context management, particularly for tasks involving extensive artifacts such as large codebases. Privacy and confidentiality concerns can hinder prompt sharing, especially when sensitive data is involved.

Not all systems allow the reporting of complete (system) prompts and interaction logs with ease. This hinders transparency and verifiability. In addition to our recommendation to Use an Open LLM as a Baseline, we recommend exploring open-source tools such as Continue (Dunn 2023). For commercial tools, researchers MUST report all available information and acknowledge unknown aspects as limitations (see Section Report Limitations and Mitigations). Understanding suggestions of commercial tools such as GitHub Copilot might require recreating the exact state of the codebase at the time the suggestion was made, a challenging context to report. One solution could be to use version control to capture the exact state of the codebase when a recommendation was made, keeping track of the files that were automatically added as context.

Study Types

Reporting requirements vary depending on the study type. For the study type LLMs for New Software Engineering Tools, researchers MUST explain how prompts were generated and structured within the tool. When Studying LLM Usage in Software Engineering, especially for controlled experiments, exact prompts MUST be reported for all conditions. For observational studies, especially the ones targeting commercial tools, researchers MUST report the full interactions logs except for when transcripts might identify anonymous participants or reveal personal or confidential information. As mentioned before, if the complete interaction logs cannot be shared, e.g., because they contain confidential information, the prompts and responses MUST at least be summarized and described in the PAPER.

Researchers MUST report the complete prompts that were used, including all instructions, context, and input data. This applies across all study types, but specific focus areas vary. For LLMs as Annotators, researchers MUST document any predefined coding guides or instructions included in prompts, as these influence how the model labels artifacts. For LLMs as Judges, researchers MUST include the evaluation criteria, scales, and examples embedded in the prompt to ensure a consistent interpretation. For LLMs for Synthesis tasks (e.g., summarization, aggregation), researchers MUST document the sequence of prompts used to generate and refine outputs, including follow-ups and clarification queries. For LLMs as Subjects(e.g., simulating human participants), researchers MUST report any role-playing instructions, constraints, or personas used to guide LLM behavior. For Benchmarking LLMs for Software Engineering Tasks studies using pre-defined prompts (e.g., HumanEval, CruxEval), researchers MUST specify the benchmark version and any modifications made to the prompts or evaluation setup. If prompt tuning, retrieval-augmented generation (RAG), or other methods were used to adapt prompts, researchers MUST disclose and justify those changes, and SHOULD make the relevant code publicly available.

References

Anandayuvaraj, Dharun, Matthew Campbell, Arav Tewari, and James C Davis. 2024. “FAIL: Analyzing Software Failures from the News Using LLMs.” In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 506–18.

Chann, Sherman. 2023. “Non-determinism in GPT-4 is caused by Sparse MoE.” https://152334h.github.io/blog/non-determinism-in-gpt-4/.

Dunn, Ty. 2023. “It’s Time to Collect Data on How You Build Software.” https://blog.continue.dev/its-time-to-collect-data-on-how-you-build-software/.

Liang, Jenny T, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, and Thomas Zimmermann. 2024. “Can Gpt-4 Replicate Empirical Software Engineering Research?” Proceedings of the ACM on Software Engineering 1 (FSE): 1330–53.

Ronanki, Krishna, Christian Berger, and Jennifer Horkoff. 2023. “Investigating ChatGPT’s Potential to Assist in Requirements Elicitation Processes.” In 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 354–61. IEEE.

Use Human Validation for LLM Outputs

tl;dr: If assessing the quality of generated artifacts is important and no reference datasets or suitable comparison metrics exist, researchers SHOULD use human validation for LLM outputs. If they do, they MUST define the measured construct (e.g., usability, maintainability) and describe the measurement instrument in the PAPER. Researchers SHOULD consider human validation early in the study design (not as an afterthought), build on established reference models for human-LLM comparison, and share their instruments as SUPPLEMENTARY MATERIAL. When aggregating LLM judgments, methods and rationale SHOULD be reported and inter-rater agreement SHOULD be assessed. Confounding factors SHOULD be controlled for, and power analysis SHOULD be conducted to ensure statistical robustness.

Recommendations

Although, in principle, LLMs can automate many tasks in research and software development, which have traditionally been performed by humans, it is often essential to assess the outcome quality to determine whether the automation was successful. If, for a certain task, there are no reference datasets or suitable comparison metrics, researchers SHOULD rely on human judgment to validate the LLM outputs. Human judgment is particularly important if metrics and existing datasets alone cannot fully capture the target qualities relevant in the study context. Integrating human participants in the study design will require additional considerations, including a recruitment strategy, annotation guidelines, training sessions, or ethical approvals. Therefore, researchers SHOULD consider human validation early in the study design, not as an afterthought. In any way, they MUST clearly define the constructs that the human and LLM annotators evaluate (Ralph and Tempero 2018). When designing custom instruments to assess LLM output (e.g., questionnaires, scales), researchers SHOULD share their instruments in the SUPPLEMENTARY MATERIAL.

Validating the output of an LLM MAY involve the aggregation of inputs from multiple human judges. In these cases, researchers SHOULD clearly describe their aggregation method and document their reasoning. When multiple humans are annotating the same artifact, researchers SHOULD validate the agreement between multiple validators with inter-rater reliability measures such as Cohen’s Kappa or Krippendorff’s Alpha. When human validation is used, additional confounding factors SHOULD be controlled for, e.g., by categorizing participants according to their level of experience or expertise. Where applicable, researchers SHOULD perform a power analysis to estimate the required sample size, ensuring sufficient statistical power in their experimental design.

Researchers SHOULD use established reference models to compare humans with LLMs. For example, the reference model of Schneider, Fotrousi, and Wohlrab (2025) (Schneider, Fotrousi, and Wohlrab 2025) provides researchers with an overview of the design considerations for studies comparing LLMs with humans. If studies involve the annotation of software artifacts, and the goal is to automate the annotation process using LLM, researchers SHOULD follow systematic approaches to decide whether and how human annotators can be replaced. For example, Ahmed et al. (2025) (Ahmed et al. 2025) suggest a method that involves using a jury of three LLMs with 3 to 4 few-shot examples rated by humans, where the model-to-model agreement on all samples is determined using Krippendorff’s Alpha. If the agreement is high (alpha $gt$ 0.5), a human rating can be replaced with an LLM-generated one. In cases of low model-to-model agreement (alpha $\le$ 0.5), they then evaluate the prediction confidence of the model, selectively replacing annotations where the model confidence is high ( $\ge$ 0.8).

Besides generating and modifying content, agentic software development tools such as Claude Code can autonomously call command-line tools or pull in additional information from MCP servers. For such tools, it makes sense to separate textual output (e.g., file changes) from output related to tool execution. When evaluating agentic tools, researchers SHOULD assess the feedback that users provided for proposed changes, report statistics on how frequently they accepted the content, and how they modified it. This agentic human-in-the-loop interaction approach is essentially a built-in human validation, even though the degrees of freedom are larger than in more traditional experiments that use LLMs directly.

Example(s)

Ahmed et al. (2025) (Ahmed et al. 2025) evaluated how human annotations can be replaced by LLM generated labels. In their study, they explicitly report the calculated agreement metrics between the models and between humans. For example, they wrote that “model-model agreement is high, for all criteria, especially for the three large models (GPT-4, Gemini, and Claude). Table I in their study indicates that the mean Krippendorff’s $\alpha$ is 0.68-0.76. Second, we see that human-model and human-human agreements are in similar ranges, 0.24-0.40 and 0.21-0.48 for the first three categories.”

Xue et al. (2024) (Xue et al. 2024) conducted a controlled experiment in which they evaluated the impact of ChatGPT on the performance and perceptions of students in an introductory programming course. They used multiple measures to judge the impact of LLM from the human point of view. In their study, they recorded students’ screens, evaluated their answers for the given tasks, and distributed a post-study survey to collect their opinions.

Advantages

For empirical studies involving LLMs, human validation helps ensure the reliability of study results, as LLMs can produce incorrect or biased outputs. For example, in natural language processing tasks, a large-scale study has shown that LLMs have a significant variation in their results, which limits their reliability as a direct substitute for human judges (Bavaresco et al. 2024). Moreover, LLMs do not match human annotation in labeling tasks for natural language inference, position detection, semantic change, and hate speech detection (Wang et al. 2024).

Incorporating human judgment into the evaluation process adds a layer of quality control and increases the trustworthiness of the study findings, especially when explicitly reporting inter-rater reliability metrics (Khraisha et al. 2024). Incorporating feedback from individuals in the target population strengthens external validity by grounding study findings in real-world usage scenarios, which may positively impact the transfer of study results to practice. Researchers might uncover additional opportunities to further improve the LLM or LLM-based tool based on the reported experiences.

Challenges

Measurement through human validation can be challenging. Ensuring that the operationalization of a desired construct and the method of measuring it are appropriate requires a deep understanding of (1) the construct, (2) construct validity in general, and (3) systematic approaches for designing measurement instruments (Sjøberg and Bergersen 2023). Human judgment is subjective, which can lead to variability between judgments due to differences in experience, expertise, interpretations, and biases among evaluators (McDonald, Schoenebeck, and Forte 2019). For example, Hicks, Lee, and Foster-Marks (2025) found that “the struggle with adapting to AI-assisted work is more common for the racial minority developer cohort. That group also rated the quality of the output of AI-assisted coding tools significantly lower than other groups” (Hicks, Lee, and Foster-Marks 2025). Such biases can be mitigated with a careful selection and combination of multiple judges, but cannot be completely controlled for. Recruiting participants as human validators will always incur additional resources compared to machine-generated measures. Researchers must weigh the cost and time investment incurred by the recruitment process against the potential benefits for the validity of their study results.

Study Types

For Studying LLM Usage in Software Engineering, researchers SHOULD carefully reflect on the validity of their evaluation criteria. In studies where there are arguably objective evaluation criteria (e.g., in Benchmarking LLMs for Software Engineering Tasks), there is little need for human validation, since the benchmark has (hopefully) been validated. Where criteria are more unclear (e.g., LLMs as Annotators, LLMs as Subjects, or LLMs for Synthesis), human validation is important, and this guideline should be followed. When using LLMs as Judges, it is common to start with humans who co-create intial rating criteria. In developing LLMs for New Software Engineering Tools, human validation of the LLM output is critical when the output is supposed to match human expectations.

References

Ahmed, Toufique, Premkumar T. Devanbu, Christoph Treude, and Michael Pradel. 2025. “Can LLMs Replace Manual Annotation of Software Engineering Artifacts?” In 22nd IEEE/ACM International Conference on Mining Software Repositories, MSR@ICSE 2025, Ottawa, ON, Canada, April 28-29, 2025, 526–38. IEEE. https://doi.org/10.1109/MSR66628.2025.00086.

Bavaresco, Anna, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, et al. 2024. “LLMs Instead of Human Judges? A Large Scale Empirical Study Across 20 NLP Evaluation Tasks.” CoRR abs/2406.18403. https://doi.org/10.48550/ARXIV.2406.18403.

Hicks, Catherine M, Carol S Lee, and Kristen Foster-Marks. 2025. “The New Developer: AI Skill Threat, Identity Change & Developer Thriving in the Transition to AI-Assisted Software Development.” PsyArXiv, March. https://doi.org/10.31234/osf.io/2gej5_v2.

Khraisha, Qusai, Sophie Put, Johanna Kappenberg, Azza Warraitch, and Kristin Hadfield. 2024. “Can Large Language Models Replace Humans in Systematic Reviews? Evaluating GPT-4’s Efficacy in Screening and Extracting Data from Peer-Reviewed and Grey Literature in Multiple Languages.” Research Synthesis Methods 15 (4): 616–26. https://doi.org/https://doi.org/10.1002/jrsm.1715.

McDonald, Nora, Sarita Schoenebeck, and Andrea Forte. 2019. “Reliability and Inter-Rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice.” Proc. ACM Hum. Comput. Interact. 3 (CSCW): 72:1–23. https://doi.org/10.1145/3359174.

Ralph, Paul, and Ewan D. Tempero. 2018. “Construct Validity in Software Engineering Research and Software Metrics.” In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering, EASE2018, edited by Austen Rainer, Stephen G. MacDonell, and Jacky W. Keung, 13–23. ACM. https://doi.org/10.1145/3210459.3210461.

Schneider, Kurt, Farnaz Fotrousi, and Rebekka Wohlrab. 2025. “A Reference Model for Empirically Comparing LLMs with Humans.” In Proceedings of the 47th International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS2025). IEEE.

Sjøberg, Dag I. K., and Gunnar Rye Bergersen. 2023. “Construct Validity in Software Engineering.” IEEE Trans. Software Eng. 49 (3): 1374–96. https://doi.org/10.1109/TSE.2022.3176725.

Wang, Xinru, Hannah Kim, Sajjadur Rahman, Kushan Mitra, and Zhengjie Miao. 2024. “Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels.” In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, edited by Florian ’Floyd’Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Max L. Wilson, Phoebe O. Toups Dugas, and Irina Shklovski, 303:1–21. ACM. https://doi.org/10.1145/3613904.3641960.

Xue, Yuankai, Hanlin Chen, Gina R. Bai, Robert Tairas, and Yu Huang. 2024. “Does ChatGPT Help with Introductory Programming?an Experiment of Students Using ChatGPT in CS1.” In Proceedings of the 46th International Conference on Software Engineering: Software Engineering Education and Training, SEET@ICSE 2024, Lisbon, Portugal, April 14-20, 2024, 331–41. ACM. https://doi.org/10.1145/3639474.3640076.

Use an Open LLM as a Baseline

tl;dr: Researchers SHOULD include an open LLM as a baseline when using commercial models and report inter-model agreement. By open LLM, we mean a model that everyone with the required hardware can deploy and operate. Such models are usually “open weight.” A full replication package with step-by-step instructions SHOULD be provided as part of the SUPPLEMENTARY MATERIAL.

Recommendations

Empirical studies using LLMs in SE, especially those that target commercial tools or models, SHOULD incorporate an open LLM as a baseline and report established metrics for inter-model agreement (see Section Use Suitable Baselines, Benchmarks, and Metrics). We acknowledge that including an open LLM baseline might not always be possible, for example, if the study involves human participants, and letting them work on the tasks using two different models might not be feasible. Using an open model as a baseline is also not necessary if the use of the LLM is tangential to the study goal.

Open models allow other researchers to verify research results and build upon them, even without access to commercial models. A comparison of commercial and open models also allows researchers to contextualize model performance. Researchers SHOULD provide a complete replication package as part of their SUPPLEMENTARY MATERIAL, including clear step-by-step instructions on how to verify and reproduce the results reported in the paper.

Open LLMs are available on platforms such as Hugging Face. Depending on their size and the available compute power, open LLMs can be hosted on a local computer or server using frameworks such as Ollama or LM Studio. They can also be run on cloud-based services such as Together AI or on large hyperscalers such as AWS, Azure, Alibaba Cloud, and Google Cloud.

The term “open” can have different meanings in the context of LLMs. Widder, Whittaker, and West (2024) discuss three types of openness: transparency, reusability, and extensibility (Widder, Whittaker, and West 2024). They also discuss what openness in AI can and cannot provide. Moreover, the Open Source Initiative (OSI) (Open Source Initiative (OSI) 2025) provides a definition of open-source AI that serves as a useful framework for evaluating the openness of AI models. In simple terms, according to OSI, open-source AI means that one has access to everything needed to use the AI, which includes that it is possible to understand, modify, share, retrain, and recreate it.

Besides open models, researchers MAY explore whether their study goal can be achieved with existing open-source tools. Tools such as Continue, Cline, and opencode offer open alternatives to commercial tools such as GitHub Copilot and Claude Code. Using open tools instead of their closed-source counterparts allows researchers to instrument the tools, understand their architecture (in particular how exactly they leverage LLMs), and collect detailed telemetry data.

Example(s)

Numerous studies have adopted open LLMs as baseline models. For example, Wang et al. (2024) evaluated seven advanced LLMs, six of which were open-source, testing 145 API mappings drawn from eight popular Python libraries across 28,125 completion prompts aimed at detecting deprecated API usage in code completion (Wang et al. 2024). Moumoula et al. (2024) compared four LLMs on a cross-language code clone detection task (Moumoula et al. 2024). Three evaluated models were open-source. Gonçalves et al. (2025) fine-tuned the open LLM LLaMA 3.2 on a refined version of the DiverseVul dataset to benchmark vulnerability detection performance (Gonçalves et al. 2025). Many papers have used CodeBERT as an LLM, which is a bimodal (code + NL) Transformer pre-trained by Microsoft Research and released under the MIT license (Yang et al. 2023; Xia, Shao, and Deng 2024; Sonnekalb et al. 2022; Cai et al. 2024). Its model weights, source code, and data-processing scripts are all openly published on GitHub (Microsoft 2023).

Advantages

Using a true open LLM as a baseline improves the reproducibility of scientific research by providing full access to model architectures, training data, and parameter settings (see Section Report Model Version, Configuration, and Customizations), thereby allowing independent reconstruction and verification of experimental results. Moreover, by adopting an open-source baseline, researchers can directly compare novel methods against established performance metrics without the variability introduced by proprietary systems (see also Section Use Suitable Baselines, Benchmarks, and Metrics). The transparent nature of these models allows for detailed inspection of data processing pipelines and decision-making routines, which is essential for identifying potential sources of bias and delineating model limitations (see Section Report Limitations and Mitigations). Furthermore, unlike closed-source alternatives, which can be withdrawn or altered without notice, open LLMs ensure long-term accessibility and stability, preserving critical resources for future studies. Finally, the licensing requirements associated with open-source implementations lower financial barriers, making advanced language models attainable for research groups operating under constrained budgets.

Challenges

Open-source LLMs face several notable challenges (see also Section Report Limitations and Mitigations). First, they often lag behind the most advanced proprietary systems in common benchmarks, making it difficult for researchers to demonstrate clear improvements when evaluating new methods using open LLMs alone (see Section Use Suitable Baselines, Benchmarks, and Metrics). Additionally, deploying and experimenting with these models typically requires substantial hardware resources, in particular high-performance GPUs, which may be beyond reach for many academic groups (see Section Report Tool Architecture beyond Models).

The notion of “openness” itself remains in flux: although numerous models are described as open, many release only the trained weights without disclosing the underlying training data or methodological details (see Section Report Model Version, Configuration, and Customizations), a practice sometimes referred to as “open weight” openness (Gibney 2024). To address this gap, in our recommendations, we have referenced the open-source AI definition proposed by the OSI as an initial framework for what constitutes genuinely open-source AI software (Open Source Initiative (OSI) 2025).

Finally, unlike managed cloud APIs provided by proprietary vendors, installing, configuring, and fine-tuning open-source models can be technically demanding. Documentation is often sparse or fragmented, placing a high barrier to entry for researchers without specialized engineering support (see also Section Report Limitations and Mitigations).

Study Types

When evaluating LLMs for New Software Engineering Tools, researchers SHOULD use an open LLM as a baseline whenever it is technically feasible; if integration proves too complex, they SHOULD report the initial benchmarking results of open models (see Section Use Suitable Baselines, Benchmarks, and Metrics). In formal benchmarking studies and controlled experiments (see Section Benchmarking LLMs for Software Engineering Tasks), an open LLM MUST be one of the models under evaluation. For observational studies in which using an open LLM is impossible, investigators SHOULD explicitly acknowledge its absence and discuss how this limitation might affect their conclusions (see Section Report Limitations and Mitigations). Finally, when using LLMs for Synthesis, where an LLM serves to explore qualitative data or contrast alternative interpretations, researchers MAY report results of an open LLM.

References

Cai, Yuchen, Aashish Yadavally, Abhishek Mishra, Genesis Montejo, and Tien N. Nguyen. 2024. “Programming Assistant for Exception Handling with CodeBERT.” In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024, 94:1–13. ACM. https://doi.org/10.1145/3597503.3639188.

Gibney, Elizabeth. 2024. “Not all ‘open source’ AI models are actually open.” Nature News. https://doi.org/10.1038/d41586-024-02012-5.

Gonçalves, José, Miguel Silva, Bernardo Cabral, Tiago Dias, Eva Maia, Isabel Praça, Ricardo Severino, and Luı́s Lino Ferreira. 2025. “Evaluating LLaMA 3.2 for Software Vulnerability Detection.” arXiv Preprint arXiv:2503.07770.

Microsoft. 2023. “CodeBERT on GitHub.” https://github.com/microsoft/CodeBERT.

Moumoula, Micheline Bénédicte, Abdoul Kader Kabore, Jacques Klein, and Tegawendé Bissyande. 2024. “Large Language Models for Cross-Language Code Clone Detection.” arXiv Preprint arXiv:2408.04430.

Open Source Initiative (OSI). 2025. “Open Source AI Definition 1.0.” https://opensource.org/ai/open-source-ai-definition.

Sonnekalb, Tim, Bernd Gruner, Clemens-Alexander Brust, and Patrick Mäder. 2022. “Generalizability of Code Clone Detection on CodeBERT.” In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, 143:1–3. ACM. https://doi.org/10.1145/3551349.3561165.

Wang, Chong, Kaifeng Huang, Jian Zhang, Yebo Feng, Lyuye Zhang, Yang Liu, and Xin Peng. 2024. “How and Why Llms Use Deprecated Apis in Code Completion? An Empirical Study.” arXiv Preprint arXiv:2406.09834.

Widder, David Gray, Meredith Whittaker, and Sarah Myers West. 2024. “Why ‘Open’AI Systems Are Actually Closed, and Why This Matters.” Nature 635 (8040): 827–33.

Xia, Yuying, Haijian Shao, and Xing Deng. 2024. “VulCoBERT: A CodeBERT-Based System for Source Code Vulnerability Detection.” In 2024 International Conference on Generative Artificial Intelligence and Information Security, GAIIS 2024, Kuala Lumpur, Malaysia, May 10-12, 2024. ACM. https://doi.org/10.1145/3665348.3665391.

Yang, Guang, Yu Zhou, Xiang Chen, Xiangyu Zhang, Tingting Han, and Taolue Chen. 2023. “ExploitGen: Template-Augmented Exploit Code Generation Based on CodeBERT.” J. Syst. Softw. 197: 111577. https://doi.org/10.1016/J.JSS.2022.111577.

Use Suitable Baselines, Benchmarks, and Metrics

tl;dr: Researchers MUST justify all benchmark and metric choices in the PAPER and SHOULD summarize benchmark structure, task types, and limitations. Where possible, traditional (non-LLM) baselines SHOULD be used for comparison. Researchers MUST explain why the selected metrics are suitable for the specific study. They SHOULD report established metrics to make study results comparable, but can report additional metrics that they consider appropriate. Due to the inherent non-determinism of LLMs, experiments SHOULD be repeated; the result distribution SHOULD then be reported using descriptive statistics.

Recommendations

Benchmarks, baselines, and metrics play an important role in assessing the effectiveness of LLMs or LLM-based tools. Benchmarks are model- and tool-independent standardized tests used to assess the performance of LLMs on specific tasks such as code summarization or code generation. A benchmark consists of multiple standardized test cases, each with at least one task and a corresponding expected result. Metrics are used to quantify performance on benchmark tasks, enabling a comparison. A baseline represents a reference point for the measured LLM. Since LLMs require substantial hardware resources, baselines serve as a comparison to assess their performance against traditional algorithms with lower computational costs.

When selecting benchmarks, it is important to fully understand the benchmark tasks and the expected results because they determine what the benchmark actually assesses. Researchers MUST briefly summarize why they selected certain benchmarks in the PAPER, They SHOULD also summarize the structure and tasks of the selected benchmark(s), including the programming language(s) and descriptive statistics such as the number of contained tasks and test cases. Researchers SHOULD also discuss the limitations of the selected benchmark(s). For example, many benchmarks focus heavily on isolated Python functions. This assesses a very specific part of software development, which is certainly not representative of the full breadth of software engineering work (Chandra 2025).

Researchers MAY include an example of a task and the corresponding test case(s) to illustrate the structure of the benchmark. If multiple benchmarks exist for the same task, researchers SHOULD compare performance between benchmarks. When selecting only a subset of all available benchmarks, researchers SHOULD use the most specific benchmarks given the context.

The use of LLMs might not always be reasonable if traditional approaches achieve similar performance. For many tasks for which LLMs are being evaluated, there exist traditional non-LLM-based approaches (e.g., for program repair) that can serve as a baseline. Even if LLM-based tools perform better, the question is whether the resources consumed justify the potentially marginal improvements (Menzies 2025). Researchers SHOULD always check whether such traditional baselines exist and, if they do, compare them with the LLM or LLM-based tool using suitable metrics.

To compare traditional and LLM-based approaches or different LLM-based tools, researchers SHOULD report established metrics whenever possible, as this allows secondary research. They can, of course, report additional metrics that they consider appropriate. In any way, researchers MUST argue why the selected metrics are suitable for the given task or study.

If a study evaluates an LLM-based tool that is supposed to support humans, a relevant metric is the acceptance rate, meaning the ratio of all accepted artifacts (e.g., test cases, code snippets) in relation to all artifacts that were generated and presented to the user. Another way of evaluating LLM-based tools is calculating inter-model agreement (see also Section Use an Open LLM as a Baseline). This allows researchers to assess how dependent a tool’s performance is on specific models and versions. Metrics used to measure inter-model agreements include general agreement (percentage), Cohen’s kappa, and Scott’s Pi coefficient.

LLM-based generation is non-deterministic by design. Due to this non-determinism, researchers SHOULD repeat experiments to statistically assess the performance of a model or tool, e.g., using the arithmetic mean, median, confidence intervals, standard deviations, or more advanced statistical approaches (Agarwal et al. 2021).

From a measurement perspective, researchers SHOULD reflect on the theories, values, and measurement models on which the benchmarks and metrics they have selected for their study are based. For example, a large open dataset of software bugs is a collection of bugs according to a certain theory of what constitutes a bug, and the values and perspective of the people who labeled the dataset. Reflecting on the context in which these labels were assigned and discussing whether and how the labels generalize to a new study context is crucial.

Example(s)

According to Hou et al. (2024), main problem types for LLMs are classification, recommendation and generation problems. Each of these problem types requires a different set of metrics. Hu et al. (2025) conducted a comprehensive structured literature review on LLM benchmarks related to software engineering tasks. They analyzed 191 benchmarks and categorized them according to the specific task they address, making the paper a valuable resource for identifying existing metrics. Metrics used to assess generation tasks include BLEU, pass@k, Accuracy, Accuracy@k, and Exact Match. The most common metric for recommendation tasks is Mean Reciprocal Rank. For classification tasks, classical machine learning metrics such as Precision, Recall, F1-score, and Accuracy are often reported.

In the following, we briefly discuss two common metrics used for generation tasks: BLEU-N and pass@k. BLEU-N (Papineni et al. 2002) is a similarity score based on n-gram precision between two strings, ranging from $0$ to $1$ . Values close to $0$ represent dissimilar content, values closer to $1$ represent similar content, indicating that a model is more capable of generating the expected output. BLEU-N has multiple variations. CodeBLEU (Ren et al. 2020) and CrystalBLEU (Eghbali and Pradel 2022) are the most notable ones tailored to code. They introduce additional heuristics such as AST matching. As mentioned, researchers MUST motivate why they chose a certain metric or variant thereof for their particular study.

The metric pass@k reports the likelihood that a model correctly completes a code snippet at least once within k attempts. To our knowledge, the basic concept of pass@k was first reported by Kulal et al. (2019) to evaluate code synthesis (Kulal et al. 2019). They named the concept success rate at B, where B denotes the trial “budget.” The term pass@k was later popularized by Chen et al. (2021) as a metric for code generation correctness (Chen et al. 2021). The exact definition of correctness varies depending on the task. For code generation, correctness is often defined based on test cases: A passing test implies that the solution is correct. The resulting pass rates range from $0$ to $1$ . A pass rate of $0$ indicates that the model was unable to generate a single correct solution within $k$ tries; a rate of $1$ indicates that the model successfully generated at least one correct solution in $k$ tries.

The pass@k metric is defined as: $1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ , where $n$ is the total number of generated samples per prompt, $c$ is the number of correct samples among $n$ , and $k$ is the number of samples.

Choosing an appropriate value for k depends on the downstream task of the model and how end-users interact with it. A high pass rate for pass@1 is highly desirable in tasks where the system presents only one solution or if a single solution requires high computational effort. For example, code completion depends on a single prediction, since end users typically see only a single suggestion. Pass rates for higher $k$ values (e.g., $2$ , $5$ , $10$ ) indicate how well a model can solve a given task in multiple attempts. The pass@k metric is frequently referenced in papers on code generation (Rozière et al. 2023; Guo et al. 2024; Hui et al. 2024; R. Li et al. 2023). However, pass@k is not a universal metric suitable for all tasks. For instance, for creative or open-ended tasks, such as comment generation, it is not a suitable metric since more than one correct result exists.

Two benchmarks used for code generation are HumanEval (available on GitHub) (Papineni et al. 2002) and MBPP (available on Hugging Face) (Austin et al. 2021). Both benchmarks consist of code snippets written in Python sourced from publicly available repositories. Each snippet consists of four parts: a prompt containing a function definition and a corresponding description of what the function should accomplish, a canonical solution, an entry point for execution, and test cases. The input of the LLM is the entire prompt. The output of the LLM is then evaluated against the canonical solution using metrics or against a test suite. Other benchmarks for code generation include ClassEval (available on GitHub) (Du et al. 2023), LiveCodeBench (available on GitHub) (Jain et al. 2024), and SWE-bench (available on GitHub) (Jimenez et al. 2024). An example of a code translation benchmark is TransCoder (Lachaux et al. 2020) (available on GitHub).

Advantages

Benchmarks are an essential tool for assessing model performance for SE tasks. Reproducible benchmarks measure and assess the performance of models for a specific SE tasks, enabling comparison. That comparison enables progress tracking, for example, when researchers iteratively improve a new LLM-based tool and test it against benchmarks after significant changes. For practitioners, leaderboards, i.e., published benchmark results of models, support the selection of models for downstream tasks.

Challenges

A general challenge with benchmarks for LLMs is that the most prominent ones, such as HumanEval and MBPP, use Python, introducing a bias toward this specific programming language and its idiosyncrasies. Since model performance is measured against these benchmarks, researchers often optimize for them. As a result, performance may degrade if programming languages other than Python are used.

Many closed-source models, such as those released by OpenAI, achieve exceptional performance on certain tasks but lack transparency and reproducibility (J. Li et al. 2024; Du et al. 2023; Zhuo et al. 2024). Benchmark leaderboards, particularly for code generation, are led by closed-source models (Du et al. 2023; Zhuo et al. 2024). While researchers SHOULD compare performance against these models, they must consider that providers might discontinue them or apply undisclosed pre- or post-processing beyond the researchers’ control (see Use an Open LLM as a Baseline).

The challenges of individual metrics include that, for example, BLEU-N is a syntactic metric and therefore does not measure semantic or structural correctness. Thus, a high BLEU-N score does not directly indicate that the generated code is executable. While alternatives exist, they often come with their own limitations. For example, Exact Match is a strict measurement that does not account for functional equivalence of syntactically different code. Execution-based metrics such as pass@k directly evaluate correctness by running test cases, but they require a setup with an execution environment. When researchers observe unexpected values for certain metrics, the specific results should be investigated in more detail to uncover potential problems. These problems can be related to formatting, since code formatting is known to influence metrics such as BLEU-N or Exact Match.

Another challenge to consider is that metrics usually capture one specific aspect of a task or solution. For example, metrics such as pass@k do not reflect qualitative aspects of the generated source code, including its maintainability or readability. However, these aspects are critical for many downstream tasks. Moreover, benchmarks are isolated test sets and may not fully represent real-world applications. For example, benchmarks such as HumanEval synthesize code based on written specifications. However, such explicit descriptions are rare in real-world applications. Thus, evaluating model performance with benchmarks might not reflect real-world tasks and end-user performance.

Finally, benchmark data contamination (Xu et al. 2024) continues to be a major challenge. In many cases, the LLM training data is not released. However, the benchmark itself could be part of the training data. Such benchmark contamination may lead to the model remembering the solution from the training data rather than solving the new task based on the input data. This leads to artificially high performance on benchmarks. However, for unforeseen scenarios, the model might perform much worse.

Study Types

This guideline MUST be followed for all study types that automatically evaluate the performance of LLMs or LLM-based tools. The design of a benchmark and the selection of appropriate metrics are highly dependent on the specific study type and research goal. Recommending specific metrics for specific study types is beyond the scope of these guidelines, but Hu et al. (2025) provide a good overview of existing metrics for evaluating LLMs (Hu et al. 2025). In addition, our Section Use Human Validation for LLM Outputs provides an overview of the integration of human evaluation in LLM-based studies. For LLMs as Annotators, the research goal might be to assess which model comes close to a ground truth dataset created by human annotators. Especially for open annotation tasks, selecting suitable metrics to compare LLM-generated and human-generated labels is important. In general, annotation tasks can vary significantly. Are multiple labels allowed for the same sequence? Are the available labels predefined, or should the LLM generate a set of labels independently? Due to this task dependence, researchers MUST justify their metric choice, explaining what aspects of the task it captures together with known limitations. If researchers assess a well-established task such as code generation, they SHOULD report standard metrics such as pass@k and compare the performance between models. If non-standard metrics are used, researchers MUST state their reasoning.

References

Agarwal, Rishabh, Max Schwarzer, Pablo Samuel Castro, Aaron C. Courville, and Marc G. Bellemare. 2021. “Deep Reinforcement Learning at the Edge of the Statistical Precipice.” In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Virtual, edited by Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, 29304–20. https://proceedings.neurips.cc/paper/2021/hash/f514cec81cb148559cf475e7426eed5e-Abstract.html.

Austin, Jacob, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, et al. 2021. “Program Synthesis with Large Language Models.” CoRR abs/2108.07732. https://arxiv.org/abs/2108.07732.

Chandra, Satish. 2025. “Benchmarks for AI in Software Engineering (BLOG@CACM).” https://cacm.acm.org/blogcacm/benchmarks-for-ai-in-software-engineering/.

Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. 2021. “Evaluating Large Language Models Trained on Code.” CoRR abs/2107.03374. https://arxiv.org/abs/2107.03374.

Du, Xueying, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. “ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-Level Code Generation.” CoRR abs/2308.01861. https://doi.org/10.48550/ARXIV.2308.01861.

Eghbali, Aryaz, and Michael Pradel. 2022. “CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code.” In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, 28:1–12. ACM. https://doi.org/10.1145/3551349.3556903.

Guo, Daya, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, et al. 2024. “DeepSeek-Coder: When the Large Language Model Meets Programming - the Rise of Code Intelligence.” CoRR abs/2401.14196. https://doi.org/10.48550/ARXIV.2401.14196.

Hou, Xinyi, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. “Large Language Models for Software Engineering: A Systematic Literature Review.” ACM Trans. Softw. Eng. Methodol. 33 (8). https://doi.org/10.1145/3695988.

Hu, Xing, Feifei Niu, Junkai Chen, Xin Zhou, Junwei Zhang, Junda He, Xin Xia, and David Lo. 2025. “Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks.” arXiv. https://arxiv.org/abs/2505.08903.

Hui, Binyuan, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, et al. 2024. “Qwen2.5-Coder Technical Report.” CoRR abs/2409.12186. https://doi.org/10.48550/ARXIV.2409.12186.

Jain, Naman, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.” CoRR abs/2403.07974. https://doi.org/10.48550/ARXIV.2403.07974.

Jimenez, Carlos E., John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. “SWE-Bench: Can Language Models Resolve Real-World Github Issues?” In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66.

Kulal, Sumith, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy Liang. 2019. “SPoC: Search-Based Pseudocode to Code.” CoRR abs/1906.04908. http://arxiv.org/abs/1906.04908.

Lachaux, Marie-Anne, Baptiste Rozière, Lowik Chanussot, and Guillaume Lample. 2020. “Unsupervised Translation of Programming Languages.” CoRR abs/2006.03511. https://arxiv.org/abs/2006.03511.

Li, Jia, Ge Li, Xuanming Zhang, Yunfei Zhao, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, and Yongbin Li. 2024. “EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations.” In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, edited by Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang. http://papers.nips.cc/paper\files/paper/2024/hash/6a059625a6027aca18302803743abaa2-Abstract-Datasets\and\Benchmarks\Track.html.

Li, Raymond, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, et al. 2023. “StarCoder: May the Source Be with You!” CoRR abs/2305.06161. https://doi.org/10.48550/ARXIV.2305.06161.

Menzies, Tim. 2025. “The Case for Compact AI.” Commun. ACM 68 (9): 6–7. https://doi.org/10.1145/3746057.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “Bleu: A Method for Automatic Evaluation of Machine Translation.” In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–18. ACL. https://doi.org/10.3115/1073083.1073135.

Ren, Shuo, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. “CodeBLEU: A Method for Automatic Evaluation of Code Synthesis.” CoRR abs/2009.10297. https://arxiv.org/abs/2009.10297.

Rozière, Baptiste, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, et al. 2023. “Code Llama: Open Foundation Models for Code.” CoRR abs/2308.12950. https://doi.org/10.48550/ARXIV.2308.12950.

Xu, Cheng, Shuhao Guan, Derek Greene, and M. Tahar Kechadi. 2024. “Benchmark Data Contamination of Large Language Models: A Survey.” CoRR abs/2406.04244. https://doi.org/10.48550/ARXIV.2406.04244.

Zhuo, Terry Yue, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, et al. 2024. “BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions.” CoRR abs/2406.15877. https://doi.org/10.48550/ARXIV.2406.15877.

Report Limitations and Mitigations

tl;dr: Researchers MUST transparently report study limitations, including the impact of non-determinism and generalizability constraints. The PAPER MUST specify whether generalization across LLMs or across time was assessed, and discuss model and version differences. Authors MUST describe measurement constructs and methods, disclose any data leakage risks, and avoid leaking evaluation data into LLM improvement pipelines. They MUST provide model outputs, discuss sensitive data handling, ethics approvals, and justify LLM usage in light of its resource demands.

When using LLMs for empirical studies in SE, researchers face unique challenges and potential limitations that can influence the validity, reliability, and reproducibility of their findings (Sallou, Durieux, and Panichella 2024). It is important to openly discuss these limitations and explain how their impact was mitigated. All this is relative to the current capabilities of LLMs and the current state of the art in terms of tool architectures. If and how the performance of LLMs in a particular study context will improve with future model generations or new architectural patterns is beyond what researchers can and should discuss as part of a paper’s limitation section. Nevertheless, risk management and threat mitigation are important tasks during the design of an empirical study. They should not happen as an afterthought.

Recommendations

A cornerstone of open science is the ability to reproduce research results. Although the inherent non-deterministic nature of LLM is a strength in many use cases, its impact on reproducibility is problematic. To enable reproducibility, researchers SHOULD disclose a replication package for their study. They SHOULD perform multiple repetitions of their experiments (see Use Suitable Baselines, Benchmarks, and Metrics) to account for non-deterministic outputs. In some cases, researchers MAY reduce the output variability by setting the temperature to a value close to 0 and setting a fixed seed value. However, configuring a lower temperature can negatively impact task performance, and not all models allow the configuration of seed values. Besides the inherent non-determinism, the behavior of an LLM depends on many external factors such as prompt variations and model evolution. To ensure reproducibility, researchers SHOULD follow all previous guidelines and further discuss the generalization challenges outlined below.

Although the topic of generalizability is not new, it has gained new relevance with increasing interest in LLMs. In LLM-based studies, generalizability boils down to two main concerns: (1) First, are the results specific to an LLM, or can they be achieved with other LLMs as well? (2) Second, will these results still be valid in the future? If generalizability to other LLMs is not in the scope of the research, this MUST be clearly explained in the PAPER or, if generalizability is in scope, researchers MUST compare their results or subsets of the results (e.g., due to computational cost) with other LLMs that are, e.g., similar in size, to assess the generalizability of their findings (see also Section Use an Open LLM as a Baseline). Multiple studies (e.g., (Chen, Zaharia, and Zou 2023; Li et al. 2024)) found that the performance of proprietary LLMs (e.g., GPT) decreased over time for certain tasks using the same model version (e.g. GPT-4o). Reporting the model version and configuration is not sufficient in such cases. To date, the only way to mitigate this limitation is the usage of an open LLM having a versioning schema and archiving (see Use an Open LLM as a Baseline). Hence, researchers SHOULD employ open LLMs to establish a reproducible baseline and, if the use of an open LLM is not possible, researchers SHOULD test and report their results over an extended period of time as a proxy of the stability of the results over time.

Data leakage, contamination, or overfitting occurs when information outside the training data influences model performance. With the growing reliance on big datasets, the risks of inter-dataset duplication increases (see, e.g., (Lopes et al. 2017; Allamanis 2019)). In the context of LLMs for SE, this can manifest itself, for example, as training data samples that appear in the fine-tuning or evaluation datasets, potentially compromising the validity of the evaluation results (López et al. 2025). Moreover, ChatGPT’s functionality to “improve the model for everyone” can result in unintentional data leakage. Hence, to ensure the validity of the evaluation results, researchers SHOULD carefully curate the fine-tuning and evaluation datasets to prevent inter-dataset duplication and MUST NOT leak their fine-tuned or evaluation datasets into LLM improvement processes. When publishing the results, researchers SHOULD of course still follow open science practices and publish the datasets as part of their SUPPLEMENTARY MATERIAL. If information about the training data of the employed LLM is available, researchers SHOULD evaluate the inter-dataset duplication and MUST discuss potential data leakage in the PAPER. When training an LLM from scratch, researchers MAY consider using open datasets such as RedPajama (Together AI) (Computer 2023), which are already built with deduplication in mind (with the positive side effect of potentially improving performance (Lee et al. 2022)).

Conducting studies with LLMs is a resource-intensive process. For self-hosted LLMs, the respective hardware must be provided, and for managed LLMs, the service costs must be considered. The challenge becomes more pronounced as LLMs grow larger, research architectures become more complex, and experiments become more computationally expensive. For example, multiple repetitions to assess model or tool performance (see Use Suitable Baselines, Benchmarks, and Metrics) multiply the cost and impact scalability. Consequently, resource-intensive research remains predominantly the domain of private companies or well-funded research institutions, hindering researchers with limited resources in reproducing, replicating, or extending study results. Hence, for transparency reasons, researchers SHOULD report the cost associated with executing an LLM-based study. If the study used self-hosted LLMs, researchers SHOULD report the specific hardware used. If the study used managed LLMs, the service cost SHOULD be reported. To ensure the validity and reproducibility of the research results, researchers MUST provide the LLM outputs as evidence (see Section Report Prompts, their Development, and Interaction Logs). Depending on the architecture (see Section Report Tool Architecture beyond Models), these outputs SHOULD be reported on different architectural levels (e.g., outputs of individual LLMs in multi-agent systems). Additionally, researchers SHOULD include a subset of the employed validation dataset, selected using an accepted sampling strategy, to allow partial replication of the results.

Sensitive data can range from personal to proprietary data, each with its own set of ethical concerns. As mentioned above, a big threat to proprietary LLMs and sensitive data is its use for model improvement. Hence, using sensitive data can lead to privacy and intellectual property (IP) violations. Another threat is the implicit bias of LLMs that could lead to discrimination or unfair treatment of individuals or groups. There is also the fear that using LLMs for qualitative research might “reinforce dominant paradigms and biases” and “identify, replicate and reinforce dominant language and patterns” (Jowsey et al. 2025). To mitigate these concerns, researchers SHOULD minimize the sensitive data used in their studies, MUST follow applicable regulations (e.g., GDPR) and individual processing agreements, SHOULD create a data management plan outlining how the data is handled and protected against leakage and discrimination, and MUST apply for approval from the ethics committee of their organization (if required).

Although metrics such as BLEU or ROUGE are commonly used to evaluate the performance of LLMs (see Section Use Suitable Baselines, Benchmarks, and Metrics), they may not capture other relevant SE-specific aspects such as functional correctness or runtime performance of automatically generated source code (Liu et al. 2023). Given the high resource demand of LLMs, in addition to traditional metrics such as accuracy, precision, and recall or more contemporary metrics such as pass@k, resource consumption has to become a key indicator of performance. While research has predominantly focused on energy consumption during the early phases of building LLMs (e.g., data center manufacturing, data acquisition, training), inference, that is LLM usage, often becomes similarly or even more resource-intensive (Vries 2023; Wu et al. 2022; Fu et al. 2024; Jiang et al. 2024; Mitu and Mitu 2024). Hence, researchers SHOULD aim for lower resource consumption on the model side. This can be achieved by selecting smaller (e.g., GPT-4o-mini instead of GPT-4o) or newer models or by employing techniques such as model pruning, quantization, or knowledge distillation (Mitu and Mitu 2024). Researchers MAY further reduce resource consumption by restricting the number of queries, input tokens, or output tokens (Mitu and Mitu 2024), by using different prompt engineering techniques (e.g., zero-shot prompts seem to emit less CO2 than chain-of-thought prompts), or by carefully sampling smaller datasets for fine-tuning and evaluation instead of using large datasets. Reducing resource consumption involves a trade-off with our recommendation to perform multiple experimental runs to account for non-determinism, as suggested in Use Suitable Baselines, Benchmarks, and Metrics. To report the environmental impact of a study, researchers SHOULD use software such as CodeCarbon or Experiment Impact Tracker to track and quantify the carbon footprint of the study or report an estimate of the carbon footprint through tools such as MLCO2 Impact. They SHOULD detail the LLM version and configuration as described in Report Model Version, Configuration, and Customizations, state the hardware or hosting provider of the model as described in Report Tool Architecture beyond Models and report the total number of requests, accumulated input and output tokens. Researchers MUST justify why LLMs were chosen over existing approaches and discuss the achieved performance in relation to their potentially higher resource consumption.

Example(s)

An example highlighting the need for caution around replicability is the study of Staudinger et al. (2024) (Staudinger et al. 2024) who attempted to replicate an LLM study. The authors were unable to reproduce the exact results, even though they saw similar trends as in the original study.

To analyze whether the results of proprietary LLMs transfer to open LLMs, Staudinger et al. (2024) benchmarked previous results using GPT-3.5 and GPT4 against Mistral and Zephyr (Staudinger et al. 2024). They found that open-source models could not deliver the same performance as proprietary models. This paper is also an example of a study reporting costs: “120 USD in API calls for GPT 3.5 and GPT 4, and 30 USD in API calls for Mistral AI. Thus, the total LLM cost of our reproducibility study was 150 USD”. Tinnes, Welter, and Apel (2025) is an example of balancing dataset size between the need for manual semantic analysis and computational resource consumption (Tinnes, Welter, and Apel 2025).

Individual studies have already begun to highlight the uncertainty about the generalizability of their results in the future. Jesse et al. (2023) acknowledge the issue that LLMs evolve over time and that this evolution could impact the study results (Jesse et al. 2023). Since a lot of research in SE evolves around code, inter-dataset code duplication has been extensively researched over the years to curate de-duplicated datasets (see, e.g., (Lopes et al. 2017; Allamanis 2019; Karmakar, Allamanis, and Robbes 2023; López et al. 2025)). The issue of inter-dataset duplication has also attracted interest in other disciplines. For example, in biology, Lakiotaki et al. (2018) acknowledge and address the overlap between multiple common disease datasets (Lakiotaki et al. 2018). In the domain of code generation, Coignion, Quinton, and Rouvoy (2024) evaluated the performance of LLMs to produce LeetCode (Coignion, Quinton, and Rouvoy 2024). To mitigate the issue of inter-dataset duplication, they only used LeetCode problems published after 2023-01-01, reducing the likelihood of LLMs having seen those problems before. Further, they discuss the performance differences of LLMs on different datasets in light of potential inter-dataset duplication. Zhou et al. (2025) performed an empirical evaluation of data leakage in 83 software engineering benchmarks (Zhou et al. 2025). Although most benchmarks suffer from minimal leakage, very few showed a leakage of up to 100%. The authors found a high impact of data leakage on the performance evaluation. A starting point for studies that aim to assess and mitigate inter-dataset duplication are the Falcon LLMs, for which parts of its training data are available on Hugging Face (Technology Innovation Institute 2023). Through this dataset, it is possible to reduce the overlap between the traning and evaluation data, improving the validity of the evaluation results. A starting point to prevent active data leakage into a LLM improvement process is to ensure that the data is not used to train the model (e.g., via OpenAI’s data control functionality) (Balloccu et al. 2024).

Finally, bias can occur in LLM training datasets, resulting in various types of discrimination. Gallegos et al. (2023) propose metrics to quantify biases in various tasks (e.g., text generation, classification, question answering) (Gallegos et al. 2023).

Advantages

The reproduction or replication of study results under similar conditions by different parties greatly increases the validity of the results. Independent verification is of particular importance for studies involving LLMs, due to the non-determinism of their outputs and the potential for biases in their training, fine-tuning, and evaluation datasets. Mitigating threats to generalizability of a study through the integration of an open LLM as a baseline or the reporting of results over an extended period of time can increase the validity, reliability, and replicability of a study’s results. Assessing and mitigating the effects of inter-dataset duplication strengthens a study’s validity and reliability, as it prevents overly optimistic performance estimates that do not apply to previously unknown samples. Reporting the costs of performing a study not only increases transparency but also provides context for secondary literature. Providing replication packages with LLM output and data samples for partial replicability are paramount steps toward open and inclusive research in light of resource inequality between researchers. Considering and justifying the usage of LLMs over other approaches can lead to more efficient and sustainable solutions for SE problems. Reporting the environmental impact of LLM usage also sets the stage for more sustainable research practices in SE.

Challenges

With commercial LLMs evolving over time, the generalizability of results to future model versions is uncertain. Employing open LLMs as a baseline can mitigate this limitation, but may not always be feasible due to computational cost. Most LLM providers do not publicly offer information about their training data, impeding the assessment of inter-dataset duplication effects. Consistently keeping track of and reporting the costs involved in a research endeavor is challenging. Building a coherent replication package that includes LLM outputs and samples for partial replicability requires additional effort and resources. Finding the right metrics to evaluate the performance of LLMs in SE for specific tasks is challenging. Defining all requirements beforehand to ensure the use of suitable metrics can be a challenge, especially in exploratory research. Our Section Use Suitable Baselines, Benchmarks, and Metrics and its references can serve as a starting point. Ensuring compliance across jurisdictions is difficult with different regions having different regulations and requirements (e.g., GDPR and the AI Act in the EU, CCPA in California). Selecting datasets and models with fewer biases is challenging, as biases are often unknown. Measuring or estimating the environmental impact of a study is challenging and might not always be feasible. Especially in exploratory research, the impact is hard to estimate in advance, making it difficult to justify the usage of LLMs over other approaches.

Study Types

Limitations and mitigations SHOULD be discussed for all study types in relation to the context of the individual study. This section provides a starting point and lists aspects along which researchers can reflect on the limitations of their study design and potential mitigations that exist.

References

Allamanis, Miltiadis. 2019. “The Adverse Effects of Code Duplication in Machine Learning Models of Code.” In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2019, edited by Hidehiko Masuhara and Tomas Petricek, 143–53. ACM. https://doi.org/10.1145/3359591.3359735.

Balloccu, Simone, Patrı́cia Schmidtová, Mateusz Lango, and Ondrej Dusek. 2024. “Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs.” In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024, edited by Yvette Graham and Matthew Purver, 67–93. Association for Computational Linguistics. https://aclanthology.org/2024.eacl-long.5.

Chen, Lingjiao, Matei Zaharia, and James Zou. 2023. “How Is ChatGPT’s Behavior Changing over Time?” CoRR abs/2307.09009. https://doi.org/10.48550/ARXIV.2307.09009.

Coignion, Tristan, Clément Quinton, and Romain Rouvoy. 2024. “A Performance Study of LLM-Generated Code on Leetcode.” In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, EASE 2024, Salerno, Italy, June 18-21, 2024, 79–89. ACM. https://doi.org/10.1145/3661167.3661221.

Computer, Together. 2023. “RedPajama: An Open Dataset for Training Large Language Models.” https://github.com/togethercomputer/RedPajama-Data.

Fu, Zhenxiao, Fan Chen, Shan Zhou, Haitong Li, and Lei Jiang. 2024. “LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Inferences.” CoRR abs/2410.02950. https://doi.org/10.48550/ARXIV.2410.02950.

Gallegos, Isabel O., Ryan A. Rossi, Joe Barrow, Md. Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. 2023. “Bias and Fairness in Large Language Models: A Survey.” CoRR abs/2309.00770. https://doi.org/10.48550/ARXIV.2309.00770.

Jesse, Kevin, Toufique Ahmed, Premkumar T. Devanbu, and Emily Morgan. 2023. “Large Language Models and Simple, Stupid Bugs.” In 20th IEEE/ACM International Conference on Mining Software Repositories, MSR 2023, 563–75. IEEE. https://doi.org/10.1109/MSR59073.2023.00082.

Jiang, Peng, Christian Sonne, Wangliang Li, Fengqi You, and Siming You. 2024. “Preventing the Immense Increase in the Life-Cycle Energy and Carbon Footprints of LLM-Powered Intelligent Chatbots.” Engineering 40: 202–10. https://doi.org/https://doi.org/10.1016/j.eng.2024.04.002.

Jowsey, Tanisha, Virginia Braun, Victoria Clarke, Victoria Clarke, Deborah Lupton, and Michelle Fine. 2025. “We Reject the Use of Generative Artificial Intelligence for Reflexive Qualitative Research,” October. https://doi.org/10.2139/ssrn.5676462.

Karmakar, Anjan, Miltiadis Allamanis, and Romain Robbes. 2023. “JEMMA: An Extensible Java Dataset for ML4Code Applications.” Empir. Softw. Eng. 28 (2): 54. https://doi.org/10.1007/S10664-022-10275-7.

Lakiotaki, Kleanthi, Nikolaos Vorniotakis, Michail Tsagris, Georgios Georgakopoulos, and Ioannis Tsamardinos. 2018. “BioDataome: A Collection of Uniformly Preprocessed and Automatically Annotated Datasets for Data-Driven Biology.” Database J. Biol. Databases Curation 2018: bay011. https://doi.org/10.1093/DATABASE/BAY011.

Lee, Katherine, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. “Deduplicating Training Data Makes Language Models Better.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, edited by Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, 8424–45. Association for Computational Linguistics. https://doi.org/10.18653/V1/2022.ACL-LONG.577.

Li, David, Kartik Gupta, Mousumi Bhaduri, Paul Sathiadoss, Sahir Bhatnagar, and Jaron Chong. 2024. “Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases.” Radiology 310 (1): e232411. https://doi.org/10.1148/radiol.232411.

Liu, Jiawei, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation.” In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, edited by Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine. http://papers.nips.cc/paper\files/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Conference.html.

Lopes, Cristina V., Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017. “Déjàvu: A Map of Code Duplicates on GitHub.” Proc. ACM Program. Lang. 1 (OOPSLA): 84:1–28. https://doi.org/10.1145/3133908.

López, José Antonio Hernández, Boqi Chen, Mootez Saad, Tushar Sharma, and Dániel Varró. 2025. “On Inter-Dataset Code Duplication and Data Leakage in Large Language Models.” IEEE Trans. Software Eng. 51 (1): 192–205. https://doi.org/10.1109/TSE.2024.3504286.

Mitu, Narcis Eduard, and George Teodor Mitu. 2024. “The Hidden Cost of AI: Carbon Footprint and Mitigation Strategies.” Available at SSRN 5036344.

Sallou, June, Thomas Durieux, and Annibale Panichella. 2024. “Breaking the Silence: The Threats of Using Llms in Software Engineering.” In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, 102–6.

Staudinger, Moritz, Wojciech Kusa, Florina Piroi, Aldo Lipani, and Allan Hanbury. 2024. “A Reproducibility and Generalizability Study of Large Language Models for Query Generation.” In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP 2024, Tokyo, Japan, December 9-12, 2024, edited by Tetsuya Sakai, Emi Ishita, Hiroaki Ohshima, Faegheh Hasibi, Jiaxin Mao, and Joemon M. Jose, 186–96. ACM. https://doi.org/10.1145/3673791.3698432.

Technology Innovation Institute. 2023. “Falcon-Refinedweb (Revision 184df75).” https://huggingface.co/datasets/tiiuae/falcon-refinedweb; Hugging Face. https://doi.org/ 10.57967/hf/0737 .

Tinnes, Christof, Alisa Welter, and Sven Apel. 2025. “Software Model Evolution with Large Language Models: Experiments on Simulated, Public, and Industrial Datasets.” In 47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, 950–62. IEEE. https://doi.org/10.1109/ICSE55347.2025.00112.

Vries, Alex de. 2023. “The Growing Energy Footprint of Artificial Intelligence.” Joule 7 (10): 2191–94.

Wu, Carole-Jean, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, et al. 2022. “Sustainable AI: Environmental Implications, Challenges and Opportunities.” In Proceedings of the Fifth Conference on Machine Learning and Systems, MLSys 2022, edited by Diana Marculescu, Yuejie Chi, and Carole-Jean Wu. mlsys.org. https://proceedings.mlsys.org/paper\files/paper/2022/hash/462211f67c7d858f663355eff93b745e-Abstract.html.

Zhou, Xin, Martin Weyssow, Ratnadira Widyasari, Ting Zhang, Junda He, Yunbo Lyu, Jianming Chang, Beiqi Zhang, Dan Huang, and David Lo. 2025. “LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks.” arXiv. https://arxiv.org/abs/2502.06215.