Guidelines

This set of guidelines is currently a DRAFT and based on a discussion session with researchers at the 2024 International Software Engineering Research Network (ISERN) meeting and at the 2nd Copenhagen Symposium on Human-Centered Software Engineering AI. This draft is meant as a starting point for further discussions in the community with the aim of developing a common understanding of how we should conduct and report empirical studies involving large language models (LLMs). See also the pages on study types and scope.

The wording of the recommendations follows RFC 2119 and 8174.

Overview

  1. Declare LLM Usage and Role
  2. Report Model Version and Configuration
  3. Report Tool Architecture and Supplemental Data
  4. Report Prompts and their Development
  5. Report Interaction Logs
  6. Use Human Validation for LLM Outputs
  7. Use an Open LLM as a Baseline
  8. Report Suitable Baselines, Benchmarks, and Metrics
  9. Report Limitations and Mitigations

Declare LLM Usage and Role

Recommendations

When conducting any kind of empirical study involving LLMs, researchers MUST clearly declare that an LLM was used in a suitable section of the paper(e.g., in the introduction or research methods section). For authoring scientific articles, such transparency is, for example, required by the ACM Policy on Authorship: “The use of generative AI tools and technologies to create content is permitted but must be fully disclosed in the Work” [1]. Beyond generic authorship declarations and declarations that LLMs were used as part of the research process, researchers SHOULD report the exact purpose of using an LLM in a study, the tasks it was used to automate, and the expected outcomes in the paper.

Example(s)

The ACM Policy on Authorship suggests to to disclose the usage of Generative AI tools in the acknowledgements section of papers, for example: “ChatGPT was utilized to generate sections of this Work, including text, tables, graphs, code, data, citations” [1]. Similarly, the acknowledgements section could also be used to disclose Generative AI usage for other aspects of the research, if not explicitly described in other parts of the paper. The ACM policy further suggests: “If you are uncertain ­about the need to disclose the use of a particular tool, err on the side of caution, and include a disclosure in the acknowledgements section of the Work” [1].

An example of an LLM disclosure can be found in a recent paper by Ludos et al. [2], in which they write in the methodology section:

“We conducted an LLM-based evaluation of requirements utilizing the Llama 2 language model with 70 billion parameters, fine-tuned to complete chat responses…”

Advantages

Transparency in the usage of LLMs helps in understanding the context and scope of the study, facilitating better interpretation and comparison of results. Beyond this declaration, we recommend authors to be explicit about the LLM version they used (see Section Report Model Version and Configuration) and the LLM’s exact role (see Section Report Tool Architecture and Supplemental Data).

Challenges

We do not expect any challenges for researchers following this guideline.

Study Types

This guideline MUST be followed for all study types.

References

[1] Association for Computing Machinery, “ACM Policy on Authorship.” https://www.acm.org/publications/policies/new-acm-policy-on-authorship, 2023.

[2] S. Lubos et al., “Leveraging LLMs for the quality assurance of software requirements,” in 32nd IEEE international requirements engineering conference, RE 2024, reykjavik, iceland, june 24-28, 2024, G. Liebel, I. Hadar, and P. Spoletini, Eds., IEEE, 2024, pp. 389–397. doi: 10.1109/RE59067.2024.00046.

Report Model Version and Configuration

Recommendations

TODO: indicate what information is supposed to be reported in the paper or in the supplementary material

TODO: double-check usage of MUST, SHOULD, etc.

TODO: Add a recommendations that researchers MUST motivate why certain models, versions, and configurations were selected. For example due to monetary or technical reasons, based on previous work using the same models, etc. This applies also the the model size, as smaller models might have been selected due to hardware constraints. This is totally fine, but needs to be reported.

LLMs or LLM-based tools, especially those offered as-a-service, are frequently updated; different versions may produce varying results for the same input. Moreover, configuration parameters such as the temperature affect content generation. Therefore, researchers MUST document the specific model or tool version used in a study, along with the date when the experiments were conducted, and the exact configuration being used. Since default values might change over time, researchers SHOULD always report all configuration values, even if they used the defaults. Depending on the specific study context, additional information regarding the architecture of the tool or experiment SHOULD be reported (see Section Report Tool Architecture and Supplemental Data). Our recommendation is to report:

  • Model/tool name.

  • Model/tool version (including a checksum if available).

  • The configured temperature that controls randomness, and all other relevant parameters that affect output generation (e.g., seed values).

  • The context window (number of tokens).

  • Whether historical context was considered when generating responses.

Example(s)

For an OpenAI model, researchers might report that “A gpt-4 model was integrated via the Azure OpenAI Service, and configured with a temperature of 0.7, top_p set to 0.8, and a maximum token length of

  1. We used version 0125-Preview, system fingerprint fp_6b68a8204b, seed value 23487, and ran our experiment on 10th January 2025” [1], [2]. Similar statements can be made for self-hosted models, for which supplementary material can report specific instructions for reproducing results. For example, for models provisioned using ollama, one can report the specific tag and checksum of the model being used, e.g., ‘llama3.3, tag 70b-instruct-q8_0, checksum d5b5e1b84868‘. Given suitable hardware, running the corresponding model in its default configuration is then as easy as executing ollama run llama3.3:70b-instruct-q8_0 (see Section Use an Open LLM as a Baseline).

Kang et al. provide a similar statement in their paper on exploring LLM-based general bug reproduction [3]:

“We access OpenAI Codex via its closed beta API, using the code-davinci-002 model. For Codex, we set the temperature to 0.7, and the maximum number of tokens to 256.”

Our guidelines additionally suggest to report a checksum and exact dates, but otherwise this example is close to our recommendations.

Advantages

The recommended information is a prerequisite to enable reproducibility of LLM-based studies under the same or similar conditions. Please note that this information alone is generally not sufficient. Therefore, depending on the specific study setup, researchers SHOULD provide additional information about architecture and data (Report Tool Architecture and Supplemental Data), prompts (Report Prompts and their Development), interaction logs (Report Interaction Logs), and specific limitations an mitigations (Report Limitations and Mitigations).

Challenges

Different model providers and modes of operating the models allow for varying degrees of information. For example, OpenAI provides a model version and a system fingerprint describing the backend configuration that can also influence the output. However, the fingerprint is indeed just intended to detect changes to the model or its configuration. As a user, one cannot go back to a certain fingerprint. As a beta feature, OpenAI also lets users set a seed parameter to receive “(mostly) consistent output” [4]. However, the seed value does not allow for full reproducibility and the fingerprint changes frequently. While, as motived above, open models significantly simplify re-running experiments, they also come with challenges in terms of reproducibility, as generated outputs can be inconsistent despite setting the temperature to 0 and using a seed value (see GitHub issue for Llama3).

Study Types

This guideline MUST be followed for all study types for which the researcher has access to (parts of) the model’s configuration. They MUST always report the configuration that is visible to them, acknowledging the reproducibility challenges of commercial tools and models offered as-a-service. When Studying LLM Usage in Software Engineering, for example the usage of commercial tools such as ChatGPT or GitHub Copilot, researchers MUST be as specific as possible in describing their study setup. The model name and date MUST always be reported. In those cases, reporting other aspects such as prompts (Report Prompts and their Development) and interaction logs (Report Interaction Logs) is essential.

References

[1] OpenAI, “OpenAI API Introduction.” https://platform.openai.com/docs/api-reference/chat/streaming, 2025.

[2] Microsoft, “Azure OpenAI Service models.” https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models, 2025.

[3] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot testers: Exploring LLM-based general bug reproduction,” in 45th IEEE/ACM international conference on software engineering, ICSE 2023, melbourne, australia, may 14-20, 2023, IEEE, 2023, pp. 2312–2323. doi: 10.1109/ICSE48619.2023.00194.

[4] OpenAI, “How to make your completions outputs consistent with the new seed parameter.” https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter, 2023.

Report Tool Architecture and Supplemental Data

Recommendations

TODO: indicate what information is supposed to be reported in the paper or in the supplementary material

TODO: double-check usage of MUST, SHOULD, etc.

TODO: Discuss RAG and agent-based systems

Oftentimes, there is a layer around the LLM that preprocesses data, prepares prompts or filters user requests. One example is ChatGPT, which, at the time of writing these guidelines, primarily uses the GPT-4o model. GitHub Copilot also relies on the same model, and researchers can build their own tools utilizing GPT-4o directly (e.g., via the OpenAI API). The infrastructure around the bare model can significantly contribute to the performance of a model in a given task. Therefore, it is important that researchers clearly describe the architecture and what the LLM contributes to the tool or method presented in a research paper.

If the LLM is used as a standalone system (e.g., ChatGPT-4o API without additional architecture layers), researchers SHOULD provide a brief explanation of how it was used rather than detailing a full system architecture. However, if the LLM is integrated into a more complex system with preprocessing, retrieval mechanisms, fine-tuning, or autonomous agents, researchers MUST clearly document the tool architecture, including how the LLM interacts with other components such as databases, retrieval mechanisms, external APIs, and reasoning frameworks. A high-level architectural diagram SHOULD be provided in these cases to improve transparency. To enhance clarity, researchers SHOULD explain design decisions, particularly regarding model access (e.g., API-based, fine-tuned, self-hosted) and retrieval mechanisms (e.g., keyword search, semantic similarity matching, rule-based extraction). Researchers MUST NOT omit critical architectural details that could impact reproducibility, such as hidden dependencies or proprietary tools that influence model behavior.

Additionally, when performance or time-sensitive measurements are relevant, researchers SHOULD explicitly describe the hosting environment of the LLM or LLM-based tool, as this can significantly impact results. This description SHOULD specify not only where the model runs (e.g., local infrastructure, cloud-based services, or dedicated hardware) but also relevant details about the environment, such as hardware specifications, resource allocation, and latency considerations.

If the LLM is part of an agent-based system that autonomously plans, reasons, or executes tasks, researchers MUST describe its architecture, including the agent’s role (e.g., planner, executor, coordinator), whether it is a single-agent or multi-agent system, how it interacts with external tools and users, and the reasoning framework used (e.g., chain-of-thought, self-reflection, multi-turn dialogue, tool usage). Researchers MUST NOT present an agent-based system without detailing how it makes decisions and executes tasks.

If a retrieval or augmentation method is used (e.g., retrieval-augmented generation (RAG), rule-based retrieval, structured query generation, or hybrid approaches), researchers MUST describe how external data is retrieved, stored, and integrated into the LLM’s responses. This includes specifying the type of storage or database used (e.g., vector databases, relational databases, knowledge graphs) and how the retrieved information is selected and used. Stored data used for context augmentation MUST be reported, including details on data preprocessing, versioning, and update frequency. If this data is not confidential, an anonymized snapshot of the data used for context augmentation SHOULD be made available.

Similarly, if the LLM is fine-tuned, researchers MUST describe the fine-tuning goal (e.g., domain adaptation, task specialization), procedure (e.g., full fine-tuning, parameter-efficient fine-tuning), and dataset (source, size, preprocessing, availability). They should include training details (e.g., compute resources, hyperparameters, loss function) and performance metrics (benchmarks, baseline comparison). If the data used for fine-tuning is not confidential, an anonymized snapshot of the data used for fine-tuning the model SHOULD be made available.

Example(s)

Some empirical studies in software engineering involving LLMs have documented the architecture and supplemental data aligning with the recommended guidelines. Hereafter, we provide two examples.

Schäfer et al. conducted an empirical evaluation of using LLMs for automated unit test generation [1]. The authors provide a comprehensive description of the system architecture, detailing how the LLM is integrated into the software development workflow to analyze codebases and produce corresponding unit tests. The architecture includes components for code parsing, prompt formulation, interaction with the LLM, and integration of the generated tests into existing test suites. The paper also elaborates on the datasets utilized for training and evaluating the LLM’s performance in unit test generation. It specifies the sources of code samples, the selection criteria, and the preprocessing steps undertaken to prepare the data.

Dhar et al. conducted an exploratory empirical study to assess whether LLMs can generate architectural design decisions [2]. The authors detail the system architecture, including the decision-making framework, the role of the LLM in generating design decisions, and the interaction between the LLM and other components of the system. The study provides information on the fine-tuning approach and datasets used for evaluation, including the source of the architectural decision records, preprocessing methods, and the criteria for data selection.

Advantages

Documenting the architecture and supplemental data of LLM-based systems enhances reproducibility, transparency, and trust [3]. In empirical software engineering studies, this is essential for experiment replication, result validation, and benchmarking. Clear documentation of RAG, fine-tuning, and data storage enables comparison, optimizes efficiency, and upholds scientific rigor and accountability, fostering reliable and reusable research.

Challenges

Researchers face challenges in documenting LLM-based architectures, including proprietary APIs and dependencies that restrict disclosure, managing large-scale retrieval databases, and ensuring efficient query execution. They must also balance transparency with data privacy concerns, adapt to the evolving nature of LLM integrations, and, depending on the context, handle the complexity of multi-agent interactions and decision-making logic, all of which can impact reproducibility and system clarity.

Study Types

This guideline MUST be followed for all empirical study types involving LLMs, especially those using fine-tuned or self-hosted models, retrieval-augmented generation (RAG) or alternative retrieval methods, API-based model access, and agent-based systems where LLMs handle autonomous planning and execution.

References

[1] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Trans. Software Eng., vol. 50, no. 1, pp. 85–105, 2024, doi: 10.1109/TSE.2023.3334955.

[2] R. Dhar, K. Vaidhyanathan, and V. Varma, “Can LLMs generate architectural design decisions? - an exploratory empirical study,” in 21st IEEE international conference on software architecture, ICSA 2024, hyderabad, india, june 4-8, 2024, IEEE, 2024, pp. 79–89. doi: 10.1109/ICSA59870.2024.00016.

[3] Q. Lu, L. Zhu, X. Xu, Z. Xing, and J. Whittle, “Toward responsible AI in the era of generative AI: A reference architecture for designing foundation model-based systems,” IEEE Softw., vol. 41, no. 6, pp. 91–100, 2024, doi: 10.1109/MS.2024.3406333.

Report Prompts and their Development

TODO: indicate what information is supposed to be reported in the paper or in the supplementary material

TODO: double-check usage of MUST, SHOULD, etc.

Recommendations

Comment: These examples are fine but as scenarios they might be more representative of developers rather than researchers’ activities? What if we took of prompts that are relevant for the examples from the Section Study Types (annotators, raters, synthesis, etc.)? Prompts are critical in empirical software engineering studies involving LLMs. Depending on the task, prompts may include various types of content, such as source code, execution traces, error messages, natural language descriptions, or even screenshots and other multi-modal inputs. These elements significantly influence the model’s output, and understanding how exactly they were formatted and integrated is essential for transparency and reproducibility. We indicate absolute requirements for reproducibility using the keyword MUST and strongly recommended practices that could be omitted if justified using the keyword SHOULD.

Researchers MUST report the full text of prompts used, along with any surrounding instructions, metadata, or contextual information. The exact structure of the prompt should be described, including the order and format of each element. For example, when using code snippets, researchers should specify whether they were enclosed in markdown-style code blocks (e.g., triple backticks), if line breaks and whitespace were preserved, and whether additional annotations (e.g., comments) were included. Similarly, for other artifacts such as error messages, stack traces, or non-text elements like screenshots, researchers should explain how these were presented. If rich media was involved, such as in multi-modal models, details on how these inputs were encoded or referenced in the prompt are crucial.

When dealing with extensive or complex prompts, such as those involving large codebases or multiple error logs, researchers MUST describe strategies they used for handling input length constraints. Approaches might include truncating, summarizing, or splitting prompts into multiple parts. Token optimization measures, such as simplifying code formatting or removing unnecessary comments, should also be documented if applied.

In terms of strategy, prompts can vary widely based on the task design. Researchers MUST specify whether zero-shot, one-shot, or few-shot prompting was used. For few-shot prompts, the examples provided to the model should be clearly outlined, along with the rationale for selecting them. If multiple versions of a prompt were tested, researchers should describe how these variations were evaluated and how the final design was chosen.

In cases where prompts are generated dynamically—such as through preprocessing, template structures, or retrieval-augmented generation (RAG)—the process MUST be thoroughly documented. This includes explaining any automated algorithms or rules that influenced prompt generation. For studies involving human participants, where users might create or modify prompts themselves, researchers MUST describe how these prompts were collected and analyzed. If full disclosure is not feasible due to privacy concerns, summaries and representative examples should be provided.

To ensure full reproducibility, researchers MUST make all prompts and prompt variations publicly available in an online appendix, replication package, or repository. If the full set of prompts is too extensive to include in the paper itself, researchers SHOULD still provide representative examples and describe variations in the main body of the paper. For example, a recent paper by Anandayuvaraj et al. [1] is a good example of making prompts available online. In the paper, the authors analyze software failures reported in news articles and use prompting to automate tasks such as filtering relevant articles, merging reports, and extracting detailed failure information. Their online appendix contains all the prompts used in the study, providing valuable transparency and supporting reproducibility.

When reporting prompts, researchers MUST also reference the model version as specified in Section X (’Report Model Version and Configuration’), as prompt effectiveness varies across model versions. For example: This prompt performed differently with GPT-4 (effective) versus Llama 2 (less effective) despite identical parameters.

Prompt development is often iterative, involving collaboration between human researchers and AI tools. Researchers SHOULD report any instances where LLMs were used to suggest prompt refinements, as well as how those suggestions were incorporated. Furthermore, prompts may need to be revised in response to failure cases where the model produced incorrect or incomplete outputs. Iterative changes based on human feedback and pilot testing results should also be included in the documentation. A prompt changelog can help track and report the evolution of prompts throughout a research project, including key revisions, reasons for changes, and versioning (e.g., v1.0: initial prompt; v1.2: added output formatting; v2.0: incorporated examples of ideal responses). Comment: We might want to add some sort of guidance on tracking and reporting prompt evolution throughout research projects? Such as recommendation for maintaining a *prompt changelog. For example: - Initial design (v1.0): initial prompt text; - Key revision (v1.2): Added specific output formatting requirements; - Final version (v2.0): Incorporated examples of ideal responses.*

Finally, pilot testing and prompt evaluation are vital for ensuring that prompts yield reliable results. If such testing was conducted, researchers SHOULD summarize key insights, including how different prompt variations affected output quality and which criteria were used to finalize the prompt design.

Comment: Maybe the document could establish a connections to guidelines in ‘Report Model Version and Configuration’? Something along this line? “When reporting prompts, researchers MUST also reference the model version as specified in Section X (’Report Model Version and Configuration’), as prompt effectiveness varies across model versions. For example: *This prompt performed differently with GPT-4 (effective) versus Llama 2 (less effective) despite identical parameters.”.*

Example(s)

A debugging study may use a prompt structured like this:

Comment: Do you think a larger variety of examples would be beneficial? Something from requirements engineering, code generation, and testing?

You are a coding assistant. Below is a Python script that fails with an error. Analyze the code and suggest a fix.
Code:
```
def divide(a, b):
    return a / b

print(divide(10, 0))
```
Error message:
ZeroDivisionError: division by zero

The study should document that the code was enclosed in triple backticks, specify whether additional context (e.g., stack traces or annotations) was included, and explain how variations of the prompt were tested.

A good example of comprehensive prompt reporting is provided by Liang et al. [2]. The authors make the exact prompts available in their online appendix on Figshare, including details such as code blocks being enclosed in triple backticks. While this level of detail would not fit within the paper itself, the paper thoroughly explains the rationale behind the prompt design and data output format. It also includes one overview figure and two concrete examples, ensuring transparency and reproducibility while keeping the main text concise.

Advantages

Providing detailed documentation of prompts enhances reproducibility and comparability. It allows other researchers to replicate the study under similar conditions, refine prompts based on documented improvements, and evaluate how different types of content (e.g., source code vs. execution traces) influence LLM behavior. This transparency also enables a better understanding of how formatting, prompt length, and structure impact results across various studies.

Challenges

One challenge is the complexity of prompts that combine multiple components, such as code, error messages, and explanatory text. Formatting differences—such as whether markdown or plain text was used—can affect how LLMs interpret inputs. Additionally, prompt length constraints may require careful management, particularly for tasks involving extensive artifacts like large codebases.

For multi-modal studies, handling non-text artifacts such as screenshots introduces additional complexity. Researchers must decide how to represent such inputs, whether by textual descriptions, image encoding, or data references. Lastly, proprietary LLMs (e.g., Copilot) may obscure certain details about internal prompt processing, limiting full transparency.

Privacy and confidentiality concerns can also hinder prompt sharing, especially when sensitive data is involved. In these cases, researchers should provide anonymized examples and summaries wherever possible. For prompts containing sensitive information, researchers MUST: (i) Anonymize personal identifiers. (ii) Replace proprietary code with functionally equivalent examples. (iii) Clearly mark modified sections.

Comment: I think we need to add something about handling sensive data or proprietary information in prompts. Something along this line? *“For prompts containing sensitive information, researchers MUST: (i) Anonymize personal identifiers. (ii )Replace proprietary code with functionally equivalent examples. (iii )Clearly mark modified sections.’.*

Study Types

Reporting requirements may vary depending on the study type. For tool evaluation studies, researchers MUST explain how prompts were generated and structured within the tool. Controlled experiments MUST provide exact prompts for all conditions, while observational studies SHOULD summarize common prompt patterns and provide representative examples if full prompts cannot be shared.

References

[1] D. Anandayuvaraj, M. Campbell, A. Tewari, and J. C. Davis, “FAIL: Analyzing software failures from the news using LLMs,” in Proceedings of the 39th IEEE/ACM international conference on automated software engineering, 2024, pp. 506–518.

[2] J. T. Liang et al., “Can gpt-4 replicate empirical software engineering research?” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1330–1353, 2024.

Report Interaction Logs

Recommendations

TODO: indicate what information is supposed to be reported in the paper or in the supplementary material

TODO: double-check usage of MUST, SHOULD, etc.

Previous guidelines aim to address the reproducibility of a study by calling for reporting full LLM configuration (Report Version and Configuration)) and prompts (Report Prompts and their Development)), but even following both might not be always sufficient. Indeed, LLMs can still behave non-deterministically even if decoding strategies and parameters are fixed because non-determinism can arise from batching, input preprocessing, and floating point arithmetic on GPUs  [1]. Thus, in order to establish a fixed point from which a given study can be reproducible, a study SHOULD report the full interaction logs with a LLM if possible.

Reporting this is especially important when reporting a study targeting commercial SaaS solutions based on LLMs (e.g., ChatGPT) or novel tools that integrate LLMs via cloud APIs where there is even less guarantee of reproducing the state of the LLM-powered system at a later point by a reader of the study who wants to replicate it.

The rationale for this guideline is similar to the rationale for reporting interview transcripts in qualitative research. In both cases, it’s important to document the entire interaction between the interviewer and the participant. Just as a human participant might give different answers to the same question asked two months apart, the responses from OpenAI ChatGPT can also vary over time. Therefore, keeping a record of the actual conversation is crucial for accuracy and context and shows depth of engagement for transparency.

Example(s)

To intuitively explain why this can be important, consider a study in which the researchers evaluated the correctness of bug fixing capabilities of LLM-based tools and consider that the researchers only provided the prompt without the LLM answer. One of the multiple prompts would include the buggy function below in which the function returns the wrong variable.

Below is a Python function that is buggy. 
Analyze the code and suggest a fix.

def remove_duplicates_buggy(input_list):
    unique_list = []
    for item in input_list:
        if item not in unique_list:
            unique_list.append(item)
    return input_list 

If a paper doesn’t include the output and simply states that the fix was correct, it will lack crucial information needed for reproducibility. Without this information, fellow researchers can’t assess whether the solution was efficient or if it was written in a non-idiomatic way (e.g., the function could have been implemented more elegantly using Python list comprehensions). There could be other potential issues that researchers won’t be able to verify if the detailed response is not provided.

In their paper “Investigating ChatGPT’s Potential to Assist in Requirements Elicitation Processes” [2], Ronanki et al. report the full answers of ChatGPT and they upload them in a Zenodo record .

Advantages

The advantage of following this guideline is the transparency and increased reproducibility of the resulting research.

The guideline is straightforward to follow. Obtaining transcripts is simple, especially when considering a large language model (LLM) as an interviewee, compared to obtaining transcripts from human participants. Even in systems where interactions are voice-based, these interactions are first converted to text using speech-to-text methods, making transcripts easily accessible. Therefore, there is no valid reason for researchers not to report full transcripts.

Another advantage is that, while for human participants conversations often cannot be reported due to confidentiality, LLM conversations can (e.g. as of beginning of 2025, the for-profit OpenAI company allows sharing of chat transcripts.

Detailed logs enable future replication studies to compare results using the same prompts. This could be valuable for tracking changes in LLM responses over time or across different versions of the model. A body of knowledge would also be collected that would allow researchers to analyze how consistent the LLM’s responses are and identify any variations or improvements in its performance.

Challenges

Not all systems allow the reporting of interaction logs with the same ease. At one end of the spectrum, chatbots can be easily documented because the conversations are typically text-based and can be logged directly. At the other end, auto-complete systems (e.g., GitHub Copilot) make it harder to report full interactions.

While some tools, such as Continue[1], facilitate logging interacitons within the IDE, understanding the value of a Copilot suggestion during a coding session might require recreating the exact state of the codebase at the time the suggestion was made – a challenging context to report. One solution is to use version control to capture the state of the codebase when a recommendation occurred, allowing researchers to track changes and analyze the context behind the suggestion.

Given that chat transcripts are easy to generate, a study might end up with a very large appendix. Consequently, online storage might be needed. Services such as Zenodo, Figshare, or other similar long term storage for research artifacts SHOULD be used in such situations.

Study Types

This guideline SHOULD be followed for all study types.

References

Comment: capitalization is not right in the bib entries

[1] S. Chann, “Non-determinism in GPT-4 is caused by Sparse MoE.” https://152334h.github.io/blog/non-determinism-in-gpt-4/, 2023.

[2] K. Ronanki, C. Berger, and J. Horkoff, “Investigating ChatGPT’s potential to assist in requirements elicitation processes,” in 2023 49th euromicro conference on software engineering and advanced applications (SEAA), IEEE, 2023, pp. 354–361.

[1] https://blog.continue.dev/its-time-to-collect-data-on-how-you-build-software/

Use Human Validation for LLM Outputs

Recommendations

TODO: indicate what information is supposed to be reported in the paper or in the supplementary material

TODO: double-check usage of MUST, SHOULD, etc.

While LLMs can automate many tasks, it is important to validate their outputs with human judgment. For natural language processing tasks, a large-scale study has shown that LLMs have significant variation in their results, which limits their reliability as a direct substitute for human raters [1]. Human validation helps ensure the accuracy and reliability of the results, as LLMs may sometimes produce incorrect or biased outputs. Especially in studies where LLMs are used to support researchers, human validation is generally recommended to ensure validity [2]. We recommend that researchers plan for validation in humans from the outset and develop their methodology with this validation in mind. Study reference models for comparing humans with LLMs [3] can help provide a template to ease the design process. In some cases, a hybrid approach between human and machine-generated annotations can improve annotation efficiency. However, researchers should use systematic approaches to decide what and how human annotations can be replaced with LLM-generated ones, such as the methods proposed by Ahmed et al. [4].

When evaluating the capability of LLMs to generate SE-related artifacts, may employ human validation to complement machine-generated measures. For example, proxies for software quality, such as code complexity or the number of code smells, may be complemented by human ratings of maintainability, readability, or understandability. In the case of more abstract variables or psychometric measurements, human validation may be the only way of measuring a specific construct. For example, measuring human factors such as trust, cognitive load, and comprehension levels may inherently require human evaluation.

When conducting empirical measurements, researchers should clearly define the construct that they are measuring and specify the methods used for measurement. Further, they should use established measurement methods and instruments that are empirically validated [5], [6]. Measuring a construct may require aggregating input from multiple subjects. For example, a study may assess inter-rater agreement using measures such as Cohen’s Kappa or Krippendorff’s Alpha before aggregating ratings. In some cases, researchers may also combine multiple measures into single composite measures. As an example, they may evaluate both the time taken and accuracy when completing a task and aggregate them into a composite measure for the participants’ overall performance. In these cases, researchers should clearly describe their method of aggregation and document their reasoning for doing so.

When employing human validation, additional confounding factors should be controlled for, such as the level of expertise or experience with LLM-based applications or their general attitude towards AI-based tools. Researchers should control for these factors through methods such as stratified sampling or by categorizing participants based on experience levels. Where applicable, researchers should conduct a power analysis to estimate the required sample size and ensure sufficient statistical power in their experiment design. When multiple humans are annotating the same artifact, researchers should validate the objectivity of annotations through measures of inter-rater reliability. For instance, “A subset of 20% of the LLM-generated annotations was reviewed and validated by experienced software engineers to ensure accuracy. Using Cohen’s Kappa, an inter-rater reliability of \kappa = 0.90 was reached.”

For instance, “Model-model agreement is high, for all criteria, especially for the three large models (GPT-4, Gemini, and Claude). Table I indicates that the mean Krippendorff’s \alpha is 0.68-0.76. Second, we see that human-model and human-human agreements are in similar ranges, 0.24-0.40 and 0.21-0.48 for the first three categories.” [4].

Example(s)

As an example, Khojah et al. [7] augmented the results of their study using human measurement. Specifically, they asked participants to provide ratings regarding their experience, trust, perceived effectiveness and efficiency, and scenarios and lessons learned in their experience with ChatGPT.

Choudhuri et al. [8] evaluated the perceptions of students of their experience with ChatGPT in a controlled experiment. They added this data to extend their results from the task performance in a series of software engineering tasks. This way, they were to employ questionnaires to measure the cognitive load, document any perceived faults in the system, and inquire about the participants intention to continue using the tool.

Xue et al. [9] conducted a controlled experiment in which they evaluated the impact of ChatGPT on the performance and perceptions of students in an introductory programming course. They employed multiple measures to judge the impact of the LLM from the perspective of humans. In their study, they recorded the students’ screens, evaluated the answers they provided in tasks, and distributed a post-study survey to get direct opinions from the students.

Hymel et al. [10] evaluated the capability of ChatGPT-4.0 to generate requirements documents. Specifically, they evaluated two requirements documents based on the same business use case, one document generated with the LLM and one document created by a human expert. The documents were then reviewed by experts and judged in terms of alignment with the original business use case, requirements quality and whether they believed it was created by a human or an LLM. Finally, they analyzed the influence of the participants’ familiarity with AI tools on the study results.

Advantages

Incorporating human judgment in the evaluation process adds a layer of quality control and increases the trustworthiness of the study’s findings, especially when explicitly reporting inter-rater reliability metrics [11].

Incorporating feedback from individuals from the target population strengthens external validity by grounding study findings in real-world usage scenarios and may positively impact the transfer of study results to practice. Researchers may uncover additional opportunities to further improve the LLM or LLM-based tool based on the reported experiences.

Challenges

Measurement through human validation can be challenging. Ensuring that the operationalization of a desired construct and the method of measuring it are appropriate requires a good understanding of the studied concept and construct validity in general, and a systematic design approach for the measurement instruments [12].

Human judgment is often very subjective and may lead to large variability between different subjects due to differences in expertise, interpretation, and biases among evaluators [13]. Controlling for this subjectivity will require additional rigor when analyzing the study results.

Recruiting participants as human validators will always incur additional resources compared to machine-generated measures. Researchers must weigh the cost and time investment incurred by the recruitment process against the potential benefits for the validity of their study results.

Study Types

These guidelines apply to all study types:

When conducting their studies, researchers

MUST

  • Clearly define the construct measured through human validation.

  • Describe how the construct is operationalized in the study, specifying the method of measurement.

  • Employ established and widely accepted measurement methods and instruments.

  • Should assess the characteristics of the human validators to control for factors influencing the validation results (e.g., years of experience, familiarity with the task, etc.)

SHOULD

  • Use empirically validated measures.

  • Complement automated or machine-generated measures with human validation where possible.

  • Should ensure consistency among human validators by establishing shared understanding in training sessions or pilot studies and by assessing inter-rater agreement.

MAY

  • Use multiple different measures (e.g., expert ratings, surveys, task performance) for human validation.

References

[1] A. Bavaresco et al., “LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks,” CoRR, vol. abs/2406.18403, 2024, doi: 10.48550/ARXIV.2406.18403.

[2] X. Wang, H. Kim, S. Rahman, K. Mitra, and Z. Miao, “Human-LLM collaborative annotation through effective verification of LLM labels,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, honolulu, HI, USA, may 11-16, 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 303:1–303:21. doi: 10.1145/3613904.3641960.

[3] K. Schneider, F. Fotrousi, and R. Wohlrab, “A reference model for empirically comparing LLMs with humans,” in Proceedings of the 47th international conference on software engineering: Software engineering in society (ICSE-SEIS2025), IEEE, 2025.

[4] T. Ahmed, P. T. Devanbu, C. Treude, and M. Pradel, “Can LLMs replace manual annotation of software engineering artifacts?” CoRR, vol. abs/2408.05534, 2024, doi: 10.48550/ARXIV.2408.05534.

[5] R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman, “Measures for explainable AI: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-AI performance,” Frontiers Comput. Sci., vol. 5, 2023, doi: 10.3389/FCOMP.2023.1096257.

[6] S. A. C. Perrig, N. Scharowski, and F. Brühlmann, “Trust issues with trust scales: Examining the psychometric quality of trust measures in the context of AI,” in Extended abstracts of the 2023 CHI conference on human factors in computing systems, CHI EA 2023, hamburg, germany, april 23-28, 2023, A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, and A. Peters, Eds., ACM, 2023, pp. 297:1–297:7. doi: 10.1145/3544549.3585808.

[7] R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto, “Beyond code generation: An observational study of ChatGPT usage in software engineering practice,” Proc. ACM Softw. Eng., vol. 1, no. FSE, pp. 1819–1840, 2024, doi: 10.1145/3660788.

[8] R. Choudhuri, D. Liu, I. Steinmacher, M. A. Gerosa, and A. Sarma, “How far are we? The triumphs and trials of generative AI in learning software engineering,” in Proceedings of the 46th IEEE/ACM international conference on software engineering, ICSE 2024, lisbon, portugal, april 14-20, 2024, ACM, 2024, pp. 184:1–184:13. doi: 10.1145/3597503.3639201.

[9] Y. Xue, H. Chen, G. R. Bai, R. Tairas, and Y. Huang, “Does ChatGPT help with introductory programming?an experiment of students using ChatGPT in CS1,” in Proceedings of the 46th international conference on software engineering: Software engineering education and training, SEET@ICSE 2024, lisbon, portugal, april 14-20, 2024, ACM, 2024, pp. 331–341. doi: 10.1145/3639474.3640076.

[10] C. Hymel and H. Johnson, “Analysis of LLMs vs human experts in requirements engineering.” 2025. Available: https://arxiv.org/abs/2501.19297

[11] Q. Khraisha, S. Put, J. Kappenberg, A. Warraitch, and K. Hadfield, “Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages,” Research Synthesis Methods, vol. 15, no. 4, pp. 616–626, 2024, doi: https://doi.org/10.1002/jrsm.1715.

[12] D. I. K. Sjøberg and G. R. Bergersen, “Construct validity in software engineering,” IEEE Trans. Software Eng., vol. 49, no. 3, pp. 1374–1396, 2023, doi: 10.1109/TSE.2022.3176725.

[13] N. McDonald, S. Schoenebeck, and A. Forte, “Reliability and inter-rater reliability in qualitative research: Norms and guidelines for CSCW and HCI practice,” Proc. ACM Hum. Comput. Interact., vol. 3, no. CSCW, pp. 72:1–72:23, 2019, doi: 10.1145/3359174.

Use an Open LLM as a Baseline

Recommendations

TODO: indicate what information is supposed to be reported in the paper or in the supplementary material

TODO: double-check usage of MUST, SHOULD, etc.

To ensure that empirical studies using Large Language Models (LLMs) in software engineering are reproducible and comparable, we recommend incorporating an open LLM as a baseline for analysis. This applies whether you’re using LLMs to explore something or evaluating them on on specific software engineering tasks. Sometimes, including an open LLM baseline might be impossible. In such cases, researchers should at least use an open LLM on an early version of their product. Open LLMs are available for inspection and download in places like Hugging Face. Depending on the size, and the local computing power, open LLMs can be hosted on a local computer or server employing LLM management systems such as Ollama and LM Studio. Otherwise, such models can also be run on cloud-based services like Together AI. We recommend providing a replication package to facilitate other researchers checking your work. This should have clear, step-by-step instructions on how to get the same results you did using open LLM models. This makes the research more reliable since it allows others to confirm what you found. For example, researchers could report: “We compared our results to Meta’s Code Llama, which is available on Hugging Face.” and researchers shall include a link to the replication package.

The term Open when applied to an LLM can have various meanings. [1] discusses three types of openness (transparency, reusability and extensibility) and what openness in AI can and cannot provide. Moreover, the Open Source Initiative (OSI) [2] provides a definition of open-source AI that serves as a useful framework for evaluating the openness of AI models. In simple terms, according to OSI, open-source AI means you have access to everything you need to use the AI, such as understanding, modifying, sharing, retraining, and recreating. Thus, researchers must be clear about what “open” means in their context.

Finally, we recommend reporting and analysing inter-model agreement metrics. These metrics quantify the consistency between your model’s outputs and the baseline, thus they support the identification of potential biases or disagreement areas. Moreover, we recommend reporting the model confidence scores to analyse the model uncertainty. The analysis of inter-model agreement and model confidence provides valuable insights into the reliability and robustness of LLM performance, allowing a deeper understanding of their capabilities and limitations.

Example(s)

  • Benchmarking a Proprietary LLM: Researchers want to know how good their own LLM is at writing code. They might compare it against an open LLM such as StarCoderBase.

  • Evaluating an LLM-Powered Tool: A team developing an AI-driven code review tool might want to assess the quality of suggestions generated by both a proprietary LLM and an open alternative. Human evaluators could then independently rate the relevance and correctness of the suggestions, providing an objective measure of the tool’s effectiveness.

  • Ensuring Reproducibility with a Replication Package: A study on bug localization that uses a closed-source LLM could support Reproducibility by including a replication package. This package might contain a script that automatically reruns the same experiments using an open-source LLM—such as Llama 3—and generates a comparative report.

Advantages

  • Improved Reproducibility: Researchers can independently replicate experiments.

  • More Objective Comparisons: Using a standardized baseline allows for more unbiased evaluations.

  • Greater Transparency: Open models enable the analysis of how data is processed, which supports researchers in identifying potential biases and limitations.

  • Long-Term Accessibility: Unlike proprietary models, which may become unavailable, open LLMs remain available for future studies.

  • Lower Costs: Open-source models usually have fewer licensing restrictions, which makes them more accessible to researchers with limited funding.

Challenges

  • Performance Differences: Open models may not always match the latest proprietary LLMs in accuracy or efficiency, making it harder to demonstrate improvements.

  • Computational Demands: Running large open models requires hardware resources, including high-performance GPUs and significant memory.

  • Defining “Openness”: The term open is evolving—many so-called open models provide access to weights but do not disclose training data or methodologies. We are aware that the definition of an “open” model is actively being discussed, and many open models are essentially only “open weight” [3].

  • We consider the Open Source AI Definition proposed by the Open Source Initiative (OSI) [2] to be a first step towards defining true open-source models.

  • Implementation Complexity: Unlike cloud-based APIs from proprietary providers, setting up and fine-tuning open models can be technically demanding due to the possible limited documentation.

Study Types

  • Tool Evaluation: An open LLM baseline MUST be included if technically feasible. If integration is too complex, researchers SHOULD at least report initial benchmarking results using open models.

  • Benchmarking Studies and Controlled Experiments: An open LLM MUST be one of the models evaluated.

  • Observational Studies: If an open LLM is impossible, the researchers SHOULD acknowledge its absence and discuss potential impacts on their findings.

  • Qualitative Studies: If the LLM is used for exploratory data analysis or to compare alternative interpretations of results then an LLM baseline MAY be reported.

References

[1] D. G. Widder, M. Whittaker, and S. M. West, “Why ‘open’AI systems are actually closed, and why this matters,” Nature, vol. 635, no. 8040, pp. 827–833, 2024.

[2] Open Source Initiative (OSI), “Open Source AI Definition 1.0.” https://opensource.org/ai/open-source-ai-definition.

[3] E. Gibney, “Not all ‘open source’ AI models are actually open,” Nature News, 2024, doi: 10.1038/d41586-024-02012-5.

Report Suitable Baselines, Benchmarks, and Metrics

Recommendations

TODO: indicate what information is supposed to be reported in the paper or in the supplementary material

TODO: double-check usage of MUST, SHOULD, etc.

Empirical software engineering showed the importance of validating tools supporting software engineering [1], [2]. Thus, it is of pivotal importance to empirically assess the effectiveness of LLMs or LLM-based tools. Benchmarks are model- and tool-independent standardized tests used to assess the performance of LLMs on specific tasks such as code summarization or code generation. A benchmark consists of multiple standardized test cases, each with at least a task and an expected result. Metrics are used to quantify the performance for the benchmark tasks, enabling a comparison. Since LLMs require substantial hardware resources, baselines serve as a comparison to assess their performance against traditional algorithms with lower computational costs. When selecting benchmarks, it is important to understand both the contained task to solve and the expected result because this determines what the benchmark assesses. We recommend that researchers briefly summarize the selected benchmark and why it is suitable for their study. They should report why the given tasks and corresponding expected benchmark results reflect the problem the researcher wants to solve. In addition, reporting the total number of unique benchmark test cases and illustrating an example of a single test case allows other researchers to assess what the model is tested on. If multiple benchmark exist for the same task, the goal should be to compare performance between benchmarks. We recommend the use of the most specific benchmarks given the context.

Furthermore, the representativeness of a benchmark is important to report. For example, many benchmarks focus heavily on Python, and often on isolated functions. This assesses a very specific part of software development, which is certainly not representative for the full breadth of software engineering.

The use of LLMs might not always be justifiable if traditional approaches achieve similar performance. For many tasks LLMs are being evaluated for, there exist traditional non-LLM-based approaches (e.g., for program repair) that can serve as a baseline. Even if LLM-based tools perform better, the question is whether the resources consumed justify the potentially marginal improvements. We recommend researchers to always check whether such traditional baselines exist and if they do, compare them with the LLM or LLM-based tool using suitable metrics. To make such comparisons between traditional and LLM-based approaches, or comparison between LLM-based tools based on a benchmark for the given context. In general, we recommend to use established metrics whenever possible (see summary below), as this enables secondary research. We further recommend researchers to carefully argue why the selected metrics are suitable for the given task or study. If an LLM-based tool that is supposed to support humans is evaluated, a relevant metric might be the acceptance rate, meaning the ratio of all accepted artifacts (e.g., test cases, code snippets) in relation to all artifacts that were generated and presented to the user. Another way of evaluating LLM-based tools is calculating inter-model agreement (see also Section Use an Open LLM as a Baseline). This also allows researchers to assess how dependent a tool’s performance is on specific models.

LLM-based generation is non-deterministic by-design. This non-determinism requires the repetition of experiments to statistically assess the performance of a model or tool using the arithmetic mean, confidence intervals, and standard deviations.

Example(s)

Two benchmarks used for code generation are HumanEval (GitHub) [3] and MBPP (GitHub) [4]. Both benchmarks consist of code snippets written in Python sourced from publicly available repositories. Each snippet consists of four parts: a prompt based on function definition and a corresponding description what the function should accomplish, a canonical solution, an entry point for execution, and tests. The input of the LLM is the entire prompt. The output of the LLM is evaluated either against the canonical solution using metrics or against a test suite. Other benchmarks for code generation include ClassEval (GitHub) [5], LiveCodeBench (GitHub) [6], and SWE-bench (GitHub) [7]. An example of a code translation benchmark is TransCoder [8] (GitHub).

According to [9], main problems types for LLMs are classification, recommendation and generation problems. Each of these problem types requires a different set of metrics. They provide a comprehensive overview of benchmarks categorized by software engineering tasks. Common metrics for assessing generation tasks are BLEU, pass@k, Accuracy/ Accuracy@k, and Exact Match [9]. The most common recommendation task metric is Mean Reciprocal Rank [9]. For classification tasks, classical machine learning metrics such as Precision, Recall, F1-score, and Accuracy are often reported [9].

We now briefly discuss two common metrics used for generation tasks. BLEU-N [3] is a similarity score based on n-gram precision between two strings, ranging from 0 to 1. Values close to 0 depict dissimilar values closer to 1 represent similar content. A value closer to 1 indicates that the model is more capable of generating the expected output for code generation. BLEU-N has multiple variations. CodeBLEU [10] and CrystalBLEU [11] are the most notable variations tailored to code, by introducing additional heuristics such as AST matching. As mentioned above, researchers should motivate why they chose a certain metric or variant thereof for their particular study.

The metric pass@k reports the likelihood of a model correctly completing a code snippet at least once within k tries. To the best of our knowledge, the basic concept of pass@k was first used in [12] for evaluating code synthesis under the name success rate at B, where B denotes the budget of trials. The term pass@k was later popularized by [13] as a metric for code generation correctness. The exact definition of correctness varies depending on the task For code generation, correctness is often defined based on test cases. A passing test then means that the solution is correct. The resulting pass rate ranges from 0 to 1. A pass rate of 0 indicates that the model was not able to generate a single correct solution within k tries. A pass rate of 1 indicates that the model successfully generated at least one correct solution in k tries. The metric is defined as:

\text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}

Where n is the total number of generated samples per prompt, c is the number of correct samples among n, and k is the number of samples.

Choosing an appropriate value for k depends on the downstream task of the model and how end-users interact with the model. A high pass rate for pass@1 is highly desirable in tasks where the system only presents one solution or if a single solution requires high computational effort. For example, code completion depends on a single prediction since the end user typically sees only a single suggestion. Pass rates for higher k values (e.g., 2, 5, 10) indicate whether the model can solve the given task within multiple attempts. For downstream tasks that permit multiple solutions or user interaction, strong performance at k > 1 can be justified. For instance, a user selecting the correct test case from multiple suggestions allows for some model errors.

Common examples for papers using pass@k are papers introducing new models for code generation such as [14], [15], [16], [17]. TODO: Find SE papers which uses other benchmarks than pass@k & other papers which do not introduce a new model.

Advantages

Challenges

A general challenge with benchmarks for LLMs is that the most prominent ones, such as HumanEval and MBPP, use Python, introducing a bias towards this specific programming language and its idiosyncrasies. Since model performance is measured against these benchmarks, researchers often optimize for them. As a result, performance may degrade if programming languages other than Python are used.

Many closed-source models, such as those released by OpenAI, achieve exceptional performance on certain tasks but lack transparency and reproducibility [5], [18], [19]. Benchmark leaderboards, particularly for code generation, are led by close-sourced models [5], [19]. While researchers should compare performance against these models, they must consider that providers might discontinue them or apply undisclosed pre- or post-processing beyond the researcher’s control (see also Section see also Section Use an Open LLM as a Baseline).

Challenges with individual metrics include that, for example, BLEU-N is a syntactic metric and hence does not measure semantic correctness or structural correctness. Thus, a high BLEU-N score does not directly indicate that the generated code is executable. While alternatives exist, they often come with their own limitations. For instance, Exact Match is a strict measurement that does not account for functional equivalence but syntactically different code. Execution-based metrics (e.g. pass@k) directly evaluate correctness by running test cases, but they require a setup with an execution environment. When researchers observe unexpected values for certain metrics, the specific results should be investigated in more detail to uncover potential problems. These problems can, for example, be related to formatting since code formatting highly influences metrics such as BLEU-N or Exact Match.

Another challenge to consider is that metrics usually capture one specific aspect of a task or solution. For instance, metrics such as pass@k do not reflect qualitative aspects of code such as maintainability, cognitive load, or readability. These aspects are critical for the downstream task and influence the overall usability. Moreover, benchmarks are isolated test sets and may not fully represent real-world applications. For example, benchmarks such as HumanEval synthesize code based on written specifications. However, such explicit descriptions are rare in real-world applications. Thus, evaluating the model performance with benchmarks might not reflect real-world tasks and end-user usability.

Finally, benchmark data contamination [20] continues to be a major challenge as well. In many cases, the training data set for an LLM is not released in conjunction with the model. The benchmark itself could be part of the model’s training dataset. Such benchmark contamination may lead to the model remembering the actual solution from the training data rather than solving the new task based on the seen data. This leads to artificially high performance on the benchmark. For unforeseen scenarios, however, the model might perform much worse.

Study Types

TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [21]).

This guideline must be followed for all study types that evaluate the performance of plain LLMs on a given task.

For example, if you use an LLM as an annotator and the research goal is to assess which model annotates best based on a test dataset, you MUST report an appropriate metric that reflects the nature of your task and you SHOULD disclose the used dataset for evaluation. Annotation tasks can vary significantly: Are multiple labels allowed for the same sequence? Are the available labels predefined, or should the LLM generate a set of labels independently? Due to this task dependence, you SHOULD justify your choice of metric, explaining what aspects of the task it captures and what its limitations are.

If you are conducting a well-established task, such as code generation, you SHOULD report standard metrics like pass@k and compare to other models.

TODO: Describe more study types. Copy common outline for this section.

References

[1] D. E. Perry, A. A. Porter, and L. G. Votta, “Empirical studies of software engineering: A roadmap,” in 22nd international conference on on software engineering, future of software engineering track, ICSE 2000, limerick ireland, june 4-11, 2000, A. Finkelstein, Ed., ACM, 2000, pp. 345–355. doi: 10.1145/336512.336586.

[2] W. Hasselbring, “Benchmarking as empirical standard in software engineering research,” in EASE 2021: Evaluation and assessment in software engineering, trondheim, norway, june 21-24, 2021, R. Chitchyan, J. Li, B. Weber, and T. Yue, Eds., ACM, 2021, pp. 365–372. doi: 10.1145/3463274.3463361.

[3] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the association for computational linguistics, july 6-12, 2002, philadelphia, PA, USA, ACL, 2002, pp. 311–318. doi: 10.3115/1073083.1073135.

[4] J. Austin et al., “Program synthesis with large language models,” CoRR, vol. abs/2108.07732, 2021, Available: https://arxiv.org/abs/2108.07732

[5] X. Du et al., “ClassEval: A manually-crafted benchmark for evaluating LLMs on class-level code generation,” CoRR, vol. abs/2308.01861, 2023, doi: 10.48550/ARXIV.2308.01861.

[6] N. Jain et al., “LiveCodeBench: Holistic and contamination free evaluation of large language models for code,” CoRR, vol. abs/2403.07974, 2024, doi: 10.48550/ARXIV.2403.07974.

[7] C. E. Jimenez et al., “SWE-bench: Can language models resolve real-world github issues?” in The twelfth international conference on learning representations, ICLR 2024, vienna, austria, may 7-11, 2024, OpenReview.net, 2024. Available: https://openreview.net/forum?id=VTF8yNQM66

[8] M.-A. Lachaux, B. Rozière, L. Chanussot, and G. Lample, “Unsupervised translation of programming languages,” CoRR, vol. abs/2006.03511, 2020, Available: https://arxiv.org/abs/2006.03511

[9] X. Hou et al., “Large language models for software engineering: A systematic literature review,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, Dec. 2024, doi: 10.1145/3695988.

[10] S. Ren et al., “CodeBLEU: A method for automatic evaluation of code synthesis,” CoRR, vol. abs/2009.10297, 2020, Available: https://arxiv.org/abs/2009.10297

[11] A. Eghbali and M. Pradel, “CrystalBLEU: Precisely and efficiently measuring the similarity of code,” in 37th IEEE/ACM international conference on automated software engineering, ASE 2022, rochester, MI, USA, october 10-14, 2022, ACM, 2022, pp. 28:1–28:12. doi: 10.1145/3551349.3556903.

[12] S. Kulal et al., “SPoC: Search-based pseudocode to code,” CoRR, vol. abs/1906.04908, 2019, Available: http://arxiv.org/abs/1906.04908

[13] M. Chen et al., “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021, Available: https://arxiv.org/abs/2107.03374

[14] B. Rozière et al., “Code llama: Open foundation models for code,” CoRR, vol. abs/2308.12950, 2023, doi: 10.48550/ARXIV.2308.12950.

[15] D. Guo et al., “DeepSeek-coder: When the large language model meets programming - the rise of code intelligence,” CoRR, vol. abs/2401.14196, 2024, doi: 10.48550/ARXIV.2401.14196.

[16] B. Hui et al., “Qwen2.5-coder technical report,” CoRR, vol. abs/2409.12186, 2024, doi: 10.48550/ARXIV.2409.12186.

[17] R. Li et al., “StarCoder: May the source be with you!” CoRR, vol. abs/2305.06161, 2023, doi: 10.48550/ARXIV.2305.06161.

[18] J. Li et al., “EvoCodeBench: An evolving code generation benchmark with domain-specific evaluations,” in Advances in neural information processing systems 38: Annual conference on neural information processing systems 2024, NeurIPS 2024, vancouver, BC, canada, december 10 - 15, 2024, A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, Eds., 2024. Available: http://papers.nips.cc/paper\files/paper/2024/hash/6a059625a6027aca18302803743abaa2-Abstract-Datasets\and\Benchmarks\Track.html

[19] T. Y. Zhuo et al., “BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions,” CoRR, vol. abs/2406.15877, 2024, doi: 10.48550/ARXIV.2406.15877.

[20] C. Xu, S. Guan, D. Greene, and M. T. Kechadi, “Benchmark data contamination of large language models: A survey,” CoRR, vol. abs/2406.04244, 2024, doi: 10.48550/ARXIV.2406.04244.

[21] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.

Report Limitations and Mitigations

Recommendations

When using large language models (LLMs) in empirical studies for software engineering, researchers face unique challenges and potential limitations that can influence the validity, reliability, and reproducibility [1] of their findings including:

Reproducibility: A cornerstone of open science is the ability to reproduce research results. Even though the inherent non-deterministic nature of LLMs is a strength in many use cases, its impact on reproducibility is a challenge. To enable reproducibility, researchers

  • MUST disclose a replication package for their study and

  • SHOULD perform multiple evaluation iterations of their experiments (see Report Suitable Baselines, Benchmarks, and Metrics) to account for non-deterministic outputs of LLMs or

  • MAY disable reduce output variability by setting the temperature to 0 and stating a seed value and fingerprint if the research and the LLM provider allows for it.

Besides non-determinism, the behavior of an LLM depends on many external factors, such as model version, API updates, or prompt variations. To ensure reproducibility, researchers

Generalization: Even though the topic of generalizability is not new, it has gained new relevance with the increasing interest in LLMs. In LLM studies, generalizability boils down to two main concerns: First, are the results specific to an LLM, or can they be achieved with other LLMs too?

  • If generalizability to other LLMs is not in the scope of the research, this MUST be clearly explained or

  • if generalizability is in scope, researchers MUST compare their results or subsets of the results (if not possible e.g., due to computational cost) with other LLMs that are similar (e.g. in size, general performance) to assess the generalizability of their findings.

Second, will these results still be valid in the future? Multiple studies ([2], [3]) found, that the performance of proprietary LLMs (like GPT) within the same version (e.g. GPT 4) decreased over time. Reporting the model version and configuration is not sufficient in such cases. To date, the only way of mitigating this limitation is the usage of an Open LLM with a transparently communicated versioning and archiving (see Use an Open LLM as a Baseline). Hence, researchers

  • SHOULD employ open LLMs to set a reproducible baseline (see Use an Open LLM as a Baseline) and

  • if employing an open LLM is not possible, researchers SHOULD test and report their results over an extended period of time as a proxy of the results’ relevance over time.

Data Leakage: Data leakage/contamination/overfitting occurs when information from outside the training dataset influences the model, leading to overly optimistic performance estimates. With the growing reliance on big datasets, the risks of inter-dataset duplication increases (e.g., [4], [5]). In the context of LLMs for software engineering, this can, for example, manifest as samples from the pre-train dataset appearing in the fine-tune or evaluation dataset, potentially compromising the validity of evaluation results [6]. For example the functionality “Improve the model for everyone“ of ChatGPT can result in unintentional data leakage. Hence, to ensure the validity of the evaluation results, researchers

  • SHOULD carefully curate the fine-tuning and evaluation datasets to prevent inter-dataset duplication and

  • MUST NOT leak their fine-tune or evaluation datasets into the improvement process of the LLM.

  • If information about the pre-train dataset of the employed LLM is available, researchers SHOULD assess the inter-dataset duplication and MUST discuss the potential data leakage.

  • If training a LLM from scratch, researchers MAY consider using open datasets (such as Together AI’s RedPajama [7]) that already incorporate duplication (with the positive side effect of potentially improving performance [8]).

Scalability and Cost: Conducting studies with LLMs is a resource demanding endeavor. For self-hosted LLMs, the respective hardware needs to be provided, for managed LLMs, the service cost has to be considered. The challenge becomes more pronounced as LLMs grow larger, research architectures get more complex, and experiments become more computationally expensive e.g., multiple repetitions to assess performance in the face of non-determinism (see Report Suitable Baselines, Benchmarks, and Metrics). Consequently, resource-intensive research remains predominantly the domain of well-funded researchers, hindering researchers with limited resources from replicating or extending the study results. Hence, for transparency reasons, researchers

  • SHOULD report the cost associated with executing the study.

  • If the study employed self-hosted LLMs, researchers SHOULD report the hardware used.

  • If the study employed managed LLMs, the service cost SHOULD be reported.

  • To ensure research result validity and replicability, researchers MUST provide the LLM outputs as evidence for validation at different granularities (e.g., outputs of individual LLMs when employing multi-agent systems).

  • Additionally, researchers SHOULD include a subset of they employed validation dataset, selected using an accepted sampling strategy, to allow partial replication of the results.

Misleading Performance Metrics: While metrics such as BLEU or ROUGE are commonly used to evaluate the performance of LLMs, they may not capture other relevant, software engineering-specific aspects such as functional correctness or the runtime performance of automatically generated code [9]. Researchers

  • SHOULD clarify and state all relevant requirements, and employ and report the metrics to measure the satisfaction of the requirements (e.g., test-case success rate) and

  • SHOULD follow the best practices described in Report Suitable Baselines, Benchmarks, and Metrics.

Ethical Concerns with Sensitive Data: Sensitive data can range from personal to proprietary data, each with its own set of ethical concerns. A big threat of proprietary LLMs and sensitive data is the data’s usage for model improvements (see “Data Leakage“). Hence, using sensitive data can lead to privacy and IP violations. Another threat is the implicit bias of LLMs potentially leading to discrimination or unfair treatment of individuals or groups. To mitigate these concerns, researchers:

  • SHOULD minimize the sensitive data used in their studies and

  • MUST follow applicable regulations (e.g., GDPR) and individual processing agreements and

  • SHOULD create a data management plan outlining how the data is handled and protected against leakage and discrimination and

  • SHOULD apply for approval from the ethics committee of their organization (if required).

Performance and Resource Consumption: “The field of AI is currently primarily driven by research that seeks to maximize model accuracy — progress is often used synonymously with improved prediction quality. This endless pursuit of higher accuracy over the decade of AI research has significant implications for computational resource requirements and environmental footprint. To develop AI technologies responsibly, we must achieve competitive model accuracy at a fixed or even reduced computational and environmental cost.” [10]

The performance of an LLM is usually measured in terms of traditional metrics such as accuracy, precision, and recall or more contemporary metrics such as pass@k, or BLEU-N (see Report Suitable Baselines, Benchmarks, and Metrics). However, given how resource-hungry LLMs are, resource consumption has to become a key indicator for performance to assess research progress responsibly. While research predominantly focused on the energy consumption during the early phases of LLMs (e.g., data center manufacturing, data acquisition, training), inference - i.e. the use of the LLM - often becomes similarly or even more resource-intensive ( [11], [10], [12], [13], [14]). Hence, researchers:

  • SHOULD aim for lower resource consumption on the model side. This can be achieved by selecting smaller (e.g., GPT 4o mini instead of GPT 4o) or newer models as a base model for the study or by employing techniques such as model pruning, quantization, knowledge distillation, etc. [14].

  • SHOULD reduce resource consumption when using the LLMs, e.g. by restricting the number of queries, input tokens, or output tokens [14], with different prompt engineering techniques (e.g., on average zero-shot prompts seem to emit less CO2 than Chain Of Thought prompts), or by carefully sampling smaller datasets for fine-tuning and evaluation instead of using large datasets in their entirety.

To report the environmental impact of a study, researchers

Example(s)

Reproducibility: An example highlighting the need for caution around replicability is the study of Staudinger et al. [15] who attempted the replication of an LLM study. They aimed to replicate the results of a previous study that did not provide a replication package. However, they were not able to reproduce the exact results, even though they saw similar trends to the original study. They consider their results as not reliable enough for a systematic review.

Generalization: To analyze whether the results of proprietary LLMs transfer to open LLMs, Staudinger et al. [15] benchmarked previous results using GPT3.5 and GPT4 against Mistral and Zephyr. They found that the employed open-source models could not deliver the same performance as the proprietary models, restricting the effect to certain proprietary models. Individual studies already started to highlight the uncertainty about the generalizability of their results in the future. In [16] Jesse et al. acknowledge the issue that LLMs evolve over time and that this evolution might impact the study’s results.

Data Leakage: Since much research in software engineering evolves around code, inter-dataset code duplication has been extensively researched and addressed over the years to curate deduplicated datasets (e.g., by Lopes in 2017 [4], Allamanis in 2019 [5], Karmakar in 2023 [17], or Lopez in 2025 [6]). The issue of inter-dataset duplication has also attracted interest in other disciplines, with growing demands for data mining. For example, in the biology field, Lakiotaki et al. [18] acknowledge and address the overlap between multiple common disease datasets. In the domain of code generation, Coignion et al. [19] evaluated the performance of LLMs to produce leet code. To mitigate the issue of inter-dataset duplication, they only used leet code problems published after 2023-01-01, reducing the likelihood of LLMs having seen those problems before. Further, they discuss the performance differences of LLMs on different datasets in light of potential inter-dataset duplication. In [20] Zhou et al. performed an empirical evaluation of data leakage in 83 software engineering benchmarks. While most benchmarks suffer from minimal leakage, very few suffered from leakage of up to 100%. They found a high impact of data leakage on the performance evaluation. A starting point for studies that aim to assess and mitigate inter-dataset duplication are the Falcon LLMs. The technology innovation institute publicly provides access to parts of its training data for the Falcon LLMs [21] via Huggingface. Through this dataset, it is possible to reduce the overlap between the pre-train and evaluation data, improving the validity of the evaluation results. A starting point to prevent actively leaking data into a LLM improvement process is to ensure that the data is not used to train the model (e.g., via OpenAI’s data control functionality, or the OpenAI API instead of the web interface) [22].

Scalability and Cost: An example of stating the cost of a study can be found in a study by Staudinger et al. [15] who specified the costs for the study as “120 USD in API calls for GPT 3.5 and GPT 4, and 30 USD in API calls for Mistral AI. Thus, the total LLM cost of our reproducibility study was 150 USD”.

Ethical Concerns with Sensitive Data: Bias can occur in datasets as well as in LLMs that have been trained on them and may result in various types of discrimination. Gallegos et al. propose metrics to quantify biases in various tasks (e.g., text generation, classification, question answering) [23].

Performance and Resource Consumption: In their study, Tinnes et al. [24] balanced the dataset size between the need for manual semantic analysis and computational resource consumption.

Advantages

Reproducibility: Ensuring reproducibility allows for the independent replication and verification of study results. Reproducing study results under similar conditions by different parties greatly increases their validity and promotes transparency in research. Independent verification is of particular importance in studies involving LLMs, due to the randomness of their outputs and the potential for biases in their predictions and in training, fine-tuning, and evaluation datasets.

Generalization: Mitigating threats to generalizability through the integration of open LLMs as a baseline or the reporting of results over an extended period of time can increase the validity, reliability, and replicability of a study’s results.

Data Leakage: Assessing and mitigating the effects of inter-dataset duplication strengthens a study’s validity and reliability, as it prevents overly optimistic performance estimates that do not apply to previously unknown samples.

Scalability and Cost: Reporting the cost associated with executing a study not only increases transparency but also supports secondary literature in setting primary research into perspective. Providing replication packages entailing direct LLM output evidence as well as samples for partial replicability are paramount steps towards open and inclusive research in the light of resource inequality among researchers.

Performance and Resource Consumption: Mindfully deciding and justifying the usage of LLMs over other approaches can lead to more efficient and sustainable approaches. Reporting the environmental impact of the usage of LLMs also sets the stage for more sustainable research practices in the field of AI.

Challenges

Generalization: With commercial LLMs evolving over time, the generalizability of results to future versions of the model is uncertain. Employing open LLMs as a baseline can mitigate this limitation, but may not always be feasible due to computational cost.

Data Leakage: Most LLM providers do not publicly offer information about the datasets employed for pre-training, impeding the assessment of inter-dataset duplication effects.

Scalability and Cost: Consistently keeping track of and reporting the cost involved in a research endeavor is challenging. Building a coherent replication package that includes LLM outputs and samples for partial replicability requires additional effort and resources.

Misleading Performance Metrics: Defining all requirements beforehand to ensure the usage of suitable metrics can be challenging, especially in exploratory research. In this growing field of research, finding the right metrics to evaluate the performance of LLMs in software engineering for specific tasks is challenging. Report Suitable Baselines, Benchmarks, and Metrics can serve as a starting point.

Ethical Concerns with Sensitive Data: Ensuring compliance across jurisdictions is difficult with different regions having different regulations and requirements (e.g., GDPR and the AI Act in the EU, CCPA in California). Selecting datasets and models with less bias is challenging, as the bias in LLMs is often not transparently reported.

Performance and Resource Consumption: Measuring or estimating the environmental impact of a study is challenging and might not always be feasible. Especially in exploratory research, the impact is hard to estimate beforehand, making it difficult to justify the usage of LLMs over other approaches.

Study Types

The limitations and mitigations SHOULD be followed for all study types in a sensible manner, i.e. depending on the applicability to the individual study.

References

[1] J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence: The threats of using llms in software engineering,” in Proceedings of the 2024 ACM/IEEE 44th international conference on software engineering: New ideas and emerging results, 2024, pp. 102–106.

[2] L. Chen, M. Zaharia, and J. Zou, “How is ChatGPT’s behavior changing over time?” CoRR, vol. abs/2307.09009, 2023, doi: 10.48550/ARXIV.2307.09009.

[3] D. Li, K. Gupta, M. Bhaduri, P. Sathiadoss, S. Bhatnagar, and J. Chong, “Comparing GPT-3.5 and GPT-4 accuracy and drift in radiology diagnosis please cases,” Radiology, vol. 310, no. 1, p. e232411, 2024, doi: 10.1148/radiol.232411.

[4] C. V. Lopes et al., “Déjàvu: A map of code duplicates on GitHub,” Proc. ACM Program. Lang., vol. 1, no. OOPSLA, pp. 84:1–84:28, 2017, doi: 10.1145/3133908.

[5] M. Allamanis, “The adverse effects of code duplication in machine learning models of code,” in Proceedings of the 2019 ACM SIGPLAN international symposium on new ideas, new paradigms, and reflections on programming and software, onward! 2019, athens, greece, october 23-24, 2019, H. Masuhara and T. Petricek, Eds., ACM, 2019, pp. 143–153. doi: 10.1145/3359591.3359735.

[6] J. A. H. López, B. Chen, M. Saad, T. Sharma, and D. Varró, “On inter-dataset code duplication and data leakage in large language models,” IEEE Trans. Software Eng., vol. 51, no. 1, pp. 192–205, 2025, doi: 10.1109/TSE.2024.3504286.

[7] T. Computer, RedPajama: An open dataset for training large language models. (2023). Available: https://github.com/togethercomputer/RedPajama-Data

[8] K. Lee et al., “Deduplicating training data makes language models better,” in Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), ACL 2022, dublin, ireland, may 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds., Association for Computational Linguistics, 2022, pp. 8424–8445. doi: 10.18653/V1/2022.ACL-LONG.577.

[9] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation,” in Advances in neural information processing systems 36: Annual conference on neural information processing systems 2023, NeurIPS 2023, new orleans, LA, USA, december 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., 2023. Available: http://papers.nips.cc/paper\files/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Conference.html

[10] C.-J. Wu et al., “Sustainable AI: Environmental implications, challenges and opportunities,” in Proceedings of the fifth conference on machine learning and systems, MLSys 2022, santa clara, CA, USA, august 29 - september 1, 2022, D. Marculescu, Y. Chi, and C.-J. Wu, Eds., mlsys.org, 2022. Available: https://proceedings.mlsys.org/paper\files/paper/2022/hash/462211f67c7d858f663355eff93b745e-Abstract.html

[11] A. de Vries, “The growing energy footprint of artificial intelligence,” Joule, vol. 7, no. 10, pp. 2191–2194, 2023.

[12] Z. Fu, F. Chen, S. Zhou, H. Li, and L. Jiang, “LLMCO2: Advancing accurate carbon footprint prediction for LLM inferences,” CoRR, vol. abs/2410.02950, 2024, doi: 10.48550/ARXIV.2410.02950.

[13] P. Jiang, C. Sonne, W. Li, F. You, and S. You, “Preventing the immense increase in the life-cycle energy and carbon footprints of LLM-powered intelligent chatbots,” Engineering, vol. 40, pp. 202–210, 2024, doi: https://doi.org/10.1016/j.eng.2024.04.002.

[14] N. E. Mitu and G. T. Mitu, “The hidden cost of AI: Carbon footprint and mitigation strategies,” Available at SSRN 5036344, 2024.

[15] M. Staudinger, W. Kusa, F. Piroi, A. Lipani, and A. Hanbury, “A reproducibility and generalizability study of large language models for query generation,” in Proceedings of the 2024 annual international ACM SIGIR conference on research and development in information retrieval in the asia pacific region, SIGIR-AP 2024, tokyo, japan, december 9-12, 2024, T. Sakai, E. Ishita, H. Ohshima, F. Hasibi, J. Mao, and J. M. Jose, Eds., ACM, 2024, pp. 186–196. doi: 10.1145/3673791.3698432.

[16] K. Jesse, T. Ahmed, P. T. Devanbu, and E. Morgan, “Large language models and simple, stupid bugs,” in 20th IEEE/ACM international conference on mining software repositories, MSR 2023, melbourne, australia, may 15-16, 2023, IEEE, 2023, pp. 563–575. doi: 10.1109/MSR59073.2023.00082.

[17] A. Karmakar, M. Allamanis, and R. Robbes, “JEMMA: An extensible java dataset for ML4Code applications,” Empir. Softw. Eng., vol. 28, no. 2, p. 54, 2023, doi: 10.1007/S10664-022-10275-7.

[18] K. Lakiotaki, N. Vorniotakis, M. Tsagris, G. Georgakopoulos, and I. Tsamardinos, “BioDataome: A collection of uniformly preprocessed and automatically annotated datasets for data-driven biology,” Database J. Biol. Databases Curation, vol. 2018, p. bay011, 2018, doi: 10.1093/DATABASE/BAY011.

[19] T. Coignion, C. Quinton, and R. Rouvoy, “A performance study of LLM-generated code on leetcode,” in Proceedings of the 28th international conference on evaluation and assessment in software engineering, EASE 2024, salerno, italy, june 18-21, 2024, ACM, 2024, pp. 79–89. doi: 10.1145/3661167.3661221.

[20] X. Zhou et al., “LessLeak-bench: A first investigation of data leakage in LLMs across 83 software engineering benchmarks.” 2025. Available: https://arxiv.org/abs/2502.06215

[21] Technology Innovation Institute, “Falcon-refinedweb (revision 184df75).” Hugging Face, 2023. doi: 10.57967/hf/0737 .

[22] S. Balloccu, P. Schmidtová, M. Lango, and O. Dusek, “Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,” in Proceedings of the 18th conference of the european chapter of the association for computational linguistics, EACL 2024 - volume 1: Long papers, st. Julian’s, malta, march 17-22, 2024, Y. Graham and M. Purver, Eds., Association for Computational Linguistics, 2024, pp. 67–93. Available: https://aclanthology.org/2024.eacl-long.5

[23] I. O. Gallegos et al., “Bias and fairness in large language models: A survey,” CoRR, vol. abs/2309.00770, 2023, doi: 10.48550/ARXIV.2309.00770.

[24] C. Tinnes, A. Welter, and S. Apel, “Software model evolution with large language models: Experiments on simulated, public, and industrial datasets.”