Guidelines

This set of guidelines is currently a DRAFT and based on a discussion session with researchers at the 2024 International Software Engineering Research Network (ISERN) meeting and at the 2nd Copenhagen Symposium on Human-Centered Software Engineering AI. This draft is meant as a starting point for further discussions in the community with the aim of developing a common understanding of how we should conduct and report empirical studies involving large language models (LLMs). See also the pages on study types and scope.

The wording of the recommendations (MUST, SHOULD, MAY) follows RFC 2119 and 8174. Throughout the guidelines, we mention what information we expect to be reported in the PAPER or in the SUPPLEMENTARY MATERIAL. We are aware that different venues have different page limits and that not everything can be reported in the paper. It is of course better to report essential information that we expect to be reported in the paper in the supplementary material than not at all. Unless explicitly mentioned otherwise, we expect information to be reported either in the PAPER and/or the SUPPLEMENTARY MATERIAL. If information MUST be reported in the PAPER, we explicitly mention this in the specific guidelines.

Overview

  1. Declare LLM Usage and Role
  2. Report Model Version, Configuration, and Customizations
  3. Report Tool Architecture beyond Models
  4. Report Prompts, their Development, and Interaction Logs
  5. Use Human Validation for LLM Outputs
  6. Use an Open LLM as a Baseline
  7. Report Suitable Baselines, Benchmarks, and Metrics
  8. Report Limitations and Mitigations

Declare LLM Usage and Role

Recommendations

When conducting any kind of empirical study involving LLMs, researchers MUST clearly declare that an LLM was used in a suitable section of the PAPER(e.g., in the introduction or research methods section). For authoring scientific articles, such transparency is, for example, required by the ACM Policy on Authorship: “The use of generative AI tools and technologies to create content is permitted but must be fully disclosed in the Work” [1]. Beyond generic authorship declarations and declarations that LLMs were used as part of the research process, researchers SHOULD report the exact purpose of using an LLM in a study, the tasks it was used to automate, and the expected outcomes in the PAPER.

Example(s)

The ACM Policy on Authorship suggests to to disclose the usage of Generative AI tools in the acknowledgements section of papers, for example: “ChatGPT was utilized to generate sections of this Work, including text, tables, graphs, code, data, citations” [1]. Similarly, the acknowledgements section could also be used to disclose Generative AI usage for other aspects of the research, if not explicitly described in other parts of the paper. The ACM policy further suggests: “If you are uncertain ­about the need to disclose the use of a particular tool, err on the side of caution, and include a disclosure in the acknowledgements section of the Work” [1].

An example of an LLM disclosure can be found in a recent paper by Ludos et al. [2], in which they write in the methodology section:

“We conducted an LLM-based evaluation of requirements utilizing the Llama 2 language model with 70 billion parameters, fine-tuned to complete chat responses…”

Advantages

Transparency in the usage of LLMs helps in understanding the context and scope of the study, facilitating better interpretation and comparison of results. Beyond this declaration, we recommend authors to be explicit about the LLM version they used (see Section Report Model Version, Configuration, and Customizations) and the LLM’s exact role (see Section Report Tool Architecture beyond Models).

Challenges

We do not expect any challenges for researchers following this guideline.

Study Types

This guideline MUST be followed for all study types.

References

[1] Association for Computing Machinery, “ACM Policy on Authorship.” https://www.acm.org/publications/policies/new-acm-policy-on-authorship, 2023.

[2] S. Lubos et al., “Leveraging LLMs for the quality assurance of software requirements,” in 32nd IEEE international requirements engineering conference, RE 2024, reykjavik, iceland, june 24-28, 2024, G. Liebel, I. Hadar, and P. Spoletini, Eds., IEEE, 2024, pp. 389–397. doi: 10.1109/RE59067.2024.00046.

Report Model Version, Configuration, and Customizations

Recommendations

LLMs or LLM-based tools, especially those offered as-a-service, are frequently updated; different versions may produce different results for the same input. Moreover, configuration parameters such as the temperature affect content generation. Therefore, researchers MUST document the specific model or tool version used in a study, along with the date when the experiments were conducted, and the exact configuration that was used, in the PAPER. Since default values might change over time, researchers SHOULD always report all configuration values, even if they used the defaults. Depending on the study context, other properties such as the context window size (number of tokens) MAY be reported. Researchers SHOULD motivate in the PAPER why they selected certain models, versions, and configurations. Potential reasons can be monetary (e.g., no funding to integrate large commercial models), technical (e.g., existing hardware only supports smaller models), or methodological (e.g., planned comparison to previous work). Depending on the specific study context, additional information regarding tool architecture or the experiment setup SHOULD be reported (see Section Report Tool Architecture beyond Models).

A common customization approach for existing LLMs is fine-tuning. If a model was fine-tuned, researchers MUST describe the fine-tuning goal (e.g., improving the performance for a specific task), the fine-tuning procedure (e.g., full fine-tuning vs. Low-Rank Adaptation (LoRA), selected hyperparameters, loss function, learning rate, batch size, etc.), and the fine-tuning dataset (e.g., data sources, the preprocessing pipeline, dataset size) in the PAPER. Researchers SHOULD either share the fine-tuning dataset as part of the SUPPLEMENTARY MATERIAL or explain in the PAPER why the data cannot be shared (e.g., because it contains confidential or personal data that could not be anonymized). The same applies to the fine-tuned model weights. Suitable benchmarks and metrics SHOULD be used to compare the base model to the fine-tuned model (see Section Report Suitable Baselines, Benchmarks, and Metrics).

In summary our recommendation is to report:

  • Model/tool name and version (including a checksum/fingerprint if available) (MUST, PAPER).

  • All relevant configured parameters that affect output generation (e.g., temperature, seed values, consideration of historical context) (MUST, PAPER).

  • Default values of all available parameters (SHOULD).

  • Additional properties (e.g., context window size) (MAY).

For fine-tuned models, additional recommendations apply:

  • Fine-tuning goal (e.g., improving the performance for a specific task) (MUST, PAPER).

  • Fine-tuning dataset creation and characterization (e.g., data collection, preprocessing, descriptive statistics) (MUST, PAPER).

  • Fine-tuning parameters and procedure (e.g., learning rate, epochs, batch size, next token prediction, instruction tuning) (MUST, PAPER).

  • Fine-tuning dataset and fine-tuned model weights (SHOULD).

  • Validation metrics and benchmarks (SHOULD).

Commercial models (e.g., GPT-4o) or LLM-based tools (e.g., ChatGPT) might not give researchers access to all required information. Our suggestion is to report what is available, and openly acknowledge limitations that hinder reproducibility (see also Sections Report Prompts, their Development, and Interaction Logsand Report Limitations and Mitigations).

Example(s)

Based on the documentation that OpenAI and Azure provide [1], [2], researchers might, for example, report:

“We integrated a gpt-4 model in version 0125-Preview via the Azure OpenAI Service, and configured it with a temperature of 0.7, top_p set to 0.8, a maximum token length of 512, and the seed value 23487. We ran our experiment on 10th January 2025’ (system fingerprint fp_6b68a8204b).

Kang et al. provide a similar statement in their paper on exploring LLM-based general bug reproduction [3]:

“We access OpenAI Codex via its closed beta API, using the code-davinci-002 model. For Codex, we set the temperature to 0.7, and the maximum number of tokens to 256.”

Our guidelines additionally suggest to report a checksum/fingerprint and exact dates, but otherwise this example is close to our recommendations.

Similar statements can be made for self-hosted models. However, when self-hosting models, the SUPPLEMENTARY MATERIAL can become a true replication package, providing specific instructions for reproducing study results. For example, for models provisioned using ollama, one can report the specific tag and checksum of the model being used, e.g., “llama3.3, tag 70b-instruct-q8_0, checksum d5b5e1b84868.” Given suitable hardware, running the corresponding model in its default configuration is then as easy as executing one command in the command line (see also Section Use an Open LLM as a Baseline):

ollama run llama3.3:70b-instruct-q8_0

An example of a study involving fine-tuning is Dhar et al.’s work [4]. They conducted an exploratory empirical study to assess whether LLMs can generate architectural design decisions. The authors detail the system architecture, including the decision-making framework, the role of the LLM in generating design decisions, and the interaction between the LLM and other components of the system. The study provides information on the fine-tuning approach and datasets used for evaluation, including the source of the architectural decision records, preprocessing methods, and the criteria for data selection.

Advantages

Reporting the information that we have outlined in our recommendations is a prerequisite for the verification, reproduction, and replication of LLM-based studies under the same or similar conditions. LLMs are inherently non-deterministic. However, this should not be an excuse to dismiss the verifiability and reproducibility of empirical studies involving LLMs. While exactly reproducibility is hard to achieve, researchers can do their best to come as close as possible to that gold standard. And part of that effort is reporting the information outlined in this guideline. However, that information alone is generally not sufficient.

Challenges

Different model providers and modes of operating the models allow for varying degrees of information. For example, OpenAI provides a model version and a system fingerprint describing the backend configuration that can also influence the output. However, the fingerprint is indeed just intended to detect changes to the model or its configuration. As a user, one cannot go back to a certain fingerprint. As a beta feature, OpenAI lets users set a seed parameter to receive “(mostly) consistent output” [5]. However, the seed value does not allow for full reproducibility and the fingerprint changes frequently. While, as motived above, open models significantly simplify re-running experiments, they also come with challenges in terms of reproducibility, as generated outputs can be inconsistent despite setting the temperature to 0 and using a seed value (see GitHub issue for Llama3).

Study Types

This guideline MUST be followed for all study types for which the researcher has access to (parts of) the model’s configuration. They MUST always report the configuration that is visible to them, acknowledging the reproducibility challenges of commercial tools and models offered as-a-service.

Depending on the specific study type, researchers SHOULD provide additional information regarding the architecture of a tool they built (see Section Report Tool Architecture beyond Models), prompts and interactions logs (see Section Report Prompts, their Development, and Interaction Logs), and specific limitations an mitigations (see Section Report Limitations and Mitigations).

For example, when Studying LLM Usage in Software Engineeringby focusing on commercial tools such as ChatGPT or GitHub Copilot, researchers MUST be as specific as possible in describing their study setup. The configured model name, version, and the date when the experiment was conducated MUST always be reported. In those cases, reporting other aspects such as prompts and interaction logs is essential.

References

[1] OpenAI, “OpenAI API Introduction.” https://platform.openai.com/docs/api-reference/chat/streaming, 2025.

[2] Microsoft, “Azure OpenAI Service models.” https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models, 2025.

[3] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot testers: Exploring LLM-based general bug reproduction,” in 45th IEEE/ACM international conference on software engineering, ICSE 2023, melbourne, australia, may 14-20, 2023, IEEE, 2023, pp. 2312–2323. doi: 10.1109/ICSE48619.2023.00194.

[4] R. Dhar, K. Vaidhyanathan, and V. Varma, “Can LLMs generate architectural design decisions? - an exploratory empirical study,” in 21st IEEE international conference on software architecture, ICSA 2024, hyderabad, india, june 4-8, 2024, IEEE, 2024, pp. 79–89. doi: 10.1109/ICSA59870.2024.00016.

[5] OpenAI, “How to make your completions outputs consistent with the new seed parameter.” https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter, 2023.

Report Tool Architecture Beyond Models

Recommendations

Oftentimes, LLM-based tool have complex software layers around the model (or models) that pre-processes data, prepares prompts, filters user requests, or post-processes responses. One example is ChatGPT, which allows users to select from different GPT models, including GPT-4o. GitHub Copilot allows users to select the same model, but answers may significantly differ between the two tools as GitHub Copilot automatically adds context from the software project it is used in. Finally, researchers can build their own tools utilizing GPT-4o directly (e.g., via the OpenAI API). The infrastructure and business logic around the bare model can significantly contribute to the performance of a tool for a given task. Therefore, researchers MUST clearly describe the tool architecture and what exactly the LLM (or LLMs) contribute to the tool or method presented in a research paper.

If the LLM is used as a standalone system, for example, by sending prompts directly to a GPT-4o model via the OpenAI API without pre-processing the prompts or post-processing the responses, a brief explanation about this is usually sufficient. However, if LLMs are integrated into a more complex system with pre-processing, retrieval mechanisms, or autonomous agents, researchers MUST provide a more detailed description of the system architecture in the PAPER. Aspects to consider are how the LLM interacts with other components such as databases, external APIs, and frameworks. If the LLM is part of an agent-based system that autonomously plans, reasons, or executes tasks, researchers MUST describe its exact architecture, including the agents’ roles (e.g., planner, executor, coordinator), whether it is a single-agent or multi-agent system, how it interacts with external tools and users, and the reasoning framework used (e.g., chain-of-thought, self-reflection, multi-turn dialogue, tool usage).

Researchers SHOULD provide a high-level architectural diagram to improve transparency. To enhance clarity, researchers SHOULD explain design decisions, particularly regarding how the models were hosted and accessed (e.g., API-based, self-hosted, etc.) and which retrieval mechanisms were implemented (e.g., keyword search, semantic similarity matching, rule-based extraction, etc.). Researchers MUST NOT omit critical architectural details that could impact reproducibility, such as dependencies to proprietary tools that influence the tool behavior. Especially for time-sensitive measurements, the before-mentioned description of the hosting environment is central, as it can significantly impact results. Researchers MUST clarify whether local infrastructure or cloud services were used, including detailed (virtual) hardware specifications and latency considerations.

If retrieval or augmentation methods were used (e.g., retrieval-augmented generation (RAG), rule-based retrieval, structured query generation, or hybrid approaches), researchers MUST describe how external data is retrieved, stored, and integrated into the LLM’s responses. This includes specifying the type of storage or database used (e.g., vector databases, relational databases, knowledge graphs) and how the retrieved information is selected and used. Stored data used for context augmentation MUST be reported, including details on data preprocessing, versioning, and update frequency. If this data is not confidential, an anonymized snapshot of the data used for context augmentation SHOULD be made available.

For ensemble models, besides following the Report Model Version, Configuration, and Customizationsguideline for each model, researchers MUST describe the architecture that connects the models. The PAPERMUST at least contain a high-level description, and details can be reported in the SUPPLEMENTARY MATERIAL. Aspects to consider are documenting the logic that determines which model handles which input, the interaction between models, and the architecture for combining outputs (e.g., majority voting, weighted averaging, sequential processing).

Example(s)

Some empirical studies in software engineering involving LLMs have documented the architecture and supplemental data aligning with the recommended guidelines. Hereafter, we provide two examples.

Schäfer et al. conducted an empirical evaluation of using LLMs for automated unit test generation [1]. The authors provide a comprehensive description of the system architecture, detailing how the LLM is integrated into the software development workflow to analyze codebases and produce corresponding unit tests. The architecture includes components for code parsing, prompt formulation, interaction with the LLM, and integration of the generated tests into existing test suites. The paper also elaborates on the datasets utilized for training and evaluating the LLM’s performance in unit test generation. It specifies the sources of code samples, the selection criteria, and the preprocessing steps undertaken to prepare the data.

TODO: Sebastian: Where’s the second example?

Advantages

Usually, researchers implement software layers around the bare LLMs, using different architectural patterns. These implementations significantly impact the performance of LLM-based tools and hence need to be documented in detail. Documenting the architecture and supplemental data of LLM-based systems enhances reproducibility, transparency, and trust [2]. In empirical software engineering studies, this is essential for experiment replication, result validation, and benchmarking. A clear documentation of the architecture and supplemental data being used enables comparison and upholds scientific rigor and accountability, fostering reliable and reusable research.

Challenges

Researchers face challenges in documenting LLM-based architectures, including proprietary APIs and dependencies that restrict disclosure, managing large-scale retrieval databases, and ensuring efficient query execution. They must also balance transparency with data privacy concerns, adapt to the evolving nature of LLM integrations, and, depending on the context, handle the complexity of multi-agent interactions and decision-making logic, all of which can impact reproducibility and system clarity.

Study Types

This guideline MUST be followed for all studies that develop tools beyond bare LLMs, from thin layers that pre-process user input and post-process LLM responses over tools that use retrieval-augmented generation (RAG) to complex agent-based systems. In particular it applies to the study types LLMs for New Software Engineering Toolsand Benchmarking LLMs for Software Engineering Tasks.

TODO: Sebastian: Is there something we can say about study types from the LLMs for researchers section? Any papers that, e.g., use agent-based systems or RAG for research purposes?

References

[1] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Trans. Software Eng., vol. 50, no. 1, pp. 85–105, 2024, doi: 10.1109/TSE.2023.3334955.

[2] Q. Lu, L. Zhu, X. Xu, Z. Xing, and J. Whittle, “Toward responsible AI in the era of generative AI: A reference architecture for designing foundation model-based systems,” IEEE Softw., vol. 41, no. 6, pp. 91–100, 2024, doi: 10.1109/MS.2024.3406333.

Report Prompts, their Development, and Interaction Logs

Recommendations

Prompts are critical for any study involving LLMs. Depending on the task, prompts may include various types of content, including instructions, context, and input data. For software engineering tasks, the context can include source code, execution traces, error messages, and other forms of natural language content. Prompts significantly influence a model’s output, and understanding how exactly they were formatted and integrated in an LLM-based study is essential for transparency, verifiability, and reproducibility.

Researchers MUST report the complete prompts that were used, including all instructions, context, and input data. This can be done with a structured template that contains placeholders for dynamically added content (see below). Specifying the exact format is crucial. For example, when using code snippets, researchers should specify whether they were enclosed in markdown-style code blocks (indented by four space characters, enclosed in triple backticks, etc.), if line breaks and whitespace were preserved, and whether additional annotations (e.g., comments) were included. Similarly, for other artifacts such as error messages, stack traces, researchers should explain how these were presented.

Researchers MUST also explain in the PAPER how they developed the prompts and why they decided to follow certain prompting strategies. They MUST specify whether zero-shot, one-shot, or few-shot prompting was used. For few-shot prompts, the examples provided to the model should be clearly outlined, along with the rationale for selecting them. If multiple versions of a prompt were tested, researchers should describe how these variations were evaluated and how the final design was chosen.

When dealing with extensive or complex prompts, such as those involving large codebases or multiple error logs, researchers MUST describe the strategies they used for handling input length constraints. Approaches might include truncating, summarizing, or splitting prompts into multiple parts. Token optimization measures, such as simplifying code formatting or removing unnecessary comments, MUST also be documented if applied.

In cases where prompts are generated dynamically—such as through preprocessing, template structures, or retrieval-augmented generation (RAG)—the process MUST be thoroughly documented. This includes explaining any automated algorithms or rules that influenced prompt generation. For studies involving human participants, where users might create or modify prompts themselves, researchers MUST describe how these prompts were collected and analyzed. If full disclosure is not feasible due to privacy or confidentiality concerns, summaries and representative examples SHOULD be provided.

To ensure full reproducibility, researchers MUST make all prompts and prompt variations publicly available as part of their SUPPLEMENTARY MATERIAL. If the full set of prompts is too extensive to include in the paper itself, researchers SHOULD still provide representative examples and describe variations in the main body of the paper. Since prompt effectiveness varies across models and model versions, researchers MUST make clear which prompts were used for which models in which versions and with which configuration (see Section Report Model Version, Configuration, and Customizations).

Prompt development is often iterative, involving collaboration between human researchers and AI tools. Researchers SHOULD report any instances where LLMs were used to suggest prompt refinements, as well as how those suggestions were incorporated. Furthermore, prompts may need to be revised in response to failure cases where the model produced incorrect or incomplete outputs. Iterative changes based on human feedback and pilot testing results SHOULD also be included in the documentation. A prompt changelog can help track and report the evolution of prompts throughout a research project, including key revisions, reasons for changes, and versioning (e.g., v1.0: initial prompt; v1.2: added output formatting; v2.0: incorporated examples of ideal responses). Pilot testing and prompt evaluation are vital for ensuring that prompts yield reliable results. If such testing was conducted, researchers SHOULD summarize key insights, including how different prompt variations affected output quality and which criteria were used to finalize the prompt design.

When trying to verify results, even with the exact same prompts, decoding strategies, and parameters, LLMs can still behave non-deterministically. Non-determinism can arise from batching, input preprocessing, and floating point arithmetic on GPUs [1]. Thus, in order to enable other researchers to verify the conclusions that researchers have drawn from LLM interactions, researchers SHOULD report the full interaction logs with an LLM if possible, that is not only the prompts but also the LLM’s responses. Reporting this is especially important when reporting a study targeting commercial SaaS solutions based on LLMs (e.g., ChatGPT) or novel tools that integrate LLMs via cloud APIs where there is even less guarantee of reproducing the state of the LLM-powered system at a later point by a reader of the study who wants to verify, reproduce, or replicate it. The rationale for this is similar to the rationale for reporting interview transcripts in qualitative research. In both cases, it is important to document the entire interaction between the interviewer and the participant. Just as a human participant might give different answers to the same question asked two months apart, the responses from ChatGPT can also vary over time. Therefore, keeping a record of the actual conversation is crucial for accuracy and context and shows depth of engagement for transparency.

Example(s)

A recent paper by Anandayuvaraj et al. [2] is a good example of making prompts available online. In the paper, the authors analyze software failures reported in news articles and use prompting to automate tasks such as filtering relevant articles, merging reports, and extracting detailed failure information. Their online appendix contains all the prompts used in the study, providing valuable transparency and supporting reproducibility.

A good example of comprehensive prompt reporting is provided by Liang et al. [3]. The authors make the exact prompts available in their online appendix on Figshare, including details such as code blocks being enclosed in triple backticks. While this level of detail would not fit within the paper itself, the paper thoroughly explains the rationale behind the prompt design and data output format. It also includes one overview figure and two concrete examples, ensuring transparency and reproducibility while keeping the main text concise.

An example for reporting full interaction logs is the study Ronanki et al. [4], for which they reported the full answers of ChatGPT and uploaded them to Zenodo.

Advantages

As motivated above, providing detailed documentation of prompts enhances verifiability, reproducibility, and comparability of LLM-based studies. It allows other researchers to replicate the study under similar conditions, refine prompts based on documented improvements, and evaluate how different types of content (e.g., source code vs. execution traces) influence LLM behavior. This transparency also enables a better understanding of how formatting, prompt length, and structure impact results across various studies.

An advantage of reporting full interaction logs is that, while for human participants conversations often cannot be reported due to confidentiality, LLM conversations can (e.g. as of beginning of 2025, OpenAI allows sharing of chat transcripts). Detailed logs enable future reproduction or replication studies to compare results using the same prompts. This could be valuable for tracking changes in LLM responses over time or across different versions of the model. It would also enable secondary research to study how consistent an LLM’s responses are over model versions and identify any variations or improvements in its performance for specific software engineering tasks.

Challenges

One challenge is the complexity of prompts that combine multiple components, such as code, error messages, and explanatory text. Formatting differences—such as whether markdown or plain text was used—can affect how LLMs interpret inputs. Additionally, prompt length constraints may require careful management, particularly for tasks involving extensive artifacts such as large codebases. Privacy and confidentiality concerns can also hinder prompt sharing, especially when sensitive data is involved. In these cases, researchers SHOULD provide anonymized examples and summaries wherever possible. For prompts containing sensitive information, researchers MUST: (i) anonymize personal identifiers. (ii) replace proprietary code with functionally equivalent examples or placeholders, and (iii) clearly mark modified sections.

Not all systems allow the reporting of full (system) prompts and interaction logs with the same ease. This hinders transparency and verifiability. Besides our recommendation to Use an Open LLM as a Baseline, researchers SHOULD report whatever they can about commercial tools, acknowledging limitations due to unavailable data. They might also consider using open-source tools such as Continue[1], which enable researchers to collect interaction logs and system prompts locally. Understanding suggestions of commercial tools such as GitHub Copilot might require recreating the exact state of the codebase at the time the suggestion was made, a challenging context to report. One solution could be to use version control to capture the exact state of the codebase when a recommendation occurred, and keep track of the files that were automatically added as context.

Study Types

Reporting requirements may vary depending on the study type. For the study type LLMs for New Software Engineering Tools, researchers MUST explain how prompts were generated and structured within the tool. When Studying LLM Usage in Software Engineering, especially for controlled experiments exact prompts MUST be reported for all conditions. For observational studies, especially the ones targeting commercial tools, researchers MUST report the full interactions logs/conversations. Of the full interaction logs cannot be shared, e.g., because they contain confidential information, the prompts and responses SHOULD at least be summarized and described in the PAPER.

TODO: Sebastian: What about the study types targeting LLMs for SE research?

References

[1] S. Chann, “Non-determinism in GPT-4 is caused by Sparse MoE.” https://152334h.github.io/blog/non-determinism-in-gpt-4/, 2023.

[2] D. Anandayuvaraj, M. Campbell, A. Tewari, and J. C. Davis, “FAIL: Analyzing software failures from the news using LLMs,” in Proceedings of the 39th IEEE/ACM international conference on automated software engineering, 2024, pp. 506–518.

[3] J. T. Liang et al., “Can gpt-4 replicate empirical software engineering research?” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1330–1353, 2024.

[4] K. Ronanki, C. Berger, and J. Horkoff, “Investigating ChatGPT’s potential to assist in requirements elicitation processes,” in 2023 49th euromicro conference on software engineering and advanced applications (SEAA), IEEE, 2023, pp. 354–361.

[1] https://blog.continue.dev/its-time-to-collect-data-on-how-you-build-software/

Use Human Validation for LLM Outputs

Recommendations

TODO: Sebastian: Revise the recommendations to make clear what we recommend using MUST, SHOULD, etc.. Also indicate which information we expect to be reported in the PAPER or in the SUPPLEMENTARY MATERIAL. Note the general comments in the introduction of the guidelines: https://llm-guidelines.org/guidelines#guidelines Some content in this subsection reads more like examples, advantages, or challenges. Please revise this as well.

While LLMs can automate many tasks in research and software development that have traditionally been executed by humans, it remains important to validate their outputs with human judgment. For natural language processing tasks, a large-scale study has shown that LLMs have significant variation in their results, which limits their reliability as a direct substitute for human raters [1]. Human validation helps ensure the accuracy and reliability of the results, as LLMs may sometimes produce incorrect or biased outputs. Especially in studies where LLMs are used to support researchers, human validation is generally recommended to ensure validity [2]. We recommend that researchers plan for validation in humans from the outset and develop their methodology with this validation in mind. Study reference models for comparing humans with LLMs [3] can help provide a template to ease the design process. In some cases, a hybrid approach between human and machine-generated annotations can improve annotation efficiency. However, researchers should use systematic approaches to decide what and how human annotations can be replaced with LLM-generated ones, such as the methods proposed by Ahmed et al. [4].

When evaluating the capability of LLMs to generate SE-related artifacts, may employ human validation to complement machine-generated measures. For example, proxies for software quality, such as code complexity or the number of code smells, may be complemented by human ratings of maintainability, readability, or understandability. In the case of more abstract variables or psychometric measurements, human validation may be the only way of measuring a specific construct. For example, measuring human factors such as trust, cognitive load, and comprehension levels may inherently require human evaluation.

When conducting empirical measurements, researchers should clearly define the construct that they are measuring and specify the methods used for measurement. Further, they should use established measurement methods and instruments that are empirically validated [5], [6]. Measuring a construct may require aggregating input from multiple subjects. For example, a study may assess inter-rater agreement using measures such as Cohen’s Kappa or Krippendorff’s Alpha before aggregating ratings. In some cases, researchers may also combine multiple measures into single composite measures. As an example, they may evaluate both the time taken and accuracy when completing a task and aggregate them into a composite measure for the participants’ overall performance. In these cases, researchers should clearly describe their method of aggregation and document their reasoning for doing so.

When employing human validation, additional confounding factors should be controlled for, such as the level of expertise or experience with LLM-based applications or their general attitude towards AI-based tools. Researchers should control for these factors through methods such as stratified sampling or by categorizing participants based on experience levels. Where applicable, researchers should conduct a power analysis to estimate the required sample size and ensure sufficient statistical power in their experiment design. When multiple humans are annotating the same artifact, researchers should validate the objectivity of annotations through measures of inter-rater reliability. For instance, “A subset of 20% of the LLM-generated annotations was reviewed and validated by experienced software engineers to ensure accuracy. Using Cohen’s Kappa, an inter-rater reliability of \kappa = 0.90 was reached.”

For instance, “Model-model agreement is high, for all criteria, especially for the three large models (GPT-4, Gemini, and Claude). Table I indicates that the mean Krippendorff’s \alpha is 0.68-0.76. Second, we see that human-model and human-human agreements are in similar ranges, 0.24-0.40 and 0.21-0.48 for the first three categories.” [4].

Example(s)

As an example, Khojah et al. [7] augmented the results of their study using human measurement. Specifically, they asked participants to provide ratings regarding their experience, trust, perceived effectiveness and efficiency, and scenarios and lessons learned in their experience with ChatGPT.

Choudhuri et al. [8] evaluated the perceptions of students of their experience with ChatGPT in a controlled experiment. They added this data to extend their results from the task performance in a series of software engineering tasks. This way, they were to employ questionnaires to measure the cognitive load, document any perceived faults in the system, and inquire about the participants intention to continue using the tool.

Xue et al. [9] conducted a controlled experiment in which they evaluated the impact of ChatGPT on the performance and perceptions of students in an introductory programming course. They employed multiple measures to judge the impact of the LLM from the perspective of humans. In their study, they recorded the students’ screens, evaluated the answers they provided in tasks, and distributed a post-study survey to get direct opinions from the students.

Hymel et al. [10] evaluated the capability of ChatGPT-4.0 to generate requirements documents. Specifically, they evaluated two requirements documents based on the same business use case, one document generated with the LLM and one document created by a human expert. The documents were then reviewed by experts and judged in terms of alignment with the original business use case, requirements quality and whether they believed it was created by a human or an LLM. Finally, they analyzed the influence of the participants’ familiarity with AI tools on the study results.

Advantages

Incorporating human judgment in the evaluation process adds a layer of quality control and increases the trustworthiness of the study’s findings, especially when explicitly reporting inter-rater reliability metrics [11].

Incorporating feedback from individuals from the target population strengthens external validity by grounding study findings in real-world usage scenarios and may positively impact the transfer of study results to practice. Researchers may uncover additional opportunities to further improve the LLM or LLM-based tool based on the reported experiences.

Challenges

Measurement through human validation can be challenging. Ensuring that the operationalization of a desired construct and the method of measuring it are appropriate requires a good understanding of the studied concept and construct validity in general, and a systematic design approach for the measurement instruments [12].

Human judgment is often very subjective and may lead to large variability between different subjects due to differences in expertise, interpretation, and biases among evaluators [13]. Controlling for this subjectivity will require additional rigor when analyzing the study results.

Recruiting participants as human validators will always incur additional resources compared to machine-generated measures. Researchers must weigh the cost and time investment incurred by the recruitment process against the potential benefits for the validity of their study results.

Study Types

TODO: Sebastian: This section rather reads like the actual recommendations. The intention of this section was rather to focus on the study types where the guidelines is most important. For example, when evaluating an LLM-based tool for program repair, with existing benchmarks and non-LLM-baselines, why do I need human validation?

These guidelines apply to all study types:

When conducting their studies, researchers

MUST

  • Clearly define the construct measured through human validation.

  • Describe how the construct is operationalized in the study, specifying the method of measurement.

  • Employ established and widely accepted measurement methods and instruments.

  • Should assess the characteristics of the human validators to control for factors influencing the validation results (e.g., years of experience, familiarity with the task, etc.)

SHOULD

  • Use empirically validated measures.

  • Complement automated or machine-generated measures with human validation where possible.

  • Should ensure consistency among human validators by establishing shared understanding in training sessions or pilot studies and by assessing inter-rater agreement.

MAY

  • Use multiple different measures (e.g., expert ratings, surveys, task performance) for human validation.

References

[1] A. Bavaresco et al., “LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks,” CoRR, vol. abs/2406.18403, 2024, doi: 10.48550/ARXIV.2406.18403.

[2] X. Wang, H. Kim, S. Rahman, K. Mitra, and Z. Miao, “Human-LLM collaborative annotation through effective verification of LLM labels,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, honolulu, HI, USA, may 11-16, 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 303:1–303:21. doi: 10.1145/3613904.3641960.

[3] K. Schneider, F. Fotrousi, and R. Wohlrab, “A reference model for empirically comparing LLMs with humans,” in Proceedings of the 47th international conference on software engineering: Software engineering in society (ICSE-SEIS2025), IEEE, 2025.

[4] T. Ahmed, P. T. Devanbu, C. Treude, and M. Pradel, “Can LLMs replace manual annotation of software engineering artifacts?” CoRR, vol. abs/2408.05534, 2024, doi: 10.48550/ARXIV.2408.05534.

[5] R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman, “Measures for explainable AI: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-AI performance,” Frontiers Comput. Sci., vol. 5, 2023, doi: 10.3389/FCOMP.2023.1096257.

[6] S. A. C. Perrig, N. Scharowski, and F. Brühlmann, “Trust issues with trust scales: Examining the psychometric quality of trust measures in the context of AI,” in Extended abstracts of the 2023 CHI conference on human factors in computing systems, CHI EA 2023, hamburg, germany, april 23-28, 2023, A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, and A. Peters, Eds., ACM, 2023, pp. 297:1–297:7. doi: 10.1145/3544549.3585808.

[7] R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto, “Beyond code generation: An observational study of ChatGPT usage in software engineering practice,” Proc. ACM Softw. Eng., vol. 1, no. FSE, pp. 1819–1840, 2024, doi: 10.1145/3660788.

[8] R. Choudhuri, D. Liu, I. Steinmacher, M. A. Gerosa, and A. Sarma, “How far are we? The triumphs and trials of generative AI in learning software engineering,” in Proceedings of the 46th IEEE/ACM international conference on software engineering, ICSE 2024, lisbon, portugal, april 14-20, 2024, ACM, 2024, pp. 184:1–184:13. doi: 10.1145/3597503.3639201.

[9] Y. Xue, H. Chen, G. R. Bai, R. Tairas, and Y. Huang, “Does ChatGPT help with introductory programming?an experiment of students using ChatGPT in CS1,” in Proceedings of the 46th international conference on software engineering: Software engineering education and training, SEET@ICSE 2024, lisbon, portugal, april 14-20, 2024, ACM, 2024, pp. 331–341. doi: 10.1145/3639474.3640076.

[10] C. Hymel and H. Johnson, “Analysis of LLMs vs human experts in requirements engineering.” 2025. Available: https://arxiv.org/abs/2501.19297

[11] Q. Khraisha, S. Put, J. Kappenberg, A. Warraitch, and K. Hadfield, “Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages,” Research Synthesis Methods, vol. 15, no. 4, pp. 616–626, 2024, doi: https://doi.org/10.1002/jrsm.1715.

[12] D. I. K. Sjøberg and G. R. Bergersen, “Construct validity in software engineering,” IEEE Trans. Software Eng., vol. 49, no. 3, pp. 1374–1396, 2023, doi: 10.1109/TSE.2022.3176725.

[13] N. McDonald, S. Schoenebeck, and A. Forte, “Reliability and inter-rater reliability in qualitative research: Norms and guidelines for CSCW and HCI practice,” Proc. ACM Hum. Comput. Interact., vol. 3, no. CSCW, pp. 72:1–72:23, 2019, doi: 10.1145/3359174.

Use an Open LLM as a Baseline

Recommendations

Empirical studies using LLMs in software engineering, especially the ones targeting commercial tools or model, SHOULD incorporate an open LLM as a baseline and report established metrics for inter-model agreement (see Section Report Suitable Baselines, Benchmarks, and Metrics). TODO: Sebastian: Please check that the cited section contains the required information. If not, add it there. We acknowledge that including an open LLM baseline might not always be possible, for example if the study involves human participants and letting them work on the tasks using two different models might not be feasible. However, open models allow other researchers to verify research results and build upon them, even without access to commercial models. Comparing commercial and open models also allows to contextualize model performance. Researchers SHOULD provide a full replication package as part of their SUPPLEMENTARY MATERIAL, including clear, step-by-step instructions on how to verify and reproduce the results reported in the paper.

Open LLMs are available on platforms such as Hugging Face. Depending on the size, and the local computing power, open LLMs can be hosted on a local computer or server using frameworks such as Ollama or LM Studio. They can also be run on cloud-based services such as Together AI or large hyperscalers such as AWS, Azure, Alibaba Cloud, and Google Cloud.

The term “open” can have different meanings in context of LLMs. Widder et al. discuss three types of openness: transparency, reusability and extensibility [1]. They also discuss what openness in AI can and cannot provide. Moreover, the Open Source Initiative (OSI) [2] provides a definition of open-source AI that serves as a useful framework for evaluating the openness of AI models. In simple terms, according to OSI, open-source AI means you have access to everything you need to use the AI, such as understanding, modifying, sharing, retraining, and recreating.

Example(s)

TODO: Sebastian: Artificial examples should only be used in case we can’t find any SE or non-SE papers that use open LLMs as a baseline. Please add suitable examples here.

  • Benchmarking a Proprietary LLM: Researchers want to know how good their own LLM is at writing code. They might compare it against an open LLM such as StarCoderBase.

  • Evaluating an LLM-Powered Tool: A team developing an AI-driven code review tool might want to assess the quality of suggestions generated by both a proprietary LLM and an open alternative. Human evaluators could then independently rate the relevance and correctness of the suggestions, providing an objective measure of the tool’s effectiveness.

  • Ensuring Reproducibility with a Replication Package: A study on bug localization that uses a closed-source LLM could support Reproducibility by including a replication package. This package might contain a script that automatically reruns the same experiments using an open-source LLM—such as Llama 3—and generates a comparative report.

Advantages

TODO: Sebastian: Please formulate a text based on the bullet points (see earlier guidelines for inspiration).

  • Improved Reproducibility: Researchers can independently replicate experiments.

  • More Objective Comparisons: Using a standardized baseline allows for more unbiased evaluations.

  • Greater Transparency: Open models enable the analysis of how data is processed, which supports researchers in identifying potential biases and limitations.

  • Long-Term Accessibility: Unlike proprietary models, which may become unavailable, open LLMs remain available for future studies.

  • Lower Costs: Open-source models usually have fewer licensing restrictions, which makes them more accessible to researchers with limited funding.

Challenges

TODO: Sebastian: Please formulate a text based on the bullet points (see earlier guidelines for inspiration).

  • Performance Differences: Open models may not always match the latest proprietary LLMs in accuracy or efficiency, making it harder to demonstrate improvements.

  • Computational Demands: Running large open models requires hardware resources, including high-performance GPUs and significant memory.

  • Defining “Openness”: The term open is evolving—many so-called open models provide access to weights but do not disclose training data or methodologies. We are aware that the definition of an “open” model is actively being discussed, and many open models are essentially only “open weight” [3].

  • We consider the Open Source AI Definition proposed by the Open Source Initiative (OSI) [2] to be a first step towards defining true open-source models.

  • Implementation Complexity: Unlike cloud-based APIs from proprietary providers, setting up and fine-tuning open models can be technically demanding due to the possible limited documentation.

Study Types

TODO: Sebastian: Please formulate a text based on the bullet points (see earlier guidelines for inspiration) and use the command from header.tex to refer to the study types.

  • Tool Evaluation: An open LLM baseline MUST be included if technically feasible. If integration is too complex, researchers SHOULD at least report initial benchmarking results using open models.

  • Benchmarking Studies and Controlled Experiments: An open LLM MUST be one of the models evaluated.

  • Observational Studies: If an open LLM is impossible, the researchers SHOULD acknowledge its absence and discuss potential impacts on their findings.

  • Qualitative Studies: If the LLM is used for exploratory data analysis or to compare alternative interpretations of results then an LLM baseline MAY be reported.

References

[1] D. G. Widder, M. Whittaker, and S. M. West, “Why ‘open’AI systems are actually closed, and why this matters,” Nature, vol. 635, no. 8040, pp. 827–833, 2024.

[2] Open Source Initiative (OSI), “Open Source AI Definition 1.0.” https://opensource.org/ai/open-source-ai-definition.

[3] E. Gibney, “Not all ‘open source’ AI models are actually open,” Nature News, 2024, doi: 10.1038/d41586-024-02012-5.

Report Suitable Baselines, Benchmarks, and Metrics

Recommendations

Benchmarks, baselines, and metrics play a significant role in assessing the effectiveness of an LLM or LLM-based tool. Benchmarks are model- and tool-independent standardized tests used to assess the performance of LLMs on specific tasks such as code summarization or code generation. A benchmark consists of multiple standardized test cases, each with at least a task and an expected result. Metrics are used to quantify the performance for the benchmark tasks, enabling a comparison. A baseline represents a reference point for the measured LLM. Since LLMs require substantial hardware resources, baselines serve as a comparison to assess their performance against traditional algorithms with lower computational costs.

When selecting benchmarks, it is important to fully understand the benchmark tasks and the expected result, because they determine what the benchmark actually assesses. Researchers MUST briefly summarize why they selected certain benchmarks in the PAPER, They SHOULD also summarize the structure and tasks of the selected benchmark(s), including the programming language(s) and descriptive statistics such as the number of contained tasks and test cases. Researchers SHOULD also discuss the limitations of the selected benchmark(s). For example, many benchmarks focus heavily on Python, and often on isolated functions. This assesses a very specific part of software development, which is certainly not representative for the full breadth of software engineering.

Researchers MAY include an example of a task and corresponding test case(s) the illustrate the structure of the benchmark. If multiple benchmark exist for the same task, researchers SHOULD compare performance between benchmarks. If selecting only a subset of benchmarks, researchers SHOULD use of the most specific benchmarks given the context.

The use of LLMs might not always be reasonable if traditional approaches achieve similar performance. For many tasks LLMs are being evaluated for, there exist traditional non-LLM-based approaches (e.g., for program repair) that can serve as a baseline. Even if LLM-based tools perform better, the question is whether the resources consumed justify the potentially marginal improvements. Researchers SHOULD always check whether such traditional baselines exist and if they do, compare them with the LLM or LLM-based tool using suitable metrics.

For comparing traditional and LLM-based approaches or different LLM-based tools, researchers SHOULD use established metrics whenever possible (see examples below), as this enables secondary research. Researchers MUST argue why the selected metrics are suitable for the given task or study. If a study evaluates an LLM-based tool that is supposed to support humans, a relevant metric might be the acceptance rate, meaning the ratio of all accepted artifacts (e.g., test cases, code snippets) in relation to all artifacts that were generated and presented to the user. Another way of evaluating LLM-based tools is calculating inter-model agreement (see also Section Use an Open LLM as a Baseline). This also allows researchers to assess how dependent a tool’s performance is on specific models. TODO: What are suitable metrics for comparing inter-model agreement?

LLM-based generation is non-deterministic by-design. Due to this non-determinism, researchers SHOULD repeat experiments to statistically assess the performance of a model or tool, for example using the arithmetic mean, confidence intervals, and standard deviations.

Example(s)

TODO: Sebastian: Examples of SE papers using certain benchmarks, baseline, or metrics are still missing. See also the TODO at the end of this section, which was added during the previous review round.

Two benchmarks used for code generation are HumanEval (GitHub) [1] and MBPP (GitHub) [2]. Both benchmarks consist of code snippets written in Python sourced from publicly available repositories. Each snippet consists of four parts: a prompt based on function definition and a corresponding description what the function should accomplish, a canonical solution, an entry point for execution, and tests. The input of the LLM is the entire prompt. The output of the LLM is evaluated either against the canonical solution using metrics or against a test suite. Other benchmarks for code generation include ClassEval (GitHub) [3], LiveCodeBench (GitHub) [4], and SWE-bench (GitHub) [5]. An example of a code translation benchmark is TransCoder [6] (GitHub).

According to [7], main problems types for LLMs are classification, recommendation and generation problems. Each of these problem types requires a different set of metrics. They provide a comprehensive overview of benchmarks categorized by software engineering tasks. Common metrics for assessing generation tasks are BLEU, pass@k, Accuracy/ Accuracy@k, and Exact Match [7]. The most common recommendation task metric is Mean Reciprocal Rank [7]. For classification tasks, classical machine learning metrics such as Precision, Recall, F1-score, and Accuracy are often reported [7].

We now briefly discuss two common metrics used for generation tasks. BLEU-N [1] is a similarity score based on n-gram precision between two strings, ranging from 0 to 1. Values close to 0 depict dissimilar values closer to 1 represent similar content. A value closer to 1 indicates that the model is more capable of generating the expected output for code generation. BLEU-N has multiple variations. CodeBLEU [8] and CrystalBLEU [9] are the most notable variations tailored to code, by introducing additional heuristics such as AST matching. As mentioned above, researchers should motivate why they chose a certain metric or variant thereof for their particular study.

The metric pass@k reports the likelihood of a model correctly completing a code snippet at least once within k tries. To the best of our knowledge, the basic concept of pass@k was first used in [10] for evaluating code synthesis under the name success rate at B, where B denotes the budget of trials. The term pass@k was later popularized by [11] as a metric for code generation correctness. The exact definition of correctness varies depending on the task For code generation, correctness is often defined based on test cases. A passing test then means that the solution is correct. The resulting pass rate ranges from 0 to 1. A pass rate of 0 indicates that the model was not able to generate a single correct solution within k tries. A pass rate of 1 indicates that the model successfully generated at least one correct solution in k tries. The metric is defined as:

\text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}

Where n is the total number of generated samples per prompt, c is the number of correct samples among n, and k is the number of samples.

Choosing an appropriate value for k depends on the downstream task of the model and how end-users interact with the model. A high pass rate for pass@1 is highly desirable in tasks where the system only presents one solution or if a single solution requires high computational effort. For example, code completion depends on a single prediction since the end user typically sees only a single suggestion. Pass rates for higher k values (e.g., 2, 5, 10) indicate whether the model can solve the given task within multiple attempts. For downstream tasks that permit multiple solutions or user interaction, strong performance at k > 1 can be justified. For instance, a user selecting the correct test case from multiple suggestions allows for some model errors. Common examples for papers using pass@k are papers introducing new models for code generation such as [12], [13], [14], [15].

TODO: Find SE papers which uses other benchmarks than pass@k & other papers which do not introduce a new model.

Exact match is another metric that calculates the percentage of replicas a model can produce. If the model is able to produce the exact target sequence a score of 1 is awarded; otherwise 0. Compared to BLEU-N, exact match is a stricter measurement.

Advantages

Challenges

A general challenge with benchmarks for LLMs is that the most prominent ones, such as HumanEval and MBPP, use Python, introducing a bias towards this specific programming language and its idiosyncrasies. Since model performance is measured against these benchmarks, researchers often optimize for them. As a result, performance may degrade if programming languages other than Python are used.

Many closed-source models, such as those released by OpenAI, achieve exceptional performance on certain tasks but lack transparency and reproducibility [3], [16], [17]. Benchmark leaderboards, particularly for code generation, are led by close-sourced models [3], [17]. While researchers should compare performance against these models, they must consider that providers might discontinue them or apply undisclosed pre- or post-processing beyond the researcher’s control (see also Section see also Section Use an Open LLM as a Baseline).

Challenges with individual metrics include that, for example, BLEU-N is a syntactic metric and hence does not measure semantic correctness or structural correctness. Thus, a high BLEU-N score does not directly indicate that the generated code is executable. While alternatives exist, they often come with their own limitations. For instance, Exact Match is a strict measurement that does not account for functional equivalence but syntactically different code. Execution-based metrics (e.g. pass@k) directly evaluate correctness by running test cases, but they require a setup with an execution environment. When researchers observe unexpected values for certain metrics, the specific results should be investigated in more detail to uncover potential problems. These problems can, for example, be related to formatting since code formatting highly influences metrics such as BLEU-N or Exact Match.

Another challenge to consider is that metrics usually capture one specific aspect of a task or solution. For instance, metrics such as pass@k do not reflect qualitative aspects of code such as maintainability, cognitive load, or readability. These aspects are critical for the downstream task and influence the overall usability. Moreover, benchmarks are isolated test sets and may not fully represent real-world applications. For example, benchmarks such as HumanEval synthesize code based on written specifications. However, such explicit descriptions are rare in real-world applications. Thus, evaluating the model performance with benchmarks might not reflect real-world tasks and end-user usability.

Finally, benchmark data contamination [18] continues to be a major challenge as well. In many cases, the training data set for an LLM is not released in conjunction with the model. The benchmark itself could be part of the model’s training dataset. Such benchmark contamination may lead to the model remembering the actual solution from the training data rather than solving the new task based on the seen data. This leads to artificially high performance on the benchmark. For unforeseen scenarios, however, the model might perform much worse.

Study Types

TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [19]).

This guideline MUST be followed for all study types that evaluate the performance of LLMs or LLM-based tools.

For example, for LLMs as Annotators, the research goal might be to assess which model comes closes to a ground truth dataset created by human annotators. Especially for open annotation tasks, it is important to select suitable metrics to compare LLM-generated an human-generated labels. In general, annotation tasks can vary significantly: Are multiple labels allowed for the same sequence? Are the available labels predefined, or should the LLM generate a set of labels independently? Due to this task dependence, it is important that researches justify their metric choice, explaining what aspects of the task it captures and what its limitations are.

TODO: Sebastian: Extend this and look at the comment below that is still unresolved. If researchers assess a well-established task, such as code generation, they SHOULD report standard metrics like pass@k and compare to other models.

TODO: Describe more study types. Copy common outline for this section.

References

[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the association for computational linguistics, july 6-12, 2002, philadelphia, PA, USA, ACL, 2002, pp. 311–318. doi: 10.3115/1073083.1073135.

[2] J. Austin et al., “Program synthesis with large language models,” CoRR, vol. abs/2108.07732, 2021, Available: https://arxiv.org/abs/2108.07732

[3] X. Du et al., “ClassEval: A manually-crafted benchmark for evaluating LLMs on class-level code generation,” CoRR, vol. abs/2308.01861, 2023, doi: 10.48550/ARXIV.2308.01861.

[4] N. Jain et al., “LiveCodeBench: Holistic and contamination free evaluation of large language models for code,” CoRR, vol. abs/2403.07974, 2024, doi: 10.48550/ARXIV.2403.07974.

[5] C. E. Jimenez et al., “SWE-bench: Can language models resolve real-world github issues?” in The twelfth international conference on learning representations, ICLR 2024, vienna, austria, may 7-11, 2024, OpenReview.net, 2024. Available: https://openreview.net/forum?id=VTF8yNQM66

[6] M.-A. Lachaux, B. Rozière, L. Chanussot, and G. Lample, “Unsupervised translation of programming languages,” CoRR, vol. abs/2006.03511, 2020, Available: https://arxiv.org/abs/2006.03511

[7] X. Hou et al., “Large language models for software engineering: A systematic literature review,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, Dec. 2024, doi: 10.1145/3695988.

[8] S. Ren et al., “CodeBLEU: A method for automatic evaluation of code synthesis,” CoRR, vol. abs/2009.10297, 2020, Available: https://arxiv.org/abs/2009.10297

[9] A. Eghbali and M. Pradel, “CrystalBLEU: Precisely and efficiently measuring the similarity of code,” in 37th IEEE/ACM international conference on automated software engineering, ASE 2022, rochester, MI, USA, october 10-14, 2022, ACM, 2022, pp. 28:1–28:12. doi: 10.1145/3551349.3556903.

[10] S. Kulal et al., “SPoC: Search-based pseudocode to code,” CoRR, vol. abs/1906.04908, 2019, Available: http://arxiv.org/abs/1906.04908

[11] M. Chen et al., “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021, Available: https://arxiv.org/abs/2107.03374

[12] B. Rozière et al., “Code llama: Open foundation models for code,” CoRR, vol. abs/2308.12950, 2023, doi: 10.48550/ARXIV.2308.12950.

[13] D. Guo et al., “DeepSeek-coder: When the large language model meets programming - the rise of code intelligence,” CoRR, vol. abs/2401.14196, 2024, doi: 10.48550/ARXIV.2401.14196.

[14] B. Hui et al., “Qwen2.5-coder technical report,” CoRR, vol. abs/2409.12186, 2024, doi: 10.48550/ARXIV.2409.12186.

[15] R. Li et al., “StarCoder: May the source be with you!” CoRR, vol. abs/2305.06161, 2023, doi: 10.48550/ARXIV.2305.06161.

[16] J. Li et al., “EvoCodeBench: An evolving code generation benchmark with domain-specific evaluations,” in Advances in neural information processing systems 38: Annual conference on neural information processing systems 2024, NeurIPS 2024, vancouver, BC, canada, december 10 - 15, 2024, A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, Eds., 2024. Available: http://papers.nips.cc/paper\files/paper/2024/hash/6a059625a6027aca18302803743abaa2-Abstract-Datasets\and\Benchmarks\Track.html

[17] T. Y. Zhuo et al., “BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions,” CoRR, vol. abs/2406.15877, 2024, doi: 10.48550/ARXIV.2406.15877.

[18] C. Xu, S. Guan, D. Greene, and M. T. Kechadi, “Benchmark data contamination of large language models: A survey,” CoRR, vol. abs/2406.04244, 2024, doi: 10.48550/ARXIV.2406.04244.

[19] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.

Report Limitations and Mitigations

Recommendations

When using LLMs for empirical studies in software engineering, researchers face unique challenges and potential limitations that can influence the validity, reliability, and reproducibility of their findings [1]. including:

A cornerstone of open science is the ability to reproduce research results. Even though the inherent non-deterministic nature of LLMs is a strength in many use cases, its impact on reproducibility is a challenge. To enable reproducibility, researchers MUST disclose a replication package for their study. TODO: Sebastian: This is not possible for all study types. If we keep it that generic, it’s rather a SHOULD. Or we contextualize by study type. They SHOULD perform multiple evaluation repetitions of their experiments (see Report Suitable Baselines, Benchmarks, and Metrics) to account for non-deterministic outputs of LLMs or In same cases, they MAY reduce output variability by setting the temperature to a value close to 0 and configure a seed value for more deterministic results. However, configuring a lower temperature can impair task performance, and not all models allow configuring seed values. Besides non-determinism, the behavior of an LLM depends on many external factors, such as model version, API updates, or prompt variations. To ensure reproducibility, researchers SHOULD follow all suggestions in our guidelines and further address the generalization challenges outlined below.

Even though the topic of generalizability is not new, it has gained new relevance with the increasing interest in LLMs. In LLM-based studies, generalizability boils down to two main concerns: First, are the results specific to an LLM, or can they be achieved with other LLMs too? If generalizability to other LLMs is not in the scope of the research, this MUST be clearly explained in the PAPER or, if generalizability is in scope, researchers MUST compare their results or subsets of the results (if not possible e.g., due to computational cost) with other LLMs that are similar (e.g. in size) to assess the generalizability of their findings (see also Section Use an Open LLM as a Baseline). Second, will these results still be valid in the future? Multiple studies (e.g., [2], [3]) found that the performance of proprietary LLMs (e.g., GPT) within the same version (e.g. GPT-4o) decreased over time for certain tasks. Reporting the model version and configuration is not sufficient in such cases. To date, the only way of mitigating this limitation is the usage of an Open LLM with a transparently communicated versioning schema and archiving (see Use an Open LLM as a Baseline). Hence, researchers SHOULD employ open LLMs to set a reproducible baseline and, if employing an open LLM is not possible, researchers SHOULD test and report their results over an extended period of time as a proxy of the results’ stability over time.

Data leakage/contamination/overfitting occurs when information from outside the training dataset influences the model, leading to overly optimistic performance estimates. With the growing reliance on big datasets, the risks of inter-dataset duplication increases (see, e.g., [4], [5]). In the context of LLMs for software engineering, this can, for example, manifest as samples from the pre-train dataset appearing in the fine-tune or evaluation dataset, potentially compromising the validity of evaluation results [6]. For example ChatGPT’s functionality to “improve the model for everyone” can result in unintentional data leakage. Hence, to ensure the validity of the evaluation results, researchers SHOULD carefully curate the fine-tuning and evaluation datasets to prevent inter-dataset duplication and MUST NOT leak their fine-tuned or evaluation datasets into the improvement process of the LLM. However, when publishing the results, researchers SHOULD of course still follow open science practices and publish the datasets as part of their SUPPLEMENTARY MATERIAL. If information about the pre-train dataset of the employed LLM is available, researchers SHOULD assess the inter-dataset duplication and MUST discuss the potential data leakage in the PAPER. When training an LLM from scratch, researchers MAY consider using open datasets such as Together AI’s RedPajama [7] that already incorporate duplication (with the positive side effect of potentially improving performance [8]). TODO: Sebastian: Is incorporate the right word here? Sounds like the dataset contains duplicates.

Conducting studies with LLMs is a resource-intensive endeavor. For self-hosted LLMs, the respective hardware needs to be provided, for managed LLMs, the service cost has to be considered. The challenge becomes more pronounced as LLMs grow larger, research architectures get more complex, and experiments become more computationally expensive. For example, multiple repetitions to assess performance in the face of non-determinism (see Report Suitable Baselines, Benchmarks, and Metrics) multiplies the cost and impacts scalability. Consequently, resource-intensive research remains predominantly the domain of private companies or well-funded research institutions, hindering researchers with limited resources in reproducing, replicating, or extending study results. Hence, for transparency reasons, researchers SHOULD report the cost associated with executing an LLM-based study. If the study employed self-hosted LLMs, researchers SHOULD report the specific hardware used. If the study employed managed LLMs, the service cost SHOULD be reported. To ensure research result validity and reproducibility, researchers MUST provide the LLM outputs as evidence for validation (see Section Report Prompts, their Development, and Interaction Logs). Depending on the architecture (see Section Report Tool Architecture beyond Models), these outputs need to be reported on different architectural levels (e.g., outputs of individual LLMs in multi-agent systems). TODO: Sebastian: “a subset of they employed validation dataset”? Additionally, researchers SHOULD include a subset of they employed validation dataset, selected using an accepted sampling strategy, to allow partial replication of the results.

While metrics such as BLEU or ROUGE are commonly used to evaluate the performance of LLMs (see Section Report Suitable Baselines, Benchmarks, and Metrics), they may not capture other relevant, software engineering-specific aspects such as functional correctness or the runtime performance of automatically generated code [9].

Sensitive data can range from personal to proprietary data, each with its own set of ethical concerns. A big threat of proprietary LLMs and sensitive data is the data’s usage for model improvements (see discussion of data leakage above). Hence, using sensitive data can lead to privacy and intellectual property (IP) violations. Another threat is the implicit bias of LLMs potentially leading to discrimination or unfair treatment of individuals or groups. To mitigate these concerns, researchers SHOULD minimize the sensitive data used in their studies, MUST follow applicable regulations (e.g., GDPR) and individual processing agreements, SHOULD create a data management plan outlining how the data is handled and protected against leakage and discrimination, and MUST apply for approval from the ethics committee of their organization (if required).

The performance of an LLM is usually measured in terms of traditional metrics such as accuracy, precision, and recall or more contemporary metrics such as pass@k, or BLEU-N (see Section Report Suitable Baselines, Benchmarks, and Metrics). However, given the high resource demand of LLMs are, resource consumption has to become a key indicator for performance to assess research progress responsibly. While research predominantly focused on the energy consumption during the early phases of LLMs (e.g., data center manufacturing, data acquisition, training), inference - i.e. the use of the LLM - often becomes similarly or even more resource-intensive [10], [11], [12], [13], [14]. Hence, researchers SHOULD aim for lower resource consumption on the model side. This can be achieved by selecting smaller (e.g., GPT 4o mini instead of GPT 4o) or newer models as a base model for the study or by employing techniques such as model pruning, quantization, knowledge distillation [14]. They SHOULD further reduce resource consumption when using the LLMs, e.g. by restricting the number of queries, input tokens, or output tokens [14], with different prompt engineering techniques (e.g., on average zero-shot prompts seem to emit less CO2 than chain-of-thought prompts), or by carefully sampling smaller datasets for fine-tuning and evaluation instead of using large datasets in their entirety. TODO: Sebastian: This of course goes against our recommendations to repeat studies. How do we balance this? To report the environmental impact of a study, researchers SHOULD use software such as CodeCarbon or Experiment Impact Tracker to track and quantify the carbon footprint of the study or report an estimation of the carbon footprint through tools such as MLCO2 Impact. They SHOULD detail the LLM version and configuration as described in Report Model Version, Configuration, and Customizations, state the hardware or hosting provider of the model as described in Report Tool Architecture beyond Modelsand report the total number of requests, accumulated input tokens, and output tokens. Researchers MUST justify why an LLM was chosen over existing approaches and set the achieved results in relation to the higher resource consumption of LLMs (see also Section Report Suitable Baselines, Benchmarks, and Metrics).

Example(s)

An example highlighting the need for caution around replicability is the study of Staudinger et al. [15] who attempted the replication of an LLM study. They aimed to replicate the results of a previous study that did not provide a replication package. However, they were not able to reproduce the exact results, even though they saw similar trends to the original study. They consider their results as not reliable enough for a systematic review. To analyze whether the results of proprietary LLMs transfer to open LLMs, Staudinger et al. [15] benchmarked previous results using GPT3.5 and GPT4 against Mistral and Zephyr. They found that the employed open-source models could not deliver the same performance as the proprietary models, restricting the effect to certain proprietary models. This paper is also an example of a study reporting the costs: “120 USD in API calls for GPT 3.5 and GPT 4, and 30 USD in API calls for Mistral AI. Thus, the total LLM cost of our reproducibility study was 150 USD”.

Individual studies already started to highlight the uncertainty about the generalizability of their results in the future. In [16] Jesse et al. acknowledge the issue that LLMs evolve over time and that this evolution might impact the study’s results. Since a lot of research in software engineering evolves around code, inter-dataset code duplication has been extensively researched and addressed over the years to curate deduplicated datasets (see, e.g., [4], [5], [6], [17]). The issue of inter-dataset duplication has also attracted interest in other disciplines, with growing demands for data mining. For example, in the biology field, Lakiotaki et al. [18] acknowledge and address the overlap between multiple common disease datasets. In the domain of code generation, Coignion et al. [19] evaluated the performance of LLMs to produce leet code. To mitigate the issue of inter-dataset duplication, they only used leet code problems published after 2023-01-01, reducing the likelihood of LLMs having seen those problems before. Further, they discuss the performance differences of LLMs on different datasets in light of potential inter-dataset duplication. Zhou et al. performed an empirical evaluation of data leakage in 83 software engineering benchmarks [20]. While most benchmarks suffer from minimal leakage, very few suffered from leakage of up to 100%. They found a high impact of data leakage on the performance evaluation. A starting point for studies that aim to assess and mitigate inter-dataset duplication are the Falcon LLMs. The technology innovation institute publicly provides access to parts of its training data for the Falcon LLMs via Hugging Face [21]. Through this dataset, it is possible to reduce the overlap between the pre-train and evaluation data, improving the validity of the evaluation results. A starting point to prevent actively leaking data into a LLM improvement process is to ensure that the data is not used to train the model (e.g., via OpenAI’s data control functionality, or the OpenAI API instead of the web interface) [22].

Bias can occur in datasets as well as in LLMs that have been trained on them and may result in various types of discrimination. Gallegos et al. propose metrics to quantify biases in various tasks (e.g., text generation, classification, question answering) [23]. Tinnes et al. [24] balanced the dataset size between the need for manual semantic analysis and computational resource consumption.

Advantages

Reproducing study results under similar conditions by different parties greatly increases their validity and promotes transparency in research. Independent verification is of particular importance in studies involving LLMs, due to the randomness of their outputs and the potential for biases in their predictions and in training, fine-tuning, and evaluation datasets. Mitigating threats to generalizability of a study through the integration of open LLMs as a baseline or the reporting of results over an extended period of time can increase the validity, reliability, and replicability of a study’s results. Assessing and mitigating the effects of inter-dataset duplication strengthens a study’s validity and reliability, as it prevents overly optimistic performance estimates that do not apply to previously unknown samples. Reporting the costs associated with executing a study not only increases transparency but also supports secondary literature in setting primary research into perspective. Providing replication packages entailing direct LLM output evidence as well as samples for partial replicability are paramount steps towards open and inclusive research in the light of resource inequality among researchers. Mindfully deciding and justifying the usage of LLMs over other approaches can lead to more efficient and sustainable approaches. Reporting the environmental impact of the usage of LLMs also sets the stage for more sustainable research practices in the field of AI.

Challenges

With commercial LLMs evolving over time, the generalizability of results to future versions of the model is uncertain. Employing open LLMs as a baseline can mitigate this limitation, but may not always be feasible due to computational cost. Most LLM providers do not publicly offer information about the datasets employed for pre-training, impeding the assessment of inter-dataset duplication effects. Consistently keeping track of and reporting the costs involved in a research endeavor is challenging. Building a coherent replication package that includes LLM outputs and samples for partial replicability requires additional effort and resources. Defining all requirements for metrics beforehand to ensure the usage of suitable metrics can be challenging, especially in exploratory research. In this growing field of research, finding the right metrics to evaluate the performance of LLMs in software engineering for specific tasks is challenging. Our Section Report Suitable Baselines, Benchmarks, and Metricsand its references can serve as a starting point. Ensuring compliance across jurisdictions is difficult with different regions having different regulations and requirements (e.g., GDPR and the AI Act in the EU, CCPA in California). Selecting datasets and models with fewer biases is challenging, as the bias in LLMs is often not transparently reported. Measuring or estimating the environmental impact of a study is challenging and might not always be feasible. Especially in exploratory research, the impact is hard to estimate beforehand, making it difficult to justify the usage of LLMs over other approaches.

Study Types

The limitations and mitigations SHOULD be followed for all study types in a sensible manner, i.e. depending on the applicability to the individual study.

References

[1] J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence: The threats of using llms in software engineering,” in Proceedings of the 2024 ACM/IEEE 44th international conference on software engineering: New ideas and emerging results, 2024, pp. 102–106.

[2] L. Chen, M. Zaharia, and J. Zou, “How is ChatGPT’s behavior changing over time?” CoRR, vol. abs/2307.09009, 2023, doi: 10.48550/ARXIV.2307.09009.

[3] D. Li, K. Gupta, M. Bhaduri, P. Sathiadoss, S. Bhatnagar, and J. Chong, “Comparing GPT-3.5 and GPT-4 accuracy and drift in radiology diagnosis please cases,” Radiology, vol. 310, no. 1, p. e232411, 2024, doi: 10.1148/radiol.232411.

[4] C. V. Lopes et al., “Déjàvu: A map of code duplicates on GitHub,” Proc. ACM Program. Lang., vol. 1, no. OOPSLA, pp. 84:1–84:28, 2017, doi: 10.1145/3133908.

[5] M. Allamanis, “The adverse effects of code duplication in machine learning models of code,” in Proceedings of the 2019 ACM SIGPLAN international symposium on new ideas, new paradigms, and reflections on programming and software, onward! 2019, athens, greece, october 23-24, 2019, H. Masuhara and T. Petricek, Eds., ACM, 2019, pp. 143–153. doi: 10.1145/3359591.3359735.

[6] J. A. H. López, B. Chen, M. Saad, T. Sharma, and D. Varró, “On inter-dataset code duplication and data leakage in large language models,” IEEE Trans. Software Eng., vol. 51, no. 1, pp. 192–205, 2025, doi: 10.1109/TSE.2024.3504286.

[7] T. Computer, RedPajama: An open dataset for training large language models. (2023). Available: https://github.com/togethercomputer/RedPajama-Data

[8] K. Lee et al., “Deduplicating training data makes language models better,” in Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), ACL 2022, dublin, ireland, may 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds., Association for Computational Linguistics, 2022, pp. 8424–8445. doi: 10.18653/V1/2022.ACL-LONG.577.

[9] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation,” in Advances in neural information processing systems 36: Annual conference on neural information processing systems 2023, NeurIPS 2023, new orleans, LA, USA, december 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., 2023. Available: http://papers.nips.cc/paper\files/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Conference.html

[10] A. de Vries, “The growing energy footprint of artificial intelligence,” Joule, vol. 7, no. 10, pp. 2191–2194, 2023.

[11] C.-J. Wu et al., “Sustainable AI: Environmental implications, challenges and opportunities,” in Proceedings of the fifth conference on machine learning and systems, MLSys 2022, santa clara, CA, USA, august 29 - september 1, 2022, D. Marculescu, Y. Chi, and C.-J. Wu, Eds., mlsys.org, 2022. Available: https://proceedings.mlsys.org/paper\files/paper/2022/hash/462211f67c7d858f663355eff93b745e-Abstract.html

[12] Z. Fu, F. Chen, S. Zhou, H. Li, and L. Jiang, “LLMCO2: Advancing accurate carbon footprint prediction for LLM inferences,” CoRR, vol. abs/2410.02950, 2024, doi: 10.48550/ARXIV.2410.02950.

[13] P. Jiang, C. Sonne, W. Li, F. You, and S. You, “Preventing the immense increase in the life-cycle energy and carbon footprints of LLM-powered intelligent chatbots,” Engineering, vol. 40, pp. 202–210, 2024, doi: https://doi.org/10.1016/j.eng.2024.04.002.

[14] N. E. Mitu and G. T. Mitu, “The hidden cost of AI: Carbon footprint and mitigation strategies,” Available at SSRN 5036344, 2024.

[15] M. Staudinger, W. Kusa, F. Piroi, A. Lipani, and A. Hanbury, “A reproducibility and generalizability study of large language models for query generation,” in Proceedings of the 2024 annual international ACM SIGIR conference on research and development in information retrieval in the asia pacific region, SIGIR-AP 2024, tokyo, japan, december 9-12, 2024, T. Sakai, E. Ishita, H. Ohshima, F. Hasibi, J. Mao, and J. M. Jose, Eds., ACM, 2024, pp. 186–196. doi: 10.1145/3673791.3698432.

[16] K. Jesse, T. Ahmed, P. T. Devanbu, and E. Morgan, “Large language models and simple, stupid bugs,” in 20th IEEE/ACM international conference on mining software repositories, MSR 2023, melbourne, australia, may 15-16, 2023, IEEE, 2023, pp. 563–575. doi: 10.1109/MSR59073.2023.00082.

[17] A. Karmakar, M. Allamanis, and R. Robbes, “JEMMA: An extensible java dataset for ML4Code applications,” Empir. Softw. Eng., vol. 28, no. 2, p. 54, 2023, doi: 10.1007/S10664-022-10275-7.

[18] K. Lakiotaki, N. Vorniotakis, M. Tsagris, G. Georgakopoulos, and I. Tsamardinos, “BioDataome: A collection of uniformly preprocessed and automatically annotated datasets for data-driven biology,” Database J. Biol. Databases Curation, vol. 2018, p. bay011, 2018, doi: 10.1093/DATABASE/BAY011.

[19] T. Coignion, C. Quinton, and R. Rouvoy, “A performance study of LLM-generated code on leetcode,” in Proceedings of the 28th international conference on evaluation and assessment in software engineering, EASE 2024, salerno, italy, june 18-21, 2024, ACM, 2024, pp. 79–89. doi: 10.1145/3661167.3661221.

[20] X. Zhou et al., “LessLeak-bench: A first investigation of data leakage in LLMs across 83 software engineering benchmarks.” 2025. Available: https://arxiv.org/abs/2502.06215

[21] Technology Innovation Institute, “Falcon-refinedweb (revision 184df75).” Hugging Face, 2023. doi: 10.57967/hf/0737 .

[22] S. Balloccu, P. Schmidtová, M. Lango, and O. Dusek, “Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs,” in Proceedings of the 18th conference of the european chapter of the association for computational linguistics, EACL 2024 - volume 1: Long papers, st. Julian’s, malta, march 17-22, 2024, Y. Graham and M. Purver, Eds., Association for Computational Linguistics, 2024, pp. 67–93. Available: https://aclanthology.org/2024.eacl-long.5

[23] I. O. Gallegos et al., “Bias and fairness in large language models: A survey,” CoRR, vol. abs/2309.00770, 2023, doi: 10.48550/ARXIV.2309.00770.

[24] C. Tinnes, A. Welter, and S. Apel, “Software model evolution with large language models: Experiments on simulated, public, and industrial datasets.”