Motivation and Scope

Motivation

In the short period since the release of ChatGPT in November 2022, large language models (LLMs) have changed the software engineering (SE) research landscape. Although there are numerous opportunities to use LLMs to support SE research and development tasks, solid science needs rigorous empirical evaluations to explore the effectiveness, performance, and robustness of using LLMs to automate different research and development tasks. LLMs can, for example, be used to support literature reviews, reduce development time, and generate software documentation. However, it is often unclear how valid, reproducible, and replicable empirical studies involving LLM are. This uncertainty poses significant challenges for researchers and practitioners who seek to draw reliable conclusions from empirical studies.

The importance of open science practices and documenting the study setup is not to be underestimated (Gundersen et al. 2024). One of the primary risks in creating irreproducible and irreplicable results based on studies involving LLMs stems from the variability in model performance due to their inherent non-determinism, but also due to differences in configuration, training data, model architecture, or evaluation metrics. Slight changes can lead to significantly different results. The lack of standardized benchmarks and evaluation protocols further hinders the reproducibility and replicability of study results. Without detailed information on the exact study setup, benchmarks, and metrics used, it is challenging for other researchers to replicate results using different models and tools. These issues highlight the need for clear guidelines and best practices for designing and reporting studies involving LLMs. While the SE research community has developed guidelines for conducting and reporting specific types of empirical studies such as controlled experiments (e.g., Experimentation in Software Engineering (Wohlin et al. 2024), Guide to Advanced Empirical Software Engineering (Shull, Singer, and Sjøberg 2008)) or their replications (e.g., A Procedure and Guidelines for Analyzing Groups of Software Engineering Replications (Santos et al. 2021)), we believe that LLMs have specific intrinsic characteristics that require specific guidelines for researchers to achieve an acceptable level of reproducibility and replicability (see also our previous position paper (Wagner et al. 2025)). For example, even if we knew the specific version of a commercial LLM used for an empirical study, the reported task performance could still change over time, since commercial models are known to evolve beyond version identifiers (Chen, Zaharia, and Zou 2023). Moreover, commercial providers do not guarantee the availability of old model or tool versions indefinitely. In addition to version differences, LLM performance varies widely depending on configured parameters such as temperature. Therefore, not reporting the parameter settings severely impacts reproducibility. Even for “open” models such as Mistral, DeepSeek or Llama, we do not know how they were fine-tuned for specific tasks and what the exact training data was (Gibney 2024). A general problem when evaluating LLMs’ performance is that we do not know whether the solution to a certain problem was part of the training data or not.

So far, there are no holistic guidelines for conducting and reporting studies involving LLMs in SE research. With this community effort, we try to fill this gap. After outlining our Motivation and Scope, we continue by introducing a taxonomy of Study Types before presenting eight Guidelines that the authors of this article co-developed. Also, for each study type and guideline, we identify relevant research exemplars, both within the SE literature, but also outside the discipline where relevant; this serves a useful purpose given the vast proliferation of research on this nascent topic. The most recent version is always available online.[1] Other researchers can suggest changes via a public GitHub repository.xw[2]

Scope

In this section, we outline the scope of the study types and guidelines we present.

SE as our Target Discipline

We target software engineering (SE) research conducted in academic or industry contexts. While other disciplines have started developing community guidelines for reporting empirical studies involving LLMs, SE has not seen a holistic effort in that direction. In healthcare, for example, Gallifant et al. (2025) presented the TRIPOD-LLM guidelines (Gallifant et al. 2025), which are centered around a checklist for “good reporting of studies that are developing, tuning, prompt engineering or evaluating an LLM” (Gallifant et al. 2025). Although the checklist overlaps with our Guidelines (e.g., reporting the LLM name and version, comparing performance between LLMs, humans and other benchmarks), they are healthcare-specific (e.g., they require to “explain the healthcare context” and “therapeutic and clinical workflow”). Instead of a checklist, we distinguish MUST and SHOULD criteria and provide brief tl;dr summaries highlighting our most important recommendations. SE research utilizes a broader range of empirical methods than many other disciplines. Therefore, we are convinced that SE-specific guidelines need to be developed in combination with a Study Types taxonomy such as ours. In SE, Sallou, Durieux, and Panichella (2024) have presented a vision paper on mitigating validity threats in LLM-based SE research. Although their guidelines partially overlap with ours (e.g., repeating experiments to handle output variability, considering data leakage), our guidelines cover a broader range of study types and provide more detailed advise. For example, we contextualize our guidelines by study type, distinguish LLMs and LLM-based tools, talk about models, baselines, benchmarks, and metrics, and explicitly address human validation. Moreover, our guidelines are the result of an extensive coordination process between a large number of experts in empirical SE, hence a first step towards true community guidelines.

Focus on Natural Language Use Cases

We want to clarify that our focus is on LLMs, that is, natural language use cases. Multi-modal foundational models are beyond the scope of our study types and guidelines. We are aware that these foundational models have great potential to support SE research and practice. However, due to the diversity of artifacts that can be generated or used as input (e.g., images, audio, and video) and the more demanding hardware requirements, we deliberately focus on LLMs only. However, our guidelines could be extended in the future to include foundational models beyond natural language text.

Focus on Direct Tool or Research Support

Given the exponential growth in LLM usage across all research domains, we also want to define the research contexts in which our guidelines apply. LLMs are already widely used to support several aspects of the overall research process, from fairly simple tasks such as proof-reading, spell-checking, and text translation, to more complex activities such as data coding and synthesis of literature reviews. Regarding tools, we focus on use cases in the area of AI for software engineering (AI4SE), that is, studying the support and automation of SE tasks with the help of artificial intelligence (AI), more specifically LLMs (see Section LLMs as Tools for Software Engineers). For research support, we focus on empirical SE research supported by LLMs (see Section LLMs as Tools for Software Engineering Researchers). By research support, we mean the active involvement of LLMs in data collection, processing, or analysis. We consider LLMs supporting the study design or the writing process to be out of scope.

Researchers as our Target Audience

Third, our guidelines mainly target SE researchers planning, designing, conducting, and reporting empirical studies involving LLMs. Although researchers who review scientific articles written by others can also use our guidelines, for example, to check whether the authors adhere to the essential MUST requirements we present, reviewers are not our main target audience.

References

Chen, Lingjiao, Matei Zaharia, and James Zou. 2023. “How Is ChatGPT’s Behavior Changing over Time?” CoRR abs/2307.09009. https://doi.org/10.48550/ARXIV.2307.09009.

Gallifant, Jack, Majid Afshar, Saleem Ameen, Yindalon Aphinyanaphongs, Shan Chen, Giovanni Cacciamani, Dina Demner-Fushman, et al. 2025. “The TRIPOD-LLM Reporting Guideline for Studies Using Large Language Models.” Nature Medicine 31 (1): 60–69. https://doi.org/10.1038/s41591-024-03425-5.

Gibney, Elizabeth. 2024. “Not all ‘open source’ AI models are actually open.” Nature News. https://doi.org/10.1038/d41586-024-02012-5.

Gundersen, Odd Erik, Odd Cappelen, Martin Mølnå, and Nicklas Grimstad Nilsen. 2024. “The Unreasonable Effectiveness of Open Science in AI: A Replication Study.” CoRR abs/2412.17859. https://doi.org/10.48550/ARXIV.2412.17859.

Sallou, June, Thomas Durieux, and Annibale Panichella. 2024. “Breaking the Silence: The Threats of Using Llms in Software Engineering.” In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, 102–6.

Santos, Adrian, Sira Vegas, Markku Oivo, and Natalia Juristo. 2021. “A Procedure and Guidelines for Analyzing Groups of Software Engineering Replications.” IEEE Trans. Software Eng. 47 (9): 1742–63. https://doi.org/10.1109/TSE.2019.2935720.

Shull, Forrest, Janice Singer, and Dag I. K. Sjøberg, eds. 2008. Guide to Advanced Empirical Software Engineering. Springer. https://doi.org/10.1007/978-1-84800-044-5.

Wagner, Stefan, Marvin Muñoz Barón, Davide Falessi, and Sebastian Baltes. 2025. “Towards Evaluation Guidelines for Empirical Studies Involving LLMs.” In IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering, WSESE@ICSE 2025, May 3, 2025, 24–27. IEEE. https://doi.org/10.1109/WSESE66602.2025.00011.

Wohlin, Claes, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2024. Experimentation in Software Engineering, Second Edition. Springer. https://doi.org/10.1007/978-3-662-69306-3.

[1] https://llm-guidelines.org

[2] https://github.com/se-uhd/llm-guidelines-website