Motivation and Scope
Motivation
In the short period since the release of ChatGPT in November 2022, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. Such evaluations might explore the effectiveness, performance, and robustness of LLMs in different contexts, such as improving code quality, reducing development time, or supporting software documentation. However, it is often unclear how valid and reproducible results can be achieved with empirical studies involving LLMs - or what effect their usage has on the validity of empirical results. This uncertainty poses significant challenges for researchers aiming to draw reliable conclusions from empirical studies.
One of the primary risks in creating unreproducible results stems from the variability in LLM performance due to differences in training data, model architecture, evaluation metrics, and the inherent non-determinism that those models possess. For example, slight changes in the training dataset or the hyperparameters can lead to significantly different outcomes, making it difficult to replicate studies. Additionally, the lack of standardized benchmarks and evaluation protocols further complicates the reproducibility of results. These issues highlight the need for clear guidelines and best practices to ensure that empirical studies with LLMs yield valid and reproducible results. Another aspect is that, without detailed information on the exact study setup, benchmarks, and metrics used, it is challenging for other researchers to replicate results using different models and tools. The importance of open science practices and documenting the study setup is not to be underestimated [1].
There has been extensive work developing guidelines for conducting and reporting specific types of empirical studies such as controlled experiments (e.g., Experimentation in Software Engineering [2] or Guide to Advanced Empirical Software Engineering [3]) or their replications (e.g., [A Procedure and Guidelines for Analyzing Groups of Software Engineering Replications [4]). We believe that LLMs have specific intrinsic characteristics that require specific guidelines for researchers to achieve an acceptable level of reproducibility. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research.
For example, even if we know the specific version of an LLM used for an empirical study, the reported performance for the studied tasks can change over time, especially for commercial models that evolve beyond version identifiers [5]. Moreover, commercial providers do not guarantee the availability of old versions indefinitely. Besides versions, LLMs’ performance widely varies depending on configured parameters such as temperature. Therefore, not reporting the parameter settings impacts the reproducibility of the research. Even for *open* models such as Llama, we do not know how they were fine-tuned for specific tasks and what the exact training data was [6]. For example, when evaluating LLMs’ performance for certain programming tasks, it would be relevant to know whether the solution to a certain problem was part of the training data or not.
Scope
First, we want to clarify that our focus is on LLMs, that is natural language use cases. Multi-modal foundational models are beyond the scope of our guidelines. We are aware that these foundational models have a huge potential for supporting software engineering research and practice. However, due to the diversity of the artifacts that can be generated or used as input (e.g., images, audio, and video) and the more demanding hardware requirements, we deliberately focus in LLMs alone. However, our guidelines could be extended in the future to include foundational models beyond text.
Second, given the exponential growth in LLM usage across all research domains, we also want to define the study types to which our guidelines apply. LLMs are already widely used to support several aspects of the overall research process—from fairly simple tasks such as proof-reading, spell-checking, text translation to more significant activities such as data coding and synthesis of literature reviews. The Study Types and Guidelines we describe are tailored to software engineering research, more specifically AI for software engineering (AI4SE), but we expect many of those study types to generalize beyond that domain.
References
[1] O. E. Gundersen, O. Cappelen, M. Mølnå, and N. G. Nilsen, “The unreasonable effectiveness of open science in AI: A replication study,” CoRR, vol. abs/2412.17859, 2024, doi: 10.48550/ARXIV.2412.17859.
[2] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén, Experimentation in software engineering, second edition. Springer, 2024. doi: 10.1007/978-3-662-69306-3.
[3] F. Shull, J. Singer, and D. I. K. Sjøberg, Eds., Guide to advanced empirical software engineering. Springer, 2008. doi: 10.1007/978-1-84800-044-5.
[4] A. Santos, S. Vegas, M. Oivo, and N. Juristo, “A procedure and guidelines for analyzing groups of software engineering replications,” IEEE Trans. Software Eng., vol. 47, no. 9, pp. 1742–1763, 2021, doi: 10.1109/TSE.2019.2935720.
[5] L. Chen, M. Zaharia, and J. Zou, “How is ChatGPT’s behavior changing over time?” CoRR, vol. abs/2307.09009, 2023, doi: 10.48550/ARXIV.2307.09009.
[6] E. Gibney, “Not all ‘open source’ AI models are actually open,” Nature News, 2024, doi: 10.1038/d41586-024-02012-5.