Guidelines
This set of guidelines is currently a DRAFT and based on a discussion session with researchers at the 2024 International Software Engineering Research Network (ISERN) meeting and at the 2nd Copenhagen Symposium on Human-Centered Software Engineering AI. This draft is meant as a starting point for further discussions in the community with the aim of developing a common understanding of how we should conduct and report empirical studies involving large language models (LLMs). See also the pages on study types and scope.
The wording of the recommendations follows RFC 2119 and 8174.
Overview
- Declare LLM Usage and Role
- Report Model Version and Configuration
- Report Tool Architecture and Supplemental Data
- Report Prompts and their Development
- Report Interaction Logs
- Use Human Validation for LLM Outputs
- Use an Open LLM as a Baseline
- Report Suitable Baselines, Benchmarks, and Metrics
- Report Limitations and Mitigations
Declare LLM Usage and Role
Context
When conducting any kind of empirical study involving LLMs, it is essential to clearly declare that an LLM was used. This is, for example, required by the ACM Policy on Authorship [1]: “The use of generative AI tools and technologies to create content is permitted but must be fully disclosed in the Work.”
Recommendations
We recommend reporting the exact purpose of using an LLM in a study, the tasks it was used to automate, and the expected outcomes.
Example
TODO: Maybe reuse example from ACM Policy?
Benefits
Transparency in the usage of LLMs helps in understanding the context and scope of the study, facilitating better interpretation and comparison of results. Beyond this declaration, we recommend authors to be explicit about the LLM version they used (see version and date guideline) and the LLM’s exact role (see architecture guideline).
Challenges
We do not expect any challenges for researchers following this guideline.
Study Types
This guideline MUST be followed for all study types.
References
[1] Association for Computing Machinery, “ACM Policy on Authorship.” https://www.acm.org/publications/policies/new-acm-policy-on-authorship, 2023.
Report Model Version and Configuration
Context
LLMs are frequently updated, and different versions may produce varying results. Moreover, the model configuration and parameters influence the output generation of the models.
Recommendations
It is crucial to document the specific version of the LLM used in the study, along with the date when the experiments were conducted, and the exact configuration being used. Furthermore, detailed documentation of the configuration and parameters used during the study is necessary for reproducibility. Our recommendations is to report: Additionally, a thorough description of the hosting environment of the LLM or LLM-based tool should be provided, especially in studies focusing on performance or any time-sensitive measurement.
-
Model name
-
Model version
-
The maximum token length for prompts.
-
The configured temperature that controls randomness, and all other relevant parameters that affect output generation.
-
Whether historical context was considered when generating responses.
Example
For an OpenAI model, researchers might report that “A gpt-4
model was integrated via the Azure OpenAI Service, and configured with a temperature of 0.7, top_p set to 0.8, and a maximum token length of
- We used version
0125-Preview
, system fingerprintfp_6b68a8204b
, seed value23487
, and ran our experiment on 10th January 2025” [1], [2]. However, there are also related challenges (see section below).
TODO: Talk about local or self-hosted models as well.
Benefits
By providing this information, researchers enable others to reproduce the study under the same or similar conditions.
Challenges
Different model providers and modes of operating the models allow for varying degrees of information. For example, OpenAI provides a model version and a system fingerprint describing the backend configuration that can also influence the output. However, the fingerprint is indeed just intended to detect changes to the model or its configuration. As a user, one cannot go back to a certain fingerprint. As a beta feature, OpenAI also lets users set a seed parameter to receive “(mostly) consistent output” [3]. However, the seed value does not allow for full reproducibility and the fingerprint changes frequently.
TODO: Also “open” models come with challenges in terms of reproducibility (https://github.com/ollama/ollama/issues/5321).
Study Types
TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [4]).
TODO: I guess when studying the usage of a tool such as ChatGPT or Copilot, it might not be possible to report all of this?
References
[1] OpenAI, “OpenAI API Introduction.” https://platform.openai.com/docs/api-reference/chat/streaming, 2025.
[2] Microsoft, “Azure OpenAI Service models.” https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models, 2025.
[3] OpenAI, “How to make your completions outputs consistent with the new seed parameter.” https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter, 2023.
[4] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.
Report Tool Architecture and Supplemental Data
Context
Oftentimes, there is a complex layer around the LLM that preprocesses data, prepares prompts, or filters user requests. One example is ChatGPT, which can, among others, use the GPT-4o model. GitHub Copilot uses the same model as well, and researchers can build their own tools utilizing GPT-4o directly (e.g., via the OpenAI API). The infrastructure around the bare model can significantly contribute to the performance of a model in a certain task. Therefore, it is crucial that researchers clearly describe what the LLM contributes to the tool or method presented in a research paper.
TODO: Architecture (e.g., usage of RAG, agent-based architecture, etc.)
TODO: data dump of vector database if used
TODO: finetuning? if yes, how? also: publish data used for finetuning (if not confidential)
Recommendations
TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [1]).
Example
TODO
Benefits
TODO
Challenges
TODO
References
[1] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.
Report Prompts and their Development
Context
Prompts can significantly influence the [output of LLMs [1], and sharing them allows other researchers to understand and reproduce the conditions of the study.
TODO: Architecture (e.g., usage of RAG, agent-based architecture, etc.)
TODO: data dump of vector database if used
TODO: finetuning? if yes, how? also: publish data used for finetuning (if not confidential)
Recommendations
Reporting the exact prompts used in the study is essential for transparency and reproducibility. For example, including the specific questions or tasks given to the LLM helps in assessing the validity of the results and comparing them with other studies. This is an example where different types of studies require different information. When studying LLM usage, the researchers ideally collect and publish the prompts written by the users (if confidentiality allows). Otherwise, summaries and examples can be provided. Prompts also need to be reported when LLMs are integrated into new tools, especially if study participants were able to formulate (parts of) the prompts. For all other types of studies, researchers should discuss how they arrived at their final set of prompts. If a systematic approach was used, this process should be described in detail.
TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [2]).
Example
TODO
Benefits
TODO
Challenges
TODO
References
[1] Y. Liu et al., “Refining ChatGPT-generated code: Characterizing and mitigating code quality issues,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 5, pp. 116:1–116:26, 2024, doi: 10.1145/3643674.
[2] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.
Report Interaction Logs
Context
Given that commercial LLMs and LLM-based tools are evolving systems (minor upgrades of a major version are likely to be deployed frequently [1] and that they behave non-deterministically [2] even with a temperature of 0, reporting the prompts and model versions alone will not be enough to enable reproducibility.
Even for models such as Llama [3] that researchers can host and configure themselves, reaching complete determinism is challenging. While decoding strategies and parameters can be fixed (e.g., by defining seeds, setting temperature to 0, using deterministic decoding strategies etc.), non-determinism can also arise from batching, input preprocessing, and floating point arithmetic on GPUs.
Recommendations
For complete transparency, researchers should report full interaction logs, that is all prompts and responses generated by the LLM or LLM-based tool, in the context of the presented study. Reporting this is especially important when reporting a study targeting commercial SaaS solutions based on LLMs (e.g., ChatGPT) or novel tools that integrate LLMs via cloud APIs.
TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [4]).
Example
TODO
Benefits
In this sense, the way an LLM has to be treated is similar to the way a human participant in an interview would be treated, because both the human and the LLM might provide different answers if presented with the same questions at different times. The difference is that, while for human participants conversations often cannot be reported due to confidentiality, LLM conversations can.
Challenges
TODO
References
[1] L. Chen, M. Zaharia, and J. Zou, “How is ChatGPT’s behavior changing over time?” CoRR, vol. abs/2307.09009, 2023, doi: 10.48550/ARXIV.2307.09009.
[2] S. Chann, “Non-determinism in GPT-4 is caused by Sparse MoE.” https://152334h.github.io/blog/non-determinism-in-gpt-4/, 2023.
[3] Meta, “Llama.” https://www.llama.com/.
[4] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.
Use Human Validation for LLM Outputs
Context
While LLMs can automate many tasks, it is important to validate their outputs with human annotations, at least partially. For natural language processing tasks, a large-scale study has shown that LLMs have too large a variation in their results to be reliably used as a substitution for human raters [1]. Human validation helps ensure the accuracy and reliability of the results, as LLMs may sometimes produce incorrect or biased outputs.
Recommendations
Especially in studies where LLMs are used to support researchers, human validation should always be employed. For studies using LLMs as annotators, the proposed process by Ahmed et al. [2], which includes an initial few-shot learning and, given good results, the replacement of one human annotator by an LLM, might be a way forward.
TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [3]).
Example
TODO
Benefits
Incorporating human judgment in the evaluation process adds a layer of quality control and increases the trustworthiness of the study’s findings, especially when explicitly reporting inter-rater reliability metrics. For instance, “A subset of 20% of the LLM-generated annotations were reviewed and validated by experienced software engineers to ensure accuracy. An inter-rater reliability of 90% was reached.”
Challenges
TODO
References
[1] A. Bavaresco et al., “LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks,” CoRR, vol. abs/2406.18403, 2024, doi: 10.48550/ARXIV.2406.18403.
[2] T. Ahmed, P. T. Devanbu, C. Treude, and M. Pradel, “Can LLMs replace manual annotation of software engineering artifacts?” CoRR, vol. abs/2408.05534, 2024, doi: 10.48550/ARXIV.2408.05534.
[3] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.
Use an Open LLM as a Baseline
Context
TODO
Recommendations
To ensure the reproducibility of results, we recommended findings be reported with an open LLM as a baseline. This applies both when using LLMs as tools for supporting researchers in empirical studies and when benchmarking LLMs for SE tasks. In case LLMs are integrated into new tools, this is also preferable if the architecture of the tool allows it. If the effort of changing models is too high, researchers should at least report an initial benchmarking with open models, which enables more objective comparisons. Open LLMs can either be hosted via cloud platforms such as Hugging Face or used locally via tools such as ollama or LM Studio. A replication package for papers using LLMs should include clear instructions that allow other researchers to reproduce the findings using open models. This practice enhances the credibility of the study and allows for independent verification of the results. Researchers could, e.g., mention that “results were compared with those obtained using Meta’s Code LLAMA, available on the Hugging Face platform” and point to a replication package.
TODO: Inter-model agreement, model confidence
TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [1]).
Example
TODO
Benefits
TODO
Challenges
We are aware that the definition of an “open” model is actively being discussed, and many open models are essentially only “open weight” [2]. We consider the Open Source AI Definition proposed by the Open Source Initiative (OSI) [3] to be a first step towards defining true open-source models.
References
[1] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.
[2] E. Gibney, “Not all ‘open source’ AI models are actually open,” Nature News, 2024, doi: 10.1038/d41586-024-02012-5.
[3] Open Source Initiative (OSI), “Open Source AI Definition 1.0.” https://opensource.org/ai/open-source-ai-definition.
Report Suitable Baselines, Benchmarks, and Metrics
Context
TODO
Recommendations
TODO: What are suitable metrics and benchmarks for evaluating LLMs? A good starting point could be this paper [1].
-
pass@k (TODO: What are common values for k? Who came up with that metric?), but also others such as CodeBLEU, etc.
-
If a tool is analyzed, the acceptance rate of generated artifacts could be interesting (how many artifacts were accepted/rejected by the user)
-
Inter-model-agreement (related to section on open LLM as baseline): Ask different LLMs or differently considered LLMs and determine their agreement
-
…
TODO: Maybe something along the lines of using different benchmarks? Being aware of their biases (e.g., focus on a particular programming language such as Python)?
-
HumanEval https://github.com/openai/human-eval
-
CoderEval https://github.com/CoderEval/CoderEval
-
…
TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [2]).
TODO: In some cases, there might be tools/methods using “traditional” approaches (like static analysis) that the LLM-based approach needs to be compared with. This is what is meant by “baselines” in the title.
Example
TODO
Benefits
TODO
Challenges
TODO
References
[1] X. Hou et al., “Large language models for software engineering: A systematic literature review,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, Dec. 2024, doi: 10.1145/3695988.
[2] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.
Report Limitations and Mitigations
Context
TODO
Recommendations
TODO: Number of repetitions, how were repetitions aggregated?, discuss limitations and mitigations
TODO: Discuss what makes the results of a presented study generalizable and why they are not model-dependent. Argue why the results will likely hold for a different (future) model or the next release of the LLM-based tool that was studied (e.g., ChatGPT)?
TODO: Discuss aspects as such increased performance (see benchmarking guidelines) vs. increased resource consumption and non-determinism.
TODO: Connect guideline to study types and for each type have bullet point lists with information that MUST, SHOULD, or MAY be reported (usage of those terms according to RFC 2119 [1]).
Example
TODO
Benefits
TODO
Challenges
TODO
References
[1] Network Working Group, “RFC 2119.” https://www.rfc-editor.org/rfc/rfc2119, 1997.