Guidelines

A primary goal of our guidelines is to enable reproducibility and replicability of empirical SE studies involving LLMs. As repeating LLM-focused research to verify results lies somewhere between the ACM’s definitions of reproducibility (different team, same research artifacts) and repeatability (different team, different research artifacts) (ACM Publications Board 2020) due to potential model changes and imperfect research artifacts, we follow Angermeir et al. (2025) (Angermeir et al. 2025) and use the terms interchangeably. While previous guidelines regarding open science and empirical studies still apply, LLM-specific characteristics (e.g. inherent non-determinism (Song et al. 2025; Yuan et al. 2025), opaque and proprietary models) present additional replicability challenges, which, in turn, demand new guidance.

Each guideline below begins with a brief tl;dr summary, followed by its rationale, recommendations, examples, benefits, challenges, links to the relevant study types. The rationale articulates the underlying principle, i.e., why the guideline matters, while the recommendations provide concrete, actionable practices.

The guidelines further contain advice for reviewers. Broadly, reviewers should use our guidelines to help them interrogate the extent to which a manuscript’s authors have done what was practically possible to improve reproducibility. We must neither accept research absent reasonable efforts to improve reproducibility, nor reject research for failing to obtain an impossible goal. We borrow this principle of “reasonable efforts” from the SIGSOFT Empirical Standards (Ralph et al. 2021), where it applies to methodological rigor more generally. Of course, papers should acknowledge their limitations, but to determine whether these limitations are reasonable, reviewers should ask “have the authors done what they could to minimize limitations?” Our guidelines attempt to capture what “reasonable efforts” practically means for LLM-based studies.

To distinguish essential criteria from recommendations, our guidelines use two tiers. A MUST criterion is a requirement. Studies that intend to follow these guidelines are expected to meet all MUST criteria. A SHOULD criterion represents a desired practice that strengthens a study’s rigor or transparency. However, there may be valid reasons to deviate in particular circumstances (e.g., resource constraints, inapplicability to a specific study context or type). Nonetheless, studies that deviate from a SHOULD criterion should briefly justify the deviation and discuss its potential impact (e.g., on validity or reproducibility).

The following sections indicate which information we expect researchers to report, and whether it should be in the PAPER or SUPPLEMENTARY MATERIAL. Where a publication venue’s page limits hinder reporting all expected elements in the PAPER, it is better to report essential information in the SUPPLEMENTARY MATERIAL than not at all. The SUPPLEMENTARY MATERIAL should be published according to the ACM SIGSOFT Open Science Policies (Graziotin 2024).

Guidelines by Study Type

Note: Each guideline’s study-type-specific guidance is detailed in the corresponding subsection.

The guideline’s core recommendations:
● = MUST be followed for this study type.
= SHOULD be followed for this study type.
– = are not directly applicable for this study type.

  S1:
Annotators
S2:
Judges
S3:
Synthesis
S4:
Subjects
S5:
LLM Usage
S6:
Tools
S7:
Benchmarking
G1:
Declare Usage
G2:
Model Version
G3:
Architecture
G4:
Prompts
G5:
Human Validation
G6:
Open LLM
G7:
Benchmarks
G8:
Limitations

Table of contents