Rationale and Recommendations

Each guideline is summarized below by its rationale and core recommendations. ● = must, ○ = should. See the guidelines for the full statements, the applicability matrix for per-study-type severities, and the reporting checklist for an item-by-item breakdown.

Guideline	Rationale	Core Recommendations
Declare Usage	Transparency enables informed assessment of scope and limitations.	● Declare which LLM, how it was used, and where in the research process.
Model Version	Reproducibility requires precise identification of the system used in a study.	● Report exact version, date, configuration, and fine-tuning details.
Design	Static artifacts determine what the model sees on every call and must be documented in full.	● Describe architecture, infrastructure, prompts and templates (with development strategy), context files, tool catalog, agent architecture, and retrieval mechanisms. ○ For LLM usage, describe the tool architecture to the extent accessible.
Traces	Runtime traces make LLM and agent behavior verifiable despite non-determinism and tool opacity.	● For studies of LLM usage, share full interaction logs subject to privacy constraints. ○ Otherwise, share interaction logs, runtime traces, and plans where feasible.
Benchmarks	Meaningful evaluation requires reasoned valid measurement.	● Justify metric and benchmark choices; discuss their validity. ○ Define the phenomenon and sampling strategy; isolate confounders; analyze errors; repeat experiments and report distributions.
Open LLM	Reproducibility depends on access to the model under study.	● For benchmarking studies, include an open LLM. ○ Otherwise, include an open LLM as a baseline; provide a replication package.
Human Validation	Automated metrics alone cannot ensure validity of subjective constructs.	● Define the measured construct and share any custom measurement instruments. ○ Validate against human judgment with inter-rater reliability; describe rater demographics for value-laden constructs.
Limitations	Honest acknowledgment of threats strengthens a study.	● Discuss threats to internal validity (data leakage), reliability (non-determinism), construct and external validity. ○ Employ mitigations where possible.