Summary
Guideline Rationale and Core Recommendations
Each guideline is summarized below by its rationale and core recommendations. ● = must, ● = should. See the guidelines for the full statements, the applicability matrix for per-study-type severities, and the reporting checklist for an item-by-item breakdown.
| Guideline | Rationale | Core Recommendations |
|---|---|---|
| Declare Usage | Transparency enables informed assessment of scope and limitations. | ● Declare which LLM, how it was used, and where in the research process. |
| Model Version | Reproducibility requires precise identification of the system used in a study. | ● Report exact version, date, configuration, and fine-tuning details. ● Report defaults, checksums, and quantization; motivate model choice; acknowledge commercial-model reproducibility limits. |
| Design | Static artifacts determine the model’s input on every call and must be documented in full. | ● Describe system and agent architecture, infrastructure, prompts, agent configuration, tool catalog, and retrieval mechanisms. ● For LLM usage, describe the tool architecture to the extent accessible. |
| Traces | Runtime traces make LLM and agent behavior verifiable despite non-determinism and tool opacity. | ● For studies of LLM usage, share full interaction logs subject to privacy constraints. ● Otherwise, share interaction logs, runtime traces, and plans where feasible. |
| Benchmarks & Metrics | Meaningful evaluation requires reasoned valid measurement. | ● Justify metric and benchmark choices; discuss their validity. ● Define the phenomenon and sampling strategy; isolate confounders; analyze errors; repeat experiments and report distributions. |
| Open LLM | Reproducibility depends on access to the model under study. | ● Include an open LLM as a baseline; ensure it is independently reproducible from supplementary material; for benchmarks, design the harness for use with open models. |
| Human Validation | Automated metrics alone cannot ensure validity of subjective constructs. | ● Define the measured construct and share custom measurement instruments. ● When LLMs replace humans in research tasks, explain whether and how the replacement is justified. ● When LLMs replace humans in research tasks, ground the replacement with inter-model and model-to-human agreement. ● Validate against human judgment with inter-rater reliability; describe rater demographics for value-laden constructs. |
| Limitations | Honest acknowledgment of threats strengthens a study. | ● Discuss threats to internal validity (data leakage), reliability (non-determinism), construct and external validity. ● Employ mitigations where possible. |