Summary

Guideline Rationale and Core Recommendations

Each guideline is summarized below by its rationale and core recommendations. ● = must, = should. See the guidelines for the full statements, the applicability matrix for per-study-type severities, and the reporting checklist for an item-by-item breakdown.

Guideline Rationale Core Recommendations
Declare Usage Transparency enables informed assessment of scope and limitations. ● Declare which LLM, how it was used, and where in the research process.
Model Version Reproducibility requires precise identification of the system used in a study. ● Report exact version, date, configuration, and fine-tuning details.
Report defaults, checksums, and quantization; motivate model choice; acknowledge commercial-model reproducibility limits.
Design Static artifacts determine the model’s input on every call and must be documented in full. ● Describe system and agent architecture, infrastructure, prompts, agent configuration, tool catalog, and retrieval mechanisms.
For LLM usage, describe the tool architecture to the extent accessible.
Traces Runtime traces make LLM and agent behavior verifiable despite non-determinism and tool opacity. ● For studies of LLM usage, share full interaction logs subject to privacy constraints.
Otherwise, share interaction logs, runtime traces, and plans where feasible.
Benchmarks & Metrics Meaningful evaluation requires reasoned valid measurement. ● Justify metric and benchmark choices; discuss their validity.
Define the phenomenon and sampling strategy; isolate confounders; analyze errors; repeat experiments and report distributions.
Open LLM Reproducibility depends on access to the model under study. Include an open LLM as a baseline; ensure it is independently reproducible from supplementary material; for benchmarks, design the harness for use with open models.
Human Validation Automated metrics alone cannot ensure validity of subjective constructs. ● Define the measured construct and share custom measurement instruments.
● When LLMs replace humans in research tasks, explain whether and how the replacement is justified.
When LLMs replace humans in research tasks, ground the replacement with inter-model and model-to-human agreement.
Validate against human judgment with inter-rater reliability; describe rater demographics for value-laden constructs.
Limitations Honest acknowledgment of threats strengthens a study. ● Discuss threats to internal validity (data leakage), reliability (non-determinism), construct and external validity.
Employ mitigations where possible.