Rationale and Recommendations
Each guideline is summarized below by its rationale and core recommendations. ● = must, ○ = should. See the guidelines for the full statements, the applicability matrix for per-study-type severities, and the reporting checklist for an item-by-item breakdown.
| Guideline | Rationale | Core Recommendations |
|---|---|---|
| Declare Usage | Transparency enables informed assessment of scope and limitations. | ● Declare which LLM, how it was used, and where in the research process. |
| Model Version | Reproducibility requires precise identification of the system used in a study. | ● Report exact version, date, configuration, and fine-tuning details. |
| Design | Static artifacts determine what the model sees on every call and must be documented in full. | ● Describe architecture, infrastructure, prompts and templates (with development strategy), context files, tool catalog, agent architecture, and retrieval mechanisms. ○ For LLM usage, describe the tool architecture to the extent accessible. |
| Traces | Runtime traces make LLM and agent behavior verifiable despite non-determinism and tool opacity. | ● For studies of LLM usage, share full interaction logs subject to privacy constraints. ○ Otherwise, share interaction logs, runtime traces, and plans where feasible. |
| Benchmarks | Meaningful evaluation requires reasoned valid measurement. | ● Justify metric and benchmark choices; discuss their validity. ○ Define the phenomenon and sampling strategy; isolate confounders; analyze errors; repeat experiments and report distributions. |
| Open LLM | Reproducibility depends on access to the model under study. | ● For benchmarking studies, include an open LLM. ○ Otherwise, include an open LLM as a baseline; provide a replication package. |
| Human Validation | Automated metrics alone cannot ensure validity of subjective constructs. | ● Define the measured construct and share any custom measurement instruments. ○ Validate against human judgment with inter-rater reliability; describe rater demographics for value-laden constructs. |
| Limitations | Honest acknowledgment of threats strengthens a study. | ● Discuss threats to internal validity (data leakage), reliability (non-determinism), construct and external validity. ○ Employ mitigations where possible. |