Reporting Checklist

The following checklist, inspired by CONSORT (Schulz, Altman, and Moher 2010), summarizes actionable items from the guidelines based on the tl;dr sections. The checklist is organized along typical paper sections. Items marked ● are requirements (MUST); items marked ● are recommendations (SHOULD). Each item references its source guideline (G1–G8).

Introduction

● Disclose any use of LLMs, specifying which LLM, how, and where it was used (G1).
● Report the purpose of using LLMs, automated tasks, and expected benefits (G1).

Research Design and Methods

Model Selection and Configuration

● Report the exact LLM model or tool version, configuration, and experiment date in the PAPER (G2).
● For fine-tuned models, describe the fine-tuning goal, dataset, and procedure (G2).
● Report default parameters and explain model and version choices (G2).
● For quantized models, report the quantization level and method (G2).
● Compare base and fine-tuned models using suitable metrics and benchmarks; share fine-tuning data and weights (or justify why they cannot be shared) (G2).

Architecture

● Describe the full architecture of LLM-based tools in the PAPER, including the role of the LLM, interactions with other components, and overall system behavior (G3).
● If autonomous agents are used, specify agent roles, reasoning frameworks, and communication flows (G3).
● Report hosting, hardware setup, and latency implications (G3).
● For tools using retrieval or augmentation methods, describe data sources, integration mechanisms, and update and versioning strategies (G3).
● For ensemble architectures, explain the coordination logic between models (G3).
● Include architectural diagrams and justify design decisions (G3).

Prompts and Interactions

● Publish all prompts, including their structure, content, formatting, and dynamic components (G4).
● Describe prompt development strategies (e.g., zero-shot, few-shot), rationale, and selection process (G4).
● Document input handling and token optimization strategies when prompts are long or complex (G4).
● Report generation and collection processes for dynamically generated or user-authored prompts (G4).
● Specify prompt reuse across models and configurations (G4).
● Report all context files (e.g., CLAUDE.md, .cursorrules) used to configure AI coding agents as SUPPLEMENTARY MATERIAL (G4).
● If full prompt disclosure is not feasible, provide summaries or examples (G4).
● Report prompt revisions and pilot testing insights (G4).
● For agentic systems, report developed plans and generalize interaction logs to include human-agent interaction traces (G4).
● Include full interaction logs (prompts and responses) if privacy and confidentiality can be ensured (G4).

Human Validation

● If using human validation, define the measured construct (e.g., usability, maintainability) and describe the measurement instrument in the PAPER (G5).
● Consider human validation early in the study design, build on established reference models for human-LLM comparison, and share instruments as SUPPLEMENTARY MATERIAL (G5).
● When aggregating LLM judgments, report methods and rationale and assess inter-rater agreement (G5).
● Control for confounding factors and conduct power analysis to ensure statistical robustness (G5).

Benchmarks and Metrics

● Justify all benchmark and metric choices in the PAPER (G7).
● Explain why the selected metrics are suitable for the specific study (G7).
● Include an open LLM as a baseline when using commercial models and report inter-model agreement (G6).
● Summarize benchmark structure, task types, and limitations (G7).

Reproducibility, Ethics, and Resources

● Provide model outputs and discuss sensitive data handling (G8).
● Justify LLM usage in light of its resource demands (G8).
● Provide a full replication package with step-by-step instructions as part of the SUPPLEMENTARY MATERIAL (G6).
● Include a subset of the validation data for partial replication (G8).

Results

● Report established metrics to make study results comparable; additional metrics may be reported as appropriate (G7).
● Repeat experiments due to the inherent non-determinism of LLMs and report the result distribution using descriptive statistics (G7).
● Use traditional (non-LLM) baselines for comparison where possible (G7).

Limitations and Threats to Validity

● Acknowledge non-disclosed confidential or proprietary components as reproducibility limitations (G3).
● Transparently report study limitations, including the impact of non-determinism and generalizability constraints (G8).
● Specify whether generalization across LLMs or across time was assessed, and discuss model and version differences (G8).
● Describe measurement constructs and methods; disclose any data leakage risks and avoid leaking evaluation data into LLM improvement pipelines (G8).
● Employ and report mitigation strategies where applicable, such as replication packages, human validation, longitudinal re-runs, triangulation, and sensitivity analysis (G8).

References

Schulz, Kenneth F., Douglas G. Altman, and David Moher. 2010. “CONSORT 2010 Statement: Updated Guidelines for Reporting Parallel Group Randomised Trials.” BMJ 340: c332. https://doi.org/10.1136/bmj.c332.