Reporting Checklist
The following checklist, inspired by CONSORT (Schulz, Altman, and Moher 2010), summarizes actionable items from the guidelines based on the tl;dr sections. The checklist is organized along typical paper sections. Items marked ● are requirements (MUST); items marked ● are recommendations (SHOULD). Each item references its source guideline (G1–G8). Items annotated with PAPER or SUPPLEMENTARY MATERIAL indicate where information should be reported; unmarked items may be reported in either.
Introduction
- ● Disclose any use of LLMs in the empirical study, specifying which LLM, how, and where it was used (G1).
- ● Report the purpose of using LLMs, automated tasks, and expected benefits in the PAPER (G1).
Research Design and Methods
Model Selection and Configuration
- ● Report the exact LLM model or tool version, configuration, and experiment date in the PAPER (G2).
- ● For fine-tuned models, describe the fine-tuning goal, dataset, and procedure in the PAPER (G2).
- ● Report default parameters and explain model and version choices (G2).
- ● For quantized models, report the quantization level (e.g., 4-bit, 8-bit) and method (e.g., GPTQ or AWQ) (G2).
- ● Compare base and fine-tuned models using suitable metrics and benchmarks; share fine-tuning data and weights as SUPPLEMENTARY MATERIAL (or justify in the PAPER why they cannot be shared) (G2).
Architecture
- ● Describe the full architecture of LLM-based tools in the PAPER, including the role of the LLM, interactions with other components, and overall system behavior (G3).
- ● For ensemble architectures, explain the coordination logic between models in the PAPER (G3).
- ● If autonomous agents are used, specify agent roles, reasoning frameworks, and communication flows (G3).
- ● Report hosting, hardware setup, and latency implications (G3).
- ● For tools using retrieval or augmentation methods, describe data sources, integration mechanisms, and update and versioning strategies (G3).
- ● Include architectural diagrams and justify design decisions (G3).
Prompts and Interactions
- ● Describe prompt development strategies (e.g., zero-shot, few-shot), rationale, and selection process in the PAPER (G4).
- ● Publish all prompts or, when using templates, prompt templates with representative instances, including their structure, content, formatting, and dynamic components, as SUPPLEMENTARY MATERIAL (G4).
- ● Document input handling and token optimization strategies when prompts are long or complex (G4).
- ● Report generation and collection processes for dynamically generated or user-authored prompts (G4).
- ● Specify prompt reuse across models and configurations (G4).
- ● If full prompt disclosure is not feasible, provide summaries or examples (G4).
- ● Report prompt revisions and pilot testing insights (G4).
- ● Include full interaction logs (prompts and responses) as SUPPLEMENTARY MATERIAL if privacy and confidentiality can be ensured (G4).
- ● For agentic systems, report all context files (e.g.,
AGENTS.md) used to configure AI agents as SUPPLEMENTARY MATERIAL (G4). - ● For agentic systems, report developed plans and generalize interaction logs to include human-agent interaction traces as SUPPLEMENTARY MATERIAL (G4).
Human Validation
- ● If using human validation, define the measured construct (e.g., usability, maintainability) and describe the measurement instrument in the PAPER (G5).
- ● When developing or adapting measurement instruments, share them (G5).
- ● Consider human validation early in the study design and build on established reference models for human-LLM comparison (G5).
- ● Validate LLM judgments against human judgment, report aggregation methods, and assess human-LLM agreement (G5).
- ● Control for confounding factors and conduct power analysis to ensure statistical robustness (G5).
Benchmarks and Metrics
- ● Justify all benchmark and metric choices in the PAPER (G7).
- ● Explain in the PAPER why the selected metrics are suitable for the specific study (G7).
- ● Include an open LLM as a baseline when using commercial models and report inter-model agreement (G6).
- ● Summarize benchmark structure, task types, and limitations (G7).
- ● Justify the number of experiment repetitions, for example through a power analysis or by monitoring convergence of descriptive statistics (G7).
Reproducibility, Ethics, and Resources
- ● Provide raw LLM outputs for reproducibility and discuss sensitive data handling (G8).
- ● Justify LLM usage in light of its resource demands (G8).
- ● If full data sharing is not possible, include a subset of the validation data for partial replication (G8).
- ● Provide a full replication package with step-by-step instructions as SUPPLEMENTARY MATERIAL (G6).
Results
- ● If comparing models or tools, use appropriate inferential statistics (e.g., hypothesis tests, effect sizes) rather than relying solely on summary statistics (G7).
- ● Repeat experiments due to the inherent non-determinism of LLMs and report the result distribution using descriptive statistics (G7).
- ● Use traditional (non-LLM) baselines for comparison where possible (G7).
- ● Report established metrics to make study results comparable; additional metrics may be reported where appropriate (G7).
Limitations and Threats to Validity
- ● Describe measurement constructs and methods; disclose any data leakage risks and avoid leaking evaluation data into LLM improvement pipelines (G8).
- ● Acknowledge non-disclosed confidential or proprietary components as reproducibility limitations (G3).
- ● Transparently report study limitations, including the impact of non-determinism and generalizability constraints (G8).
- ● Specify whether generalization across LLMs or across time was assessed, and discuss model and version differences (G8).
- ● Employ and report strategies to mitigate identified validity and reproducibility threats, such as replication packages, human validation, longitudinal re-runs, triangulation, and sensitivity analysis (G8).
References
Schulz, Kenneth F., Douglas G. Altman, and David Moher. 2010. “CONSORT 2010 Statement: Updated Guidelines for Reporting Parallel Group Randomised Trials.” BMJ 340: c332. https://doi.org/10.1136/bmj.c332.