Reporting Checklist

The following checklist, inspired by CONSORT (Schulz, Altman, and Moher 2010), summarizes actionable items from the guidelines based on the tl;dr sections. The checklist is organized along typical paper sections. Items marked ● are requirements (MUST); items marked  are recommendations (SHOULD). Each item references its source guideline (G1–G8).

Introduction

  • ● Disclose any use of LLMs, specifying which LLM, how, and where it was used (G1).
  •  Report the purpose of using LLMs, automated tasks, and expected benefits (G1).

Research Design and Methods

Model Selection and Configuration

  • ● Report the exact LLM model or tool version, configuration, and experiment date in the PAPER (G2).
  • ● For fine-tuned models, describe the fine-tuning goal, dataset, and procedure (G2).
  •  Report default parameters and explain model and version choices (G2).
  •  For quantized models, report the quantization level and method (G2).
  •  Compare base and fine-tuned models using suitable metrics and benchmarks; share fine-tuning data and weights (or justify why they cannot be shared) (G2).

Architecture

  • ● Describe the full architecture of LLM-based tools in the PAPER, including the role of the LLM, interactions with other components, and overall system behavior (G3).
  • ● If autonomous agents are used, specify agent roles, reasoning frameworks, and communication flows (G3).
  • ● Report hosting, hardware setup, and latency implications (G3).
  • ● For tools using retrieval or augmentation methods, describe data sources, integration mechanisms, and update and versioning strategies (G3).
  • ● For ensemble architectures, explain the coordination logic between models (G3).
  •  Include architectural diagrams and justify design decisions (G3).

Prompts and Interactions

  • ● Publish all prompts, including their structure, content, formatting, and dynamic components (G4).
  • ● Describe prompt development strategies (e.g., zero-shot, few-shot), rationale, and selection process (G4).
  • ● Document input handling and token optimization strategies when prompts are long or complex (G4).
  • ● Report generation and collection processes for dynamically generated or user-authored prompts (G4).
  • ● Specify prompt reuse across models and configurations (G4).
  • ● Report all context files (e.g., CLAUDE.md, .cursorrules) used to configure AI coding agents as SUPPLEMENTARY MATERIAL (G4).
  •  If full prompt disclosure is not feasible, provide summaries or examples (G4).
  •  Report prompt revisions and pilot testing insights (G4).
  •  For agentic systems, report developed plans and generalize interaction logs to include human-agent interaction traces (G4).
  •  Include full interaction logs (prompts and responses) if privacy and confidentiality can be ensured (G4).

Human Validation

  • ● If using human validation, define the measured construct (e.g., usability, maintainability) and describe the measurement instrument in the PAPER (G5).
  •  Consider human validation early in the study design, build on established reference models for human-LLM comparison, and share instruments as SUPPLEMENTARY MATERIAL (G5).
  •  When aggregating LLM judgments, report methods and rationale and assess inter-rater agreement (G5).
  •  Control for confounding factors and conduct power analysis to ensure statistical robustness (G5).

Benchmarks and Metrics

  • ● Justify all benchmark and metric choices in the PAPER (G7).
  • ● Explain why the selected metrics are suitable for the specific study (G7).
  •  Include an open LLM as a baseline when using commercial models and report inter-model agreement (G6).
  •  Summarize benchmark structure, task types, and limitations (G7).

Reproducibility, Ethics, and Resources

  • ● Provide model outputs and discuss sensitive data handling (G8).
  • ● Justify LLM usage in light of its resource demands (G8).
  •  Provide a full replication package with step-by-step instructions as part of the SUPPLEMENTARY MATERIAL (G6).
  •  Include a subset of the validation data for partial replication (G8).

Results

  •  Report established metrics to make study results comparable; additional metrics may be reported as appropriate (G7).
  •  Repeat experiments due to the inherent non-determinism of LLMs and report the result distribution using descriptive statistics (G7).
  •  Use traditional (non-LLM) baselines for comparison where possible (G7).

Limitations and Threats to Validity

  • ● Acknowledge non-disclosed confidential or proprietary components as reproducibility limitations (G3).
  • ● Transparently report study limitations, including the impact of non-determinism and generalizability constraints (G8).
  • ● Specify whether generalization across LLMs or across time was assessed, and discuss model and version differences (G8).
  • ● Describe measurement constructs and methods; disclose any data leakage risks and avoid leaking evaluation data into LLM improvement pipelines (G8).
  •  Employ and report mitigation strategies where applicable, such as replication packages, human validation, longitudinal re-runs, triangulation, and sensitivity analysis (G8).

References

Schulz, Kenneth F., Douglas G. Altman, and David Moher. 2010. “CONSORT 2010 Statement: Updated Guidelines for Reporting Parallel Group Randomised Trials.” BMJ 340: c332. https://doi.org/10.1136/bmj.c332.