Report System and Prompt Design
Summary: Researchers must describe in the paper the full architecture of LLM-based tools, from standalone uses to agentic systems, including the LLM’s role and its interactions with other components. Hosting, infrastructure, and latency must be reported. Researchers must publish all prompts as supplementary material, including prompt templates with representative instances and the dynamic generation process where applicable; sensitive content must be anonymized. If full prompt disclosure is not feasible, summaries or examples should be provided. The prompting strategy and prompt reuse across models and configurations must be specified, and how the prompts were developed should be described. Researchers must describe context-file mechanisms used and summarize in the paper which tools and skills were exposed; full file contents and the complete tool catalog (schemas, definitions, Model Context Protocol (MCP) servers) should be published as supplementary material. For agentic systems, researchers must specify the agents’ roles, reasoning frameworks, communication flows, and agent behavior traceability across LLM output, tool calls, and user interactions. For retrieval-augmented generation (RAG) and similar methods, researchers must describe how external data was retrieved, stored, and integrated. Where legally possible, the implementation should be open-sourced; non-disclosed proprietary components must be acknowledged as reproducibility limitations.
Rationale
LLM-based studies rest on artifacts that researchers design, author, or configure before any model is invoked: software layers that pre-process data, prepare prompts, filter user requests, or post-process responses, and the context those layers feed into each call (Galster et al. 2026). Context includes prompt templates, context files, tool and skill schemas, and retrieval mechanisms that bring external data into the call. For example, ChatGPT and GitHub Copilot use the same underlying models, but their outputs differ substantially because Copilot automatically adds project context. Researchers can also build tools using models directly via APIs.
Prompts are central to any LLM-based study (Sclar et al. 2024). A prompt is a concrete input to an LLM that guides its output (Schulhoff et al. 2024). Depending on the task, a prompt may include instructions (e.g., “classify the following bug report”), background information, input data, and output format specifications (e.g., “respond as JSON with fields ‘category’ and ‘justification’”), with outputs ranging from unstructured text to structured formats such as JSON (DAIR.AI 2024). A prompt template is a parameterized structure containing static elements (e.g., instructions, output format specifications) and placeholders for variable content (e.g., source code under analysis) that are filled in at runtime to construct concrete prompts (Schulhoff et al. 2024). In automated studies, researchers typically design prompt templates from which individual prompts are then instantiated. Prompts substantially influence a model’s output, so how they are formatted and integrated into an LLM-based study is essential for transparency, verifiability, and reproducibility. Sclar et al. (2024) question the methodological validity of comparing models with “an arbitrarily chosen, fixed prompt format”, because their research has shown that the performance of different prompt formats only weakly correlates between different models.
This guideline addresses everything researchers put in place before runs begin, complementing Version and Configuration (model-specific details) and Session Traces (runtime behavior). It does not apply when LLMs are used solely for language polishing, paraphrasing, translation, tone or style adaptation, or layout edits (see Scope).
Recommendations
Researchers must clearly describe the tool architecture and what exactly the LLM (or ensemble of LLMs) contributes to the tool or method presented in a research paper, including any dependencies on proprietary tools that affect reproducibility. Researchers should explain design decisions and which retrieval mechanisms were implemented (e.g., keyword search, semantic similarity matching, rule-based extraction). Researchers should describe how the models were hosted and accessed; for time-sensitive measurements, the hosting choice (e.g., self-hosting on local hardware, an aggregator such as OpenRouter, or a vendor API such as the OpenAI API) can substantially impact results, so researchers must clarify whether local infrastructure or cloud services were used, including detailed infrastructure specifications and latency considerations. Where legally possible (e.g., not restricted by industry-partner agreements), researchers should release the source code of their implementation under an open-source license.
Standalone LLMs require limited architectural documentation, while study setups that integrate LLMs with other components, configure agents, or depend on particular context files or tool schemas require more detailed reporting. In the topic-specific paragraphs that follow, the paper must contain a high-level description of any reported component, with full details provided as supplementary material.
Prompt Reporting:
Researchers must report all prompts used in an empirical study, including instructions, background information, input data, and output indicators. The complete set must be made publicly available as supplementary material; when the paper cannot include all prompts in full, it should provide summaries and representative examples. The only exception to this disclosure requirement is when privacy, anonymity, or confidentiality concerns (e.g., when working with industry partners) prevent complete disclosure; in such cases, researchers should provide summaries and representative examples instead. For prompts containing sensitive information, researchers must anonymize personal identifiers, replace proprietary code with placeholders, and clearly highlight modified sections. When prompt templates are used, researchers must report them alongside representative instances, specifying which parts are static and which are dynamically filled. When prompts are generated dynamically (e.g., through preprocessing or retrieval-augmented generation (RAG)), the generation process must be thoroughly documented, including any automated algorithms or rules. Researchers should specify the exact formatting of prompts, including how code snippets were enclosed (e.g., triple backticks), whether whitespace was preserved, and how other artifacts such as error messages and stack traces were presented. For studies involving human participants who create or modify prompts, researchers should describe how these prompts were collected and analyzed.
Prompt Development:
Researchers should explain in the paper how they developed the prompts and why they decided to follow certain prompting strategies. If prompts from the early phases of a research project are unavailable, researchers should at least summarize the prompt evolution.
Prompt development is often iterative, involving collaboration between human researchers and AI tools. Researchers should report any instances in which LLMs were used to suggest prompt refinements and how these suggestions were incorporated. A prompt changelog can track prompt evolution, including revisions and reasons for changes (e.g., v1.0: initial prompt; v2.0: incorporated examples of ideal responses). Because prompt effectiveness varies between models and model versions, researchers must make clear which prompts were used for which models in which versions and with which configuration.
Prompting Strategy and Input Handling:
Researchers must specify whether zero-shot, one-shot, or few-shot prompting was used. For few-shot prompts, the examples provided to the model should be clearly outlined, along with the rationale for selecting them. If multiple versions of a prompt were tested, researchers should describe how these variations were evaluated and how the final design was chosen.
When dealing with extensive or complex prompt context, researchers should describe the strategies they used to handle input length constraints (e.g., truncating, summarizing, or splitting prompts into multiple parts). Token optimization measures, such as simplifying code formatting or removing unnecessary comments, should also be documented if applied.
Context Files and Agent Configuration:
Agentic coding tools expose a range of configuration and extension mechanisms including context files, skills, subagents, hooks, settings, rules, and Model Context Protocol (MCP) servers (Galster et al. 2026). Context files such as AGENTS.md or CLAUDE.md are version-controlled Markdown files containing persistent, project-specific instructions automatically included in the context on each call (Mohsenimofidi et al. 2026). Researchers must describe which context-file mechanisms were used and should publish the full contents as supplementary material, because they steer agent behavior similarly to system prompts. Because context files are version-controlled, their evolution across a study is recoverable from the project’s git history and from the runtime traces reported under Session Traces.
Tool Catalog and Skill Definitions:
Agentic systems expose a set of callable tools, skills, and sub-agents to the model on each call, along with their schemas: parameter names, types, and natural-language descriptions as the model sees them. Tools vary in granularity, from one operation per tool to generic execution interfaces. Claude Code, for example, does not register individual command-line utilities but defines a single Bash tool (PowerShell on Windows) through which the agent runs arbitrary shell commands such as grep or ls (Anthropic 2026). The catalog, the wording of descriptions, and the order in which tools are presented all influence which tool the model selects, so the exact serialized form matters for reproducibility. Researchers must summarize in the paper which tools and skills were exposed to the model, and should include the complete list (names with one-line purposes), tool schemas, skill definitions, sub-agent definitions, and the identities of any connected Model Context Protocol (MCP) servers as supplementary material. Where sub-agents are used, researchers should also describe the delegation pattern: which sub-agent handles which kind of task, and how control returns to the orchestrator.
Pipelines and Agentic Systems:
If an LLM is used as a standalone system, for example by sending prompts directly to a GPT-5 model via the OpenAI API without pre-processing the prompts or post-processing the responses, a brief explanation is usually sufficient. For complex systems with pre-processing, retrieval mechanisms, or autonomous agents, aspects to consider include how the LLM interacts with other components such as databases, external APIs, and frameworks. If the LLM is part of an agent-based system that autonomously plans or executes tasks, researchers must describe its exact architecture, including the agents’ roles (e.g., planner, executor, coordinator), whether it is a single-agent or multi-agent system, how it interacts with external tools and users, and the reasoning framework used (e.g., chain-of-thought, self-reflection, multi-turn dialogue). For agent-based systems that use external tools (e.g., Claude Code and its subagents), researchers must describe agent behavior traceability by delineating: (1) LLM Input/Output, the internal deliberation, planning, and interpretation performed by the model; (2) External tool calls, the agent’s explicit invocations of APIs, databases, file systems, or other external services (e.g., via the Model Context Protocol (MCP)); (3) User or system interactions, human-in-the-loop feedback, environment responses, or multi-agent communication. This separation lets readers attribute task success to the LLM, the external tools, or their interaction; runtime trace-reporting for these components is covered in Session Traces.
RAG and Ensembles:
If retrieval-augmented generation (RAG) or related methods were used (e.g., rule-based retrieval, structured query generation, or hybrid approaches), researchers must describe how external data was retrieved, stored, and integrated into the LLM’s responses, including the type of storage (e.g., vector databases, relational databases, knowledge graphs) and how retrieved information was selected. Stored data used for context augmentation should be reported, including data preprocessing, versioning, and update frequency; if not confidential, an anonymized snapshot should be made available as supplementary material. For ensemble models, in addition to following the Version and Configuration guideline for each model, researchers should describe the architecture connecting them: the routing logic that determines which model handles which input, model interactions, and the output combination strategy (e.g., majority voting, weighted averaging, sequential processing).
Examples
Schäfer et al. (2024) evaluated LLMs for automated unit test generation, providing a detailed description of the system architecture including code parsing, prompt formulation, LLM interaction, and test suite integration. They also detail the datasets used, including sources, selection criteria, and preprocessing steps.
A second example is Yan et al. (2024)’s IVIE tool, which integrates LLMs into the VS Code interface. The authors document the tool architecture, detailing the IDE integration, context extraction from code editors, and the formatting pipeline for LLM-generated explanations.
Liang et al. (2024)’s paper is a good example of comprehensive prompt reporting. The authors make the exact prompts available in their supplementary material on Figshare, including details such as code blocks enclosed in triple backticks. The paper explains the rationale behind the prompt design and the data output format, and it includes an overview figure and two concrete examples, keeping the main text concise while remaining reproducible.
Benefits
Documenting the architecture and supplemental data of LLM-based systems strengthens reproducibility and transparency (Lu et al. 2024), enabling experiment replication, result validation, and cross-study comparison. Detailed prompt documentation strengthens verifiability, reproducibility, and comparability of LLM-based studies, letting other researchers replicate studies, refine prompts, and evaluate how different content types and formatting choices influence LLM behavior. Reporting the tool catalog and context files enables other researchers to reconstruct the exact context the model saw, which is often the dominant factor in agent behavior.
Challenges
Documenting LLM-based architectures involves challenges such as proprietary APIs and dependencies that restrict disclosure, managing large-scale retrieval databases, and ensuring efficient query execution. Researchers must also balance transparency with data privacy concerns, adapt to the evolving nature of LLM integrations, and handle the complexity of multi-agent interactions and decision-making logic, all of which can impact reproducibility and system clarity. Prompts themselves are challenging to document because they often combine multiple components such as code, error messages, and explanatory text, and privacy or confidentiality concerns can hinder sharing.
Not all systems allow reporting of complete (system) prompts, context files, and tool schemas. For commercial tools, researchers must report all available information and acknowledge unknown aspects as limitations. Disclosure practices vary across vendors: some publish their system prompts, others keep them proprietary. Where prompts are published, they also change between releases (Willison 2026), so researchers should record the tool version and date of use even when the prompt content itself is unavailable. Understanding suggestions of commercial tools such as GitHub Copilot might require recreating the exact state of the codebase at the time the suggestion was made, which is a challenging context to report. One solution is to use version control to capture the exact state of the codebase when a recommendation was made, keeping track of the files that were automatically added as context. We also recommend exploring open-source tools such as OpenCode (OpenCode Contributors 2025), which expose more of the configuration that controls agent behavior.
Study Types
This guideline must be followed for all studies that involve tools with system-level components beyond bare LLMs, from lightweight wrappers that pre-process user input or post-process model outputs, to systems employing retrieval-augmented methods or complex agent-based architectures. It also must be followed by all studies that use concrete prompts or prompt templates.
For LLMs for Tools, this guideline is of primary importance: researchers must describe the tool’s full architecture, explain how prompts were generated and structured within the tool, report the context files, tool catalog, and skill definitions used, and document how the role of each LLM fits into the overall system behavior. For Benchmarking LLMs, researchers should describe the evaluation harness and infrastructure if it goes beyond bare model API calls (e.g., custom sandboxing, orchestration layers, or post-processing pipelines); see (Anthropic 2025) for a practitioner account of evaluation harness design for agent-based systems. Researchers using pre-defined prompts (e.g., HumanEval, SWE-Bench) must specify the benchmark version and any modifications made to the prompts or evaluation setup. If prompt tuning, retrieval-augmented generation (RAG), or other methods were used to adapt prompts, researchers must disclose and justify those changes, and should make the relevant code publicly available. For Studying LLM Usage, researchers should describe the tool architecture of the studied tool to the extent that it is accessible, as architectural details may influence observed usage patterns. For controlled experiments under Studying LLM Usage, exact prompts must be reported for all conditions. For LLMs as Annotators, researchers must document any predefined coding guides or instructions included in prompts, as these influence how the model labels artifacts. For LLMs as Judges, researchers must report the evaluation criteria, scales, and examples embedded in the prompt to ensure consistent interpretation. For LLMs for Synthesis tasks (e.g., summarization, aggregation), researchers must document the sequence of prompts used to generate and refine outputs, including follow-ups and clarification queries. For LLMs as Subjects (e.g., simulating human participants), researchers must report any role-playing instructions, constraints, or personas used to guide LLM behavior. For LLMs as Annotators, LLMs as Judges, LLMs for Synthesis, and LLMs as Subjects, if the research setup involves a custom pipeline (e.g., retrieval-augmented generation for annotation or chained prompts for synthesis), the architecture should also be reported.
Advice for Reviewers
As with other guidelines, missing architectural or prompt information is typically a minor revision request unless so much is missing that methodological rigor cannot be assessed. Reviewers may ask authors to move key details from supplements into the paper body to ensure the main text is self-contained.
Regarding proprietary or confidential components, three principles apply: (1) empirically evaluating an opaque artifact is a valid scientific contribution; (2) the less available and inspectable the artifact, the weaker the contribution; (3) scientific instruments including tools, metrics, scales, and experimental materials must be fully disclosed, even when the object under study cannot be.
A challenging aspect of prompt reporting is the description of prompt development, which is inherently iterative and creative. Reviewers should not expect justification of each word choice or post-hoc rationalized accounts of prompt generation. Instead, reviewers should focus on whether (1) a new research team could use (not reproduce) exactly the same prompts in exactly the same way, and (2) potential biases or validity issues are transparent.
See Also
- Report Model Version, Configuration, and Customizations: Every architecture runs on at least one specific model; name its version.
- Report Session Traces: Architecture and prompts are static; session traces show how they behave at runtime.
- Use Human Validation for LLM Outputs: Human evaluation often complements automated metrics for tool outputs.
- Use an Open LLM as a Baseline: Open models let other researchers run the reported prompts and schemas on the exact same model weights.
- Report Limitations and Mitigations: When a tool hides internal prompts or schemas, authors must report the gap as a limitation.
References
Anthropic. 2025. “Demystifying Evals for AI Agents.” https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents.
———. 2026. “Claude Code Tools Reference.” https://code.claude.com/docs/en/tools-reference.
DAIR.AI. 2024. “Elements of a Prompt.” https://www.promptingguide.ai/introduction/elements.
Galster, Matthias, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes. 2026. “Configuring Agentic AI Coding Tools: An Exploratory Study.” In Proceedings of the 3rd ACM International Conference on AI-Powered Software (AIware 2026).
Liang, Jenny T., Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, and Thomas Zimmermann. 2024. “Can GPT-4 Replicate Empirical Software Engineering Research?” Proc. ACM Softw. Eng. 1 (FSE): 1330–53. https://doi.org/10.1145/3660767.
Lu, Qinghua, Liming Zhu, Xiwei Xu, Zhenchang Xing, and Jon Whittle. 2024. “Toward Responsible AI in the Era of Generative AI: A Reference Architecture for Designing Foundation Model-Based Systems.” IEEE Softw. 41 (6): 91–100. https://doi.org/10.1109/MS.2024.3406333.
Mohsenimofidi, Seyedmoein, Matthias Galster, Christoph Treude, and Sebastian Baltes. 2026. “Context Engineering for AI Agents in Open-Source Software.” In Proceedings of the 23rd IEEE/ACM International Conference on Mining Software Repositories (MSR 2026). https://arxiv.org/abs/2510.21413.
OpenCode Contributors. 2025. “OpenCode: The Open Source AI Coding Agent.” https://opencode.ai/.
Schäfer, Max, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation.” IEEE Trans. Software Eng. 50 (1): 85–105. https://doi.org/10.1109/TSE.2023.3334955.
Schulhoff, Sander, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, et al. 2024. “The Prompt Report: A Systematic Survey of Prompting Techniques.” CoRR abs/2406.06608. https://doi.org/10.48550/ARXIV.2406.06608.
Sclar, Melanie, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting.” In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=RIu5lyNXjT.
Willison, Simon. 2026. “Opus system prompt.” https://simonwillison.net/2026/Apr/18/opus-system-prompt/.
Yan, Litao, Alyssa Hwang, Zhiyuan Wu, and Andrew Head. 2024. “Ivie: Lightweight Anchored Explanations of Just-Generated Code.” In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, edited by Florian ’Floyd’Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Max L. Wilson, Phoebe O. Toups Dugas, and Irina Shklovski, 140:1–15. ACM. https://doi.org/10.1145/3613904.3642239.