Report System and Prompt Design
Summary: Researchers must describe in the paper the full architecture of LLM-based tools, from standalone uses to agentic systems, including the LLM’s role and its interactions with other components. Hosting and infrastructure must be reported. Researchers must publish all prompts as supplementary material, including prompt templates with representative instances and the dynamic generation process where applicable; sensitive content must be anonymized. If full prompt disclosure is not feasible, summaries or examples should be provided. The prompting strategy and prompt reuse across models and configurations must be specified, and how the prompts were developed should be described. Researchers must describe context-file mechanisms used and summarize in the paper which tools and skills were exposed; full file contents and the complete tool catalog (schemas, definitions, Model Context Protocol (MCP) servers) should be published as supplementary material. For agentic systems, researchers must specify the agents’ roles, reasoning frameworks, and communication flows; where external tools are used, the model’s reasoning, tool calls, and interactions with users or the environment must be reported separately. For retrieval-augmented generation (RAG) and similar methods, researchers must describe how external data was retrieved, stored, and integrated. Where legally possible, the implementation should be open-sourced; non-disclosed proprietary components must be acknowledged as reproducibility limitations.
Rationale
LLM-based studies rest on artifacts that researchers design, author, or configure before any model is invoked: software layers that pre-process data, prepare prompts, filter user requests, or post-process responses, and the context those layers feed into each call (Galster et al. 2026). Context includes prompt templates, context files, tool and skill schemas, and retrieval mechanisms that bring external data into model invocations. For example, ChatGPT and GitHub Copilot use the same underlying models, but their outputs differ substantially because Copilot automatically adds project context. Researchers can also build tools using models directly via APIs.
Prompts are central to any LLM-based study (Sclar et al. 2024). A prompt is a concrete input to an LLM that guides its output (Schulhoff et al. 2024). Depending on the task, a prompt may include instructions (e.g., “classify the following bug report”), task context, input data, and output format specifications (e.g., “respond as JSON with fields ‘category’ and ‘justification’”), with outputs ranging from unstructured text to structured formats such as JSON (DAIR.AI 2024). A prompt template is a parameterized structure containing static elements (e.g., instructions, output format specifications) and placeholders for variable content (e.g., source code under analysis) that are filled in at runtime to construct concrete prompts (Schulhoff et al. 2024). In automated studies, researchers typically design prompt templates from which individual prompts are then instantiated. Prompts substantially influence a model’s output, so how they are formatted and integrated into an LLM-based study is essential for transparency, verifiability, and reproducibility. Sclar et al. (2024) question the methodological validity of comparing models with “an arbitrarily chosen, fixed prompt format”, because their research has shown that the performance of different prompt formats only weakly correlates between different models.
This guideline covers the architecture, prompts, and configuration used in an LLM-based study, complementing Version and Configuration (model-specific details) and Session Traces (runtime behavior). It does not apply when LLMs are used solely for language polishing, paraphrasing, translation, tone or style adaptation, or layout edits (see Scope).
Recommendations
Researchers must clearly describe the tool architecture and what exactly the LLM (or ensemble of LLMs) contributes to the tool or method presented in a research paper, including any dependencies on proprietary tools that affect reproducibility. Researchers should justify substantive architectural choices where alternatives existed (e.g., why a particular agentic framework or tool catalog was selected). Researchers should describe how the models were hosted and accessed. For time-sensitive measurements, the hosting choice (e.g., self-hosting on local hardware, an aggregator such as OpenRouter, or a vendor API such as the OpenAI API) can substantially affect results, so researchers must clarify whether local infrastructure or cloud services were used, including the specific hardware for local hosting (e.g., GPU model and VRAM) or the service tier for cloud APIs (latency reporting is covered under Benchmarks and Metrics). Where legally possible (e.g., not restricted by industry-partner agreements), researchers should release the source code of their implementation under an open-source license.
Reporting requirements scale with system complexity: minimal for standalone LLMs, detailed for pipelines, agentic systems, and any context files or tool schemas they depend on. In the topic-specific paragraphs that follow, the paper must contain a high-level description of any reported component, with full details provided as supplementary material.
Prompt Reporting.
Researchers must report all prompts used in an empirical study, including instructions, task context, input data, and output indicators. The complete set must be made publicly available as supplementary material, with representative examples in the paper itself. When confidentiality (e.g., industry-partner agreements) prevents full publication, researchers should publish summaries and representative examples instead. For prompts that can be partially shared, researchers must anonymize personal identifiers, replace proprietary code with placeholders, and clearly highlight modified sections. When prompt templates are used, researchers must report them alongside representative instances, specifying which parts are static and which are dynamically filled. When prompts are generated dynamically, researchers must document the code or rules that assemble each prompt from runtime inputs. Researchers should specify the exact formatting of prompts, including how code snippets were enclosed (e.g., triple backticks), whether whitespace was preserved, and how other artifacts such as error messages and stack traces were formatted. For studies involving human participants who create or modify prompts, researchers should describe how these prompts were collected and analyzed.
Prompt Development.
Researchers should explain in the paper how they developed the prompts and why they decided to follow certain prompting strategies. If prompts from the early phases of a research project are unavailable, researchers should at least summarize the prompt evolution. Prompt development is often iterative, involving collaboration between human researchers and AI tools. Researchers should report any instances in which LLMs were used to suggest prompt refinements and how these suggestions were incorporated. A prompt changelog can track prompt evolution, including revisions and reasons for changes (e.g., v1.0: initial prompt; v2.0: added few-shot examples). Because prompt effectiveness varies between models and model versions, researchers must make clear which prompts were used for which models in which versions and with which configuration.
Prompting Strategy and Input Handling.
Researchers must specify whether zero-shot, one-shot, or few-shot prompting was used. For few-shot prompts, researchers must explain in the paper how the examples were selected and should include the concrete examples in the supplementary material. If multiple versions of a prompt were tested, researchers should describe how these variations were evaluated and how the final design was chosen. When dealing with extensive or complex prompt context, researchers should describe the strategies they used to handle input length constraints (e.g., truncating, summarizing, or splitting prompts into multiple parts). Token optimization measures, such as simplifying code formatting or removing unnecessary comments, should also be documented if applied.
Pipelines and Complex Systems.
If the LLM is used in a standalone setup, with prompts sent directly to a model via an API and no pre-processing of inputs or post-processing of outputs, researchers must state this explicitly. If the LLM is part of a complex system (e.g., with pre-processing or post-processing stages), researchers must describe each component’s role and the data flow between them. For systems using retrieval-augmented generation (RAG) or related methods (e.g., rule-based retrieval, structured query generation, or hybrid approaches), researchers must additionally describe how external data was retrieved, stored (e.g., in vector databases, knowledge graphs), and selected for inclusion in the model’s context. The data used for retrieval should also be reported, including its preprocessing, versioning, and update frequency. If not confidential, an anonymized snapshot should be made available as supplementary material. For ensemble models, in addition to following the Version and Configuration guideline for each model, researchers should describe the architecture connecting them: the routing logic that determines which model handles which input, model interactions, and the output combination strategy (e.g., majority voting, weighted averaging, sequential processing).
Agentic Systems.
If the LLM is part of an agentic system that autonomously plans or executes tasks, researchers must additionally describe the agents’ roles (e.g., planner, executor, coordinator), whether the system is single-agent or multi-agent, how the agents interact with external tools and users, and the reasoning framework used (e.g., chain-of-thought, self-reflection, multi-turn dialogue). For agentic systems that use external tools (e.g., Claude Code and its subagents), researchers must distinguish three kinds of activity: (1) the model’s reasoning, planning, and outputs; (2) tool calls (e.g., to APIs, databases, file systems, or Model Context Protocol (MCP) servers); (3) interactions with users, the environment, or other agents. Reporting these separately lets readers understand whether a result came from the model, a tool, or their interaction. How to record the runtime traces of each is covered in Session Traces.
Context Files and Agent Configuration.
Researchers can tailor agentic tools through configuration mechanisms such as context files, skills, subagents, hooks, settings, and rules (Galster et al. 2026). A configuration artifact is a concrete instance of such a mechanism: a single file (e.g., a context file or a subagent file) or a directory bundling several files (e.g., a skill folder containing SKILL.md alongside scripts, references, and assets). Since configuration mechanisms steer agent behavior in the same way as system prompts, researchers must describe which configuration mechanisms were used and should publish all configuration artifacts as supplementary material. Configuration mechanisms and their artifacts must be reported with the same level of detail as general prompts, including their development process and any iterations. Because context files are version-controlled, their evolution across a study is recoverable from the project’s Git history and from the runtime traces reported under Session Traces. Where subagents are used, researchers should also describe the delegation pattern, including which subagent handles which task and how control is returned to the orchestrator.
Tool Catalog and MCP Servers.
On each call, agentic systems expose a set of tools to the model along with their schemas (e.g., parameter names, types, and natural-language descriptions). Tools differ in granularity. An editor agent might expose one tool per editing operation (e.g., read_file, apply_edit), while Claude Code defines a single Bash tool (PowerShell on Windows) through which the agent runs arbitrary shell commands such as grep or ls (Anthropic 2026). The catalog, the wording of descriptions, and the order in which tools are presented all influence which tool the model selects, so the exact serialized form matters for reproducibility. The same applies to the list of MCP servers made available to the model. Researchers must summarize in the paper which tools were exposed to the model, and should include a complete list with names and purposes, tool schemas, and the names of any connected MCP servers as supplementary material.
Examples
Schäfer et al. (2024) evaluated LLMs for automated unit test generation, providing a detailed description of the system architecture including code parsing, prompt formulation, LLM interaction, and test suite integration. They also detail the datasets used, including sources, selection criteria, and preprocessing steps.
A second example is Yan et al. (2024)’s IVIE tool, which integrates LLMs into the VS Code interface. The authors document the tool architecture, detailing the IDE integration, context extraction from code editors, and the formatting pipeline for LLM-generated explanations.
Liang et al. (2024)’s paper is a good example of comprehensive prompt reporting. The authors make the exact prompts available in their supplementary material on Figshare, including details such as code blocks enclosed in triple backticks. The paper explains the rationale behind the prompt design and the data output format, and it includes an overview figure and two concrete examples, keeping the main text concise while remaining reproducible.
Benefits
Documenting the tool architecture and hosting infrastructure of LLM-based systems strengthens reproducibility and transparency, enabling experiment replication, result validation, and cross-study comparison. Prompt documentation provides similar benefits, letting other researchers replicate studies, refine prompts, and evaluate how content and formatting choices influence LLM behavior. Reporting configuration mechanisms, the tool catalog, and MCP servers additionally lets other researchers reconstruct the configured context, often the dominant factor in agent behavior.
Challenges
Documenting LLM-based architectures involves challenges such as proprietary APIs and dependencies that restrict disclosure, managing large-scale retrieval databases, and ensuring efficient query execution. Researchers must also balance transparency with data privacy concerns, adapt to the evolving nature of LLM integrations, and handle the complexity of multi-agent interactions and decision-making logic, all of which can impact reproducibility and system clarity. Prompts themselves are challenging to document because they often combine multiple components such as code, error messages, and explanatory text, and privacy or confidentiality concerns can hinder sharing.
Not all systems allow reporting of complete (system) prompts, context files, and tool schemas. For commercial tools, researchers must report all available information and acknowledge unknown aspects as limitations. Disclosure practices vary across vendors: some publish their system prompts, others keep them proprietary. Where prompts are published, they also change between releases (Willison 2026), so researchers should record the tool version and date of use even when the prompt content itself is unavailable. Understanding suggestions of commercial tools such as GitHub Copilot might require recreating the exact state of the codebase at the time the suggestion was made, which is a challenging context to report. One solution is to use version control to capture the exact state of the codebase when a recommendation was made, keeping track of the files that were automatically added as context. We also recommend exploring open-source tools such as OpenCode (OpenCode Contributors 2025), which expose more of the configuration that controls agent behavior.
Study Types
This guideline must be followed for all studies that involve tools with system-level components beyond bare LLMs, from lightweight wrappers that pre-process user input or post-process model outputs, to systems employing retrieval-augmented methods or complex agentic architectures. It also must be followed by all studies that use concrete prompts or prompt templates.
For LLMs for Tools, this guideline is of primary importance: researchers must describe the tool’s full architecture, explain how prompts were generated and structured within the tool, report the context files, tool catalog, and skill definitions used, and document how the role of each LLM fits into the overall system behavior. For Benchmarking LLMs, researchers must describe the evaluation harness and infrastructure when it goes beyond bare model API calls (e.g., custom sandboxing, orchestration layers, or post-processing pipelines), and the harness should be usable with open models; see (Anthropic 2025) for a practitioner account of evaluation harness design for agentic systems. Researchers using pre-defined prompts (e.g., HumanEval, SWE-Bench) must specify the benchmark version and any modifications made to the prompts or evaluation setup. If prompt tuning, RAG, or other methods were used to adapt prompts, researchers must disclose and justify those changes, and should make the relevant code publicly available. For Studying LLM Usage, researchers should describe the tool architecture of the studied tool to the extent that it is accessible, as architectural details may influence observed usage patterns. For controlled experiments under Studying LLM Usage, exact prompts must be reported for all conditions. For LLMs as Annotators, researchers must document any predefined coding guides or instructions included in prompts, as these influence how the model labels artifacts. For LLMs as Judges, researchers must report the evaluation criteria, scales, and examples embedded in the prompt to ensure consistent interpretation. For LLMs for Synthesis tasks (e.g., summarization, aggregation), researchers must document the sequence of prompts used to generate and refine outputs, including follow-ups and clarification queries. For LLMs as Subjects (e.g., simulating human participants), researchers must report any role-playing instructions, constraints, or personas used to guide LLM behavior. For LLMs as Annotators, LLMs as Judges, LLMs for Synthesis, and LLMs as Subjects, if the research setup involves a custom pipeline (e.g., RAG for annotation or chained prompts for synthesis), the architecture should also be reported.
Advice for Reviewers
As with other guidelines, missing architectural or prompt information is typically a minor revision request unless so much is missing that methodological rigor cannot be assessed. Reviewers may ask authors to move key details from supplements into the paper body to ensure the main text is self-contained.
Regarding proprietary or confidential components, three principles apply: (1) empirically evaluating an opaque artifact is a valid scientific contribution; (2) the less available and inspectable the artifact, the weaker the contribution; (3) scientific instruments including tools, metrics, scales, and experimental materials must be fully disclosed, even when the object under study cannot be.
A challenging aspect of prompt reporting is the description of prompt development, which is inherently iterative and creative. Reviewers should not expect justification of each word choice or post-hoc rationalized accounts of prompt generation. Instead, reviewers should focus on whether (1) a new research team could use (not reproduce) exactly the same prompts in exactly the same way, and (2) potential biases or validity issues are transparent.
See Also
- Report Model Version, Configuration, and Customizations: Every architecture runs on at least one specific model; name its version.
- Report Session Traces: Architecture and prompts are static; session traces show how they behave at runtime.
- Use Human Validation for LLM Outputs: Human validation often complements automated metrics for tool outputs.
- Use an Open LLM as a Baseline: Open models let other researchers run the reported prompts and schemas on the exact same model weights.
- Report Limitations and Mitigations: When a tool hides internal prompts or schemas, authors must report the gap as a limitation.
References
Anthropic. 2025. “Demystifying Evals for AI Agents.” https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents.
———. 2026. “Claude Code Tools Reference.” https://code.claude.com/docs/en/tools-reference.
DAIR.AI. 2024. “Elements of a Prompt.” https://www.promptingguide.ai/introduction/elements.
Galster, Matthias, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes. 2026. “Configuring Agentic AI Coding Tools: An Exploratory Study.” In Proceedings of the 3rd ACM International Conference on AI-Powered Software (AIware 2026).
Liang, Jenny T., Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, and Thomas Zimmermann. 2024. “Can GPT-4 Replicate Empirical Software Engineering Research?” Proc. ACM Softw. Eng. 1 (FSE): 1330–53. https://doi.org/10.1145/3660767.
OpenCode Contributors. 2025. “OpenCode: The Open Source AI Coding Agent.” https://opencode.ai/.
Schäfer, Max, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation.” IEEE Trans. Software Eng. 50 (1): 85–105. https://doi.org/10.1109/TSE.2023.3334955.
Schulhoff, Sander, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, et al. 2024. “The Prompt Report: A Systematic Survey of Prompting Techniques.” CoRR abs/2406.06608. https://doi.org/10.48550/ARXIV.2406.06608.
Sclar, Melanie, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting.” In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=RIu5lyNXjT.
Willison, Simon. 2026. “Opus system prompt.” https://simonwillison.net/2026/Apr/18/opus-system-prompt/.
Yan, Litao, Alyssa Hwang, Zhiyuan Wu, and Andrew Head. 2024. “Ivie: Lightweight Anchored Explanations of Just-Generated Code.” In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, edited by Florian ’Floyd’Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Max L. Wilson, Phoebe O. Toups Dugas, and Irina Shklovski, 140:1–15. ACM. https://doi.org/10.1145/3613904.3642239.