Report System and Prompt Design (G3)

tl;dr: Researchers MUST describe the full architecture of LLM-based tools that they develop in their PAPER, including the role of the LLM, interactions with other components, and the overall system behavior. If autonomous agents are used, researchers MUST specify agent roles, reasoning frameworks, and communication flows. Hosting, hardware setup, and latency implications MUST be reported. For tools using retrieval or augmentation methods, researchers MUST describe how external data is retrieved, stored, and integrated; data preprocessing, versioning, and update frequency SHOULD also be reported. For ensemble architectures, the coordination logic between models SHOULD be explained in the PAPER. Researchers SHOULD justify design decisions. Where legally possible, researchers SHOULD release the source code of their implementation under an open-source license. Non-disclosed confidential or proprietary components MUST be acknowledged as reproducibility limitations. Researchers MUST also describe any context-file mechanisms used (e.g., AGENTS.md, CLAUDE.md) and SHOULD publish the file contents as SUPPLEMENTARY MATERIAL. Researchers MUST summarize in the PAPER which tools and skills were exposed to the model, and SHOULD include the complete list (names with one-line purposes), tool schemas, skill definitions, sub-agent definitions, and connected MCP servers as SUPPLEMENTARY MATERIAL. Researchers MUST publish all prompts or, when using templates, prompt templates with representative instances, including their structure, content, formatting, and dynamic components, as SUPPLEMENTARY MATERIAL. If full prompt disclosure is not feasible, for example because of privacy or confidentiality concerns, summaries or examples SHOULD be provided. The prompting strategy used (e.g., zero-shot, one-shot, few-shot) MUST be specified, and the development rationale and selection process SHOULD be described. When prompts are long or complex, input handling and token optimization strategies SHOULD be documented. For dynamically generated prompts, the generation process MUST be documented; for user-authored prompts, researchers SHOULD describe how they were collected and analyzed. Prompt reuse across models and configurations MUST be specified. Researchers SHOULD report prompt revisions and pilot testing insights.

Rationale

LLM-based studies rest on artifacts that researchers design, author, or configure before any model is invoked: the software layers around the model(s) and the context those layers feed into each call. We use context to mean the input to a single model call (Galster et al. 2026). Software layers pre-process data, prepare prompts, filter user requests, or post-process responses. Context includes prompt templates, context files, tool and skill schemas, and the retrieval mechanisms that bring external data into the call. For example, ChatGPT and GitHub Copilot use the same underlying models, but their outputs differ substantially because Copilot automatically adds project context. Researchers can also build tools using models directly via APIs. Infrastructure and business logic around the bare model can contribute substantially to the performance of a tool for a given task. RAG pipelines, output parsers, and retry logic all shape the final output independently of the model itself.

Prompts are central to any LLM-based study (Sclar et al. 2024). A prompt is a concrete input to an LLM that guides its output (Schulhoff et al. 2024). Depending on the task, a prompt may include instructions (e.g., “classify the following bug report”), background information (e.g., a taxonomy of bug categories, related source code), input data (e.g., the bug report text), and output format specifications (e.g., “respond as JSON with fields ‘category’ and ‘justification’”), with outputs ranging from unstructured text to structured formats such as JSON (DAIR.AI 2024). A prompt template is a parameterized structure containing static elements (e.g., instructions, output format specifications) and placeholders for dynamic content (e.g., source code under analysis) that are filled in at runtime to construct concrete prompts (Schulhoff et al. 2024). In automated studies, researchers typically design prompt templates from which individual prompts are then instantiated. Prompts substantially influence a model’s output, so how they are formatted and integrated into an LLM-based study is essential for transparency, verifiability, and reproducibility. Sclar et al. (2024) question the methodological validity of comparing models with “an arbitrarily chosen, fixed prompt format”, because their research has shown that the performance of different prompt formats only weakly correlates between different models (Sclar et al. 2024).

This section addresses everything researchers put in place before runs begin, complementing Version and Configuration, which focuses on model-specific details, and Session Traces, which covers what happens at runtime. Standalone LLMs require limited architectural documentation, while study setups that integrate LLMs with other components, configure agents, or depend on particular context files or tool schemas require more detailed reporting.

These guidelines do not apply when LLMs are used solely for language polishing, paraphrasing, translation, tone or style adaptation, or layout edits (see Scope). Researchers are not expected to disclose prompts used only for such tasks.

Recommendations

Researchers MUST clearly describe the tool architecture and what exactly the LLM (or ensemble of LLMs) contributes to the tool or method presented in a research paper. Researchers SHOULD explain design decisions, particularly how the models were hosted and accessed (API-based, self-hosted, etc.) and which retrieval mechanisms were implemented (keyword search, semantic similarity matching, rule-based extraction, etc.). Where legally possible (e.g., not restricted by industry-partner agreements), researchers SHOULD release the source code of their implementation under an open-source license. Researchers MUST NOT omit critical architectural details that could affect reproducibility, such as dependencies on proprietary tools that influence tool behavior. For time-sensitive measurements, the description of the hosting environment is central, as different hosting choices (e.g., self-hosting on local hardware, an aggregator such as OpenRouter, or a vendor API such as the OpenAI API) can substantially impact the results. Researchers MUST clarify whether local infrastructure or cloud services were used, including detailed infrastructure specifications and latency considerations.

Agent-Based and Complex Systems:

If an LLM is used as a standalone system, for example by sending prompts directly to a GPT-4o model via the OpenAI API without pre-processing the prompts or post-processing the responses, a brief explanation of this approach is usually sufficient. However, if LLMs are integrated into more complex systems with pre-processing, retrieval mechanisms, or autonomous agents, the PAPER MUST contain a high-level description of the system architecture, with details reported as SUPPLEMENTARY MATERIAL. Aspects to consider are how the LLM interacts with other components such as databases, external APIs, and frameworks. If the LLM is part of an agent-based system that autonomously plans or executes tasks, researchers MUST describe its exact architecture, including the agents’ roles (e.g., planner, executor, coordinator), whether it is a single-agent or multi-agent system, how it interacts with external tools and users, and the reasoning framework used (e.g., chain-of-thought, self-reflection, multi-turn dialogue). For agent-based systems that use external tools (e.g., Claude Code and its subagents), researchers MUST describe agent behavior traceability by delineating the distinct components that contribute to behavior: (1) LLM Input/Output, the internal deliberation, planning, and interpretation performed by the model; (2) External tool calls, specific invocations of APIs, databases, file systems, or other external services that the agent explicitly triggers (e.g., via the Model Context Protocol (MCP)); (3) User or system interactions, human-in-the-loop feedback, environment responses, or multi-agent communication. Separating these components in the architectural description lets readers later assess whether task success is attributable to the LLM’s output, the external tools’ functionality, or their interaction patterns. Runtime trace-reporting requirements for these components are covered in Session Traces.

Retrieval, Augmentation, and Ensembles:

If retrieval or augmentation methods were used (e.g., retrieval-augmented generation (RAG), rule-based retrieval, structured query generation, or hybrid approaches), researchers MUST describe how external data is retrieved, stored, and integrated into the LLM’s responses. This includes specifying the type of storage or database used (e.g., vector databases, relational databases, knowledge graphs) and how the retrieved information is selected and used. Stored data used for context augmentation SHOULD be reported, including details on data preprocessing, versioning, and update frequency. If this data is not confidential, an anonymized snapshot of the data used for context augmentation SHOULD be made available as SUPPLEMENTARY MATERIAL. For ensemble models, in addition to following the Version and Configuration guideline for each model, the researchers SHOULD describe the architecture that connects the models. The PAPER MUST at least contain a high-level description, and details can be reported in the SUPPLEMENTARY MATERIAL. Aspects to consider include documenting the logic that determines which model handles which input, the interaction between models, and the architecture for combining outputs (e.g., majority voting, weighted averaging, sequential processing).

Context Files and Agent Configuration:

Agentic coding tools expose a range of configuration and extension mechanisms including context files, skills, subagents, hooks, settings, rules, and Model Context Protocol (MCP) servers (Galster et al. 2026). Context files such as AGENTS.md or CLAUDE.md are version-controlled Markdown files containing persistent, project-specific instructions (Mohsenimofidi et al. 2026). They are automatically included in the context on each call and instruct the agent on coding conventions, architectural constraints, tool preferences, or task-specific requirements. Researchers MUST describe which context-file mechanisms were used in their setup and SHOULD publish the full contents of those files as SUPPLEMENTARY MATERIAL, because they shape agent behavior similarly to system prompts. Because context files are version-controlled, their evolution across a study is recoverable from the project’s git history and from the runtime traces reported under Session Traces.

Tool Catalog and Skill Definitions:

Agentic systems expose a set of callable tools, skills, and sub-agents to the model on each call, along with their schemas: parameter names, types, and natural-language descriptions as the model sees them. Tools vary in granularity: some expose one operation per tool, while others provide generic execution interfaces. Claude Code, for example, does not register individual command-line utilities; it instead defines a single Bash tool (PowerShell on Windows) through which the agent runs arbitrary shell commands such as grep or ls (Anthropic 2026). The catalog, the wording of descriptions, and the order in which tools are presented all influence which tool the model selects, so the exact serialized form matters for reproducibility. Researchers MUST summarize in the PAPER which tools and skills were exposed to the model. Researchers SHOULD include the complete list (names with one-line purposes), tool schemas, skill definitions, sub-agent definitions, and the identities of any connected Model Context Protocol (MCP) servers as SUPPLEMENTARY MATERIAL. Where sub-agents are used, researchers SHOULD also describe the delegation pattern: which sub-agent handles which kind of task, and how control returns to the orchestrator.

Prompt Reporting:

Researchers MUST report all prompts that were used for an empirical study, including all instructions, background information, input data, and output indicators. The only exception is when transcripts might identify anonymous participants or reveal personal or confidential information, for example when working with industry partners. When prompt templates are used, researchers MUST report them alongside representative prompt instances, specifying which parts are static and which are dynamically filled. Researchers SHOULD specify the exact formatting of prompts, including how code snippets were enclosed (e.g., Markdown-style code blocks, triple backticks), whether whitespace was preserved, and how other artifacts such as error messages and stack traces were presented.

Prompt Development:

Researchers SHOULD explain in the PAPER how they developed the prompts and why they decided to follow certain prompting strategies. If prompts from the early phases of a research project are unavailable, researchers SHOULD at least summarize the prompt evolution. Because prompts can be stored as plain text, established SE techniques such as version control can collect the required provenance information.

Prompt development is often iterative, involving collaboration between human researchers and AI tools. Researchers SHOULD report any instances in which LLMs were used to suggest prompt refinements and how these suggestions were incorporated. Prompts may need revision in response to failure cases, and pilot testing is vital for reliable results. If such testing was conducted, researchers SHOULD summarize key insights, including how different prompt variations affected output quality and which criteria were used to finalize the prompt design. A prompt changelog can track prompt evolution, including key revisions and reasons for changes (e.g., v1.0: initial prompt; v2.0: incorporated examples of ideal responses). Because prompt effectiveness varies between models and model versions, researchers MUST make clear which prompts were used for which models in which versions and with which configuration.

Prompting Strategy and Input Handling:

Researchers MUST specify whether zero-shot, one-shot, or few-shot prompting was used. For few-shot prompts, the examples provided to the model SHOULD be clearly outlined, along with the rationale for selecting them. If multiple versions of a prompt were tested, researchers SHOULD describe how these variations were evaluated and how the final design was chosen.

When dealing with extensive or complex prompt context, researchers SHOULD describe the strategies they used to handle input length constraints. Approaches might include truncating, summarizing, or splitting prompts into multiple parts. Token optimization measures, such as simplifying code formatting or removing unnecessary comments, SHOULD also be documented if applied.

Dynamic and User-Authored Prompts:

When prompts are generated dynamically, for example through preprocessing, template structures, or retrieval-augmented generation (RAG), the process MUST be thoroughly documented. This includes any automated algorithms or rules that shaped prompt generation. For studies involving human participants who create or modify prompts, researchers SHOULD describe how these prompts were collected and analyzed.

Supplementary Material and Anonymization:

For complete reproducibility, researchers MUST make all prompts and prompt variations publicly available as part of their SUPPLEMENTARY MATERIAL. If the complete set of prompts cannot be included in the PAPER, researchers SHOULD provide summaries and representative examples. This also applies when complete disclosure is not possible because of privacy or confidentiality concerns. For prompts containing sensitive information, researchers MUST anonymize personal identifiers, replace proprietary code with placeholders, and clearly highlight modified sections.

Example(s)

Schäfer et al. (2024) (Schäfer et al. 2024) evaluated LLMs for automated unit test generation, providing a detailed description of the system architecture including code parsing, prompt formulation, LLM interaction, and test suite integration. They also detail the datasets used, including sources, selection criteria, and preprocessing steps.

A second example is Yan et al. (2024)’s IVIE tool (Yan et al. 2024), which integrates LLMs into the VS Code interface. The authors document the tool architecture, detailing the IDE integration, context extraction from code editors, and the formatting pipeline for LLM-generated explanations. This documentation illustrates how architectural components beyond the core LLM affect the overall tool performance and user experience.

A paper by Anandayuvaraj et al. (2024) (Anandayuvaraj et al. 2024) is a good example of making prompts available online. The authors analyze software failures reported in news articles and use prompting to automate tasks such as filtering relevant articles, merging reports, and extracting detailed failure information. Their online appendix contains all the prompts used in the study, supporting transparency and reproducibility.

Liang et al. (2024)’s paper (Liang et al. 2024) is another good example of comprehensive prompt reporting. The authors make the exact prompts available in their SUPPLEMENTARY MATERIAL on Figshare, including details such as code blocks enclosed in triple backticks. The PAPER explains the rationale behind the prompt design and the data output format, and it includes an overview figure and two concrete examples, keeping the main text concise while remaining reproducible.

Benefits

The software layers around bare LLMs substantially impact tool performance and hence need detailed documentation. Documenting the architecture and supplemental data of LLM-based systems strengthens reproducibility and transparency (Lu et al. 2024), enabling experiment replication, result validation, and cross-study comparison.

Detailed prompt documentation strengthens verifiability, reproducibility, and comparability of LLM-based studies, letting other researchers replicate studies, refine prompts, and evaluate how different content types and formatting choices influence LLM behavior. Reporting the tool catalog and context files enables other researchers to reconstruct the exact context the model saw, which is often the dominant factor in agent behavior.

Challenges

Documenting LLM-based architectures involves challenges such as proprietary APIs and dependencies that restrict disclosure, managing large-scale retrieval databases, and ensuring efficient query execution. Researchers must also balance transparency with data privacy concerns, adapt to the evolving nature of LLM integrations, and handle the complexity of multi-agent interactions and decision-making logic, all of which can impact reproducibility and system clarity.

Prompts themselves are challenging to document because they often combine multiple components, such as code, error messages, and explanatory text. Formatting differences (e.g., Markdown versus plain text) can affect how LLMs interpret input. Prompt length constraints may require careful context management, particularly for tasks involving extensive artifacts such as large codebases. Privacy and confidentiality concerns can hinder prompt sharing, especially when sensitive data is involved.

Not all systems allow reporting of complete (system) prompts, context files, and tool schemas. For commercial tools, researchers MUST report all available information and acknowledge unknown aspects as limitations. Disclosure practices vary across vendors: some publish their system prompts, others keep them proprietary. Where prompts are published, they also change between releases (Willison 2026), so researchers SHOULD record the tool version and date of use even when the prompt content itself is unavailable. Understanding suggestions of commercial tools such as GitHub Copilot might require recreating the exact state of the codebase at the time the suggestion was made, which is a challenging context to report. One solution is to use version control to capture the exact state of the codebase when a recommendation was made, keeping track of the files that were automatically added as context. We also recommend exploring open-source tools such as OpenCode (OpenCode Contributors 2025), which expose more of the configuration that shapes agent behavior.

Study Types

This guideline MUST be followed for all studies that involve tools with system-level components beyond bare LLMs, from lightweight wrappers that pre-process user input or post-process model outputs, to systems employing retrieval-augmented methods or complex agent-based architectures. It also MUST be followed by all studies that use concrete prompts or prompt templates.

For LLMs for Tools, this guideline is of primary importance: researchers MUST describe the tool’s full architecture, explain how prompts were generated and structured within the tool, report the context files, tool catalog, and skill definitions used, and document how the role of each LLM fits into the overall system behavior. For Benchmarking LLMs, researchers SHOULD describe the evaluation harness and infrastructure if it goes beyond bare model API calls (e.g., custom sandboxing, orchestration layers, or post-processing pipelines); see (Anthropic 2025) for a practitioner account of evaluation harness design for agent-based systems. Researchers using pre-defined prompts (e.g., HumanEval, SWE-Bench) MUST specify the benchmark version and any modifications made to the prompts or evaluation setup. If prompt tuning, retrieval-augmented generation (RAG), or other methods were used to adapt prompts, researchers MUST disclose and justify those changes, and SHOULD make the relevant code publicly available. For Studying LLM Usage, researchers SHOULD describe the tool architecture of the studied tool to the extent that it is accessible, as architectural details may influence observed usage patterns. For controlled experiments under Studying LLM Usage, exact prompts MUST be reported for all conditions. For LLMs as Annotators, researchers MUST document any predefined coding guides or instructions included in prompts, as these influence how the model labels artifacts. For LLMs as Judges, researchers MUST report the evaluation criteria, scales, and examples embedded in the prompt to ensure consistent interpretation. For LLMs for Synthesis tasks (e.g., summarization, aggregation), researchers MUST document the sequence of prompts used to generate and refine outputs, including follow-ups and clarification queries. For LLMs as Subjects (e.g., simulating human participants), researchers MUST report any role-playing instructions, constraints, or personas used to guide LLM behavior. For LLMs as Annotators, LLMs as Judges, LLMs for Synthesis, and LLMs as Subjects, if the research setup involves a custom pipeline (e.g., retrieval-augmented generation for annotation or chained prompts for synthesis), the architecture SHOULD also be reported.

Advice for Reviewers

As with other guidelines, missing architectural or prompt information is typically a minor revision request unless so much is missing that methodological rigor cannot be assessed. Reviewers may ask authors to move key details from supplements into the paper body to ensure the main text is self-contained.

Regarding proprietary or confidential components, three principles apply: (1) empirically evaluating an opaque artifact is a valid scientific contribution; (2) the less available and inspectable the artifact, the weaker the contribution; (3) scientific instruments including tools, metrics, scales, and experimental materials must be fully disclosed, even when the object under study cannot be.

A challenging aspect of prompt reporting is the description of prompt development, which is inherently iterative and creative. Reviewers should not expect justification of each word choice or post-hoc rationalized accounts of prompt generation. Instead, reviewers should focus on whether (1) a new research team could use (not reproduce) exactly the same prompts in exactly the same way, and (2) potential biases or validity issues are transparent.

References

Anandayuvaraj, Dharun, Matthew Campbell, Arav Tewari, and James C. Davis. 2024. “FAIL: Analyzing Software Failures from the News Using LLMs.” In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2024, edited by Vladimir Filkov, Baishakhi Ray, and Minghui Zhou, 506–18. ACM. https://doi.org/10.1145/3691620.3695022.

Anthropic. 2025. “Demystifying Evals for AI Agents.” https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents.

———. 2026. “Claude Code Tools Reference.” https://code.claude.com/docs/en/tools-reference.

DAIR.AI. 2024. “Elements of a Prompt.” https://www.promptingguide.ai/introduction/elements.

Galster, Matthias, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes. 2026. “Configuring Agentic AI Coding Tools: An Exploratory Study.” In Proceedings of the 3rd ACM International Conference on AI-Powered Software (AIware 2026).

Liang, Jenny T, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, and Thomas Zimmermann. 2024. “Can Gpt-4 Replicate Empirical Software Engineering Research?” Proceedings of the ACM on Software Engineering 1 (FSE): 1330–53.

Lu, Qinghua, Liming Zhu, Xiwei Xu, Zhenchang Xing, and Jon Whittle. 2024. “Toward Responsible AI in the Era of Generative AI: A Reference Architecture for Designing Foundation Model-Based Systems.” IEEE Softw. 41 (6): 91–100. https://doi.org/10.1109/MS.2024.3406333.

Mohsenimofidi, Seyedmoein, Matthias Galster, Christoph Treude, and Sebastian Baltes. 2026. “Context Engineering for AI Agents in Open-Source Software.” In Proceedings of the 23rd IEEE/ACM International Conference on Mining Software Repositories (MSR 2026). https://arxiv.org/abs/2510.21413.

OpenCode Contributors. 2025. “OpenCode: The Open Source AI Coding Agent.” https://opencode.ai/.

Schäfer, Max, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation.” IEEE Trans. Software Eng. 50 (1): 85–105. https://doi.org/10.1109/TSE.2023.3334955.

Schulhoff, Sander, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, et al. 2024. “The Prompt Report: A Systematic Survey of Prompting Techniques.” CoRR abs/2406.06608. https://doi.org/10.48550/ARXIV.2406.06608.

Sclar, Melanie, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting.” In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=RIu5lyNXjT.

Willison, Simon. 2026. “Opus system prompt.” https://simonwillison.net/2026/Apr/18/opus-system-prompt/.

Yan, Litao, Alyssa Hwang, Zhiyuan Wu, and Andrew Head. 2024. “Ivie: Lightweight Anchored Explanations of Just-Generated Code.” In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, edited by Florian ’Floyd’Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Max L. Wilson, Phoebe O. Toups Dugas, and Irina Shklovski, 140:1–15. ACM. https://doi.org/10.1145/3613904.3642239.