Report Tool Architecture Beyond Models (G3)
tl;dr: Researchers MUST describe the full architecture of LLM-based tools that they develop in their PAPER. This includes the role of the LLM, interactions with other components, and the overall system behavior. If autonomous agents are used, researchers MUST specify agent roles, reasoning frameworks, and communication flows. Hosting, hardware setup, and latency implications MUST be reported. For tools using retrieval or augmentation methods, data sources, integration mechanisms, and update and versioning strategies MUST be described. For ensemble architectures, the coordination logic between models MUST be explained. Researchers SHOULD include architectural diagrams and justify design decisions. Non-disclosed confidential or proprietary components MUST be acknowledged as reproducibility limitations.
Rationale
LLM-based tools often have complex software layers around the model(s) that pre-process data, prepare prompts, filter user requests, or post-process responses. For example, ChatGPT and GitHub Copilot use the same underlying models, but their outputs differ significantly because Copilot automatically adds project context; researchers can also build tools using models directly via APIs. The infrastructure and business logic around the bare model can significantly contribute to the performance of a tool for a given task, for instance, RAG pipelines, output parsers, and retry logic all shape the final output independently of the model itself.
This section addresses the system-level aspects of LLM-based tools that researchers develop, complementing Version and Configuration, which focuses on model-specific details. While standalone LLMs require limited architectural documentation, a more detailed description is required for study setups and tool architectures that integrate LLMs with other components to create more complex systems. This section provides guidelines for documenting these broader architectures.
Recommendations
Researchers MUST clearly describe the tool architecture and what exactly the LLM (or ensemble of LLMs) contributes to the tool or method presented in a research paper. Researchers SHOULD provide a high-level architectural diagram to improve transparency. To improve clarity, researchers SHOULD explain design decisions, particularly regarding how the models were hosted and accessed (API-based, self-hosted, etc.) and which retrieval mechanisms were implemented (keyword search, semantic similarity matching, rule-based extraction, etc.). Researchers MUST NOT omit critical architectural details that could affect reproducibility, such as dependencies on proprietary tools that influence tool behavior. Especially for time-sensitive measurements, the description of the hosting environment is central, as it can significantly impact the results. Researchers MUST clarify whether local infrastructure or cloud services were used, including detailed infrastructure specifications and latency considerations.
Agent-Based and Complex Systems:
If an LLM is used as a standalone system, for example, by sending prompts directly to a GPT-4o model via the OpenAI API without pre-processing the prompts or post-processing the responses, a brief explanation of this approach is usually sufficient. However, if LLMs are integrated into more complex systems with pre-processing, retrieval mechanisms, or autonomous agents, researchers MUST provide a detailed description of the system architecture in the PAPER. Aspects to consider are how the LLM interacts with other components such as databases, external APIs, and frameworks. If the LLM is part of an agent-based system that autonomously plans or executes tasks, researchers MUST describe its exact architecture, including the agents’ roles (e.g., planner, executor, coordinator), whether it is a single-agent or multi-agent system, how it interacts with external tools and users, and the reasoning framework used (e.g., chain-of-thought, self-reflection, multi-turn dialogue). Any agent configuration via context files (e.g., AGENTS.md) SHOULD also be documented as part of the system architecture (see Prompts and Logs for detailed reporting requirements). For agent-based systems that use external tools (e.g., Claude Code and its subagents), researchers MUST discuss agent behavior traceability by clearly delineating three distinct components: (1) LLM Input/Output: Internal deliberation, planning, and interpretation performed by the model. (2) External tool calls: Specific invocations of APIs, databases, file systems, or other external services that the agent explicitly triggers (e.g., via the Model Context Protocol (MCP)). (3) User or system interactions: Human-in-the-loop feedback, environment responses, or multi-agent communication. Researchers SHOULD report complete execution traces that show the sequence and causality between these components, including inputs and outputs of all external tool calls. This traceability is essential for determining whether task success is attributable to the LLM’s output and tool-calling capabilities, the external tools’ functionality, or their interaction patterns. When full tool call logging is not feasible due to privacy or proprietary constraints, researchers SHOULD provide representative examples or anonymized traces that demonstrate the agent’s decision-making process.
Retrieval, Augmentation, and Ensembles:
If retrieval or augmentation methods were used (e.g., retrieval-augmented generation (RAG), rule-based retrieval, structured query generation, or hybrid approaches), researchers MUST describe how external data is retrieved, stored, and integrated into the LLM’s responses. This includes specifying the type of storage or database used (e.g., vector databases, relational databases, knowledge graphs) and how the retrieved information is selected and used. Stored data used for context augmentation MUST be reported, including details on data preprocessing, versioning, and update frequency. If this data is not confidential, an anonymized snapshot of the data used for context augmentation SHOULD be made available. For ensemble models, in addition to following the Version and Configuration guideline for each model, the researchers MUST describe the architecture that connects the models. The PAPER MUST at least contain a high-level description, and details can be reported in the SUPPLEMENTARY MATERIAL. Aspects to consider include documenting the logic that determines which model handles which input, the interaction between models, and the architecture for combining outputs (e.g., majority voting, weighted averaging, sequential processing).
Example(s)
Schäfer et al. (2024) (Schäfer et al. 2024) evaluated LLMs for automated unit test generation, providing a comprehensive description of the system architecture including code parsing, prompt formulation, LLM interaction, and test suite integration. They also detail the datasets used, including sources, selection criteria, and preprocessing steps.
A second example is Yan et al. (2024)’s IVIE tool (Yan et al. 2024), which integrates LLMs into the VS Code interface. The authors document the tool architecture, detailing the IDE integration, context extraction from code editors, and the formatting pipeline for LLM-generated explanations. This documentation illustrates how architectural components beyond the core LLM affect the overall tool performance and user experience.
Benefits
The software layers around bare LLMs significantly impact tool performance and hence need detailed documentation. Documenting the architecture and supplemental data of LLM-based systems enhances reproducibility and transparency (Lu et al. 2024), enabling experiment replication, result validation, and cross-study comparison.
Challenges
Researchers face challenges in documenting LLM-based architectures, including proprietary APIs and dependencies that restrict disclosure, managing large-scale retrieval databases, and ensuring efficient query execution. They must also balance transparency with data privacy concerns, adapt to the evolving nature of LLM integrations, and, depending on the context, handle the complexity of multi-agent interactions and decision-making logic, all of which can impact reproducibility and system clarity.
Study Types
This guideline MUST be followed for all studies that involve tools with system-level components beyond bare LLMs, from lightweight wrappers that pre-process user input or post-process model outputs, to systems employing retrieval-augmented methods or complex agent-based architectures.
For LLMs for Tools, this guideline is of primary importance: researchers MUST describe the tool’s full architecture, including the role of each LLM, interactions with other components, and the overall system behavior. For Benchmarking LLMs, researchers SHOULD describe the evaluation harness and infrastructure if it goes beyond bare model API calls (e.g., custom sandboxing, orchestration layers, or post-processing pipelines). For Studying LLM Usage, researchers SHOULD describe the tool architecture of the studied tool to the extent that it is accessible, as architectural details may influence observed usage patterns. For LLMs as Annotators, LLMs as Judges, LLMs for Synthesis, and LLMs as Subjects, if the research setup involves a custom pipeline (e.g., retrieval-augmented generation for annotation or chained prompts for synthesis), the architecture SHOULD also be reported.
Advice for Reviewers
As with other guidelines, missing architectural information is typically a minor revision request unless so much is missing that methodological rigor cannot be assessed. Reviewers may ask authors to move key details from supplements into the paper body to ensure the main text is self-contained.
Regarding proprietary or confidential components, three principles apply: (1) empirically evaluating an opaque artifact is a valid scientific contribution; (2) the less available and inspectable the artifact, the weaker the contribution; (3) scientific instruments including tools, metrics, scales, and experimental materials, must be fully disclosed, even when the object under study cannot be.
References
Lu, Qinghua, Liming Zhu, Xiwei Xu, Zhenchang Xing, and Jon Whittle. 2024. “Toward Responsible AI in the Era of Generative AI: A Reference Architecture for Designing Foundation Model-Based Systems.” IEEE Softw. 41 (6): 91–100. https://doi.org/10.1109/MS.2024.3406333.
Schäfer, Max, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation.” IEEE Trans. Software Eng. 50 (1): 85–105. https://doi.org/10.1109/TSE.2023.3334955.
Yan, Litao, Alyssa Hwang, Zhiyuan Wu, and Andrew Head. 2024. “Ivie: Lightweight Anchored Explanations of Just-Generated Code.” In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, edited by Florian ’Floyd’Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Max L. Wilson, Phoebe O. Toups Dugas, and Irina Shklovski, 140:1–15. ACM. https://doi.org/10.1145/3613904.3642239.