Report Prompts, their Development, and Interaction Logs (G4)

tl;dr: Researchers MUST publish all prompts, including their structure, content, formatting, and dynamic components. If full prompt disclosure is not feasible, for example, due to privacy or confidentiality concerns, summaries or examples SHOULD be provided. Prompt development strategies (e.g., zero-shot, few-shot), rationale, and selection process MUST be described. When prompts are long or complex, input handling and token optimization strategies MUST be documented. For dynamically generated or user-authored prompts, generation and collection processes MUST be reported. Prompt reuse across models and configurations MUST be specified. Researchers SHOULD report prompt revisions and pilot testing insights. For agentic systems, developed plans SHOULD be reported and interaction logs SHOULD be generalized to include human-agent interaction traces. For studies using AI coding agents, all context files used to configure agent behavior MUST be reported as part of the SUPPLEMENTARY MATERIAL. To address model non-determinism and ensure reproducibility, especially when targeting SaaS-based commercial tools, full interaction logs (prompts and responses) SHOULD be included if privacy and confidentiality can be ensured.

Rationale

Prompts are critical for any study involving LLMs (Sclar et al. 2024). Depending on the task, they may include instructions, context (e.g., source code, execution traces, error messages), input data, and output indicators, with outputs ranging from unstructured text to structured formats such as JSON.¹ Prompts significantly influence the quality of the output of a model, and understanding how exactly they were formatted and integrated into an LLM-based study is essential to ensure transparency, verifiability, and reproducibility. Sclar et al. (2024) question the methodological validity of comparing models with “an arbitrarily chosen, fixed prompt format”, because their research has shown that the performance of different prompt formats only weakly correlates between different models (Sclar et al. 2024). We remind the reader that these guidelines do not apply when LLMs are used solely for language polishing, paraphrasing, translation, tone or style adaptation, or layout edits (see Scope). This implies that our guidelines do not recommend that researcher disclose the prompts they have used for such tasks.

Recommendations

Prompt Reporting:

Researchers MUST report all prompts that were used for an empirical study, including all instructions, context, input data, and output indicators. The only exception to this is if transcripts might identify anonymous participants or reveal personal or confidential information, for example, when working with industry partners. Prompts can be reported using a structured template that contains placeholders for dynamically added content. Moreover, researchers SHOULD specify the exact formatting of prompts, including how code snippets were enclosed (e.g., markdown-style code blocks, triple backticks), whether whitespace was preserved, and how other artifacts such as error messages and stack traces were presented.

Prompt Development:

Researchers MUST also explain in the PAPER how they developed the prompts and why they decided to follow certain prompting strategies. If prompts from the early phases of a research project are unavailable, they MUST at least summarize the prompt evolution. However, given that prompts can be stored as plain text, there are established SE techniques such as version control systems that can be used to collect the required provenance information.

Prompt development is often iterative, involving collaboration between human researchers and AI tools. Researchers SHOULD report any instances in which LLMs were used to suggest prompt refinements and how these suggestions were incorporated. Prompts may need revision in response to failure cases, and pilot testing is vital to ensure reliable results. If such testing was conducted, researchers SHOULD summarize key insights, including how different prompt variations affected output quality and which criteria were used to finalize the prompt design. A prompt changelog can track prompt evolution, including key revisions and reasons for changes (e.g., v1.0: initial prompt; v2.0: incorporated examples of ideal responses). Since prompt effectiveness varies between models and model versions, researchers MUST make clear which prompts were used for which models in which versions and with which configuration.

Prompting Strategy and Input Handling:

Researchers MUST specify whether zero-shot, one-shot, or few-shot prompting was used. For few-shot prompts, the examples provided to the model SHOULD be clearly outlined, along with the rationale for selecting them. If multiple versions of a prompt were tested, researchers SHOULD describe how these variations were evaluated and how the final design was chosen.

When dealing with extensive or complex prompt context, researchers MUST describe the strategies they used to handle input length constraints. Approaches might include truncating, summarizing, or splitting prompts into multiple parts. Token optimization measures, such as simplifying code formatting or removing unnecessary comments, MUST also be documented if applied.

Dynamic and User-Authored Prompts:

In cases where prompts are generated dynamically, such as through preprocessing, template structures, or retrieval-augmented generation (RAG), the process MUST be thoroughly documented. This includes explaining any automated algorithms or rules that influenced prompt generation. For studies involving human participants who create or modify prompts, researchers MUST describe how these prompts were collected and analyzed.

Supplementary Material and Anonymization:

To ensure complete reproducibility, researchers MUST make all prompts and prompt variations publicly available as part of their SUPPLEMENTARY MATERIAL. If the complete set of prompts cannot be included in the PAPER, researchers SHOULD provide summaries and representative examples. This also applies if complete disclosure is not possible due to privacy or confidentiality concerns. For prompts containing sensitive information, researchers MUST: (1) anonymize personal identifiers, (ii) replace proprietary code with placeholders, and (iii) clearly highlight modified sections.

Interaction Logs:

When trying to verify results, even with the exact same prompts, decoding strategies, and parameters, LLMs can still behave non-deterministically. Non-determinism can arise from batching, input preprocessing, and floating point arithmetic on GPUs (Chann 2023). Thus, in order to enable other researchers to verify the conclusions that researchers have drawn from LLM interactions, researchers SHOULD report the full interaction logs (prompts and responses) as part of their SUPPLEMENTARY MATERIAL. Reporting this is especially important for studies targeting commercial software-as-a-service (SaaS) solutions such as ChatGPT. The rationale for this is similar to the rationale for reporting interview transcripts in qualitative research. Just as a human participant might give different answers to the same question asked two months apart, the responses from tools such as ChatGPT can also vary over time.

Agentic Systems:

For agentic systems that autonomously plan and execute tasks, the developed plans SHOULD be reported if available, as they document the sequence of actions the system chose to pursue and the intermediate goals it set. Interaction logs SHOULD be generalized to include full interaction traces between humans and agentic systems, capturing human-in-the-loop feedback, approval or rejection decisions, and iterative refinements. These traces complement agent behavior traceability requirements and support the evaluation of agentic tool outputs.

Agentic coding tools can be configured using version-controlled Markdown files such as CLAUDE.md or AGENTS.md, which contain persistent, project-specific instructions (Mohsenimofidi et al. 2026). These context files are automatically injected into agent prompts and can significantly influence agent behavior by specifying coding conventions, architectural constraints, tool preferences, and task-specific instructions. Researchers MUST report all context files used to configure AI agent behavior as part of their SUPPLEMENTARY MATERIAL, as these files are a form of prompt artifact that shapes agent behavior similarly to system prompts.

Example(s)

A paper by Anandayuvaraj et al. (2024) (Anandayuvaraj et al. 2024) is a good example of making prompts available online. In that paper, the authors analyze software failures reported in news articles and use prompting to automate tasks such as filtering relevant articles, merging reports, and extracting detailed failure information. Their online appendix contains all the prompts used in the study, providing transparency and supporting reproducibility.

Liang et al. (2024)’s paper is another good example of comprehensive prompt reporting. The authors make the exact prompts available in their SUPPLEMENTARY MATERIAL on Figshare, including details such as code blocks being enclosed in triple backticks. The PAPER thoroughly explains the rationale behind the prompt design and the data output format. It also includes an overview figure and two concrete examples, enabling transparency and reproducibility while keeping the main text concise.

An example of reporting full interaction logs is the study by Ronanki, Berger, and Horkoff (2023) (Ronanki, Berger, and Horkoff 2023), for which they reported the full answers of ChatGPT and uploaded them to Zenodo.

Benefits

Detailed prompt documentation improves verifiability, reproducibility, and comparability of LLM-based studies, allowing other researchers to replicate studies, refine prompts, and evaluate how different content types and formatting choices influence LLM behavior.

Unlike human participant conversations, which often cannot be reported due to confidentiality, LLM interaction logs can be shared. This enables reproduction studies, tracking of response changes over time or across model versions, and secondary research on LLM consistency for specific SE tasks.

Challenges

One challenge is the complexity of prompts that combine multiple components, such as code, error messages, and explanatory text. Formatting differences (e.g., Markdown vs. plain text) can affect how LLMs interpret input. Additionally, prompt length constraints may require careful context management, particularly for tasks involving extensive artifacts such as large codebases. Privacy and confidentiality concerns can hinder prompt sharing, especially when sensitive data is involved.

Not all systems allow the reporting of complete (system) prompts and interaction logs with ease. This hinders transparency and verifiability. We also recommend exploring open-source tools such as OpenCode (OpenCode Contributors 2025). For commercial tools, researchers MUST report all available information and acknowledge unknown aspects as limitations. Understanding suggestions of commercial tools such as GitHub Copilot might require recreating the exact state of the codebase at the time the suggestion was made, a challenging context to report. One solution could be to use version control to capture the exact state of the codebase when a recommendation was made, keeping track of the files that were automatically added as context.

Study Types

Researchers MUST report the complete prompts that were used, including all instructions, context, and input data. This applies across all study types, but reporting requirements and specific focus areas vary.

For LLMs for Tools, researchers MUST explain how prompts were generated and structured within the tool. For Studying LLM Usage, especially for controlled experiments, exact prompts MUST be reported for all conditions. For observational studies, especially the ones targeting commercial tools, researchers MUST report the full interaction logs except for when transcripts might identify anonymous participants or reveal personal or confidential information. As mentioned before, if the complete interaction logs cannot be shared, e.g., because they contain confidential information, the prompts and responses MUST at least be summarized and described in the PAPER. For LLMs as Annotators, researchers MUST document any predefined coding guides or instructions included in prompts, as these influence how the model labels artifacts. For LLMs as Judges, researchers MUST report the evaluation criteria, scales, and examples embedded in the prompt to ensure a consistent interpretation. For LLMs for Synthesis tasks (e.g., summarization, aggregation), researchers MUST document the sequence of prompts used to generate and refine outputs, including follow-ups and clarification queries. For LLMs as Subjects (e.g., simulating human participants), researchers MUST report any role-playing instructions, constraints, or personas used to guide LLM behavior. For Benchmarking LLMs studies using pre-defined prompts (e.g., HumanEval, SWE-Bench), researchers MUST specify the benchmark version and any modifications made to the prompts or evaluation setup. If prompt tuning, retrieval-augmented generation (RAG), or other methods were used to adapt prompts, researchers MUST disclose and justify those changes, and SHOULD make the relevant code publicly available.

Advice for Reviewers

As with other guidelines, missing prompt information is typically a minor revision request unless so much is missing that methodological rigor cannot be assessed. The most challenging aspect is the description of prompt development, which is inherently iterative and creative. Reviewers should not expect justification of each word choice or post-hoc rationalized accounts of prompt generation. Instead, reviewers should focus on whether (1) a new research team could use (not reproduce) exactly the same prompts in exactly the same way; and (2) potential biases or validity issues are transparent.

References

Anandayuvaraj, Dharun, Matthew Campbell, Arav Tewari, and James C. Davis. 2024. “FAIL: Analyzing Software Failures from the News Using LLMs.” In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2024, edited by Vladimir Filkov, Baishakhi Ray, and Minghui Zhou, 506–18. ACM. https://doi.org/10.1145/3691620.3695022.

Chann, Sherman. 2023. “Non-determinism in GPT-4 is caused by Sparse MoE.” https://152334h.github.io/blog/non-determinism-in-gpt-4/.

Liang, Jenny T, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, and Thomas Zimmermann. 2024. “Can Gpt-4 Replicate Empirical Software Engineering Research?” Proceedings of the ACM on Software Engineering 1 (FSE): 1330–53.

Mohsenimofidi, Seyedmoein, Matthias Galster, Christoph Treude, and Sebastian Baltes. 2026. “Context Engineering for AI Agents in Open-Source Software.” In Proceedings of the 23rd IEEE/ACM International Conference on Mining Software Repositories (MSR 2026). https://arxiv.org/abs/2510.21413.

OpenCode Contributors. 2025. “OpenCode: The Open Source AI Coding Agent.” https://opencode.ai/.

Ronanki, Krishna, Christian Berger, and Jennifer Horkoff. 2023. “Investigating ChatGPT’s Potential to Assist in Requirements Elicitation Processes.” In 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 354–61. IEEE.

Sclar, Melanie, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting.” In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=RIu5lyNXjT.