Report Session Traces (G4)

tl;dr: To address model non-determinism and ensure reproducibility, especially when targeting SaaS-based commercial tools, researchers SHOULD include full interaction logs (prompts and responses) as SUPPLEMENTARY MATERIAL if privacy and confidentiality can be ensured. For agentic systems, interaction logs SHOULD be extended to cover exchanges between humans and the agent and between external tools and the agent, including human-in-the-loop feedback and approval decisions. Researchers SHOULD report tool-call traces (tool name, arguments, result, and ordering) as SUPPLEMENTARY MATERIAL so that readers can attribute task outcomes to the model, the tools, or their interaction pattern. Researchers SHOULD report usage traces showing which skills, context files, sub-agents, and tools were actually picked up during a run, which complements the static catalog reported under System and Prompt Design. Developed plans SHOULD be reported as SUPPLEMENTARY MATERIAL if available. When full trace disclosure is not feasible, representative or anonymized examples SHOULD be provided, and unobservable aspects of commercial-tool runs MUST be acknowledged as reproducibility limitations.

Rationale

A session is any bounded stretch of activity during which an LLM is invoked. Sessions cover one-shot prompts, batch runs, multi-turn conversations, and agentic runs that span many tool calls. A session trace records everything that crosses the LLM boundary or is produced around the model during that stretch: prompts received, responses returned, tool calls the model made together with their arguments and results, plans the model developed, and which of the statically configured artifacts (skills, context files, sub-agents, tools) were actually picked up at runtime.

Two kinds of session trace are important to capture. An interaction log captures what humans or other software tools sent to the LLM and what the LLM returned. A tool-call trace captures what the LLM called out to: external APIs, file systems, databases, MCP servers, sub-agents, or other agents. Reporting both kinds is needed for an agentic run, because neither tells the full story on its own.

Even with the exact same prompts, decoding strategies, and parameters, LLMs can behave non-deterministically. Non-determinism can arise from probabilistic sampling and, even with greedy decoding (temperature = 0), from batching, input preprocessing, and floating-point arithmetic on GPUs (Chann 2023). For other researchers to verify the conclusions drawn from LLM interactions, the traces of what happened at runtime are often as important as the design of the system itself. This matters particularly for studies targeting commercial software-as-a-service (SaaS) solutions such as ChatGPT, and for agentic runs where behavior depends on how the model chose among many possible tool calls.

The rationale is similar to reporting interview transcripts in qualitative research. Just as a human participant might give different answers to the same question asked two months apart, the responses from tools such as ChatGPT can also vary over time, and the trace of an agent’s decisions on any given run is often not reproducible at a later date.

This section addresses runtime reporting and complements System and Prompt Design, which covers the static artifacts that shape what the model sees.

Recommendations

Interaction Logs:

Researchers SHOULD report the full interaction logs (prompts sent to the LLM and responses returned) as part of their SUPPLEMENTARY MATERIAL. Reporting interaction logs is especially important for studies targeting commercial SaaS solutions. For agentic systems, interaction logs generalize to cover exchanges between humans and the agent as well as between external tools and the agent, including human-in-the-loop feedback, approval or rejection decisions, and iterative refinements. These SHOULD also be reported as SUPPLEMENTARY MATERIAL so that readers can reconstruct the sequence of exchanges and assess human oversight decisions. When full logging is not feasible because of privacy or proprietary constraints, researchers SHOULD provide representative examples or anonymized logs that demonstrate the relevant patterns.

Tool-Call Traces:

When an LLM calls out to external tools (e.g., subagents, MCP servers, file systems, APIs, databases), the sequence of outbound calls forms a tool-call trace distinct from the interaction log. Researchers SHOULD report the complete tool-call trace as SUPPLEMENTARY MATERIAL, including for each call the tool name, arguments, result, and its ordering relative to surrounding interaction-log entries. This traceability is essential for determining whether task success is attributable to the LLM’s output and tool-calling capabilities, the external tools’ functionality, or their interaction patterns; the architectural categorization of these components is covered in System and Prompt Design. When full tool-call logging is not feasible because of privacy or proprietary constraints, researchers SHOULD provide representative examples or anonymized traces that demonstrate the agent’s decision-making process.

Usage Traces:

Static artifacts reported under System and Prompt Design (skills, context files, sub-agents, tool schemas) describe what was available to the model, not what the model actually did with them. Researchers SHOULD report usage traces as SUPPLEMENTARY MATERIAL showing which skills were invoked, which context files were read or injected, which sub-agents were delegated to, and which tools were called during each run. For agents that selectively load context (e.g., on-demand subagent spawning), usage traces let readers distinguish between artifacts that were configured and artifacts that actually influenced a given run.

Agentic Plans:

For agentic systems that autonomously plan and execute tasks, the developed plans SHOULD be reported as SUPPLEMENTARY MATERIAL if available. Plans document the sequence of actions the system chose to pursue and the intermediate goals it set, which is often more informative than the final output alone. Storage conventions vary across tools: Claude Code stores plans as Markdown files that users can open and edit in their default editor during a session, whereas other frameworks such as LangGraph keep plans inside the agent’s internal execution state.

Supplementary Material and Anonymization:

For complete reproducibility, researchers MUST make the supplementary traces publicly available, subject to privacy and confidentiality constraints. If complete traces cannot be included, researchers SHOULD provide summaries and representative examples. For traces containing sensitive information, researchers MUST anonymize personal identifiers, replace proprietary code with placeholders, and clearly highlight modified sections.

Example(s)

An example of reporting full interaction logs is the study by Ronanki, Berger, and Horkoff (2023) (Ronanki, Berger, and Horkoff 2023), for which the authors reported the full answers of ChatGPT and uploaded them to Zenodo.

Benefits

Unlike human participant conversations, which often cannot be reported because of confidentiality, LLM interaction logs can be shared. This enables reproduction studies, tracking of response changes over time or across model versions, and secondary research on LLM consistency for specific SE tasks.

For agentic systems, reporting tool-call traces alongside interaction logs lets readers attribute task outcomes to the right component: the model, the tool, or the orchestration pattern. Usage traces complement the static configuration reported under System and Prompt Design by showing which of the configured artifacts actually influenced a given run.

Challenges

Not all systems allow reporting of complete interaction logs with ease, and this hinders transparency and verifiability. For commercial tools, researchers MUST report all available information and acknowledge unknown aspects as limitations. Tool-call traces for commercial SaaS agents are often opaque: the user sees the final response but not the sequence of internal tool calls. When this is the case, authors should document what was and was not observable. For agents running locally or via open-source tools, these traces are usually more accessible and SHOULD be reported whenever available. Agent frameworks differ in whether and how they log agent-to-agent communication, so reporting practices vary across studies.

Study Types

Runtime trace reporting requirements depend heavily on the study type and on the accessibility of the underlying system.

For Studying LLM Usage, especially observational studies targeting commercial tools, researchers MUST report the full interaction logs except when transcripts might identify anonymous participants or reveal personal or confidential information. If complete interaction logs cannot be shared (e.g., because they contain confidential information), the prompts and responses MUST at least be summarized and described in the PAPER. For LLMs for Tools that use agentic execution, researchers SHOULD report tool-call traces and usage traces for the runs used to evaluate the tool. For Benchmarking LLMs that use agent-based harnesses, researchers SHOULD report tool-call traces for representative runs, letting readers understand how task success depends on the agent’s decision sequence rather than the model’s raw output alone. For LLMs as Annotators, LLMs as Judges, LLMs for Synthesis, and LLMs as Subjects, when the research setup involves multi-turn interaction or agentic orchestration, researchers SHOULD report the corresponding interaction logs and, where applicable, tool-call traces.

Advice for Reviewers

As with other guidelines, missing trace information is typically a minor revision request unless so much is missing that methodological rigor cannot be assessed. Reviewers should recognize that complete trace reporting is easier for local and open-source setups than for commercial SaaS agents. When reviewing commercial-tool studies, reviewers should focus on whether authors have reported everything the system exposed and have been explicit about what remains unobservable.

References

Chann, Sherman. 2023. “Non-determinism in GPT-4 is caused by Sparse MoE.” https://152334h.github.io/blog/non-determinism-in-gpt-4/.

Ronanki, Krishna, Christian Berger, and Jennifer Horkoff. 2023. “Investigating ChatGPT’s Potential to Assist in Requirements Elicitation Processes.” In 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 354–61. IEEE.