Report Model Version, Configuration, and Customizations (G2)

tl;dr: Researchers MUST report the exact LLM model or tool version, configuration, and experiment date in the PAPER. When using quantized models, researchers SHOULD report the quantization level and method. For fine-tuned models, they MUST describe the fine-tuning goal, dataset, and procedure. Researchers SHOULD include default parameters, explain model choices, compare base- and fine-tuned model using suitable metrics and benchmarks, and share fine-tuning data and weights (or alternatively justify why they cannot share them).

Rationale

LLMs and LLM-based tools are frequently updated, and configuration parameters such as temperature or seed values affect content generation. This guideline focuses on documenting the model-specific aspects of empirical studies involving LLMs, concentrating on the models themselves, their version, configuration parameters, and customizations (e.g., fine-tuning). While the Architecture section addresses system-level integration, the information outlined here is always essential for reproducibility whenever an LLM is involved.

Recommendations

Researchers MUST document in the PAPER which model or tool version they used in their study, along with the date when the experiments were carried out and the configured parameters that affect output generation. Since default values might change over time, researchers SHOULD always report all configuration values, even if they used the defaults. Checksums and fingerprints SHOULD be reported since they identify specific versions and configurations. Depending on the study context, other properties such as the context window size (number of tokens) SHOULD be reported. When using quantized models, researchers SHOULD report the quantization level (e.g., 4-bit, 8-bit) and method (e.g., GPTQ or AWQ), as different quantization approaches produce different outputs, affecting both output quality and reproducibility. Researchers SHOULD motivate in the PAPER why they selected certain models, versions, and configurations. Reasons may be monetary, technical, or methodological (e.g., planned comparison to previous work). Depending on the specific study context, additional information regarding the experiment or tool architecture SHOULD be reported.

A common customization approach for existing LLMs is fine-tuning. If a model was fine-tuned, researchers MUST describe the fine-tuning goal (e.g., improving the performance for a specific task), the fine-tuning procedure (e.g., full fine-tuning vs. Low-Rank Adaptation (LoRA), selected hyperparameters, loss function, learning rate, batch size, etc.), and the fine-tuning dataset (e.g., data sources, the preprocessing pipeline, dataset size) in the PAPER. Researchers SHOULD either share the fine-tuning dataset as part of the SUPPLEMENTARY MATERIAL or explain in the PAPER why the data cannot be shared (e.g., because it contains confidential or personal data that could not be anonymized). The same applies to the fine-tuned model weights. Suitable benchmarks and metrics SHOULD be used to compare the base model with the fine-tuned model.

In summary, our recommendation is to report:

Model/tool name and version (MUST in PAPER);
All relevant configured parameters that affect output generation (MUST in PAPER);
Default values of all available parameters (SHOULD);
Checksum/fingerprint of used model version and configuration (SHOULD);
Additional properties such as context window size (SHOULD);
Model quantization level and method, if applicable (SHOULD).

For fine-tuned models, additional recommendations apply:

Fine-tuning goal (MUST in PAPER);
Fine-tuning dataset creation and characterization (MUST in PAPER);
Fine-tuning parameters and procedure (MUST in PAPER);
Fine-tuning dataset and fine-tuned model weights (SHOULD);
Validation metrics and benchmarks (SHOULD).

Commercial models (e.g., GPT-5) or LLM-based tools (e.g., ChatGPT) might not give researchers access to all required information. Our suggestion is to report what is available and openly acknowledge limitations that hinder reproducibility.

Example(s)

Based on the documentation that OpenAI and Azure provide (OpenAI 2025; Microsoft 2025), researchers might, for example, report:

“We integrated a gpt-4 model in version 0125-Preview via the Azure OpenAI Service, and configured it with a temperature of 0.7, top_p set to 0.8, a maximum token length of 512, and the seed value 23487. We ran our experiment on 10th January 2025. The system fingerprint was fp_6b68a8204b.

Kang, Yoon, and Yoo (2023) provide a similar statement in their paper on exploring LLM-based bug reproduction (Kang, Yoon, and Yoo 2023):

“We access OpenAI Codex via its closed beta API, using the code-davinci-002 model. For Codex, we set the temperature to 0.7, and the maximum number of tokens to 256.”

Our guidelines additionally suggest to report a checksum/fingerprint and exact dates, but otherwise this example is close to our recommendations.

Dhar, Vaidhyanathan, and Varma (2024) (Dhar, Vaidhyanathan, and Varma 2024) assessed whether LLMs can generate architectural design decisions, detailing the system architecture and the LLM’s role within it. They provide information on the fine-tuning approach and datasets, including the source of architectural decision records, preprocessing methods, and data selection criteria.

For self-hosted models, the SUPPLEMENTARY MATERIAL can become a true replication package. For example, for models provisioned using ollama, one can report the specific tag and checksum, e.g., “llama3.3, tag 70b-instruct-q8_0, checksum d5b5e1b84868.” Given suitable hardware, running the model is then as easy as executing the following command: ollama run llama3.3:70b-instruct-q8_0

Benefits

Reporting this information is a prerequisite for the verification, reproduction, and replication of LLM-based studies. While LLMs are inherently non-deterministic, this cannot excuse dismissing reproducibility. Although exact reproducibility is hard to achieve, reporting the information outlined in this guideline helps researchers come as close as possible to that standard.

Challenges

Different model providers and modes of operating the models allow for varying degrees of information. For example, OpenAI provides a model version and a system fingerprint describing the backend configuration, which can also influence the output. However, in fact, the fingerprint is intended only to detect changes in the model or its configuration; one cannot go back to a certain fingerprint. As a beta feature, OpenAI lets users set a seed parameter to receive “(mostly) consistent output” (OpenAI 2023). However, the seed value does not allow for full reproducibility and the fingerprint changes frequently. Although, as motivated above, open models significantly simplify re-running experiments, they also come with challenges in terms of reproducibility, as generated outputs can be inconsistent despite setting the temperature to 0 and using a seed value (see GitHub issue for Llama3). Setting the temperature to 0 configures greedy decoding (always selecting the most probable next token), which minimizes output variability but can degrade quality by producing repetitive text and missing higher-quality responses (Holtzman et al. 2020).

Even with a temperature of 0, full determinism is rarely guaranteed: floating-point arithmetic on GPUs causes slight numerical differences that cascade into divergent token selections (Yuan et al. 2025), Sparse Mixture-of-Experts routing amplifies this effect (Chann 2023), silent backend changes in commercial APIs produce different outputs over time (Chen, Zaharia, and Zou 2024), and even self-hosted open models with identical settings do not always yield consistent outputs (Angermeir et al. 2025; Song et al. 2025). Researchers should therefore not treat a temperature of 0 as a guarantee of reproducibility, but as one measure among several, including fixed seed values (OpenAI 2023), system fingerprints, and archiving of raw outputs. When a temperature of 0 is chosen primarily for reproducibility, this motivation SHOULD be stated explicitly, along with an acknowledgment of its potential impact on output quality.

Study Types

This guideline MUST be followed for all study types for which the researcher has access to (parts of) the model’s configuration. They MUST always report the configuration that is visible to them, acknowledging the reproducibility challenges of commercial tools and models that are offered as-a-service. Depending on the specific study type, researchers SHOULD provide additional information on the architecture of a tool they built (see Architecture), prompts and interactions logs (see Prompts and Logs), and specific limitations and mitigations (see Limitations and Mitigations).

For example, when Studying LLM Usage by focusing on commercial tools such as ChatGPT or GitHub Copilot, researchers MUST be as specific as possible in describing their study setup. The configured model name, version, and the date when the experiment was conducted MUST always be reported. In those cases, reporting other aspects, such as prompts and interaction logs, is essential.

For LLMs as Annotators, LLMs as Judges, and LLMs for Synthesis, researchers MUST report the model configuration used for the respective annotation, judging, or synthesis tasks, including temperature and other sampling parameters that affect output variability. For LLMs as Subjects, researchers MUST report any persona-related configuration settings and parameters that shape the simulated behavior. For LLMs for Tools, researchers MUST report the configuration for each model integrated in the tool’s architecture, including any model-specific parameter choices. For Benchmarking LLMs, researchers MUST report the configuration for all benchmarked models to enable fair cross-model comparisons.

Advice for Reviewers

Missing version, configuration, or parameter information is typically a minor revision request. Before concluding that information is absent, reviewers should check appendices and supplementary materials, as details are sometimes reported there rather than in the main text. Rejection over missing details is rarely warranted unless the omissions obscure deeper methodological problems.

References

Angermeir, Florian, Maximilian Amougou, Mark Kreitz, Andreas Bauer, Matthias Linhuber, Davide Fucci, Fabiola Moyón Constante, Daniel Méndez, and Tony Gorschek. 2025. “Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies.” CoRR abs/2510.25506. https://doi.org/10.48550/ARXIV.2510.25506.

Chann, Sherman. 2023. “Non-determinism in GPT-4 is caused by Sparse MoE.” https://152334h.github.io/blog/non-determinism-in-gpt-4/.

Chen, Lingjiao, Matei Zaharia, and James Zou. 2024. “How Is ChatGPT’s Behavior Changing over Time?” Harvard Data Science Review 6 (2). https://doi.org/10.1162/99608f92.5317da47.

Dhar, Rudra, Karthik Vaidhyanathan, and Vasudeva Varma. 2024. “Can LLMs Generate Architectural Design Decisions? - an Exploratory Empirical Study.” In 21st IEEE International Conference on Software Architecture, ICSA 2024, Hyderabad, India, June 4-8, 2024, 79–89. IEEE. https://doi.org/10.1109/ICSA59870.2024.00016.

Holtzman, Ari, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. “The Curious Case of Neural Text Degeneration.” In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rygGQyrFvH.

Kang, Sungmin, Juyeon Yoon, and Shin Yoo. 2023. “Large Language Models Are Few-Shot Testers: Exploring LLM-Based General Bug Reproduction.” In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, 2312–23. IEEE. https://doi.org/10.1109/ICSE48619.2023.00194.

Microsoft. 2025. “Azure OpenAI Service models.” https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models.

OpenAI. 2023. “How to make your completions outputs consistent with the new seed parameter.” https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter.

———. 2025. “OpenAI API Introduction.” https://platform.openai.com/docs/api-reference/chat/streaming.

Song, Yifan, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. 2025. “The Good, the Bad, and the Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism.” In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 - Volume 1: Long Papers, edited by Luis Chiruzzo, Alan Ritter, and Lu Wang, 4195–4206. Association for Computational Linguistics. https://doi.org/10.18653/V1/2025.NAACL-LONG.211.

Yuan, Jiayi, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, and Zirui Liu. 2025. “Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference.” In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, NeurIPS 2025. https://openreview.net/forum?id=Q3qAsZAEZw.