Use Human Validation for LLM Outputs (G5)

tl;dr: If assessing the quality of generated artifacts is important and no reference datasets or suitable comparison metrics exist, researchers SHOULD use human validation for LLM outputs. If they do, they MUST define the measured construct (e.g., usability, maintainability) and describe the measurement instrument in the PAPER. Researchers SHOULD consider human validation early in the study design (not as an afterthought), build on established reference models for human-LLM comparison, and share their instruments as SUPPLEMENTARY MATERIAL. When aggregating LLM judgments, methods and rationale SHOULD be reported and inter-rater agreement SHOULD be assessed. Confounding factors SHOULD be controlled for, and power analysis SHOULD be conducted to ensure statistical robustness.

Rationale

Automated metrics alone cannot fully capture the validity of subjective or complex constructs. When LLMs are used to automate tasks that involve human judgment, their outputs must be validated against human assessments to ensure that the results are not only reliable but also valid.

Recommendations

When LLMs are used to automate research or software development tasks previously performed by humans, it is essential to assess the LLM’s performance. When existing reference datasets or comparison metrics cannot fully capture the target qualities relevant in the study context, researchers SHOULD rely on human judgment to validate the LLM outputs.

Study Design Considerations:

Integrating human participants in the study design will require additional considerations, including a recruitment strategy, annotation guidelines, training sessions, or ethical approvals. Therefore, researchers SHOULD consider human validation early in the study design, not as an afterthought. Authors MUST clearly define the constructs that the human and LLM annotators evaluate (Ralph and Tempero 2018). When designing custom instruments to assess LLM output (e.g., questionnaires, scales), researchers MUST share their instruments, for example in the SUPPLEMENTARY MATERIAL.

Subjective Judgment and Agreement:

When the judgment is subjective (i.e., depends on the judge’s values or theories), the LLM output SHOULD be evaluated against judgments aggregated from multiple human judges, and researchers SHOULD clearly describe their aggregation method and reasoning. The preferred method is to randomly order the objects requiring judgment and put them in groups small enough to judge in one to three hours. Two or three human experts review the first group together, simultaneously judging, discussing, and creating a set of decision rules documenting their reasoning. Then, the experts iterate among rounds of:

independently rating one group of objects,
calculating inter-rater agreement (IRA) or reliability (IRR),
meeting to discuss disagreements and reach consensus,
updating the decision rules so the disagreement will not recur.

The goal is to reach a target IRA or IRR (e.g., Krippendorff’s $\alpha>0.8$ ), indicating sufficient decision rules. Once this target is reached and sustained for two to three groups, it is permissible to continue with a single rater. Researchers SHOULD report measures of IRA or IRR, preferably broken down by round.

Confounding factors SHOULD be discussed and, where feasible, controlled for (e.g., by categorizing participants according to their level of experience or expertise). Where applicable, researchers may perform a power analysis (Cohen 1992; Dybå, Kampenes, and Sjøberg 2006) to estimate the required sample size, ensuring sufficient statistical power in their experimental design. However, we also note that to our knowledge, there is limited established guidance specific to determining sample sizes for LLM-human comparisons. In other fields, 100 comparisons are often made without further explanation (Tam et al. 2024).

Researchers SHOULD use established reference models to compare humans with LLMs. For example, Schneider, Fotrousi, and Wohlrab (2025) (Schneider, Fotrousi, and Wohlrab 2025) outline design considerations for studies comparing LLMs with humans. If studies aim to automate annotating software artifacts using an LLM, researchers SHOULD follow systematic approaches to decide whether and how human annotators can be replaced (e.g., showing that a jury of three LLMs exhibit model-to-model agreement comparable to trained human annotators, such as Krippendorff’s $\alpha>0.8$ ). Model-to-model agreement should not be held to a lower standard than human-to-human agreement, as low thresholds may simply reflect shared model biases rather than annotation quality.

Agentic Tools

Besides generating and modifying content, agentic software development tools such as Claude Code can autonomously call command-line tools or pull in additional information from MCP servers. For such tools, it makes sense to separate textual output (e.g., file changes) from output related to tool execution. When evaluating agentic tools, researchers SHOULD assess the feedback that users provided for proposed changes, report statistics on how frequently they accepted the content, and how they modified it. This agentic human-in-the-loop interaction approach is essentially a built-in human validation, even though the degrees of freedom are larger than in more traditional experiments that use LLMs directly.

Example(s)

Ahmed et al. (2025) (Ahmed et al. 2025) proposed a systematic method for deciding whether LLMs can replace human annotators on a given task, using model-to-model agreement as an initial screening criterion (with a threshold of $\alpha>0.5$ ) and model confidence for sample-level decisions. However, their threshold is well below the levels generally considered acceptable for inter-rater agreement. Krippendorff (2018) recommends discarding data with $\alpha<0.667$ , considers $0.667 \leq \alpha < 0.8$ sufficient only for tentative conclusions, and requires $\alpha \geq 0.8$ for reliable data (Krippendorff 2018). Researchers adopting similar screening approaches should apply thresholds consistent with these established standards. Notably, while three LLMs exhibited higher inter-model agreement than human annotators on some tasks, human-model agreement remained low on others. This illustrates that high model-model reliability does not guarantee alignment with human judgment, reinforcing the need for human validation before assuming that LLM annotations are valid. Moreover, a high model-model agreement could merely indicate that the models share systematic biases and hence reliably agree on the wrong answer.

Hymel and Johnson (2025) (Hymel and Johnson 2025) evaluated ChatGPT’s capability to generate requirements documents by comparing an LLM-generated and a human-generated document based on the same business use case. Domain experts reviewed both documents and attempted to distinguish their origin.

Benefits

Validating LLMs against human judgments builds confidence in the LLM’s accuracy and validity, increasing the trustworthiness of findings, especially when IRA or IRR metrics are reported (Khraisha et al. 2024). Furthermore, incorporating feedback from human judges may help to improve the LLM or LLM-based tool based on the reported experiences.

Challenges

Assessing a construct like LLM performance is challenging because ensuring that a construct is defined well and operationalized using an appropriate measurement model requires a deep understanding of (1) the construct, (2) construct validity in general, and (3) instrumentation (Sjøberg and Bergersen 2023; Ralph et al. 2024). Comparing an LLM to human judges is typically slower and more expensive than machine-generated measures. More fundamentally, neither human judgment nor machine-generated measures provides an objective ground truth against which LLM accuracy can be firmly determined.

Human judgments exhibit variability due to differences in experience, expertise, interpretations, and personal biases (McDonald, Schoenebeck, and Forte 2019). When diverse humans rate items reliably given clear decision rules, we assume that reliability implies validity, but it does not. Measuring reliability is much easier than measuring validity, and often the best researchers can do is argue conceptually for why their judges, decision rules, and constructs should produce valid ratings.

Researchers must weigh the cost and time need for human validation against the expected corresponding improvement in validity over machine-generated measures.

Study Types

This guideline applies to all study types, although the need for human validation varies. For LLMs as Annotators, researchers SHOULD validate LLM-generated annotations against human annotators to assess labeling quality and identify systematic biases. When using LLMs as Judges, researchers SHOULD co-create initial rating criteria with humans and validate a sample of LLM judgments against human expert assessments. For LLMs for Synthesis, researchers SHOULD employ human oversight to verify that qualitative interpretations and synthesized outputs faithfully represent the underlying data. For LLMs as Subjects, researchers SHOULD validate simulated responses against real human data to assess the fidelity of the simulation. For Studying LLM Usage, researchers SHOULD carefully reflect on the validity of their evaluation criteria and validate subjective assessments with human experts. In developing LLMs for Tools, human validation of the LLM output is critical when the output is supposed to match human expectations. For Benchmarking LLMs, there is less need for human validation when using extensively validated and widely-used benchmarks, but researchers SHOULD employ human validation when creating or adapting new benchmarks.

Advice for Reviewers

Human validation may be the most challenging of our guidelines to assess because it often requires evaluating conceptual arguments. If LLM output is validated only by comparison with other LLMs, reviewers should look for quantitative empirical evidence that such comparison is reliable and valid. High inter-model agreement alone is insufficient, as reliability does not imply validity. Similarly, reviewers should expect evidence that any employed benchmarks are reliable and valid. Absent such evidence, human validation is warranted. A single human judge is appropriate only when judgments depend on widely accepted theories and involve limited value conflict (e.g., tagging method names containing abbreviations). For multiple judges, reviewers should expect IRA/IRR improvement techniques as described in the recommendations above (experienced raters, organized rounds, consensus meetings, updated decision rules). Low IRA or IRR (e.g., Krippendorff’s $\alpha<0.8$ ) without these techniques is a concern. Conversely, if authors have followed best practices and still obtained mediocre results (e.g., $0.66<\alpha<0.8$ ), this should be noted as a limitation. Beyond reliability, reviewers should expect authors to explain conceptually why their human judgments should be valid, considering construct definitions, decision rules, and judge expertise. As with other guidelines, missing information is typically a revision request. Absent judge instructions, instruments, decision rules, or construct definitions may prevent assessment of rigor and validity, while missing recruitment details are less critical. Clarification requests about construct definitions are routine and should not alone warrant rejection.

References

Ahmed, Toufique, Premkumar T. Devanbu, Christoph Treude, and Michael Pradel. 2025. “Can LLMs Replace Manual Annotation of Software Engineering Artifacts?” In 22nd IEEE/ACM International Conference on Mining Software Repositories, MSR@ICSE 2025, Ottawa, ON, Canada, April 28-29, 2025, 526–38. IEEE. https://doi.org/10.1109/MSR66628.2025.00086.

Cohen, Jacob. 1992. “A Power Primer.” Psychological Bulletin 112 (1): 155–59. https://doi.org/10.1037/0033-2909.112.1.155.

Dybå, Tore, Vigdis By Kampenes, and Dag I. K. Sjøberg. 2006. “A Systematic Review of Statistical Power in Software Engineering Experiments.” Inf. Softw. Technol. 48 (8): 745–55. https://doi.org/10.1016/J.INFSOF.2005.08.009.

Hymel, Cory, and Hiroe Johnson. 2025. “Analysis of LLMs Vs Human Experts in Requirements Engineering.” CoRR abs/2501.19297. https://doi.org/10.48550/ARXIV.2501.19297.

Khraisha, Qusai, Sophie Put, Johanna Kappenberg, Azza Warraitch, and Kristin Hadfield. 2024. “Can Large Language Models Replace Humans in Systematic Reviews? Evaluating GPT-4’s Efficacy in Screening and Extracting Data from Peer-Reviewed and Grey Literature in Multiple Languages.” Research Synthesis Methods 15 (4): 616–26. https://doi.org/10.1002/jrsm.1715.

Krippendorff, Klaus. 2018. Content Analysis: An Introduction to Its Methodology. 4th ed. SAGE Publications.

McDonald, Nora, Sarita Schoenebeck, and Andrea Forte. 2019. “Reliability and Inter-Rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice.” Proc. ACM Hum. Comput. Interact. 3 (CSCW): 72:1–23. https://doi.org/10.1145/3359174.

Ralph, Paul, Miikka Kuutila, Hera Arif, and Bimpe Ayoola. 2024. “Teaching Software Metrology: The Science of Measurement for Software Engineering.” In Handbook on Teaching Empirical Software Engineering, edited by Daniel Méndez, Paris Avgeriou, Marcos Kalinowski, and Nauman Bin Ali, 101–54. Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-71769-7\5.

Ralph, Paul, and Ewan D. Tempero. 2018. “Construct Validity in Software Engineering Research and Software Metrics.” In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering, EASE2018, edited by Austen Rainer, Stephen G. MacDonell, and Jacky W. Keung, 13–23. ACM. https://doi.org/10.1145/3210459.3210461.

Schneider, Kurt, Farnaz Fotrousi, and Rebekka Wohlrab. 2025. “A Reference Model for Empirically Comparing LLMs with Humans.” In 47th IEEE/ACM International Conference on Software Engineering: Software Engineering in Society, ICSE-SEIS 2025, Ottawa, ON, Canada, April 27 - May 3, 2025, 130–34. IEEE. https://doi.org/10.1109/ICSE-SEIS66351.2025.00018.

Sjøberg, Dag I. K., and Gunnar Rye Bergersen. 2023. “Construct Validity in Software Engineering.” IEEE Trans. Software Eng. 49 (3): 1374–96. https://doi.org/10.1109/TSE.2022.3176725.

Tam, Thomas Yu Chow, Sonish Sivarajkumar, Sumit Kapoor, Alisa V Stolyar, Katelyn Polanska, Karleigh R McCarthy, Hunter Osterhoudt, et al. 2024. “A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review.” NPJ Digital Medicine 7 (1): 258.