Study Types
The development of empirical guidelines for software engineering (SE) studies involving LLMs is crucial to ensuring the validity and reproducibility of results. However, such guidelines must be tailored to different study types that each pose unique challenges. Therefore, we developed a taxonomy of study types that we then use to contextualize the recommendations we provide. Note that our Guidelines refer to the Study Types but not the other way around.
Each study type section starts with a description, followed by examples from the SE research community and beyond as well as the advantages and challenges of using LLMs for the respective study type. The remainder of this section is structured as follows:
Introduction: LLMs as Tools for Software Engineering Researchers
LLMs can serve as powerful tools to help researchers conduct empirical studies. They can automate various tasks such as data collection, pre-processing, and analysis. For example, LLMs can apply predefined coding guides to qualitative datasets (LLMs as Annotators), assess the quality of software artifacts (LLMs as Judges), generate summaries of research papers (LLMs for Synthesis), or simulate human behavior in empirical studies (LLMs as Subjects). This can significantly reduce the time and effort required to conduct a study. However, besides the many advantages that all these applications bring, they also come with challenges such as potential threats to validity and implications for the reproducibility of study results.
LLMs as Annotators
Description
Just like human annotators, LLMs can label artifacts based on a pre-defined coding guide. However, they can label data much faster than any human could. In qualitative data analysis, manually annotating (“coding”) natural language text, e.g., in software artifacts, open-ended survey responses, or interview transcripts, is a time-consuming manual process [1]. LLMs can be used to augment or replace human annotations, provide suggestions for new codes (see Section LLMs for Synthesis), or even automate the entire qualitative data analysis process.
Example(s)
Recent work in software engineering has begun to explore the use of LLMs for annotation tasks. [2] [2] proposed an approach that utilizes multiple LLMs for joint annotation of mobile application reviews. They used three models of comparable size with an absolute majority voting rule (i.e., a label is only accepted if it receives more than half of the total votes from the models). Accordingly, the annotations fell into three categories: exact matches (where all models agreed), partial matches (where a majority agreed), and non-matches (where no majority was reached). The study by [3] [3] examined LLMs as annotators in software engineering research across five datasets, six LLMs, and ten annotation tasks. They found that model-model agreement strongly correlates with human-model agreement, suggesting situations in which LLMs could effectively replace human annotators. Their research showed that for tasks where humans themselves disagreed significantly , models also performed poorly. Conversely, if multiple LLMs reach similar solutions independently, then LLMs are likely suitable for the annotation task. They proposed to use model confidence scores to identify specific samples that could be safely delegated to LLMs, potentially reducing human annotation effort without compromising inter-rater agreement.
Advantages
Recent research demonstrates several advantages of using LLMs as annotators, including their cost effectiveness and accuracy. LLM-based annotation can dramatically reduce costs compared to human labeling, with studies showing cost reductions of 50-96% on various natural language tasks [4]. For example, [5] found in their study that GPT-4 annotation costs only $122.08 compared to $4,508 for a comparable MTurk pipeline [5]. Moreover, the LLM-based approach resulted in a completion time of just 2 days versus several weeks for the crowd-sourced approach. LLMs consistently demonstrate strong performance, with ChatGPT’s accuracy exceeding crowd workers by approximately 25% on average [6], achieving impressive results in specific tasks such as sentiment analysis (65% accuracy) [7]. LLMs also show remarkably high inter-rater agreement, higher than crowd workers and trained annotators [6].
Challenges
Challenges of using LLMs as annotators include reliability issues, human-LLM interaction challenges, biases, errors, and resource considerations. Studies suggest that while LLMs show promise as annotation tools in SE research, their optimal use may be in augmenting rather than replacing human annotators [4], [5]. LLMs can negatively affect human judgment when labels are incorrect [8], and their overconfidence requires careful verification [9]. Moreover, in previous studies, LLMs have shown significant variability in annotation quality depending on the dataset and the annotation task [10]. Studies have shown that LLMs are especially unreliable for high-stakes labeling tasks [11] and that LLMs can have notable performance disparities between label categories [7]. Recent empirical evidence indicates that LLM consistency in text annotation often falls below scientific reliability thresholds, with outputs being sensitive to minor prompt variations [12]. Context-dependent annotations pose a specific challenge, as LLMs show difficulty in correctly interpreting text segments that require broader contextual understanding [5]. Although pooling of multiple outputs can improve reliability, this approach requires additional computational resources and still requires validation against human-annotated data. While generally cost-effective, LLM annotation requires careful management of per token charges, particularly for longer texts [4]. Furthermore, achieving reliable annotations may require multiple runs of the same input to enable majority voting [12], although the exact cost comparison between LLM-based and human annotation is controversial [5]. Finally, research has identified consistent biases in label assignment, including tendencies to overestimate certain labels and misclassify neutral content [7].
LLMs as Judges
Description
LLMs can act as judges or raters to evaluate properties of software artifacts. For example, LLMs can be used to assess code readability, adherence to coding standards, or the quality of code comments. Judgment is distinct from the more qualitative task of assigning a code or label to typically unstructured text (see Section LLMs as Annotators). It is also different from using LLMs for general software engineering tasks, as discussed in Section LLMs as Tools for Software Engineering Researchers.
Example(s)
[13] [13] leveraged Llama-2 to evaluate the quality of software requirements statements. They prompted the LLM with the text below, where the words in curly brackets reflect the study parameters:
Your task is to evaluate the quality of a software requirement.
Evaluate whether the following requirement is {quality_characteristic}.
{quality_ characteristic} means: {quality_characteristic_explanation}
The evaluation result must be: ‘yes’ or ‘no’.
Request: Based on the following description of the project: {project_description}
Evaluate the quality of the following requirement: {requirement}.
Explain your decision and suggest an improved version.
[13] evaluated the LLM’s output against human judges, to assess how well the LLM matched experts. Agreement can be measured in many ways; this study used Cohen’s kappa [14] and found moderate agreement for simple requirements and poor agreement for more complex requirements. However, human evaluation of machine judgments may not be the same as evaluation of human judgments and is an open area of research [15]. A crucial decision in studies focusing on LLMs as judges is the number of examples provided to the LLM. This might involve no tuning (zero-shot), several examples (few-shot), or providing many examples (closer to traditional training data). In the example above, [13] chose zero-shot tuning, providing no specific guidance besides the provided context; they did not show the LLM what a ‘yes’ or ‘no’ answer might look like. In contrast, [16] [16] provided a rubric to an LLM when treating LLMs as a judge to produce interpretable scales (0 to 4) for the overall quality of acceptance criteria generated from user stories. They combined this with a lower-level reward stage to refine acceptance criteria, which significantly improved correctness, clarity, and alignment with the user stories.
Advantages
Depending on the model configuration, LLMs can provide relatively consistent evaluations. They can further help mitigate human biases and the general variability that human judges might introduce. This may lead to more reliable and reproducible results in empirical studies, to the extent that these models can be reproduced or checkpointed. LLMs can be much more efficient and scale more easily than the equivalent human approach. With LLM automation, entire datasets can be assessed, as opposed to subsets, and the assessment produced can be at the selected level of granularity, e.g., binary outputs (‘yes’ or ‘no’) or defined levels (e.g., 0 to 4). However, the main constraint that varies by model and budget is the input context size, i.e., the number of tokens one can pass into a model. For example, the upper bound of the context that can be passed to OpenAi’s o1-mini model is 32k tokens.
Challenges
When relying on the judgment of LLMs, researchers must build a reliable process for generating judgment labels that considers the non-deterministic nature of LLMs and report the intricacies of that process transparently [17]. For example, the order of options has been shown to affect LLM outputs in multiple-choice settings [18]. In addition to reliability, other quality attributes to consider include the accuracy of the labels. For example, a reliable LLM might be reliably inaccurate and wrong. Evaluating and judging large numbers of items, for example, to perform fault localization on the thousands of bugs that large open-source projects have to deal with, comes with costs in clock time, compute time, and environmental sustainability. Evidence shows that LLMs can behave differently when reviewing their own outputs [19]. In more human-oriented datasets (such as discussions of pull requests) LLMs may suffer from well-documented biases and issues with fairness [20]. For tasks in which human judges disagree significantly, it is not clear if an LLM judge should reflect the majority opinion or act as an independent judge. The underlying statistical framework of an LLM usually pushes outputs towards the most likely (majority) answer. There is ongoing research on how suitable LLMs are as independent judges. Questions about bias, accuracy, and trust remain [21]. There is reason for concern about LLMs judging student assignments or doing peer review of scientific papers [22]. Even beyond questions of technical capacity, ethical questions remain, particularly if there is some implicit expectation that a human is judging the output. Involving a human in the judgment loop, for example, to contextualize the scoring, is one approach [23]. However, the lack of large-scale ground truth datasets for benchmarking LLM performance in judgment studies hinders progress in this area.
LLMs for Synthesis
Description
LLMs can support synthesis tasks in software engineering research by processing and distilling information from qualitative data sources. In this context, synthesis refers to the process of integrating and interpreting information from multiple sources to generate higher-level insights, identify patterns across datasets, and develop conceptual frameworks or theories. Unlike annotation (see Section LLMs as Annotators), which focuses on categorizing or labeling individual data points, synthesis involves connecting and interpreting these annotations to develop a cohesive understanding of the phenomenon being studied. Synthesis tasks can also mean generating synthetic datasets (e.g., source code, bug-fix pairs, requirements, etc.) that are then used in downstream tasks to train, fine-tune, or evaluate existing models or tools. In this case, the synthesis is done primarily using the LLM and its training data; the input is limited to basic instructions and examples.
Example(s)
Published examples of applying LLMs for synthesis in the software engineering domain are still scarce. However, recent work has explored the use of LLMs for qualitative synthesis in other domains, allowing reflection on how LLMs can be applied for this purpose in software engineering [1]. [24] [24] conducted a systematic mapping study on the use of LLMs for qualitative research. They identified examples in domains such as healthcare and social sciences (see, e.g., [25], [26]) in which LLMs were used to support qualitative analysis, including grounded theory and thematic analysis. Overall, the findings highlight the successful generation of preliminary coding schemes from interview transcripts, later refined by human researchers, along with support for pattern identification. This approach was reported not only to expedite the initial coding process but also to allow researchers to focus more on higher-level analysis and interpretation. However, [24] emphasize that effective use of LLMs requires structured prompts and careful human oversight. Similarly, [27] [27] conducted a systematic mapping study to investigate how LLMs are used in qualitative analysis and how they can be applied in software engineering research. Consistent with the study by [24] [24], [27] identified that LLMs are applied primarily in tasks such as coding, thematic analysis, and data categorization, reducing the time, cognitive demands, and resources required for these processes. Finally, [28]’s work is an example of using LLMs to create synthetic datasets. They present an approach to generate synthetic requirements, showing that they “can match or surpass human-authored requirements for specific classification tasks” [28].
Advantages
LLMs offer promising support for synthesis in SE research by helping researchers process artifacts such as interview transcripts and survey responses, or by assisting literature reviews. Qualitative research in SE traditionally faces challenges such as limited scalability, inconsistencies in coding, difficulties in generalizing findings from small or context-specific samples, and the influence of the researchers’ subjectivity on data interpretation [1]. The use of LLMs for synthesis can offer advantages in addressing these challenges [1], [24], [27]. LLMs can reduce manual effort and subjectivity, improve consistency and generalizability, and assist researchers in deriving codes and developing coding guides during the early stages of qualitative data analysis [1], [29]. LLMs can enable researchers to analyze larger datasets, identifying patterns across broader contexts than traditional qualitative methods typically allow. In addition, they can help mitigate the effects of human subjectivity. However, while LLMs streamline many aspects of qualitative synthesis, careful oversight remains essential.
Challenges
Although LLMs have the potential to automate synthesis, concerns about overreliance remain, especially due to discrepancies between AI- and human-generated insights, particularly in capturing contextual nuances [30]. [30] [30] found that while LLMs can provide structured summaries and qualitative coding frameworks, they may misinterpret nuanced qualitative data due to a lack of contextual understanding. Other studies have echoed similar concerns [1], [24], [27]. In particular, LLMs cannot independently assess the validity of arguments. Critical thinking remains a human responsibility in qualitative synthesis. [31] have shown that popular LLMs and LLM-based tools tend to overgeneralize results when summarizing scientific articles, that is, they “produce broader generalizations of scientific results than those in the original.” Compared to human-authored summaries, “LLM summaries were nearly five times more likely to contain broad generalizations” [31]. In addition, LLMs can produce biased results, reinforcing existing prejudices or omitting essential perspectives, making human oversight crucial to ensure accurate interpretation, mitigate biases, and maintain quality control. Moreover, the proprietary nature of many LLMs limits transparency. In particular, it is unknown how the training data might affect the synthesis process. Furthermore, reproducibility issues persist due to the influence of model versions and prompt variations on the synthesis result.
LLMs as Subjects
Description
In empirical studies, data is collected from participants through methods such as surveys, interviews, or controlled experiments. LLMs can serve as subjects in empirical studies by simulating human behavior and interactions (we use the term “subject” since “participant” implies being human [32]). In this capacity, LLMs generate responses that approximate those of human participants, which makes them particularly valuable for research involving user interactions, collaborative coding environments, and software usability assessments. This approach enables data collection that closely reflects human reactions while avoiding the need for direct human involvement. To achieve this, prompt engineering techniques are widely employed, with a common approach being the use of the Personas Pattern [33], which involves tailoring LLM responses to align with predefined profiles or roles that emulate specific user archetypes. [34] outline opportunities and challenges of using LLMs as research subjects in detail [34]. Furthermore, recent sociological studies have emphasized that, to be effectively utilized in this capacity, LLMs—including their agentic versions—should meet four criteria of algorithmic fidelity [35]. Generated responses should be: (1) indistinguishable from human-produced texts (e.g., LLM-generated code reviews should be comparable to those from real developers); (2) consistent with the attitudes and sociodemographic information of the conditioning context (e.g., LLMs simulating junior developers should exhibit different confidence levels, vocabulary, and concerns compared to senior engineers); (3) naturally aligned with the form, tone, and content of the provided context (e.g., responses in an agile stand-up meeting simulation should be concise, task-focused, and aligned with sprint objectives rather than long, formal explanations); and (4) reflective of patterns in relationships between ideas, demographics, and behavior observed in comparable human data (e.g., discussions on software architecture decisions should capture trade-offs typically debated by human developers, such as maintainability versus performance, rather than abstract theoretical arguments).
Example(s)
LLMs can be used as subjects in various types of empirical studies, enabling researchers to simulate human participants. The broader applicability of LLM-based studies beyond software engineering has been compiled by [36] [36], who examined various uses in social science research. Given the socio-technical nature of software development, some of these approaches are transferable to empirical software engineering research. For example, LLMs can be applied in survey and interview studies to impersonate developers responding to survey questionnaires or interviews, allowing researchers to test the clarity and effectiveness of survey items or to simulate responses under varying conditions, such as different levels of expertise or cultural contexts. For example, [37] [37] explored persona-based interviews and multi-persona focus groups, demonstrating how LLMs can emulate human responses and behaviors while addressing ethical concerns, biases, and methodological challenges. Another example are usability studies, in which LLMs can simulate end-user feedback, providing insights into potential usability issues and offering suggestions for improvement based on predefined user personas. This aligns with the work of [38] [38], who investigated biases in LLM-generated candidate profiles in SE recruitment processes. Their study, which analyzed both textual and visual inputs, revealed biases favoring male, Caucasian candidates, lighter skin tones, and slim physiques, particularly for senior roles.
Advantages
Using LLMs as subjects can reduce the effort of recruiting human participants, a process that is often time-consuming and costly [39], by augmenting existing datasets or, in some cases, completely replacing human participants. In interviews and survey studies, LLMs can simulate diverse respondent profiles, enabling access to underrepresented populations, which can strengthen the generalizability of findings. In data mining studies, LLMs can generate synthetic data (see LLMs for Synthesis) to fill gaps where real-world data is unavailable or underrepresented.
Challenges
It is important that researchers are aware of the inherent biases [40] and limitations [41], [42] when using LLMs as study subjects. [43] [43], who have studied discrepancies between LLM and human responses, even conclude that “LLMs do not simulate human psychology and recommend that psychological researchers should treat LLMs as useful but fundamentally unreliable tools that need to be validated against human responses for every new application.” One critical concern is construct validity. LLMs have been shown to misrepresent demographic group perspectives, failing to capture the diversity of opinions and experiences within a group. The use of identity-based prompts can reduce identities to fixed and innate characteristics, amplifying perceived differences between groups. These biases introduce the risk that studies which rely on LLM-generated responses may inadvertently reinforce stereotypes or misrepresent real-world social dynamics. Alternatives to demographic prompts can be employed when the goal is to broaden response coverage [42]. Beyond construct validity, internal validity must also be considered, particularly with regard to causal conclusions based on studies relying on LLM-simulated responses. Finally, external validity remains a challenge, as findings based on LLMs may not generalize to humans.
Introduction: LLMs as Tools for Software Engineers
LLM-based assistants have become an essential tool for software engineers, supporting them in various tasks such as code generation and debugging. Researchers have studied how software engineers use LLMs (Studying LLM Usage in Software Engineering), developed new tools that integrate LLMs (LLMs for New Software Engineering Tools), and benchmarked LLMs for software engineering tasks (Benchmarking LLMs for Software Engineering Tasks).
Studying LLM Usage in Software Engineering
Description
Studying how software engineers use LLMs is crucial to understand the current state of practice in SE. Researchers can observe software engineers’ usage of LLM-based tools in the field, or study if and how they adopt such tools, their usage patterns, as well as perceived benefits and challenges. Surveys, interviews, observational studies, or analysis of usage logs can provide insights into how LLMs are integrated into development processes, how they influence decision making, and what factors affect their acceptance and effectiveness. Such studies can inform improvements for existing LLM-based tools, motivate the design of novel tools, or derive best practices for LLM-assisted software engineering. They can also uncover risks or deficiencies of existing tools.
Example(s)
[44] investigated the use of ChatGPT (GPT-3.5) by professional software engineers in a week-long observational study [44]. They found that most developers do not use the code generated by ChatGPT directly but instead use the output as a guide to implement their own solutions. [45] conducted a case study that evaluated the impact of introducing LLMs on the onboarding process of new software developers [45] (GPT-3). Their study identified potential in the use of LLMs for onboarding, as it can allow newcomers to seek information on their own, without the need to “bother” senior colleagues. Surveys can help researchers to quickly provide an overview of the current perceptions of LLM usage. For example, [46] surveyed participants from 15 software companies regarding their practices on LLMs in software engineering [46]. They found that the majority of study participants had already adopted AI for software engineering tasks; most of them used ChatGPT. Multiple participants cited copyright and privacy issues, as well as inconsistent or low-quality outputs, as barriers to adoption. Retrospective studies that analyze data generated while developers use LLMs can provide additional insights into human-LLM interactions. For example, researchers can employ data mining methods to build large-scale conversation datasets, such as the DevGPT dataset introduced by [47] [47]. Conversations can then be analyzed using quantitative [48] and qualitative [49] analysis methods.
Advantages
Studying the real-world usage of LLM-based tools allows researchers to understand the state of practice and guide future research directions. In field studies, researchers can uncover usage patterns, adoption rates, and the influence of contextual factors on usage behavior. Outside of a controlled study environment, researchers can uncover contextual information about LLM-assisted SE workflows beyond the specific LLMs being evaluated. This may, for example, help researchers generate hypotheses about how LLMs impact developer productivity, collaboration, and decision-making processes. In controlled laboratory studies, researchers can study specific phenomena related to LLM usage under carefully regulated, but potentially artificial conditions. Specifically, they can isolate individual tasks in the software engineering workflow and investigate how LLM-based tools may support task completion. Furthermore, controlled experiments allow for direct comparisons between different LLM-based tools. The results can then be used to validate general hypotheses about human-LLM interactions in SE.
Challenges
When conducting field studies in real-world environments, researchers have to ensure that their study results are “dependable” [50] beyond the traditional validity criteria such as internal or construct validity. The usage environment in a real-world context can often be extremely diverse. The integration of LLMs can range from specific LLMs based on company policy to the unregulated use of any available LLM. Both extremes may influence the adoption of LLM by software engineers, and hence need to be addressed in the study methodology. In addition, in longitudinal case studies, the timing of the study may have a significant impact on its result, as LLMs and LLM-based tools are rapidly evolving. Moreover, developers are still learning how to best make use of the new technology, and best practices are still being developed and established. Furthermore, the predominance of proprietary commercial LLM-based tools in the market poses a significant barrier to research. Limited access to telemetry data or other usage metrics restricts the ability of researchers to conduct comprehensive analyses of real-world tool usage. To make study results reliable, researchers must establish that their results are rigorous, for example, by using triangulation to understand and minimize potential biases [50]. When it comes to studying LLM usage in controlled laboratory studies, researchers may struggle with the inherent variability of LLM outputs. Since reproducibility and precision are essential for hypothesis testing, the stochastic nature of LLM responses—where identical prompts may yield different outputs across participants—can complicate the interpretation of experimental results and affect the reliability of study findings.
LLMs for New Software Engineering Tools
Description
LLMs are being integrated into new tools that support software engineers in their daily tasks, e.g., to assist in code comprehension [51] and test case generation [52]. One way of integrating LLM-based tools into software engineers’ workflows are GenAI agents. Unlike traditional LLM-based tools, these agents are capable of acting autonomously and proactively, are often tailored to meet specific user needs, and can interact with external environments [53], [54]. From an architectural perspective, GenAI agents can be implemented in various ways [54]. However, they generally share three key components: (1) a reasoning mechanism that guides the LLM (often enabled by advanced prompt engineering), (2) a set of tools to interact with external systems (e.g., APIs or databases), and (3) a user communication interface that extends beyond traditional chat-based interactions [55], [56], [57]. Researchers can also test and compare different tool architectures to increase artifact quality and developer satisfaction.
Example(s)
[51] proposed IVIE, a tool integrated into the VS Code graphical interface that generates and explains code using LLMs [51]. The authors focused more on the presentation, providing a user-friendly interface to interact with the LLM. [52] [52] presented a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation. They presented TestPilot, a tool that implements an approach in which the LLM is provided with prompts that include the signature and implementation of a function under test, along with usage examples extracted from the documentation. [55] introduced a preliminary GenAI agent designed to assist developers in understanding source code by incorporating a reasoning component grounded in the theory of mind [55].
Advantages
From an engineering perspective, developing LLM-based tools is easier than implementing many traditional SE approaches such as static analysis or symbolic execution. Depending on the capabilities of the underlying model, it is also easier to build tools that are independent of a specific programming language. This enables researchers to build tools for a more diverse set of tasks. In addition, it allows them to test their tools in a wider range of contexts.
Challenges
Traditional approaches, such as static analysis, are deterministic. LLMs are not. Although the non-determinism of LLMs can be mitigated using configuration parameters and prompting strategies, this poses a major challenge. It can be challenging for researchers to evaluate the effectiveness of a tool, as minor changes in the input can lead to major differences in the performance. Since the exact training data is often not published by model vendors, a reliable assessment of tool performance for unknown data difficult. From an engineering perspective, while open models are available, the most capable ones require substantial hardware resources. Using cloud-based APIs or relying on third-party providers for hosting, while seemingly a potential solution, introduces new concerns related to data privacy and security.
Benchmarking LLMs for Software Engineering Tasks
Description
Benchmarking is the process of evaluating an LLM’s performance using standardized tasks and metrics, which requires high-quality reference datasets. LLM output is compared to a ground truth from the benchmark dataset using general metrics for text generation, such as ROUGE, BLEU, or METEOR [58], or task-specific metrics, such as CodeBLEU for code generation. For example, HumanEval [59] is often used to assess code generation, establishing it as a de facto standard.
Example(s)
In SE, benchmarking may include the evaluation of an LLM’s ability to produce accurate and reliable outputs for a given input, usually a task description, which may be accompanied by data obtained from curated real-world projects or from synthetic SE-specific datasets. Typical tasks include code generation, code summarization, code completion, and code repair, but also natural language processing tasks, such as anaphora resolution (i.e., the task of identifying the referring expression of a word or phrase occurring earlier in the text). RepairBench [60], for example, contains 574 buggy Java methods and their corresponding fixed versions, which can be used to evaluate the performance of LLMs in code repair tasks. This benchmark uses the Plausible@1 metric (i.e., the probability that the first generated patch passes all test cases) and the AST Match@1 metric (i.e., the probability that the abstract syntax tree of the first generated patch matches the ground truth patch). SWE-Bench [61] is a more generic benchmark that contains 2,294 SE Python tasks extracted from GitHub pull requests. To score the LLM’s performance on the tasks, the benchmark validates whether the generated patch is applicable (i.e., successfully compiles) and calculates the percentage of passed test cases.
Advantages
Properly built benchmarks provide objective, reproducible evaluation across different tasks, enabling a comparison between different models (and versions). In addition, benchmarks built for specific SE tasks can help identify LLM weaknesses and support their optimization and fine-tuning for such tasks. They can foster open science practices by providing a common ground for sharing data (e.g., as part of the benchmark itself) and results (e.g., of models run against a benchmark). Benchmarks built using real-world data can help legitimize research results for practitioners, supporting industry-academia collaboration. However, benchmarks are subject to several challenges.
Challenges
Benchmark contamination, that is, the inclusion of the benchmark in the LLM training data [62], has recently been identified as a problem. The careful selection of samples and the creation of the corresponding input prompts is particularly important, as correlations between prompts may bias the benchmark results [63]. Although LLMs might perform well on a specific benchmark such as HumanEval, they do not necessarily perform well on other benchmarks. Moreover, benchmark metrics such as perplexity or BLEU-N do not neccessarily reflect human judgment. Recently, [64] [64] has proposed guidelines for the creation of LLM benchmarks related to coding tasks, grounded in a systematic survey of existing benchmarks. In this process, they highlight current shortcomings related to reliability, transparency, irreproducibility, low data quality, and inadequate validation measures. For more details on benchmarks, see Section Use Suitable Baselines, Benchmarks, and Metrics.
References
[1] M. Bano, R. Hoda, D. Zowghi, and C. Treude, “Large language models for qualitative research in software engineering: Exploring opportunities and challenges,” Autom. Softw. Eng., vol. 31, no. 1, p. 8, 2024, doi: 10.1007/S10515-023-00407-8.
[2] J. Huang et al., “Enhancing review classification via LLM-based data annotation and multi-perspective feature representation learning,” SSRN Electronic Journal, pp. 1–15, 2024, doi: 10.2139/ssrn.5002351.
[3] T. Ahmed, P. T. Devanbu, C. Treude, and M. Pradel, “Can LLMs replace manual annotation of software engineering artifacts?” in 22nd IEEE/ACM international conference on mining software repositories, MSR@ICSE 2025, ottawa, ON, canada, april 28-29, 2025, IEEE, 2025, pp. 526–538. doi: 10.1109/MSR66628.2025.00086.
[4] S. Wang, Y. Liu, Y. Xu, C. Zhu, and M. Zeng, “Want to reduce labeling cost? GPT-3 can help,” in Findings of the ACL: EMNLP 2021, M.-F. Moens, X. Huang, L. Specia, and S. W. Yih, Eds., Association for Computational Linguistics, 2021, pp. 4195–4205. doi: 10.18653/V1/2021.FINDINGS-EMNLP.354.
[5] Z. He, C.-Y. Huang, C.-K. C. Ding, S. Rohatgi, and T.-H. K. Huang, “If in a crowdsourced data annotation pipeline, a GPT-4,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 1040:1–1040:25. doi: 10.1145/3613904.3642834.
[6] F. Gilardi, M. Alizadeh, and M. Kubli, “ChatGPT outperforms crowd-workers for text-annotation tasks,” CoRR, vol. abs/2303.15056, 2023, doi: 10.48550/ARXIV.2303.15056.
[7] Y. Zhu, P. Zhang, E. ul Haq, P. Hui, and G. Tyson, “Can ChatGPT reproduce human-generated labels? A study of social computing tasks,” CoRR, vol. abs/2304.10145, 2023, doi: 10.48550/ARXIV.2304.10145.
[8] F. Huang, H. Kwak, and J. An, “Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech,” in Companion proceedings of the ACM web conference 2023, WWW 2023, Y. Ding, J. Tang, J. F. Sequeda, L. Aroyo, C. Castillo, and G.-J. Houben, Eds., ACM, 2023, pp. 294–297. doi: 10.1145/3543873.3587368.
[9] M. Wan et al., “TnT-LLM: Text mining at scale with large language models,” in Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, KDD 2024, R. Baeza-Yates and F. Bonchi, Eds., ACM, 2024, pp. 5836–5847. doi: 10.1145/3637528.3671647.
[10] N. Pangakis, S. Wolken, and N. Fasching, “Automated annotation with generative AI requires validation,” CoRR, vol. abs/2306.00176, 2023, doi: 10.48550/ARXIV.2306.00176.
[11] X. Wang, H. Kim, S. Rahman, K. Mitra, and Z. Miao, “Human-LLM collaborative annotation through effective verification of LLM labels,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 303:1–303:21. doi: 10.1145/3613904.3641960.
[12] M. V. Reiss, “Testing the reliability of ChatGPT for text annotation and classification: A cautionary remark,” CoRR, vol. abs/2304.11085, 2023, doi: 10.48550/ARXIV.2304.11085.
[13] S. Lubos et al., “Leveraging LLMs for the quality assurance of software requirements,” in 32nd IEEE international requirements engineering conference, RE 2024, reykjavik, iceland, june 24-28, 2024, G. Liebel, I. Hadar, and P. Spoletini, Eds., IEEE, 2024, pp. 389–397. doi: 10.1109/RE59067.2024.00046.
[14] J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, Apr. 1960, doi: 10.1177/001316446002000104.
[15] A. Elangovan et al., “Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge,” CoRR, vol. abs/2410.03775, 2024, doi: 10.48550/ARXIV.2410.03775.
[16] F. Wang et al., “Multi-modal requirements data-based acceptance criteria generation using LLMs,” 2025, Available: https://arxiv.org/abs/2508.06888
[17] K. Schroeder and Z. Wood-Doughty, “Can you trust LLM judgments? Reliability of LLM-as-a-judge,” CoRR, vol. abs/2412.12509, 2024, doi: 10.48550/ARXIV.2412.12509.
[18] P. Pezeshkpour and E. Hruschka, “Large language models sensitivity to the order of options in multiple-choice questions,” in Findings of the association for computational linguistics: NAACL 2024, mexico city, mexico, june 16-21, 2024, K. Duh, H. Gómez-Adorno, and S. Bethard, Eds., Association for Computational Linguistics, 2024, pp. 2006–2017. doi: 10.18653/V1/2024.FINDINGS-NAACL.130.
[19] A. Panickssery, S. Bowman, and S. Feng, “LLM evaluators recognize and favor their own generations,” in Advances in neural information processing systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., Curran Associates, Inc., 2024, pp. 68772–68802. Available: https://proceedings.neurips.cc/paper_files/paper/2024/file/7f1f0218e45f5414c79c0679633e47bc-Paper-Conference.pdf
[20] I. O. Gallegos et al., “Bias and fairness in large language models: A survey,” Computational Linguistics, vol. 50, pp. 1097–1179, 2024, doi: 10.1162/coli_a_00524.
[21] A. Bavaresco et al., “LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks,” CoRR, vol. abs/2406.18403, 2024, doi: 10.48550/ARXIV.2406.18403.
[22] R. Zhou, L. Chen, and K. Yu, “Is LLM a reliable reviewer? A comprehensive evaluation of LLM on automatic paper reviewing tasks,” in Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation, LREC/COLING 2024, 20-25 may, 2024, torino, italy, N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue, Eds., ELRA; ICCL, 2024, pp. 9340–9351. Available: https://aclanthology.org/2024.lrec-main.816
[23] Q. Pan et al., “Human-Centered Design Recommendations for LLM-as-a-Judge,” in ACL 2024 Workshop HuCLLM, arXiv, Jul. 2024. doi: 10.48550/arXiv.2407.03479.
[24] C. F. Barros et al., “Large language model for qualitative research – a systematic mapping study,” 2024, Available: https://arxiv.org/abs/2411.14473
[25] S. De Paoli, “Performing an inductive thematic analysis of semi-structured interviews with a large language model: An exploration and provocation on the limits of the approach,” Social Science Computer Review, vol. 42, no. 4, pp. 997–1019, 2024.
[26] W. S. Mathis, S. Zhao, N. Pratt, J. Weleff, and S. De Paoli, “Inductive thematic analysis of healthcare qualitative interviews using open-source large language models: How does it compare to traditional methods?” Computer Methods and Programs in Biomedicine, vol. 255, p. 108356, 2024.
[27] M. de Morais Leça, L. Valença, R. Santos, and R. de Souza Santos, “Applications and implications of large language models in qualitative analysis: A new frontier for empirical software engineering,” 2024, Available: https://arxiv.org/abs/2412.06564
[28] A. El-Hajjami and C. Salinesi, “How good are synthetic requirements? Evaluating LLM-generated datasets for AI4RE,” CoRR, vol. abs/2506.21138, 2025, doi: 10.48550/ARXIV.2506.21138.
[29] C. Byun, P. Vasicek, and K. D. Seppi, “Dispensing with humans in human-computer interaction research,” in Extended abstracts of the 2023 CHI conference on human factors in computing systems, CHI EA 2023, A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, and A. Peters, Eds., ACM, 2023, pp. 413:1–413:26. doi: 10.1145/3544549.3582749.
[30] M. Bano, D. Zowghi, and J. Whittle, “Exploring qualitative research using LLMs,” 2023, Available: https://arxiv.org/abs/2306.13298
[31] U. Peters and B. Chin-Yee, “Generalization bias in large language model summarization of scientific research,” R. Soc. Open Sci., vol. 12, no. 12241776, 2025, doi: 10.1098/rsos.241776.
[32] A. P. Association, “APA dictionary of psychology: subject.” https://dictionary.apa.org/subject, 2018.
[33] A. Kong et al., “Better zero-shot reasoning with role-play prompting,” CoRR, vol. abs/2308.07702, 2023, doi: 10.48550/ARXIV.2308.07702.
[34] C. Zhao, M. Habule, and W. Zhang, “Large language models (LLMs) as research subjects: Status, opportunities and challenges,” New Ideas in Psychology, vol. 79, p. 101167, 2025, doi: https://doi.org/10.1016/j.newideapsych.2025.101167.
[35] L. P. Argyle, E. C. Busby, N. Fulda, J. Gubler, C. M. Rytting, and D. Wingate, “Out of one, many: Using language models to simulate human samples,” CoRR, vol. abs/2209.06899, 2022, doi: 10.48550/ARXIV.2209.06899.
[36] R. Xu et al., “AI for social science and social science of AI: A survey,” Inf. Process. Manag., vol. 61, no. 2, p. 103665, 2024, doi: 10.1016/J.IPM.2024.103665.
[37] M. A. Gerosa, B. Trinkenreich, I. Steinmacher, and A. Sarma, “Can AI serve as a substitute for human subjects in software engineering research?” Autom. Softw. Eng., vol. 31, no. 1, p. 13, 2024, doi: 10.1007/S10515-023-00409-6.
[38] M. Bano, H. Gunatilake, and R. Hoda, “What does a software engineer look like? Exploring societal stereotypes in LLMs,” 2025, Available: https://arxiv.org/abs/2501.03569
[39] K. Madampe, J. Grundy, R. Hoda, and H. O. Obie, “The struggle is real! The agony of recruiting participants for empirical software engineering studies,” in 2024 IEEE symposium on visual languages and human-centric computing (VL/HCC), liverpool, UK, september 2-6, 2024, IEEE, 2024, pp. 417–422. doi: 10.1109/VL/HCC60511.2024.00065.
[40] R. Crowell, “Why AI’s diversity crisis matters, and how to tackle it,” Nature Career Feature, 2023, doi: 10.1038/d41586-023-01689-4.
[41] J. Harding, W. D’Alessandro, N. G. Laskowski, and R. Long, “AI language models cannot replace human research participants,” AI Soc., vol. 39, no. 5, pp. 2603–2605, 2024, doi: 10.1007/S00146-023-01725-X.
[42] A. Wang, J. Morgenstern, and J. P. Dickerson, “Large language models cannot replace human participants because they cannot portray identity groups,” CoRR, vol. abs/2402.01908, 2024, doi: 10.48550/ARXIV.2402.01908.
[43] S. Schröder, T. Morgenroth, U. Kuhl, V. Vaquet, and B. Paaßen, “Large language models do not simulate human psychology,” 2025, Available: https://arxiv.org/abs/2508.06950
[44] R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto, “Beyond code generation: An observational study of ChatGPT usage in software engineering practice,” Proc. ACM Softw. Eng., vol. 1, no. FSE, pp. 1819–1840, 2024, doi: 10.1145/3660788.
[45] M. Azanza, J. Pereira, A. Irastorza, and A. Galdos, “Can LLMs facilitate onboarding software developers? An ongoing industrial case study,” in 36th international conference on software engineering education and training, CSEE&t 2024, IEEE, 2024, pp. 1–6. doi: 10.1109/CSEET62301.2024.10662989.
[46] J. Jahic and A. Sami, “State of practice: LLMs in software engineering and software architecture,” in 21st IEEE international conference on software architecture, ICSA 2024 - companion, hyderabad, india, june 4-8, 2024, IEEE, 2024, pp. 311–318. doi: 10.1109/ICSA-C63560.2024.00059.
[47] T. Xiao, C. Treude, H. Hata, and K. Matsumoto, “DevGPT: Studying developer-ChatGPT conversations,” in 21st IEEE/ACM international conference on mining software repositories, MSR 2024, lisbon, portugal, april 15-16, 2024, D. Spinellis, A. Bacchelli, and E. Constantinou, Eds., ACM, 2024, pp. 227–230. doi: 10.1145/3643991.3648400.
[48] Md. F. Rabbi, A. I. Champa, M. F. Zibran, and Md. R. Islam, “AI writes, we analyze: The ChatGPT python code saga,” in 21st IEEE/ACM international conference on mining software repositories, MSR 2024, lisbon, portugal, april 15-16, 2024, D. Spinellis, A. Bacchelli, and E. Constantinou, Eds., ACM, 2024, pp. 177–181. doi: 10.1145/3643991.3645076.
[49] S. Mohamed, A. Parvin, and E. Parra, “Chatting with AI: Deciphering developer conversations with ChatGPT,” in 21st IEEE/ACM international conference on mining software repositories, MSR 2024, lisbon, portugal, april 15-16, 2024, D. Spinellis, A. Bacchelli, and E. Constantinou, Eds., ACM, 2024, pp. 187–191. doi: 10.1145/3643991.3645078.
[50] G. M. Sullivan and J. Sargeant, “Qualities of qualitative research: Part I,” J Grad Med Educ, vol. 3, no. 4, pp. 449–452, Dec. 2011.
[51] L. Yan, A. Hwang, Z. Wu, and A. Head, “Ivie: Lightweight anchored explanations of just-generated code,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 140:1–140:15. doi: 10.1145/3613904.3642239.
[52] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Trans. Software Eng., vol. 50, no. 1, pp. 85–105, 2024, doi: 10.1109/TSE.2023.3334955.
[53] W. Takerngsaksiri et al., “Human-in-the-loop software development agents,” arXiv preprint arXiv:2411.12924, 2024.
[54] J. Wiesinger, P. Marlow, and V. Vuskovic, “Agents,” Feb. 2025, Available: https://gemini.google.com
[55] J. Richards and M. Wessel, “What you need is what you get: Theory of mind for an LLM-based code understanding assistant,” in IEEE international conference on software maintenance and evolution, ICSME 2024, IEEE, 2024, pp. 666–671. doi: 10.1109/ICSME58944.2024.00070.
[56] T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths, “Cognitive architectures for language agents,” Trans. Mach. Learn. Res., vol. 2024, 2024, Available: https://openreview.net/forum?id=1i6ZCvflQJ
[57] W. Zhou et al., “Agents: An open-source framework for autonomous language agents,” CoRR, vol. abs/2309.07870, 2023, doi: 10.48550/ARXIV.2309.07870.
[58] X. Hou et al., “Large language models for software engineering: A systematic literature review,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, Dec. 2024, doi: 10.1145/3695988.
[59] M. Chen et al., “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021, Available: https://arxiv.org/abs/2107.03374
[60] A. Silva and M. Monperrus, “RepairBench: Leaderboard of frontier models for program repair,” arXiv preprint arXiv:2409.18952, 2024.
[61] C. E. Jimenez et al., “SWE-bench: Can language models resolve real-world github issues?” in The twelfth international conference on learning representations, ICLR 2024, vienna, austria, may 7-11, 2024, OpenReview.net, 2024. Available: https://openreview.net/forum?id=VTF8yNQM66
[62] S. Ahuja, V. Gumma, and S. Sitaram, “Contamination report for multilingual benchmarks,” CoRR, vol. abs/2410.16186, 2024, doi: 10.48550/ARXIV.2410.16186.
[63] C. Siska, K. Marazopoulou, M. Ailem, and J. Bono, “Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks,” in Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), ACL 2024, bangkok, thailand, august 11-16, 2024, L.-W. Ku, A. Martins, and V. Srikumar, Eds., Association for Computational Linguistics, 2024, pp. 10406–10421. doi: 10.18653/V1/2024.ACL-LONG.560.
[64] J. Cao et al., “How should i build a benchmark?” arXiv preprint arXiv:2501.10711, 2025.