Study Types

This list of study types is currently a DRAFT and based on a discussion sessions with researchers at the 2024 International Software Engineering Research Network (ISERN) meeting and at the 2nd Copenhagen Symposium on Human-Centered Software Engineering AI.

The development of empirical guidelines for studies involving large language models (LLMs) in software engineering is crucial for ensuring the validity and reproducibility of results. However, these guidelines must be tailored for different study types as they may pose unique challenges. Therefore, understanding the classification of these studies is essential for developing appropriate guidelines. We envision that a mature set of guidelines provides specific guidance for each of these study types, addressing their individual methodological idiosyncrasies. Moreover, we currently focus on large language models, that is, on natural language processing. In the future, we might extend our focus to multimodal foundation models.

  1. LLMs as Tools for Software Engineeering Researchers
    1. LLMs as Annotators
    2. LLMs as Judges
    3. LLMs for Synthesis
    4. LLMs as Subjects
    5. References
  2. LLMs as Tools for Software Engineers
    1. Studying LLM Usage in Software Engineering
    2. LLMs for new Software Engineering Tools
    3. Benchmarking LLMs for Software Engineering Tasks
    4. References

Overview

The development of empirical guidelines for software engineering (SE) studies involving large language models (LLMs) is crucial for ensuring the validity and reproducibility of results. However, these guidelines must be tailored for different study types that pose unique challenges. Therefore, understanding the classification of studies involving LLMs is essential for developing appropriate guidelines. Note that we currently focus on large language models, that is, on natural language processing. In the future, we might extend and generalize our focus to include multimodal foundation models.

Our guidelines use the following study types to contextualize the recommendations we provide. We present the study types independent of specific guidelines, that is, the guidelines refer to the study types but not the other way around.

Introduction: LLMs as Tools for Software Engineering Researchers

LLMs can be leveraged as powerful tools to assist researchers conducting empirical studies. They can automate various tasks such as data collection, preprocessing, and analysis. For example, LLMs can apply pre-defined coding guides to large qualitative datasets (see LLMs as Annotators), assess the quality of software artifacts (see LLMs as Judges), generate summaries of research papers (see LLMs for Synthesis), and even simulate human behavior in empirical studies (see LLMs as Subjects). This can significantly reduce the time and effort required by researchers, allowing them to focus on other aspects of their studies. However, besides the many advantages that all these applications bring, they also come with challenges such as potential threats to validity and implications for the reproducibility of study results.

LLMs as Annotators

Description

Just like human annotators, LLMs can label artifacts based on a pre-defined coding guide. However, they can label data much fast than any human could. In qualitative data analysis, manually annotating (“coding”) natural language text, e.g. in software artifacts, open-ended survey responses, or interview transcripts, is a time-consuming manual process [1]. LLMs can be used to augment or even replace human annotations, provide suggestions for new codes (see Section LLMs for Synthesis), or even automate the entire process.

Example(s)

Recent work in software engineering has begun exploring the use of LLMs for annotation tasks. Huang et al. [2] proposed an approach leveraging multiple LLMs for joint annotation of mobile application reviews. They used three models of comparable size with an absolute majority voting rule (i.e., a label is only accepted if it receives more than half of the total votes from the models). Accordingly, the annotations fell into three categories: exact matches (where all models agreed), partial matches (where a majority agreed), and non-matches (where no majority was reached).

The study by Ahmed et al. [3] examined LLMs-as-annotators in software engineering research across five datasets and ten annotation tasks. Using six state-of-the-art LLMs to perform tasks previously done by humans, they found that model-model agreement strongly correlates with human-model agreement, suggesting situations in which LLMs could effectively replace human annotators. Their research showed that for tasks where humans themselves disagree significantly (such as evaluating code conciseness), models also perform poorly. Conversely, if multiple LLMs reach similar solutions independently, then LLMs are likely suitable for the annotation task. They proposed using model confidence scores to identify specific samples that could be safely delegated to LLMs, potentially reducing human annotation effort without compromising inter-rater agreement.

Advantages

Recent research demonstrates several key advantages of using LLMs as annotators, in particular their cost-effectiveness as well as efficiency and accuracy benefits. LLM-based annotation dramatically reduces costs compared to human labeling, with studies showing cost reductions of 50-96% across various natural language tasks [4]. For example, He et al. found in their study that GPT-4 annotation cost only $122.08 compared to $4,508 for a comparable MTurk pipeline. Moreover, the LLM-based approach resulting in a completion time of just 2 days versus several weeks for the crowdsourced approach [5]. Moreover, LLMs consistently demonstrate strong performance, with ChatGPT’s accuracy exceeding crowd workers by approximately 25% on average [6] and achieving impressive results in specific tasks such as sentiment analysis (64.9% accuracy) and counterspeech detection (0.791 precision) [7]. They also show remarkably high intercoder agreement, surpassing both crowd workers and trained annotators [6].

Challenges

Challenges of using LLMs as annotators include reliability issues, human-LLM interaction challenges, biases, errors, and resource considerations. Studies suggest that while LLMs show promise as annotation tools in SE research, their optimal use may be in augmenting rather than replacing human annotators [4], [5]. LLMs can negatively impact human judgment when their labels are incorrect [8], and their overconfidence requires careful verification [9]. Moreover, LLMs show significant variability in annotation quality across different tasks [8], [10], [11], with particular challenges in complex tasks and post-training events. They are especially unreliable for high-stakes labeling tasks [10], demonstrating notable performance disparities across different label categories [7]. Recent empirical evidence indicates that LLM consistency in text annotation often falls below scientific reliability thresholds, with outputs being sensitive to minor prompt variations [12]. Context-dependent annotations pose a specific challenge, as LLMs show difficulty in correctly interpreting text segments that require broader contextual understanding [5]. While pooling multiple outputs can improve reliability, this approach necessitates additional computational resources and still requires validation against human-annotated data. While generally cost-effective, LLM annotation requires careful management of per-token charges, particularly for longer texts [4]. Furthermore, achieving reliable annotations may require multiple runs of the same input to enable majority voting [12], although exact cost comparisons between LLM-based and human annotation is controversial [5]. Finally, research has identified consistent biases in label assignment, including tendencies to overestimate certain labels and misclassify neutral content, particularly in stance detection tasks [7].

LLMs as Judges

Description

LLMs can act as judges or raters to evaluate properties of software artifacts. For instance, LLMs can be used to assess code readability, adherence to coding standards, or the quality of code comments. Judgment is distinct from the more qualitative task of assigning a code or label to, typically unstructured, text (see Section LLMs as Annotators). It is also distinct from using LLMs for general software engineering tasks, as we discuss in Section LLMs as Tools for SE.

Example(s)

Lubos et al. [13] leveraged Llama-2 to evaluate the quality of software requirements statements. They prompted the LLM with the text below, where the words in brackets reflect parameters for the study:

Your task is to evaluate the quality of a software requirement.
Evaluate whether the following requirement is {quality_characteristic}.
{quality_ characteristic} means: {quality_characteristic_explanation}
The evaluation result must be: ‘yes’ or ‘no’.
Request: Based on the following description of the project: {project_description}
Evaluate the quality of the following requirement: {requirement}.
Explain your decision and suggest an improved version.

The authors then evaluated the LLM’s output against human raters, to assess how well the LLM matched experts. Agreement can be measured in many ways; this study used Cohen’s kappa measure [14] and found moderate agreement for simple requirements, and poor agreement for more complex requirements. However, evaluating machine judgment may not be the same as human-human judgments, and is an open area of research [15].

One crucial decision in such studies is the number of examples provided to the LLM. This might involve no tuning (zero-shot), several examples (few-shot), or providing many examples (closer to traditional training data). In the example above, Lubos et al. chose zero-shot tuning, providing no specific guidance besides the project’s context, i.e., they did not show the LLM what a yes or no answer might look like.

Advantages

By providing consistent (depending on the model configuration) and relatively “objective” evaluations, LLMs can help mitigate certain biases and part of the variability that human judges might introduce. This may lead to more reliable and reproducible results in empirical studies, to the extent these models can be reproduced or checkpointed.

LLMs can be much more efficient, and scale far more easily, than the equivalent human approach. With LLM automation, entire datasets could be assessed, as opposed to subsets. The main constraint, which varies by model and budget, is the input context size, i.e., the number of tokens one can pass into a model. For example, the upper bound on context that can be passed into OpenAi’s o1-mini model is 32k tokens.

Challenges

When relying on the judgment of LLMs, researchers must build a reliable process for generating judgment labels that considers the non-deterministic nature of LLMs and report the intricacies of that process transparently [16]. For example, the order of options has been shown to affect LLM outputs in multiple-choice settings [17]. In addition to reliability, other quality attributes include accuracy of the labels, and the speed and scalability of the LLM tool. A reliable LLM might be reliably inaccurate and wrong. Evaluating and judging large numbers of items—for example, to perform fault localization on the thousands of bugs big open-source projects deal with—comes with costs in clock time, compute time, and environmental sustainability.

Evidence shows LLMs can behave differently, for instance, when reviewing their own outputs [18]. In more human-oriented datasets (such as discussions of pull requests) LLMs may suffer from well documented biases and issues with fairness [19]. For tasks where human judges themselves disagree significantly, it is not clear if an LLM judge should reflect the majority opinion or act as an independent judge. The underlying statistical framework of an LLM usually pushes outputs towards the most likely (majority) answer.

Research is ongoing as to how suitable LLMs are as standalone judges. Questions around bias, accuracy, and trust remain [20]. There is reason for concern about LLMs judging student assignments or doing peer review of scientific papers [21]. Even beyond questions of technical capacity, ethical questions remain, particularly if there is some implicit expectation that a human is judging the output, such as in a quiz. Involving a human in the judgment loop—for example, to contextualize the scoring—is one approach [22]. A lack of large-scale ground truth datasets for benchmarking LLM performance on judgment studies is hindering progress on evaluating research in this area.

LLMs for Synthesis

Description

LLMs can support synthesis tasks in software engineering research by processing and distilling information from qualitative data sources. In this context, synthesis refers to the process of integrating and interpreting information from multiple sources to generate higher-level insights, identify patterns across datasets, and develop conceptual frameworks or theories. Unlike annotation (see LLMs as Annotators), which focuses on categorizing or labeling individual data points, synthesis involves connecting and interpreting these annotations to develop a cohesive understanding of the phenomenon being studied.

Example(s)

While published examples of applying LLMs for synthesis in the software engineering domain are still scarce, recent work has explored the use of LLMs for qualitative synthesis in other domains, enabling a reflection on how LLMs can be applied for this purpose in software engineering [1]. Barros et al. [23] conducted a systematic mapping study on the use of LLMs for qualitative research. They identified examples in domains such as healthcare and social sciences (see, e.g., [24], [25]) in which LLMs were used to support different methodologies for qualitative analysis, including grounded theory and thematic analysis. Overall, the findings highlight the successful generation of preliminary coding schemes from interview transcripts, later refined by human researchers, along with support for pattern identification. This approach was reported not only to expedite the initial coding process but also to allow researchers to focus more on higher-level analysis and interpretation. However, they emphasize that effective use of LLMs requires structured prompts and careful human oversight. This particular paper suggests using LLMs to support tasks such as initial coding and theme identification while conservatively reserving interpretative or creative processes for human analysts. Similarly, Lecça et al. [26] conducted a systematic mapping study to investigate how LLMs are used in qualitative analysis and how they can be applied in software engineering research. Consistent with the study by Barros et al. [23], they identified that LLMs are applied primarily in tasks like coding, thematic analysis, and data categorization, providing efficiency by reducing the time, cognitive demands, and resources often required for these processes.

Advantages

LLMs offer promising support for synthesis in SE research by helping to process artifacts such as interview transcripts, survey responses, and literature reviews. Qualitative research in software engineering traditionally faces several challenges, including limited scalability due to the manual nature of the analysis, inconsistencies in coding across researchers, difficulties in generalizing findings from small or context-specific samples, and the influence of researcher subjectivity on data interpretation [1]. The use of LLMs for synthesis can offer advantages in addressing these classical challenges [1], [23], [26]. LLMs can reduce manual effort and subjectivity, improve consistency and generalizability, and assist researchers in deriving codes and developing coding guides during the early stages of qualitative data analysis [1], [27]. The codes generated by LLMs can then be used to annotate additional data, with the models identifying emerging themes and generating candidate insights, potentially automating the entire synthesis process across large qualitative datasets. LLMs can enable researchers to analyze larger datasets, identifying patterns across broader contexts than traditional qualitative methods typically allow. Additionally, they can help mitigate the effects of human subjectivity. Nevertheless, while LLMs streamline many aspects of qualitative synthesis, careful oversight remains essential to ensure nuanced interpretation and contextual accuracy.

Challenges

While LLMs have the potential to automate synthesis, concerns about overreliance remain, especially due to discrepancies between AI- and human-generated insights, particularly in capturing contextual nuances [28]. Bano et al. [28] found that while LLMs can provide structured summaries and qualitative coding frameworks, they may misinterpret nuanced qualitative data due to their lack of contextual understanding. Other studies have echoed similar concerns [1], [23], [26]. In particular, LLMs cannot independently assess argument validity, and critical thinking remains a human responsibility in qualitative synthesis. Moreover, LLMs may produce biased results, reinforcing existing prejudices or omitting key perspectives, making human oversight essential to ensure accurate interpretation, mitigate biases, and maintain quality control. Ethical and privacy concerns also arise from the proprietary nature of many LLMs, limiting transparency and control over training data. Furthermore, reproducibility issues persist due to inconsistencies across inferences, model versions, and prompt variations.

LLMs as Subjects

Description

In empirical research, subjects are the study participants from whom data is collected through methods such as surveys, interviews, or controlled experiments. LLMs can serve as subjects in empirical studies by simulating human behavior and interactions. In this capacity, LLMs generate responses that approximate those of human participants, making them particularly valuable for research involving user interactions, collaborative coding environments, and software usability assessments. This approach enables data collection that closely reflects human reactions while avoiding the need for direct human involvement. To achieve this, prompt engineering techniques are widely employed, with a common approach being the use of the Personas Pattern [29], which involves tailoring LLM responses to align with predefined profiles or roles that emulate specific user archetypes. Furthermore, recent sociological studies have emphasized that, to be effectively utilized in this capacity, LLMs—including their agentic versions tailored through prompt engineering—should meet four criteria of algorithmic fidelity [30]; generated responses should be: indistinguishable from human-produced texts (e.g., LLM-generated code reviews should be comparable to those from real developers); consistent with the attitudes and sociodemographic information of the conditioning context (e.g., LLMs simulating junior developers should exhibit different confidence levels, vocabulary, and concerns compared to senior engineers); naturally aligned with the form, tone, and content of the provided context (e.g., responses in an agile stand-up meeting simulation should be concise, task-focused, and aligned with sprint objectives rather than long, formal explanations); and reflective of patterns in relationships between ideas, demographics, and behavior observed in comparable human data (e.g., discussions on software architecture decisions should capture trade-offs typically debated by human developers, such as maintainability versus performance, rather than abstract theoretical arguments).

Example(s)

LLMs can be used as subjects in various types of empirical studies, enabling researchers to simulate human participants in controlled, repeatable scenarios. The broader applicability of LLM-based studies beyond software engineering has been compiled by Xu et al. [31], who examined various uses in social science research. Given the socio-technical nature of software development, many of these approaches are highly transferable to this domain, demonstrating the growing role of LLMs in empirical software engineering research.

For example, LLMs can be applied in survey and interview studies to impersonate developers responding to survey questionnaires or interviews, allowing researchers to test the clarity and effectiveness of survey items or to simulate responses under varying conditions, such as different expertise levels or cultural contexts. For instance, Gerosa et al. [32] explored persona-based interviews and multi-persona focus groups, demonstrating how LLMs can emulate human responses and behaviors while addressing ethical concerns, biases, and methodological challenges. Another example are usability studies, in which LLMs can simulate end-user feedback, providing insights into potential usability issues and offering suggestions for improvement based on predefined user personas. This aligns with the work of Bano et al. [33], who investigated biases in LLM-generated candidate profiles in software engineering recruitment processes. Their study, which analyzed both textual and visual outputs, revealed biases favoring male, Caucasian candidates, lighter skin tones, and slim physiques, particularly for senior roles.

Advantages

Using LLMs as subjects offers valuable insights while significantly reducing the need to recruit human participants, a process that is often time-consuming and costly [34].

TODO: Sebastian: Can be be a bit more specific here on the “valuable insights” and advantages that using LLMs as subjects provides?

Furthermore, employing LLMs as subjects enables researchers to conduct empirical research under consistent and repeatable conditions, enhancing the reliability and scalability of the studies.

Challenges

However, it is important that researchers are aware of LLMs’ inherent biases [35] and limitations [36], [37] when using them as study subjects. One critical concern is construct validity. LLMs have been shown to misrepresent demographic group perspectives, failing to capture the diversity of opinions and experiences within a group. The use of identity-based prompts can reduce identities to fixed and innate characteristics, amplifying perceived differences between groups. These biases introduce a risk that studies relying on LLM-generated responses may inadvertently reinforce stereotypes or misrepresent real-world social dynamics. To mitigate these issues, encoded identity names can be used instead of explicit labels, the temperature setting can be increased to enhance response diversity, and alternatives to demographic prompts can be employed when the goal is to broaden response coverage [36], [37]. Beyond construct validity, internal validity must also be considered, particularly regarding causal conclusions based on studies relying on LLM-simulated responses. External validity also remains a challenge, as findings based on LLMs may not generalize to humans.

Introduction: LLMs as Tools for Software Engineers

LLM-based assistants have become an essential tool for software engineers, supporting them in various tasks such as code generation, summarization, and repair. Researchers have studied how software engineers use LLMs (see Studying LLM Usage in Software Engineering), developed new tools that integrate LLMs (LLMs for New Software Engineering Tools), and benchmarked LLMs for software engineering tasks (see Benchmarking LLMs for Software Engineering Tasks). Like in the previous section, we outline the advantages that studying LLMs in these contexts brings, but also point to specific challenges.

Studying LLM Usage in Software Engineering

Description

Studying how software engineers use LLMs in their workflows is crucial for understanding the current state of practice in software engineering. Researchers can observe software engineers’ usage of LLM-based tools in the field, or study if and how they adopt such tools, their usage patterns, as well as perceived benefits and challenges. Surveys, interviews, observational studies, or the analysis of usage logs , can provide insights into how LLMs are integrated into development processes, how they influence decision-making, and what factors affect their acceptance and effectiveness. Such studies can inform improvements for existing LLM-based tools, motivate the design of novel tools, or derive best practices for LLM-assisted software engineering. However, they can also uncover risks or deficiencies of the tools.

Example(s)

Khojah et al. investigated the use of ChatGPT by professional software engineers in a week-long observational study [38]. TODO: Sebastian: What did they find? Ananza et al. conducted a case study evaluating the impact of introducing LLMs to the onboarding process of new software developers [39]. TODO: Sebastian: What did they find? Surveys can help researchers quickly provide a wider overview of the current perceptions of LLM use. For example, Jahic and Sami surveyed 15 software companies regarding their practices on LLMs in software engineering [40]. TODO: Sebastian: What did they find? Retrospective studies analyzing the data generated during the use of LLMs by software engineers can provide additional insights into human LLM interactions. For example, researchers can employ data mining methods to build large-scale conversation datasets, such as the DevGPT dataset introduced by Xiao et al [41]. Conversations can then be analyzed using quantitative [42] and qualitative [43] analysis methods.

Advantages

Studying real-world usage of LLM-based tools allows researchers to understand the state of practice and guide future research. Researching in a less controlled study environment may also uncover more nuanced information about LLM-assisted SE workflows in general, independent of the specific LLM being evaluated. TODO: Sebastian: Both field studies and controlled experiments can be used to study LLM usage in SE. This needs to be made clearer. Also, we need to list more advantages here.

Challenges

Due to the studies taking place in real-world environments as opposed to more controlled settings, many additional potential confounding factors are present, which may threaten the internal validity of the study. TODO: Sebastian: Again, this study type is not limited to field studies. The usage environment and choice of LLMs in a real-world context can often be extremely diverse. In practice, the process integration of LLMs can range from the use of specific LLMs based on company policy to the unregulated use of any available LLM. Both extremes may influence the use of LLMs by software engineers in different ways, and as such should be addressed differently by the study methodology. Further, in longitudinal case studies, the timing of the study may have a significant impact on its result, as LLMs are quickly being developed and replaced by newer versions. This difficulty is exacerbated by the relative novelty of LLMs in the SE process. Developers are still learning how to best make use of the new technology and best practices are being established. TODO: Sebastian: I commented out the recommendations. This section is about the challenges. What about the challenges of commercial tools dominating the market, without an easy way for researchers to get access to usage data?

LLMs for New Software Engineering Tools

Description

LLMs are being integrated into new tools supporting software engineers in their daily tasks. Such integration is important to tailor the tools to the specific needs of a development team and to enhance their capabilities, as well as to influence their behavior in accordance with company policies. TODO: Sebastian: Why mention company policies here? This section is about researchers developing new tools based on LLMs, potentially supporting tasks that were hard to automate in the past. TODO: Sebastian: Before jumping to agents, we should mention more standing applications such as using LLMs to generate test cases. In this regard, the advent of GenAI Agents has enabled the development of a standardized architecture, where the LLM is guided by a reasoning component (related to prompt engineering), tools (understood as interfacing with external systems via APIs or databases), and a user communication interface that is not necessarily limited to a chat [44], [45], [46]. Other than propose tools, researcher can also evaluate the effectiveness of these tools in improving developer experience (for example, in terms of productivity, artefact quality, and developer satisfaction) and testing with different implementations of the above mentioned architecture in order to improve the tools.

Example(s)

TODO: Sebastian: Mention at least one non-agentic example here. For example, Richards and Wessel introduced a preliminary GenAI agent designed to assist developers in understanding source code by incorporating a reasoning component grounded in the theory of mind [44]. Their work specifically emphasized the reasoning aspect of the architecture, leveraging the theory of mind to develop tailored agents while also demonstrating their effectiveness. Similarly, Yan et al. proposed IVIE, a tool integrated into the VS Code graphical interface that generates and explains code using LLMs [47]. In this case, the authors focused more on the presentation part of the architecture, providing a user-friendly interface to interact with the LLM.

TODO: Sebastian: This is a study based on an existing commercial tool and hence an example for studying LLM usage in SE (see above) Empirical studies can evaluate the effectiveness of these tools in improving productivity, artefact quality, and developer satisfaction. For example, Choudhuri et al. conducted an experiment with students in which they measured the impact of ChatGPT on the correctness and time taken to solve programming tasks [48].

Advantages

By assessing the impact of LLM-powered tools, researchers can identify best practices and areas for further improvement. Moreover, by developing tools based on well-defined agent architectures, researchers can facilitate technology transfer, bridging the gap between industry and academia more effectively. Moreover, the increased accessibility of open models and their advanced capabilities have made the development of such tools more feasible and democratized than in the past. TODO: Sebastian: Again, this study type is not only about agents but about novel tools based on LLMs in general. It’s also not about assessing the impact, that is a different study type (which is of course related).

Challenges

In this context, the non-determinism of generative AI models—although potentially mitigated through prompt engineering—poses a significant challenge in both the development and evaluation of tools integrating GenAI. Additionally, while open models are accessible, the most performant ones require substantial hardware resources that are not yet widely available. Resorting to cloud-based APIs using non-open models or relying on third-party providers for hosting, while seemingly a solution, introduces new concerns related to data privacy and security. TODO: Sebastian: Another challenge might be that there exist non-LLM-based baselines, that is traditional deterministic tools, that might perform slightly worse, but without many of the issues that LLMs have.

Benchmarking LLMs for Software Engineering Tasks

Description

Benchmarking is the process of evaluating the LLM output for standardized tasks using standardized metrics. High-quality reference datasets, such as HumanEval [49] for the task of code generation, are necessary to perform evaluation across studies. TODO: Sebastian: What makes us think that HumanEval is high-quality? It is certainly frequently used. Maybe that is what we should mention here? LLM output is compared against a ground truth in the benchmark dataset using general metrics for text generation, such as ROUGE, BLEU, or METEOR [50], or task-specific metrics, such as pass@k for code generation. TODO: Sebastian: pass@k is not specific for code generation. There are code-specific metrics, though. See the corresponding guideline.

Example(s)

In software engineering, benchmarking may include the evaluation of an LLM’s ability to produce accurate and reliable outputs for input data obtained from curated real-world projects or from synthetic SE-specific datasets. Typical tasks include code generation, code summarization, code completion, and code repair, but also natural-language processing tasks—i.e., anaphora resolution—interesting for subfields such a Requirements Engineering. TODO: Sebastian: What is anaphora resolution? We need to explain that for the audience. RepairBench [51], for example, contains 574 buggy Java methods and their corresponding fixed versions, which can be used to evaluate the performance of LLMs in code repair tasks. The metrics used are Plausible@1 (i.e., the probability that the first generated patch passes all test cases) and AST Match@1 (i.e., the probability that the abstract syntax tree of the first generated patch matches the one of the ground truth patch). SWE-Bench [52] is a more generic benchmark that contains 2,294 SE Python tasks extracted from GitHub pull requests. For scoring the performance of the LLMs on the tasks, the authors report whether the generated patch is applicable (i.e., it fails compilation) and, for successful patches, the percentage of test cases passed.

Advantages

Properly-built benchmarks provide objective, reproducible evaluation across different tasks, enabling fair comparison of different models (and versions). Moreover, benchmarks built for specific SE tasks can help identify LLM weaknesses and support their optimization/fine-tuning for such tasks. Benchmark built using real-world data can also help legitimize research results for practitioners, supporting industry-academia collaboration. Finally, benchmarks can foster open science practices by providing a common ground for sharing data (e.g., as part of the benchmark itself) and results (e.g., of models run against a benchmark).

Challenges

Benchmark contamination, the inclusion of the benchmark in the training dataset of the LLM, [53] has recently been identified as an issue. The careful selection of samples and building of corresponding input prompts is particularly important, as correlations between prompts may bias benchmark results [54]. While LLMs might perform well on a specific benchmark, such as HumanEval, it does not necessarily perform well on another benchmark. Benchmark metrics such as perplexity or BLEU-N do not always reflect human judgment. Recently, Cao et al. [55] has proposed guidelines for building benchmarks for LLMs related to coding tasks, grounded in a systematic survey of existing benchmarks. In this process, they highlight current shortcomings related to reliability, transparency irreproducibility, low data quality, and inadequate validation measures. Comment: Previous comment not addressed yet: I like the last reference to guildelines for building a benchmark. Should we include some challenges from the paper, too?

References

[1] M. Bano, R. Hoda, D. Zowghi, and C. Treude, “Large language models for qualitative research in software engineering: Exploring opportunities and challenges,” Autom. Softw. Eng., vol. 31, no. 1, p. 8, 2024, doi: 10.1007/S10515-023-00407-8.

[2] J. Huang et al., “Enhancing review classification via LLM-based data annotation and multi-perspective feature representation learning,” SSRN Electronic Journal, pp. 1–15, 2024, doi: 10.2139/ssrn.5002351.

[3] T. Ahmed, P. Devanbu, C. Treude, and M. Pradel, “Can LLMs replace manual annotation of software engineering artifacts?” arXiv, Feb. 04, 2025. doi: 10.48550/arXiv.2408.05534.

[4] S. Wang, Y. Liu, Y. Xu, C. Zhu, and M. Zeng, “Want to reduce labeling cost? GPT-3 can help,” in Findings of the association for computational linguistics: EMNLP 2021, virtual event / punta cana, dominican republic, 16-20 november, 2021, M.-F. Moens, X. Huang, L. Specia, and S. W. Yih, Eds., Association for Computational Linguistics, 2021, pp. 4195–4205. doi: 10.18653/V1/2021.FINDINGS-EMNLP.354.

[5] Z. He, C.-Y. Huang, C.-K. C. Ding, S. Rohatgi, and T.-H. K. Huang, “If in a crowdsourced data annotation pipeline, a GPT-4,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, honolulu, HI, USA, may 11-16, 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 1040:1–1040:25. doi: 10.1145/3613904.3642834.

[6] F. Gilardi, M. Alizadeh, and M. Kubli, “ChatGPT outperforms crowd-workers for text-annotation tasks,” CoRR, vol. abs/2303.15056, 2023, doi: 10.48550/ARXIV.2303.15056.

[7] Y. Zhu, P. Zhang, E. ul Haq, P. Hui, and G. Tyson, “Can ChatGPT reproduce human-generated labels? A study of social computing tasks,” CoRR, vol. abs/2304.10145, 2023, doi: 10.48550/ARXIV.2304.10145.

[8] F. Huang, H. Kwak, and J. An, “Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech,” in Companion proceedings of the ACM web conference 2023, WWW 2023, austin, TX, USA, 30 april 2023 - 4 may 2023, Y. Ding, J. Tang, J. F. Sequeda, L. Aroyo, C. Castillo, and G.-J. Houben, Eds., ACM, 2023, pp. 294–297. doi: 10.1145/3543873.3587368.

[9] M. Wan et al., “TnT-LLM: Text mining at scale with large language models,” in Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, KDD 2024, barcelona, spain, august 25-29, 2024, R. Baeza-Yates and F. Bonchi, Eds., ACM, 2024, pp. 5836–5847. doi: 10.1145/3637528.3671647.

[10] X. Wang, H. Kim, S. Rahman, K. Mitra, and Z. Miao, “Human-LLM collaborative annotation through effective verification of LLM labels,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, honolulu, HI, USA, may 11-16, 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 303:1–303:21. doi: 10.1145/3613904.3641960.

[11] N. Pangakis, S. Wolken, and N. Fasching, “Automated annotation with generative AI requires validation,” CoRR, vol. abs/2306.00176, 2023, doi: 10.48550/ARXIV.2306.00176.

[12] M. V. Reiss, “Testing the reliability of ChatGPT for text annotation and classification: A cautionary remark,” CoRR, vol. abs/2304.11085, 2023, doi: 10.48550/ARXIV.2304.11085.

[13] S. Lubos et al., “Leveraging LLMs for the quality assurance of software requirements,” in 32nd IEEE international requirements engineering conference, RE 2024, reykjavik, iceland, june 24-28, 2024, G. Liebel, I. Hadar, and P. Spoletini, Eds., IEEE, 2024, pp. 389–397. doi: 10.1109/RE59067.2024.00046.

[14] J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, Apr. 1960, doi: 10.1177/001316446002000104.

[15] A. Elangovan et al., “Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge,” CoRR, vol. abs/2410.03775, 2024, doi: 10.48550/ARXIV.2410.03775.

[16] K. Schroeder and Z. Wood-Doughty, “Can you trust LLM judgments? Reliability of LLM-as-a-judge,” CoRR, vol. abs/2412.12509, 2024, doi: 10.48550/ARXIV.2412.12509.

[17] P. Pezeshkpour and E. Hruschka, “Large language models sensitivity to the order of options in multiple-choice questions,” in Findings of the association for computational linguistics: NAACL 2024, mexico city, mexico, june 16-21, 2024, K. Duh, H. Gómez-Adorno, and S. Bethard, Eds., Association for Computational Linguistics, 2024, pp. 2006–2017. doi: 10.18653/V1/2024.FINDINGS-NAACL.130.

[18] A. Panickssery, S. Bowman, and S. Feng, “LLM evaluators recognize and favor their own generations,” in Advances in neural information processing systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., Curran Associates, Inc., 2024, pp. 68772–68802. Available: https://proceedings.neurips.cc/paper_files/paper/2024/file/7f1f0218e45f5414c79c0679633e47bc-Paper-Conference.pdf

[19] I. O. Gallegos et al., “Bias and fairness in large language models: A survey,” Computational Linguistics, vol. 50, pp. 1097–1179, 2024, doi: 10.1162/coli_a_00524.

[20] A. Bavaresco et al., “LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks,” CoRR, vol. abs/2406.18403, 2024, doi: 10.48550/ARXIV.2406.18403.

[21] R. Zhou, L. Chen, and K. Yu, “Is LLM a reliable reviewer? A comprehensive evaluation of LLM on automatic paper reviewing tasks,” in Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation, LREC/COLING 2024, 20-25 may, 2024, torino, italy, N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue, Eds., ELRA; ICCL, 2024, pp. 9340–9351. Available: https://aclanthology.org/2024.lrec-main.816

[22] Q. Pan et al., “Human-Centered Design Recommendations for LLM-as-a-Judge,” in ACL 2024 Workshop HuCLLM, arXiv, Jul. 2024. doi: 10.48550/arXiv.2407.03479.

[23] C. F. Barros et al., “Large language model for qualitative research – a systematic mapping study.”

  1. Available: https://arxiv.org/abs/2411.14473</span>

[24] S. De Paoli, “Performing an inductive thematic analysis of semi-structured interviews with a large language model: An exploration and provocation on the limits of the approach,” Social Science Computer Review, vol. 42, no. 4, pp. 997–1019, 2024.

[25] W. S. Mathis, S. Zhao, N. Pratt, J. Weleff, and S. De Paoli, “Inductive thematic analysis of healthcare qualitative interviews using open-source large language models: How does it compare to traditional methods?” Computer Methods and Programs in Biomedicine, vol. 255, p. 108356, 2024.

[26] M. de Morais Leça, L. Valença, R. Santos, and R. de Souza Santos, “Applications and implications of large language models in qualitative analysis: A new frontier for empirical software engineering.” 2024. Available: https://arxiv.org/abs/2412.06564

[27] C. Byun, P. Vasicek, and K. D. Seppi, “Dispensing with humans in human-computer interaction research,” in Extended abstracts of the 2023 CHI conference on human factors in computing systems, CHI EA 2023, hamburg, germany, april 23-28, 2023, A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, and A. Peters, Eds., ACM, 2023, pp. 413:1–413:26. doi: 10.1145/3544549.3582749.

[28] M. Bano, D. Zowghi, and J. Whittle, “Exploring qualitative research using LLMs.” 2023. Available: https://arxiv.org/abs/2306.13298

[29] A. Kong et al., “Better zero-shot reasoning with role-play prompting,” CoRR, vol. abs/2308.07702, 2023, doi: 10.48550/ARXIV.2308.07702.

[30] L. P. Argyle, E. C. Busby, N. Fulda, J. Gubler, C. M. Rytting, and D. Wingate, “Out of one, many: Using language models to simulate human samples,” CoRR, vol. abs/2209.06899, 2022, doi: 10.48550/ARXIV.2209.06899.

[31] R. Xu et al., “AI for social science and social science of AI: A survey,” Inf. Process. Manag., vol. 61, no. 2, p. 103665, 2024, doi: 10.1016/J.IPM.2024.103665.

[32] M. A. Gerosa, B. Trinkenreich, I. Steinmacher, and A. Sarma, “Can AI serve as a substitute for human subjects in software engineering research?” Autom. Softw. Eng., vol. 31, no. 1, p. 13, 2024, doi: 10.1007/S10515-023-00409-6.

[33] M. Bano, H. Gunatilake, and R. Hoda, “What does a software engineer look like? Exploring societal stereotypes in LLMs.” 2025. Available: https://arxiv.org/abs/2501.03569

[34] K. Madampe, J. Grundy, R. Hoda, and H. O. Obie, “The struggle is real! The agony of recruiting participants for empirical software engineering studies,” in 2024 IEEE symposium on visual languages and human-centric computing (VL/HCC), liverpool, UK, september 2-6, 2024, IEEE, 2024, pp. 417–422. doi: 10.1109/VL/HCC60511.2024.00065.

[35] R. Crowell, “Why AI’s diversity crisis matters, and how to tackle it,” Nature Career Feature, 2023, doi: 10.1038/d41586-023-01689-4.

[36] J. Harding, W. D’Alessandro, N. G. Laskowski, and R. Long, “AI language models cannot replace human research participants,” AI Soc., vol. 39, no. 5, pp. 2603–2605, 2024, doi: 10.1007/S00146-023-01725-X.

[37] A. Wang, J. Morgenstern, and J. P. Dickerson, “Large language models cannot replace human participants because they cannot portray identity groups,” CoRR, vol. abs/2402.01908, 2024, doi: 10.48550/ARXIV.2402.01908.

[38] R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto, “Beyond code generation: An observational study of ChatGPT usage in software engineering practice,” Proc. ACM Softw. Eng., vol. 1, no. FSE, pp. 1819–1840, 2024, doi: 10.1145/3660788.

[39] M. Azanza, J. Pereira, A. Irastorza, and A. Galdos, “Can LLMs facilitate onboarding software developers? An ongoing industrial case study,” in 36th international conference on software engineering education and training, CSEE&t 2024, würzburg, germany, july 29 - aug. 1, 2024, IEEE, 2024, pp. 1–6. doi: 10.1109/CSEET62301.2024.10662989.

[40] J. Jahic and A. Sami, “State of practice: LLMs in software engineering and software architecture,” in 21st IEEE international conference on software architecture, ICSA 2024 - companion, hyderabad, india, june 4-8, 2024, IEEE, 2024, pp. 311–318. doi: 10.1109/ICSA-C63560.2024.00059.

[41] T. Xiao, C. Treude, H. Hata, and K. Matsumoto, “DevGPT: Studying developer-ChatGPT conversations,” in 21st IEEE/ACM international conference on mining software repositories, MSR 2024, lisbon, portugal, april 15-16, 2024, D. Spinellis, A. Bacchelli, and E. Constantinou, Eds., ACM, 2024, pp. 227–230. doi: 10.1145/3643991.3648400.

[42] Md. F. Rabbi, A. I. Champa, M. F. Zibran, and Md. R. Islam, “AI writes, we analyze: The ChatGPT python code saga,” in 21st IEEE/ACM international conference on mining software repositories, MSR 2024, lisbon, portugal, april 15-16, 2024, D. Spinellis, A. Bacchelli, and E. Constantinou, Eds., ACM, 2024, pp. 177–181. doi: 10.1145/3643991.3645076.

[43] S. Mohamed, A. Parvin, and E. Parra, “Chatting with AI: Deciphering developer conversations with ChatGPT,” in 21st IEEE/ACM international conference on mining software repositories, MSR 2024, lisbon, portugal, april 15-16, 2024, D. Spinellis, A. Bacchelli, and E. Constantinou, Eds., ACM, 2024, pp. 187–191. doi: 10.1145/3643991.3645078.

[44] J. Richards and M. Wessel, “What you need is what you get: Theory of mind for an LLM-based code understanding assistant,” in IEEE international conference on software maintenance and evolution, ICSME 2024, flagstaff, AZ, USA, october 6-11, 2024, IEEE, 2024, pp. 666–671. doi: 10.1109/ICSME58944.2024.00070.

[45] T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths, “Cognitive architectures for language agents,” Trans. Mach. Learn. Res., vol. 2024, 2024, Available: https://openreview.net/forum?id=1i6ZCvflQJ

[46] W. Zhou et al., “Agents: An open-source framework for autonomous language agents,” CoRR, vol. abs/2309.07870, 2023, doi: 10.48550/ARXIV.2309.07870.

[47] L. Yan, A. Hwang, Z. Wu, and A. Head, “Ivie: Lightweight anchored explanations of just-generated code,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, honolulu, HI, USA, may 11-16, 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 140:1–140:15. doi: 10.1145/3613904.3642239.

[48] R. Choudhuri, D. Liu, I. Steinmacher, M. A. Gerosa, and A. Sarma, “How far are we? The triumphs and trials of generative AI in learning software engineering,” in Proceedings of the 46th IEEE/ACM international conference on software engineering, ICSE 2024, lisbon, portugal, april 14-20, 2024, ACM, 2024, pp. 184:1–184:13. doi: 10.1145/3597503.3639201.

[49] M. Chen et al., “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021, Available: https://arxiv.org/abs/2107.03374

[50] X. Hou et al., “Large language models for software engineering: A systematic literature review,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, Dec. 2024, doi: 10.1145/3695988.

[51] A. Silva and M. Monperrus, “RepairBench: Leaderboard of frontier models for program repair,” arXiv preprint arXiv:2409.18952, 2024.

[52] C. E. Jimenez et al., “SWE-bench: Can language models resolve real-world github issues?” in The twelfth international conference on learning representations, ICLR 2024, vienna, austria, may 7-11, 2024, OpenReview.net, 2024. Available: https://openreview.net/forum?id=VTF8yNQM66

[53] S. Ahuja, V. Gumma, and S. Sitaram, “Contamination report for multilingual benchmarks,” CoRR, vol. abs/2410.16186, 2024, doi: 10.48550/ARXIV.2410.16186.

[54] C. Siska, K. Marazopoulou, M. Ailem, and J. Bono, “Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks,” in Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), ACL 2024, bangkok, thailand, august 11-16, 2024, L.-W. Ku, A. Martins, and V. Srikumar, Eds., Association for Computational Linguistics, 2024, pp. 10406–10421. doi: 10.18653/V1/2024.ACL-LONG.560.

[55] J. Cao et al., “How should i build a benchmark?” arXiv preprint arXiv:2501.10711, 2025.