Study Types
This list of study types is currently a DRAFT and based on a discussion sessions with researchers at the 2024 International Software Engineering Research Network (ISERN) meeting and at the 2nd Copenhagen Symposium on Human-Centered Software Engineering AI.
The development of empirical guidelines for studies involving large language models (LLMs) in software engineering is crucial for ensuring the validity and reproducibility of results. However, these guidelines must be tailored for different study types as they may pose unique challenges. Therefore, understanding the classification of these studies is essential for developing appropriate guidelines. We envision that a mature set of guidelines provides specific guidance for each of these study types, addressing their individual methodological idiosyncrasies. Moreover, we currently focus on large language models, that is, on natural language processing. In the future, we might extend our focus to multimodal foundation models.
Introduction: LLMs as Tools for Software Engineering Researchers
LLMs can be leveraged as powerful tools to assist researchers conducting empirical studies. They can automate various tasks such as data collection, preprocessing, and analysis. For example, LLMs can apply pre-defined coding guides to large qualitative datasets (LLMs as Annotators), assess the quality of software artifacts (LLMs as Raters), generate summaries of research papers (LLMs for Synthesis), and even simulate human behavior in empirical studies (LLMs as Subjects). This can significantly reduce the time and effort required by researchers, allowing them to focus on more complex aspects of their studies. However, all these applications also come with limitations, potential threats to validity, and implications for the reproducibility of study results. In our guidelines, the following study types and used to contextualize the recommendations we provide.
LLMs as Annotators
Description
LLMs can serve as annotators by automatically labeling artifacts with corresponding categories for data analysis based on a pre-defined coding guide. In qualitative data analysis, manually annotating or coding text passages, e.g. in software artifacts, open-ended survey responses, or interview transcripts, is often a time-consuming manual process [1]. LLMs can be used to augment or even replace human annotations, provide suggestions for new codes (see Section LLMs for Synthesis), or even automate the entire process.
Example(s)
Recent work in software engineering has begun exploring the use of LLMs for annotation tasks. Huang et al. [2] proposed an approach leveraging multiple LLMs for joint annotation of mobile application reviews. They used three models (Llama3, Gemma, and Mistral) of comparable size with an absolute majority voting rule (i.e., a label is only accepted if it receives more than half of the total votes from the models). Accordingly, the annotations fell into three categories: exact matches (where all models agreed), partial matches (where a majority agreed), and non-matches (where no majority was reached). Training on the resulting multi-model annotated dataset yielded improved classifier performance compared to single-model annotations. Both BERT and RoBERTa models showed meaningful improvements in F1 score and accuracy when trained on the multi-model annotated dataset versus individual model annotations, highlighting the benefits of their approach.
Advantages
Recent research demonstrates several key advantages of using LLMs as annotators, in particular their cost-effectiveness as well as efficiency and accuracy benefits. LLM-based annotation dramatically reduces costs compared to human labeling, with studies showing cost reductions of 50-96% across various natural language tasks [3]. For example, He et al. found in their study that GPT-4 annotation cost only $122.08 compared to $4,508 for a comparable MTurk pipeline, while also completing the task in just 2 days versus several weeks for the crowdsourced approach [4]. Moreover, LLMs consistently demonstrate strong performance, with ChatGPT’s accuracy exceeding crowd workers by approximately 25% on average [5] and achieving impressive results in specific tasks such as sentiment analysis (64.9% accuracy) and counterspeech detection (0.791 precision) [6]. They also show remarkably high intercoder agreement, surpassing both crowd workers and trained annotators [5].
Challenges
Several important challenges and limitations must also be considered, which include reliability issues, human-LLM interaction challenges, biases, errors, and resource considerations. Studies suggest that while LLMs show promise as annotation tools in SE, their optimal use may be in augmenting rather than replacing human annotators entirely [3], [4], with careful consideration given to verification mechanisms and confidence thresholds LLMs can negatively impact human judgment when their labels are incorrect [7], and their overconfidence requires careful verification [8]. Moreover, LLMs show significant variability in annotation quality across different tasks [7], [9], with particular challenges in complex tasks and post-training events. They are especially unreliable for high-stakes labeling tasks [9], demonstrating notable performance disparities across different label categories [6]. Recent empirical evidence indicates that LLM consistency in text annotation often falls below scientific reliability thresholds, with outputs being sensitive to minor prompt variations [10] and performance varies between tasks [11]. Specific challenges have been identified with context-dependent annotations, where LLMs show particular difficulty in correctly interpreting text segments that require broader contextual understanding [4]. While pooling multiple outputs can improve reliability, this approach necessitates additional computational resources and still requires validation against human-annotated data. While generally cost-effective, LLM annotation requires careful management of per-token charges, particularly for longer texts [3]. Furthermore, achieving reliable annotations may require multiple runs of the same input to enable majority voting [10], although exact cost comparisons between LLM-based and human annotation is controversial [4]. Finally, research has identified consistent biases in label assignment, including tendencies to overestimate certain labels and misclassify neutral content, particularly in stance detection tasks [6].
LLMs as Raters
Description
In empirical studies, LLMs can act as raters (also known as judges) to evaluate properties of software artifacts. For instance, LLMs can be used to assess code readability, adherence to coding standards, or the quality of code comments. LLM rating could be used to assess research artifact quality, such as generated documentation from a research tool. Rating is distinct from the more qualitative task of assigning a code or label (see Section LLMs as Annotators) to (typically) unstructured text. It is also distinct from the software engineering tasks of using LLMs, as we discuss in Section LLMs as Tools for SE, to e.g. understand if LLMs can find ambiguity in requirements [12], which is then compared to a human-created gold standard.
Example(s)
Lubos et al. [13] leveraged Llama-2 to evaluate the quality of software requirements statements.
They prompted the LLM with the text below, where the words in brackets reflect parameters for the study:
Your task is to evaluate the quality of a software requirement.
Evaluate whether the following requirement is {quality_characteristic}.
{quality_ characteristic} means: {quality_characteristic_explanation}
The evaluation result must be: ‘yes’ or ‘no’.
Request: Based on the following description of the project: {project_description}
Evaluate the quality of the following requirement: {requirement}.
Explain your decision and suggest an improved version.
Lubos et al. then evaluated the LLM’s output against human raters, to assess how well the LLM matched experts. They did this in 2 phases: in one, the humans were unaware of the LLM’s decision, and in the other, with foreknowledge. This second aspect measured the extent to which the LLM might convince the human to change their mind.
Agreement can be measured in many ways; this study used Cohen’s kappa measure [14]. The study found moderate agreement on the simple requirements dataset, and poor agreement on the more complex requirements. However, evaluating machine ratings may not be the same as human-human ratings and is an open area of research [15].
The study by Ahmed et al. [16] looked at LLMs-as-raters across five datasets and ten annotation tasks. They used a variety of different LLMs to compare model to model performance. For tasks where humans do not agree with one another (in this case, evaluating conciseness of code summaries) models also do poorly; but “if multiple LLMs reach similar solutions independently, then LLMs are likely suitable for the annotation task [16, p. 6].
Advantages
By providing—depending on the model configuration—consistent and relatively “objective” evaluations, LLMs can help mitigate certain biases and part of the variability that human raters might introduce. This may lead to more reliable and reproducible results in empirical studies, to the extent these models can be reproduced or checkpointed (see “Report Model Version and Configuration” and “Report Prompts and their Development” for more).
LLMs can be much more efficient, and scale far more easily, than the equivalent human approach. With LLM automation, entire datasets could be labeled, as opposed to subsets. The main constraint, which varies by model and budget, is the input context size, i.e., the number of tokens one can pass into a model. For example, the upper bound on context that can be passed into OpenAi’s o1-mini model is 32k.
Challenges
When relying on the judgment of LLMs, researchers must build a reliable process for generating ratings that considers the non-deterministic nature of LLMs and report the intricacies of that process transparently [17]. For example, the order of options has been shown to affect LLM outputs in multiple-choice settings [18].
Avoid the temptation to assign objectivity to an LLM; evidence shows LLMs can behave differently, for instance, when reviewing their own outputs [19]. In more human-oriented datasets (such as discussions of pull requests) LLMs may suffer from well documented biases and issues with fairness [20]. For tasks where human raters themselves disagree significantly, it is not clear if an LLM rater should reflect the majority opinion or act as an independent rater. The underlying statistical framework of an LLM usually pushes outputs towards the most likely (majority) answer.
Rating large numbers of items—for example, to perform fault localization on the thousands of bugs big open-source projects deal with—comes with costs in clock time, compute time, and environmental sustainability.
One researcher choice in such studies is the number of examples (training data) provided to the LLM. This might involve zero tuning (zero-shot), several examples (few-shot), or more conventional training data with many examples. In the example above, Lubos et al. chose zero-shot tuning, providing no specific guidance besides the project’s context, i.e., they did not show the LLM what a yes or no answer might look like.
Research is ongoing as to how suitable LLMs are as standalone raters. Questions around bias, accuracy, and trust remain [21]. There is reason for concern about LLMs judging student assignments or doing peer review of scientific papers [22]. Even beyond questions of technical capacity, ethical questions remain, particularly if there is some implicit expectation that a human is rating the output, such as a quiz. Involving a human in the rating loop—for example, to contextualize the scoring—is one approach [23].
LLMs for Synthesis
Description
Large Language Models (LLMs) can streamline qualitative data analysis and synthesis by summarizing text, identifying key themes, and providing detailed analyses [24]. In software engineering research, these models can assist in processing interview transcripts, survey responses, and literature reviews, with the potential to address longstanding challenges of qualitative software engineering research, such as reducing manual effort and subjectivity and improving generalizability and consistency [1]. LLMs can support researchers in deriving codes and developing coding guides during the initial phase of qualitative data analysis. These codes can then be used to annotate additional data (see Section “LLMs as Annotators”), with LLMs identifying key themes that emerge from these codes and providing candidate insights, potentially automating the entire process and enabling the synthesis of large amounts of qualitative data. However, while LLMs show promise, there remain concerns about overreliance, as discrepancies between AI-generated and human-generated insights persist, particularly related to capturing contextual nuances [25].
Example(s)
While published examples of applying LLMs for synthesis in the software engineering domain are still rather scarce, recent work has explored the use of LLMs for qualitative synthesis in other domains and provided reflections on how they can be applied for this purpose in software engineering [1]. Barros et al. [26] conducted a systematic mapping on the use of LLMs for qualitative research. They identified examples in domains like healthcare and social sciences (e.g., [27], [28]) in which LLMs were used to support different methodologies for qualitative analysis, including grounded theory and thematic analysis. Overall, the findings highlight the successful generation of preliminary coding schemes from interview transcripts, later refined by human researchers, along with support for pattern identification. This approach was reported not only to expedite the initial coding process but also to allow researchers to focus more on higher-level analysis and interpretation. However, they emphasize that effective use of LLMs requires structured prompts and careful human oversight. This particular paper suggests using LLMs to support tasks like initial coding and theme identification while conservatively reserving interpretative or creative processes for human analysts.
Similarly, Lecça et al. [29] also conducted a systematic mapping to investigate how LLMs are used in qualitative analysis and how they can be applied in software engineering research. Consistent with the study by Barros et al. [26], they identified that LLMs are applied primarily in tasks like coding, thematic analysis, and data categorization, providing efficiency by reducing the time, cognitive demands, and resources often required for these processes. Their findings also recognize the need for human oversight to ensure reliability and suggest that, while LLMs can assist in qualitative analysis in software engineering, human expertise remains essential for in-depth interpretation.
Advantages
The use of LLMs for synthesis offers several advantages, addressing classical challenges related to qualitative empirical software engineering research [1], [26], [29]. By enhancing efficiency and scalability, LLMs can significantly reduce the time required for synthesis, alleviating the burden of time-intensive manual work. They can also improve consistency in coding and summarization, minimizing variability in human interpretations by generating structured summaries and coding suggestions with uniform phrasing. In terms of generalizability, LLMs can enable researchers to analyze larger datasets, identifying patterns across broader contexts than traditional qualitative methods typically allow. Additionally, they can help to reduce the influence of human subjectivity. Nevertheless, while LLMs streamline many aspects of qualitative synthesis, careful oversight remains essential to ensure nuanced interpretation and contextual accuracy.
Challenges
Bano et al. [25] found that while LLMs can provide structured summaries and qualitative coding frameworks, they may misinterpret nuanced qualitative data due to their lack of contextual understanding. Additionally, several other challenges have been reported [1], [26], [29]. In particular, LLMs cannot independently assess argument validity, and critical thinking remains a human responsibility in qualitative synthesis. Moreover, LLMs may produce biased results, reinforcing existing prejudices or omitting key perspectives, making human oversight essential to ensure accurate interpretation, mitigate biases, and maintain quality control. Ethical and privacy concerns also arise from the proprietary nature of many LLMs, limiting transparency and control over training data. Finally, reproducibility issues persist due to inconsistencies across inferences, model versions, and prompt variations.
LLMs as Subjects
Description
LLMs can be used as subjects in empirical studies to simulate human behavior and interactions. In this role, LLMs generate responses that mimic human participants, making them particularly useful for studies involving user interactions, collaborative coding environments, or software usability assessments. To achieve this, prompt engineering techniques are widely employed, with a common approach being the use of the Personas Pattern [30], which involves tailoring LLM responses to align with predefined profiles or roles that emulate specific user archetypes. Furthermore, recent sociological studies have emphasized that, to be effectively utilized in this capacity, LLMs—including their agentic versions tailored through prompt engineering—should meet four criteria of algorithmic fidelity [31]; generated responses should be: indistinguishable from human-produced texts; consistent with the attitudes and sociodemographic information of the conditioning context; naturally aligned with the form, tone, and content of the provided context; and reflective of patterns in relationships between ideas, demographics, and behavior observed in comparable human data.
Example(s)
LLMs can be used as subjects in various types of empirical studies, enabling researchers to simulate human participants in controlled, repeatable scenarios. The broader applicability of LLM-based studies beyond software engineering has been compiled by Xu et al. [32], who examined various uses in social science research. Given the socio-technical nature of software development, many of these approaches are highly transferable to this domain, demonstrating the growing role of LLMs in empirical software engineering research.
LLMs can be applied in survey and interview studies to impersonate developers responding to survey questionnaires or interviews, allowing researchers to test the clarity and effectiveness of survey items or to simulate responses under varying conditions, such as different expertise levels or cultural contexts. For instance, Gerosa et al. [33] explored persona-based interviews and multi-persona focus groups, demonstrating how LLMs can emulate human responses and behaviors while addressing ethical concerns, biases, and methodological challenges. Their study highlighted the potential of LLMs in enhancing research scalability while advocating for a hybrid approach that integrates AI-generated and human-generated data to preserve the validity of findings.
In usability studies, LLMs can simulate end-user feedback, providing insights into potential usability issues and offering suggestions for improvement based on predefined user personas. This aligns with the work of Bano et al. [34], who investigated biases in LLM-generated candidate profiles in software engineering recruitment processes. Their study, which analyzed both textual and visual outputs, revealed biases favoring male, Caucasian candidates, lighter skin tones, and slim physiques, particularly for senior roles. This demonstrates how LLMs can be used to assess usability concerns, such as fairness and inclusivity, in AI-driven decision-making tools within software engineering.
In experimental studies, LLMs can participate in experiments testing collaborative coding practices, such as pair programming or code review scenarios, by simulating developers with distinct coding styles, expertise levels, or attitudes. Similarly, in simulated decision-making studies, LLMs can emulate team members in exercises such as sprint planning or software requirement prioritization, enabling researchers to analyze team dynamics under different configurations.
Advantages
Using LLMs as subjects offers valuable insights while significantly reducing the need to recruit human participants, a process that is often time-consuming and costly [35]. Furthermore, employing LLMs as subjects enables researchers to conduct empirical research under consistent and repeatable conditions, enhancing the reliability and scalability of the studies.
An emerging challenge is the replicability and consistency of studies involving LLMs. The non-determinism of these models introduces potential issues in this regard. To mitigate these challenges, adjustments can be made to the model’s temperature settings, or the number of personas simulated by LLMs can be increased. The latter strategy, in particular, could be complemented by inter-group analyses to verify and demonstrate the potential lack of impact of non-determinism, ensuring the consistency of the model’s results.
Challenges
However, it is important that researchers are aware of LLMs’ inherent biases [36] and limitations [37], [38] when using them as study subjects. One issue is that LLMs tend to misrepresent demographic group perspectives, reflecting more the opinions of out-group members than those of in-group members. Further, LLMs often oversimplify demographic identities, failing to capture the diversity of opinions and experiences within a group. The use of identity-based prompts can reduce identities to fixed and innate characteristics, amplifying perceived differences between groups. To mitigate these issues, encoded identity names can be used instead of explicit labels, the temperature setting can be increased to enhance response diversity, and alternatives to demographic prompts can be employed when the goal is to broaden response coverage [37], [38].
Introduction: LLMs as Tools for Software Engineers
Besides support research tasks, LLM-based assistants have become an essential tool for software engineers, supporting them in various tasks such as code generation, code summarization, code completion, and code repair. Researchers have studied how software engineers use LLMs in their workflows, developed new tools that integrate LLMs, and benchmarked LLMs for software engineering tasks. For all these study types, different guidelines may apply. Therefore, it is important to clearly separate and describe them.
Studying LLM Usage in Software Engineering
Description
Empirical studies can also focus on understanding how software engineers use LLMs in their workflows. In these types of studies, researchers attempt to gain a detailed view of the current state of practice in software engineering. Observations and data gathering occurs in a real-world context based on information from practitioners. This involves investigating the adoption, usage patterns, and perceived benefits and challenges of LLM-based tools. Surveys, interviews, and observational studies can provide insights into how LLMs are integrated into development processes, how they influence decision-making, and what factors affect their acceptance and effectiveness. Such studies can inform the design of more user-friendly and effective LLM-based tools and uncover new best practices for the LLM-assisted software engineering process.
Example(s)
Observing the use of LLMs in case studies allows researchers to gather direct information in a real-world context. For example, Khojah et al. investigated the use of ChatGPT by professional software engineers in a week-long observational study [39]. Ananza et al. conducted a case study evaluating the impact of introducing LLMs to the onboarding process of new software developers [40].
Surveys can help researchers quickly provide a wider overview of the current perceptions of LLM use. For example, Jahic and Sami surveyed 15 software companies regarding their practices on LLNs in software engineering [41].
Advantages
Studying the usage of LLMs in its application context allows researchers to conduct studies with high external validity. As the observations happen directly in its real-world context, researchers may find it easier to generalize the study results to other cases or the whole target population. Researching in a less controlled study environment may also uncover more nuanced information about LLM-assisted SE workflows in general, independent of the specific LLM being evaluated.
Challenges
Due to the studies taking place in real-world environments as opposed to more controlled settings, many additional potential confounding factors are present, which may threaten the internal validity of the study. The usage environment and choice of LLMs in a real-world context can often be extremely diverse. In practice, the process integration of LLMs can range from the use of specific LLMs based on company policy to the unregulated use of any available LLM. Both extremes may influence the use of LLMs by software engineers in different ways, and as such should be addressed differently by the study methodology. Further, in longitudinal case studies, the timing of the study may have a significant impact on its result, as LLMs are quickly being developed and replaced by newer versions.
LLMs for New Software Engineering Tools
Description
LLMs are being integrated into new tools designed to support software engineers in their daily tasks. Such integration is important to tailor the tools to the specific needs of a development team and to enhance their capabilities, as well as to influence their behavior in accordance with company policies. In this regard, the advent of GenAI Agents has enabled the development of a standardized architecture, where the LLM is guided by a reasoning component (related to prompt engineering), tools (understood as interfacing with external systems via APIs or databases), and a user communication interface that is not necessarily limited to a chat.
Example(s)
Researchers can propose tools to facilitate technology transfer from the research environment to industry. For instance, the work of Richards and Wessel introduced a preliminary GenAI Agent designed to assist developers in understanding source code by leveraging a conceptual theory (the reasoning component) based on the theory of mind [42]. Similarly, Yan et al. proposed IVIE, a tool integrated into the VS Code graphical interface that generates and explains code using LLMs [43].
Empirical studies can evaluate the effectiveness of these tools in improving productivity, code quality, and developer satisfaction. For example, Choudhuri et al. conducted an experiment with students in which they measured the impact of ChatGPT on the correctness and time taken to solve programming tasks [44].
Advantages
By assessing the impact of LLM-powered tools, researchers can identify best practices and areas for further improvement. Moreover, by developing tools based on well-defined agent architectures, researchers can facilitate technology transfer, bridging the gap between industry and academia more effectively. Moreover, compared to the past, the accessibility of open-source models and their advanced capabilities make building such tools a potentially easier task, especially considering the ease of guiding them through prompt engineering techniques.
Challenges
In this context, the non-determinism of generative AI models—although potentially mitigated through prompt engineering—poses a significant challenge in both the development and evaluation of tools integrating GenAI. Additionally, while open-source models are accessible, the most performant ones require substantial hardware resources that are not yet widely available. Resorting to cloud-based APIs using non-open-source models or relying on third-party providers for hosting, while seemingly a solution, introduces new concerns related to data privacy and security.
Benchmarking LLMs for Software Engineering Tasks
Description
Benchmarking is the process of evaluating the LLM output obtained from a standardized datasets using a set of standardized metrics. High-quality reference datasets, such as HumanEval [45] for the task of code generation, are necessary to perform evaluation across studies. LLM output is compared against a ground truth from the dataset in the benchmark using general metrics for text generation, such as ROUGE, BLEU, or METEOR [46], a well as task-specific metrics, such as Pass@k for code generation.
Example(s)
In software engineering, benchmarking may include the evaluation of LLMs ability to produce accurate and robust outputs for input data obtined from curated real-world projects or from synthetic SE-specific datasets. Typical tasks include code generation, code summarization, code completion, and code repair, but also natural-language processing tasks—i.e., anaphora resolution—interesting for subfields such a Requirements Engineering. RepairBench [47], for example, contains 574 buggy Java methods and their corresponding fixed versions, which can be used to evaluate the performance of LLMs in code repair tasks. The metrics used are Plausible@1 (i.e., the probability that the first generated patch passes all test cases) and AST Match@1 (i.e., the probability that the Abstract Syntax Tree of the first generated patch matches the one of the ground truth patch). SWE-Bench [48] is a more generic benchmark that contains 2,294 SE python tasks extracted from GitHub pull requests. For scoring the performance of the LLMs on the tasks, the authors report whether the generated patch is applicable or not (i.e., it fails compilation) and, for successful patches, the percentage of test cases passed.
Advantages
Properly-built benchmarks provide objective evaluation across different tasks, enabling fair comparison of different models (and versions). Moreover, benchmarks built for specific SE tasks can help identify LLM weaknesses and support their optimization/fine-tuning for such tasks. Benchmark built using real-world data can also help legitimize research results for practitioners, supporting industry-academia collaboration. Finally, benchmarks can foster open science practices by providing a common ground for sharing data (e.g., as part of the benchmark itself) and results (e.g., of models run against a benchmark).
Challenges
Benchmark contamination [49] has recently been identified as an issue. The careful selection of samples and building of corresponding input prompts is particularly important, as correlations between prompts may bias benchmark results [50]. Recently, Cao et al. [51] has proposed guidelines for building benchmarks for LLMs related to coding tasks, grounded in a systematic survey of existing benchmarks. In this process, they highlight current shortcomings related to reliability, transparency irreproducibility, low data quality, and inadequate validation measures.
References
[1] M. Bano, R. Hoda, D. Zowghi, and C. Treude, “Large language models for qualitative research in software engineering: Exploring opportunities and challenges,” Autom. Softw. Eng., vol. 31, no. 1, p. 8, 2024, doi: 10.1007/S10515-023-00407-8.
[2] J. Huang et al., “Enhancing review classification via LLM-based data annotation and multi-perspective feature representation learning,” SSRN Electronic Journal, pp. 1–15, 2024, doi: 10.2139/ssrn.5002351.
[3] S. Wang, Y. Liu, Y. Xu, C. Zhu, and M. Zeng, “Want to reduce labeling cost? GPT-3 can help,” in Findings of the association for computational linguistics: EMNLP 2021, virtual event / punta cana, dominican republic, 16-20 november, 2021, M.-F. Moens, X. Huang, L. Specia, and S. W. Yih, Eds., Association for Computational Linguistics, 2021, pp. 4195–4205. doi: 10.18653/V1/2021.FINDINGS-EMNLP.354.
[4] Z. He, C.-Y. Huang, C.-K. C. Ding, S. Rohatgi, and T.-H. K. Huang, “If in a crowdsourced data annotation pipeline, a GPT-4,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, honolulu, HI, USA, may 11-16, 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 1040:1–1040:25. doi: 10.1145/3613904.3642834.
[5] F. Gilardi, M. Alizadeh, and M. Kubli, “ChatGPT outperforms crowd-workers for text-annotation tasks,” CoRR, vol. abs/2303.15056, 2023, doi: 10.48550/ARXIV.2303.15056.
[6] Y. Zhu, P. Zhang, E. ul Haq, P. Hui, and G. Tyson, “Can ChatGPT reproduce human-generated labels? A study of social computing tasks,” CoRR, vol. abs/2304.10145, 2023, doi: 10.48550/ARXIV.2304.10145.
[7] F. Huang, H. Kwak, and J. An, “Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech,” in Companion proceedings of the ACM web conference 2023, WWW 2023, austin, TX, USA, 30 april 2023 - 4 may 2023, Y. Ding, J. Tang, J. F. Sequeda, L. Aroyo, C. Castillo, and G.-J. Houben, Eds., ACM, 2023, pp. 294–297. doi: 10.1145/3543873.3587368.
[8] M. Wan et al., “TnT-LLM: Text mining at scale with large language models,” in Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, KDD 2024, barcelona, spain, august 25-29, 2024, R. Baeza-Yates and F. Bonchi, Eds., ACM, 2024, pp. 5836–5847. doi: 10.1145/3637528.3671647.
[9] X. Wang, H. Kim, S. Rahman, K. Mitra, and Z. Miao, “Human-LLM collaborative annotation through effective verification of LLM labels,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, honolulu, HI, USA, may 11-16, 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 303:1–303:21. doi: 10.1145/3613904.3641960.
[10] M. V. Reiss, “Testing the reliability of ChatGPT for text annotation and classification: A cautionary remark,” CoRR, vol. abs/2304.11085, 2023, doi: 10.48550/ARXIV.2304.11085.
[11] N. Pangakis, S. Wolken, and N. Fasching, “Automated annotation with generative AI requires validation,” CoRR, vol. abs/2306.00176, 2023, doi: 10.48550/ARXIV.2306.00176.
[12] S. Ezzini, S. Abualhaija, C. Arora, and M. Sabetzadeh, “Automated handling of anaphoric ambiguity in requirements: A multi-solution study,” in 44th IEEE/ACM 44th international conference on software engineering, ICSE 2022, pittsburgh, PA, USA, may 25-27, 2022, ACM, 2022, pp. 187–199. doi: 10.1145/3510003.3510157.
[13] S. Lubos et al., “Leveraging LLMs for the quality assurance of software requirements,” in 32nd IEEE international requirements engineering conference, RE 2024, reykjavik, iceland, june 24-28, 2024, G. Liebel, I. Hadar, and P. Spoletini, Eds., IEEE, 2024, pp. 389–397. doi: 10.1109/RE59067.2024.00046.
[14] J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, Apr. 1960, doi: 10.1177/001316446002000104.
[15] A. Elangovan et al., “Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge,” CoRR, vol. abs/2410.03775, 2024, doi: 10.48550/ARXIV.2410.03775.
[16] T. Ahmed, P. T. Devanbu, C. Treude, and M. Pradel, “Can LLMs replace manual annotation of software engineering artifacts?” CoRR, vol. abs/2408.05534, 2024, doi: 10.48550/ARXIV.2408.05534.
[17] K. Schroeder and Z. Wood-Doughty, “Can you trust LLM judgments? Reliability of LLM-as-a-judge,” CoRR, vol. abs/2412.12509, 2024, doi: 10.48550/ARXIV.2412.12509.
[18] P. Pezeshkpour and E. Hruschka, “Large language models sensitivity to the order of options in multiple-choice questions,” in Findings of the association for computational linguistics: NAACL 2024, mexico city, mexico, june 16-21, 2024, K. Duh, H. Gómez-Adorno, and S. Bethard, Eds., Association for Computational Linguistics, 2024, pp. 2006–2017. doi: 10.18653/V1/2024.FINDINGS-NAACL.130.
[19] A. Panickssery, S. Bowman, and S. Feng, “LLM evaluators recognize and favor their own generations,” in Advances in neural information processing systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., Curran Associates, Inc., 2024, pp. 68772–68802. Available: https://proceedings.neurips.cc/paper_files/paper/2024/file/7f1f0218e45f5414c79c0679633e47bc-Paper-Conference.pdf
[20] I. O. Gallegos et al., “Bias and fairness in large language models: A survey,” Computational Linguistics, vol. 50, pp. 1097–1179, 2024, doi: 10.1162/coli_a_00524.
[21] A. Bavaresco et al., “LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks,” CoRR, vol. abs/2406.18403, 2024, doi: 10.48550/ARXIV.2406.18403.
[22] R. Zhou, L. Chen, and K. Yu, “Is LLM a reliable reviewer? A comprehensive evaluation of LLM on automatic paper reviewing tasks,” in Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation, LREC/COLING 2024, 20-25 may, 2024, torino, italy, N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue, Eds., ELRA; ICCL, 2024, pp. 9340–9351. Available: https://aclanthology.org/2024.lrec-main.816
[23] Q. Pan et al., “Human-Centered Design Recommendations for LLM-as-a-Judge,” in ACL 2024 Workshop HuCLLM, arXiv, Jul. 2024. doi: 10.48550/arXiv.2407.03479.
[24] C. Byun, P. Vasicek, and K. D. Seppi, “Dispensing with humans in human-computer interaction research,” in Extended abstracts of the 2023 CHI conference on human factors in computing systems, CHI EA 2023, hamburg, germany, april 23-28, 2023, A. Schmidt, K. Väänänen, T. Goyal, P. O. Kristensson, and A. Peters, Eds., ACM, 2023, pp. 413:1–413:26. doi: 10.1145/3544549.3582749.
[25] M. Bano, D. Zowghi, and J. Whittle, “Exploring qualitative research using LLMs.” 2023. Available: https://arxiv.org/abs/2306.13298
[26] C. F. Barros et al., “Large language model for qualitative research – a systematic mapping study.”
- Available: https://arxiv.org/abs/2411.14473</span>
[27] S. De Paoli, “Performing an inductive thematic analysis of semi-structured interviews with a large language model: An exploration and provocation on the limits of the approach,” Social Science Computer Review, vol. 42, no. 4, pp. 997–1019, 2024.
[28] W. S. Mathis, S. Zhao, N. Pratt, J. Weleff, and S. De Paoli, “Inductive thematic analysis of healthcare qualitative interviews using open-source large language models: How does it compare to traditional methods?” Computer Methods and Programs in Biomedicine, vol. 255, p. 108356, 2024.
[29] M. de Morais Leça, L. Valença, R. Santos, and R. de Souza Santos, “Applications and implications of large language models in qualitative analysis: A new frontier for empirical software engineering.” 2024. Available: https://arxiv.org/abs/2412.06564
[30] A. Kong et al., “Better zero-shot reasoning with role-play prompting,” CoRR, vol. abs/2308.07702, 2023, doi: 10.48550/ARXIV.2308.07702.
[31] L. P. Argyle, E. C. Busby, N. Fulda, J. Gubler, C. M. Rytting, and D. Wingate, “Out of one, many: Using language models to simulate human samples,” CoRR, vol. abs/2209.06899, 2022, doi: 10.48550/ARXIV.2209.06899.
[32] R. Xu et al., “AI for social science and social science of AI: A survey,” Inf. Process. Manag., vol. 61, no. 2, p. 103665, 2024, doi: 10.1016/J.IPM.2024.103665.
[33] M. A. Gerosa, B. Trinkenreich, I. Steinmacher, and A. Sarma, “Can AI serve as a substitute for human subjects in software engineering research?” Autom. Softw. Eng., vol. 31, no. 1, p. 13, 2024, doi: 10.1007/S10515-023-00409-6.
[34] M. Bano, H. Gunatilake, and R. Hoda, “What does a software engineer look like? Exploring societal stereotypes in LLMs.” 2025. Available: https://arxiv.org/abs/2501.03569
[35] K. Madampe, J. Grundy, R. Hoda, and H. O. Obie, “The struggle is real! The agony of recruiting participants for empirical software engineering studies,” in 2024 IEEE symposium on visual languages and human-centric computing (VL/HCC), liverpool, UK, september 2-6, 2024, IEEE, 2024, pp. 417–422. doi: 10.1109/VL/HCC60511.2024.00065.
[36] R. Crowell, “Why AI’s diversity crisis matters, and how to tackle it,” Nature Career Feature, 2023, doi: 10.1038/d41586-023-01689-4.
[37] J. Harding, W. D’Alessandro, N. G. Laskowski, and R. Long, “AI language models cannot replace human research participants,” AI Soc., vol. 39, no. 5, pp. 2603–2605, 2024, doi: 10.1007/S00146-023-01725-X.
[38] A. Wang, J. Morgenstern, and J. P. Dickerson, “Large language models cannot replace human participants because they cannot portray identity groups,” CoRR, vol. abs/2402.01908, 2024, doi: 10.48550/ARXIV.2402.01908.
[39] R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto, “Beyond code generation: An observational study of ChatGPT usage in software engineering practice,” Proc. ACM Softw. Eng., vol. 1, no. FSE, pp. 1819–1840, 2024, doi: 10.1145/3660788.
[40] M. Azanza, J. Pereira, A. Irastorza, and A. Galdos, “Can LLMs facilitate onboarding software developers? An ongoing industrial case study,” in 36th international conference on software engineering education and training, CSEE&t 2024, würzburg, germany, july 29 - aug. 1, 2024, IEEE, 2024, pp. 1–6. doi: 10.1109/CSEET62301.2024.10662989.
[41] J. Jahic and A. Sami, “State of practice: LLMs in software engineering and software architecture,” in 21st IEEE international conference on software architecture, ICSA 2024 - companion, hyderabad, india, june 4-8, 2024, IEEE, 2024, pp. 311–318. doi: 10.1109/ICSA-C63560.2024.00059.
[42] J. Richards and M. Wessel, “What you need is what you get: Theory of mind for an LLM-based code understanding assistant,” in IEEE international conference on software maintenance and evolution, ICSME 2024, flagstaff, AZ, USA, october 6-11, 2024, IEEE, 2024, pp. 666–671. doi: 10.1109/ICSME58944.2024.00070.
[43] L. Yan, A. Hwang, Z. Wu, and A. Head, “Ivie: Lightweight anchored explanations of just-generated code,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, honolulu, HI, USA, may 11-16, 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 140:1–140:15. doi: 10.1145/3613904.3642239.
[44] R. Choudhuri, D. Liu, I. Steinmacher, M. A. Gerosa, and A. Sarma, “How far are we? The triumphs and trials of generative AI in learning software engineering,” in Proceedings of the 46th IEEE/ACM international conference on software engineering, ICSE 2024, lisbon, portugal, april 14-20, 2024, ACM, 2024, pp. 184:1–184:13. doi: 10.1145/3597503.3639201.
[45] M. Chen et al., “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021, Available: https://arxiv.org/abs/2107.03374
[46] X. Hou et al., “Large language models for software engineering: A systematic literature review,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, Dec. 2024, doi: 10.1145/3695988.
[47] A. Silva and M. Monperrus, “RepairBench: Leaderboard of frontier models for program repair,” arXiv preprint arXiv:2409.18952, 2024.
[48] C. E. Jimenez et al., “SWE-bench: Can language models resolve real-world github issues?” in The twelfth international conference on learning representations, ICLR 2024, vienna, austria, may 7-11, 2024, OpenReview.net, 2024. Available: https://openreview.net/forum?id=VTF8yNQM66
[49] S. Ahuja, V. Gumma, and S. Sitaram, “Contamination report for multilingual benchmarks,” CoRR, vol. abs/2410.16186, 2024, doi: 10.48550/ARXIV.2410.16186.
[50] C. Siska, K. Marazopoulou, M. Ailem, and J. Bono, “Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks,” in Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), ACL 2024, bangkok, thailand, august 11-16, 2024, L.-W. Ku, A. Martins, and V. Srikumar, Eds., Association for Computational Linguistics, 2024, pp. 10406–10421. doi: 10.18653/V1/2024.ACL-LONG.560.
[51] J. Cao et al., “How should i build a benchmark?” arXiv preprint arXiv:2501.10711, 2025.