Study Types

The development of empirical guidelines for software engineering (SE) studies involving LLMs is crucial to ensuring the validity and reproducibility of results. However, such guidelines must be tailored to different study types that each pose unique challenges. Therefore, we developed a taxonomy of study types that we then use to contextualize the recommendations we provide.

Each study type section starts with a description, followed by examples from the SE research community and beyond as well as the advantages and challenges of using LLMs for the respective study type. The remainder of this section is structured as follows:

  1. LLMs as Tools for Software Engineeering Researchers
    1. LLMs as Annotators
    2. LLMs as Judges
    3. LLMs for Synthesis
    4. LLMs as Subjects
  2. LLMs as Tools for Software Engineers
    1. Studying LLM Usage in Software Engineering
    2. LLMs for new Software Engineering Tools
    3. Benchmarking LLMs for Software Engineering Tasks
  3. References

Introduction: LLMs as Tools for Software Engineering Researchers

LLMs can serve as powerful tools to help researchers conduct empirical studies. They can automate various tasks such as data collection, pre-processing, and analysis. For example, LLMs can apply predefined coding guides to qualitative datasets (LLMs as Annotators), assess the quality of software artifacts (LLMs as Judges), generate summaries of research papers (LLMs for Synthesis), or simulate human behavior in empirical studies (LLMs as Subjects). This can significantly reduce the time and effort required to conduct a study. However, besides the many advantages that all these applications bring, they also come with challenges such as potential threats to validity and implications for the reproducibility of study results.

LLMs as Annotators

Description

Just like human annotators, LLMs can label artifacts based on a pre-defined coding guide. However, they can label data much faster than any human could. In qualitative data analysis, manually annotating (“coding”) natural language text, e.g., in software artifacts, open-ended survey responses, or interview transcripts, is a time-consuming manual process (Bano et al. 2024). LLMs can be used to augment or replace human annotations, provide suggestions for new codes (see Section LLMs for Synthesis), or even automate the entire qualitative data analysis process.

Example(s)

Recent work in software engineering has begun to explore the use of LLMs for annotation tasks. J. Huang et al. (2024) (J. Huang et al. 2024) proposed an approach that utilizes multiple LLMs for joint annotation of mobile application reviews. They used three models of comparable size with an absolute majority voting rule (i.e., a label is only accepted if it receives more than half of the total votes from the models). Accordingly, the annotations fell into three categories: exact matches (where all models agreed), partial matches (where a majority agreed), and non-matches (where no majority was reached). The study by Ahmed et al. (2025) (Ahmed et al. 2025) examined LLMs as annotators in software engineering research across five datasets, six LLMs, and ten annotation tasks. They found that model-model agreement strongly correlates with human-model agreement, suggesting situations in which LLMs could effectively replace human annotators. Their research showed that for tasks where humans themselves disagreed significantly , models also performed poorly. Conversely, if multiple LLMs reach similar solutions independently, then LLMs are likely suitable for the annotation task. They proposed to use model confidence scores to identify specific samples that could be safely delegated to LLMs, potentially reducing human annotation effort without compromising inter-rater agreement.

Advantages

Recent research demonstrates several advantages of using LLMs as annotators, including their cost effectiveness and accuracy. LLM-based annotation can dramatically reduce costs compared to human labeling, with studies showing cost reductions of 50-96% on various natural language tasks (S. Wang et al. 2021). For example, He et al. (2024) found in their study that GPT-4 annotation costs only $122.08 compared to $4,508 for a comparable MTurk pipeline (He et al. 2024). Moreover, the LLM-based approach resulted in a completion time of just 2 days versus several weeks for the crowd-sourced approach. LLMs consistently demonstrate strong performance, with ChatGPT’s accuracy exceeding crowd workers by approximately 25% on average (Gilardi, Alizadeh, and Kubli 2023), achieving impressive results in specific tasks such as sentiment analysis (65% accuracy) (Zhu et al. 2023). LLMs also show remarkably high inter-rater agreement, higher than crowd workers and trained annotators (Gilardi, Alizadeh, and Kubli 2023).

Challenges

Challenges of using LLMs as annotators include reliability issues, human-LLM interaction challenges, biases, errors, and resource considerations. Studies suggest that while LLMs show promise as annotation tools in SE research, their optimal use may be in augmenting rather than replacing human annotators (S. Wang et al. 2021; He et al. 2024). LLMs can negatively affect human judgment when labels are incorrect (F. Huang, Kwak, and An 2023), and their overconfidence requires careful verification (Wan et al. 2024). Moreover, in previous studies, LLMs have shown significant variability in annotation quality depending on the dataset and the annotation task (Pangakis, Wolken, and Fasching 2023). Studies have shown that LLMs are especially unreliable for high-stakes labeling tasks (X. Wang et al. 2024) and that LLMs can have notable performance disparities between label categories (Zhu et al. 2023). Recent empirical evidence indicates that LLM consistency in text annotation often falls below scientific reliability thresholds, with outputs being sensitive to minor prompt variations (Reiss 2023). Context-dependent annotations pose a specific challenge, as LLMs show difficulty in correctly interpreting text segments that require broader contextual understanding (He et al. 2024). Although pooling of multiple outputs can improve reliability, this approach requires additional computational resources and still requires validation against human-annotated data. While generally cost-effective, LLM annotation requires careful management of per token charges, particularly for longer texts (S. Wang et al. 2021). Furthermore, achieving reliable annotations may require multiple runs of the same input to enable majority voting (Reiss 2023), although the exact cost comparison between LLM-based and human annotation is controversial (He et al. 2024). Finally, research has identified consistent biases in label assignment, including tendencies to overestimate certain labels and misclassify neutral content (Zhu et al. 2023).

LLMs as Judges

Description

LLMs can act as judges or raters to evaluate properties of software artifacts. For example, LLMs can be used to assess code readability, adherence to coding standards, or the quality of code comments. Judgment is distinct from the more qualitative task of assigning a code or label to typically unstructured text (see Section LLMs as Annotators). It is also different from using LLMs for general software engineering tasks, as discussed in Section LLMs as Tools for Software Engineering Researchers.

Example(s)

Lubos et al. (2024) (Lubos et al. 2024) leveraged Llama-2 to evaluate the quality of software requirements statements. They prompted the LLM with the text below, where the words in curly brackets reflect the study parameters:

Your task is to evaluate the quality of a software requirement.
Evaluate whether the following requirement is {quality_characteristic}.
{quality_ characteristic} means: {quality_characteristic_explanation}
The evaluation result must be: ‘yes’ or ‘no’.
Request: Based on the following description of the project: {project_description}
Evaluate the quality of the following requirement: {requirement}.
Explain your decision and suggest an improved version.

Lubos et al. (2024) evaluated the LLM’s output against human judges, to assess how well the LLM matched experts. Agreement can be measured in many ways; this study used Cohen’s kappa (Cohen 1960) and found moderate agreement for simple requirements and poor agreement for more complex requirements. However, human evaluation of machine judgments may not be the same as evaluation of human judgments and is an open area of research (Elangovan et al. 2024). A crucial decision in studies focusing on LLMs as judges is the number of examples provided to the LLM. This might involve no tuning (zero-shot), several examples (few-shot), or providing many examples (closer to traditional training data). In the example above, Lubos et al. (2024) chose zero-shot tuning, providing no specific guidance besides the provided context; they did not show the LLM what a ‘yes’ or ‘no’ answer might look like. In contrast, F. Wang et al. (2025) (F. Wang et al. 2025) provided a rubric to an LLM when treating LLMs as a judge to produce interpretable scales (0 to 4) for the overall quality of acceptance criteria generated from user stories. They combined this with a lower-level reward stage to refine acceptance criteria, which significantly improved correctness, clarity, and alignment with the user stories.

Advantages

Depending on the model configuration, LLMs can provide relatively consistent evaluations. They can further help mitigate human biases and the general variability that human judges might introduce. This may lead to more reliable and reproducible results in empirical studies, to the extent that these models can be reproduced or checkpointed. LLMs can be much more efficient and scale more easily than the equivalent human approach. With LLM automation, entire datasets can be assessed, as opposed to subsets, and the assessment produced can be at the selected level of granularity, e.g., binary outputs (‘yes’ or ‘no’) or defined levels (e.g., 0 to 4). However, the main constraint that varies by model and budget is the input context size, i.e., the number of tokens one can pass into a model. For example, the upper bound of the context that can be passed to OpenAi’s o1-mini model is 32k tokens.

Challenges

When relying on the judgment of LLMs, researchers must build a reliable process for generating judgment labels that considers the non-deterministic nature of LLMs and report the intricacies of that process transparently (Schroeder and Wood-Doughty 2024). For example, the order of options has been shown to affect LLM outputs in multiple-choice settings (Pezeshkpour and Hruschka 2024). In addition to reliability, other quality attributes to consider include the accuracy of the labels. For example, a reliable LLM might be reliably inaccurate and wrong. Evaluating and judging large numbers of items, for example, to perform fault localization on the thousands of bugs that large open-source projects have to deal with, comes with costs in clock time, compute time, and environmental sustainability. Evidence shows that LLMs can behave differently when reviewing their own outputs (Panickssery, Bowman, and Feng 2024). In more human-oriented datasets (such as discussions of pull requests) LLMs may suffer from well-documented biases and issues with fairness (Gallegos et al. 2024). For tasks in which human judges disagree significantly, it is not clear if an LLM judge should reflect the majority opinion or act as an independent judge. The underlying statistical framework of an LLM usually pushes outputs towards the most likely (majority) answer. There is ongoing research on how suitable LLMs are as independent judges. Questions about bias, accuracy, and trust remain (Bavaresco et al. 2024). There is reason for concern about LLMs judging student assignments or doing peer review of scientific papers (R. Zhou, Chen, and Yu 2024). Even beyond questions of technical capacity, ethical questions remain, particularly if there is some implicit expectation that a human is judging the output. Involving a human in the judgment loop, for example, to contextualize the scoring, is one approach (Pan et al. 2024). However, the lack of large-scale ground truth datasets for benchmarking LLM performance in judgment studies hinders progress in this area.

LLMs for Synthesis

Description

LLMs can support synthesis tasks in software engineering research by processing and distilling information from qualitative data sources. In this context, synthesis refers to the process of integrating and interpreting information from multiple sources to generate higher-level insights, identify patterns across datasets, and develop conceptual frameworks or theories. Unlike annotation (see Section LLMs as Annotators), which focuses on categorizing or labeling individual data points, synthesis involves connecting and interpreting these annotations to develop a cohesive understanding of the phenomenon being studied. Synthesis tasks can also mean generating synthetic datasets (e.g., source code, bug-fix pairs, requirements, etc.) that are then used in downstream tasks to train, fine-tune, or evaluate existing models or tools. In this case, the synthesis is done primarily using the LLM and its training data; the input is limited to basic instructions and examples.

Example(s)

Published examples of applying LLMs for synthesis in the software engineering domain are still scarce. However, recent work has explored the use of LLMs for qualitative synthesis in other domains, allowing reflection on how LLMs can be applied for this purpose in software engineering (Bano et al. 2024). Barros et al. (2024) (Barros et al. 2024) conducted a systematic mapping study on the use of LLMs for qualitative research. They identified examples in domains such as healthcare and social sciences (see, e.g., (De Paoli 2024; Mathis et al. 2024)) in which LLMs were used to support qualitative analysis, including grounded theory and thematic analysis. Overall, the findings highlight the successful generation of preliminary coding schemes from interview transcripts, later refined by human researchers, along with support for pattern identification. This approach was reported not only to expedite the initial coding process but also to allow researchers to focus more on higher-level analysis and interpretation. However, Barros et al. (2024) emphasize that effective use of LLMs requires structured prompts and careful human oversight. Similarly, Morais Leça et al. (2024) (Morais Leça et al. 2024) conducted a systematic mapping study to investigate how LLMs are used in qualitative analysis and how they can be applied in software engineering research. Consistent with the study by Barros et al. (2024) (Barros et al. 2024), Morais Leça et al. (2024) identified that LLMs are applied primarily in tasks such as coding, thematic analysis, and data categorization, reducing the time, cognitive demands, and resources required for these processes. Finally, El-Hajjami and Salinesi (2025)’s work is an example of using LLMs to create synthetic datasets. They present an approach to generate synthetic requirements, showing that they “can match or surpass human-authored requirements for specific classification tasks” (El-Hajjami and Salinesi 2025).

Advantages

LLMs offer promising support for synthesis in SE research by helping researchers process artifacts such as interview transcripts and survey responses, or by assisting literature reviews. Qualitative research in SE traditionally faces challenges such as limited scalability, inconsistencies in coding, difficulties in generalizing findings from small or context-specific samples, and the influence of the researchers’ subjectivity on data interpretation (Bano et al. 2024). The use of LLMs for synthesis can offer advantages in addressing these challenges (Bano et al. 2024; Barros et al. 2024; Morais Leça et al. 2024). LLMs can reduce manual effort and subjectivity, improve consistency and generalizability, and assist researchers in deriving codes and developing coding guides during the early stages of qualitative data analysis (Byun, Vasicek, and Seppi 2023; Bano et al. 2024). LLMs can enable researchers to analyze larger datasets, identifying patterns across broader contexts than traditional qualitative methods typically allow. In addition, they can help mitigate the effects of human subjectivity. However, while LLMs streamline many aspects of qualitative synthesis, careful oversight remains essential.

Challenges

Although LLMs have the potential to automate synthesis, concerns about overreliance remain, especially due to discrepancies between AI- and human-generated insights, particularly in capturing contextual nuances (Bano, Zowghi, and Whittle 2023). Bano, Zowghi, and Whittle (2023) (Bano, Zowghi, and Whittle 2023) found that while LLMs can provide structured summaries and qualitative coding frameworks, they may misinterpret nuanced qualitative data due to a lack of contextual understanding. Other studies have echoed similar concerns (Bano et al. 2024; Barros et al. 2024; Morais Leça et al. 2024). In particular, LLMs cannot independently assess the validity of arguments. Critical thinking remains a human responsibility in qualitative synthesis. Peters and Chin-Yee (2025) have shown that popular LLMs and LLM-based tools tend to overgeneralize results when summarizing scientific articles, that is, they “produce broader generalizations of scientific results than those in the original.” Compared to human-authored summaries, “LLM summaries were nearly five times more likely to contain broad generalizations” (Peters and Chin-Yee 2025). In addition, LLMs can produce biased results, reinforcing existing prejudices or omitting essential perspectives, making human oversight crucial to ensure accurate interpretation, mitigate biases, and maintain quality control. Moreover, the proprietary nature of many LLMs limits transparency. In particular, it is unknown how the training data might affect the synthesis process. Furthermore, reproducibility issues persist due to the influence of model versions and prompt variations on the synthesis result.

LLMs as Subjects

Description

In empirical studies, data is collected from participants through methods such as surveys, interviews, or controlled experiments. LLMs can serve as subjects in empirical studies by simulating human behavior and interactions (we use the term “subject” since “participant” implies being human (Association 2018)). In this capacity, LLMs generate responses that approximate those of human participants, which makes them particularly valuable for research involving user interactions, collaborative coding environments, and software usability assessments. This approach enables data collection that closely reflects human reactions while avoiding the need for direct human involvement. To achieve this, prompt engineering techniques are widely employed, with a common approach being the use of the Personas Pattern (Kong et al. 2023), which involves tailoring LLM responses to align with predefined profiles or roles that emulate specific user archetypes. Zhao, Habule, and Zhang (2025) outline opportunities and challenges of using LLMs as research subjects in detail (Zhao, Habule, and Zhang 2025). Furthermore, recent sociological studies have emphasized that, to be effectively utilized in this capacity, LLMs—including their agentic versions—should meet four criteria of algorithmic fidelity (Argyle et al. 2022). Generated responses should be: (1) indistinguishable from human-produced texts (e.g., LLM-generated code reviews should be comparable to those from real developers); (2) consistent with the attitudes and sociodemographic information of the conditioning context (e.g., LLMs simulating junior developers should exhibit different confidence levels, vocabulary, and concerns compared to senior engineers); (3) naturally aligned with the form, tone, and content of the provided context (e.g., responses in an agile stand-up meeting simulation should be concise, task-focused, and aligned with sprint objectives rather than long, formal explanations); and (4) reflective of patterns in relationships between ideas, demographics, and behavior observed in comparable human data (e.g., discussions on software architecture decisions should capture trade-offs typically debated by human developers, such as maintainability versus performance, rather than abstract theoretical arguments).

Example(s)

LLMs can be used as subjects in various types of empirical studies, enabling researchers to simulate human participants. The broader applicability of LLM-based studies beyond software engineering has been compiled by Xu et al. (2024) (Xu et al. 2024), who examined various uses in social science research. Given the socio-technical nature of software development, some of these approaches are transferable to empirical software engineering research. For example, LLMs can be applied in survey and interview studies to impersonate developers responding to survey questionnaires or interviews, allowing researchers to test the clarity and effectiveness of survey items or to simulate responses under varying conditions, such as different levels of expertise or cultural contexts. For example, Gerosa et al. (2024) (Gerosa et al. 2024) explored persona-based interviews and multi-persona focus groups, demonstrating how LLMs can emulate human responses and behaviors while addressing ethical concerns, biases, and methodological challenges. Another example is usability studies, in which LLMs can simulate end-user feedback, providing insights into potential usability issues and offering suggestions for improvement based on predefined user personas. This aligns with the work of Bano, Gunatilake, and Hoda (2025) (Bano, Gunatilake, and Hoda 2025), who investigated biases in LLM-generated candidate profiles in SE recruitment processes. Their study, which analyzed both textual and visual inputs, revealed biases favoring male Caucasian candidates, lighter skin tones, and slim physiques, particularly for senior roles.

Advantages

Using LLMs as subjects can reduce the effort of recruiting human participants, a process that is often time-consuming and costly (Madampe et al. 2024), by augmenting existing datasets or, in some cases, completely replacing human participants. In interviews and survey studies, LLMs can simulate diverse respondent profiles, enabling access to underrepresented populations, which can strengthen the generalizability of findings. In data mining studies, LLMs can generate synthetic data (see LLMs for Synthesis) to fill gaps where real-world data is unavailable or underrepresented.

Challenges

It is important that researchers are aware of the inherent biases (Crowell 2023) and limitations (Harding et al. 2024; A. Wang, Morgenstern, and Dickerson 2024) when using LLMs as study subjects. Schröder et al. (2025) (Schröder et al. 2025), who have studied discrepancies between LLM and human responses, even conclude that “LLMs do not simulate human psychology and recommend that psychological researchers should treat LLMs as useful but fundamentally unreliable tools that need to be validated against human responses for every new application.” One critical concern is construct validity. LLMs have been shown to misrepresent demographic group perspectives, failing to capture the diversity of opinions and experiences within a group. The use of identity-based prompts can reduce identities to fixed and innate characteristics, amplifying perceived differences between groups. These biases introduce the risk that studies which rely on LLM-generated responses may inadvertently reinforce stereotypes or misrepresent real-world social dynamics. Alternatives to demographic prompts can be employed when the goal is to broaden response coverage (A. Wang, Morgenstern, and Dickerson 2024). Beyond construct validity, internal validity must also be considered, particularly with regard to causal conclusions based on studies relying on LLM-simulated responses. Finally, external validity remains a challenge, as findings based on LLMs may not generalize to humans.

Introduction: LLMs as Tools for Software Engineers

LLM-based assistants have become an essential tool for software engineers, supporting them in various tasks such as code generation and debugging. Researchers have studied how software engineers use LLMs (Studying LLM Usage in Software Engineering), developed new tools that integrate LLMs (LLMs for New Software Engineering Tools), and benchmarked LLMs for software engineering tasks (Benchmarking LLMs for Software Engineering Tasks).

Studying LLM Usage in Software Engineering

Description

Studying how software engineers use LLMs and LLM-based tools is crucial to understand the current state of practice in SE. Researchers can observe software engineers’ usage of LLM-based tools in the field, or study if and how they adopt such tools, their usage patterns, as well as perceived benefits and challenges. Surveys, interviews, observational studies, or analysis of usage logs can provide insights into how LLMs are integrated into development processes, how they influence decision making, and what factors affect their acceptance and effectiveness. Such studies can inform improvements for existing LLM-based tools, motivate the design of novel tools, or derive best practices for LLM-assisted software engineering. They can also uncover risks or deficiencies of existing tools.

Example(s)

Based on a convergent mixed-methods study, Russo (2024) has found that early adoption of generative AI by software engineers is primarily driven by compatibility with existing workflows (Russo 2024). Khojah et al. (2024) investigated the use of ChatGPT (GPT-3.5) by professional software engineers in a week-long observational study (Khojah et al. 2024). They found that most developers do not use the code generated by ChatGPT directly but instead use the output as a guide to implement their own solutions. Azanza et al. (2024) conducted a case study that evaluated the impact of introducing LLMs on the onboarding process of new software developers (Azanza et al. 2024) (GPT-3). Their study identified potential in the use of LLMs for onboarding, as it can allow newcomers to seek information on their own, without the need to “bother” senior colleagues. Jahic and Sami (2024) surveyed participants from 15 software companies regarding their practices on LLMs in software engineering (Jahic and Sami 2024). They found that the majority of study participants had already adopted AI for software engineering tasks; most of them used ChatGPT. Multiple participants cited copyright and privacy issues, as well as inconsistent or low-quality outputs, as barriers to adoption. Retrospective studies that analyze data generated while developers use LLMs can provide additional insights into human-LLM interactions. For example, researchers can employ data mining methods to build large-scale conversation datasets, such as the DevGPT dataset introduced by Xiao et al. (2024) (Xiao et al. 2024). Conversations can then be analyzed using quantitative (Rabbi et al. 2024) and qualitative (Mohamed, Parvin, and Parra 2024) analysis methods.

Advantages

Studying the real-world usage of LLM-based tools allows researchers to understand the state of practice and guide future research directions. In field studies, researchers can uncover usage patterns, adoption rates, and the influence of contextual factors on usage behavior. Outside of a controlled study environment, researchers can uncover contextual information about LLM-assisted SE workflows beyond the specific LLMs being evaluated. This may, for example, help researchers generate hypotheses about how LLMs impact developer productivity, collaboration, and decision-making processes. In controlled laboratory studies, researchers can study specific phenomena related to LLM usage under carefully regulated, but potentially artificial conditions. Specifically, they can isolate individual tasks in the software engineering workflow and investigate how LLM-based tools may support task completion. Furthermore, controlled experiments allow for direct comparisons between different LLM-based tools. The results can then be used to validate general hypotheses about human-LLM interactions in SE.

Challenges

When conducting field studies in real-world environments, researchers have to ensure that their study results are “dependable” (Sullivan and Sargeant 2011) beyond the traditional validity criteria such as internal or construct validity. The usage environment in a real-world context can often be extremely diverse. The integration of LLMs can range from specific LLMs based on company policies to the unregulated use of any available LLM. Both extremes may influence the adoption of LLM by software engineers, and hence need to be addressed in the study methodology. In addition, in longitudinal case studies, the timing of the study may have a significant impact on its result, as LLMs and LLM-based tools are rapidly evolving. Moreover, developers are still learning how to best make use of the new technology, and best practices are still being developed and established. Furthermore, the predominance of proprietary commercial LLM-based tools in the market poses a significant barrier to research. Limited access to telemetry data or other usage metrics restricts the ability of researchers to conduct comprehensive analyses of real-world tool usage. To make study results reliable, researchers must establish that their results are rigorous, for example, by using triangulation to understand and minimize potential biases (Sullivan and Sargeant 2011). When it comes to studying LLM usage in controlled laboratory studies, researchers may struggle with the inherent variability of LLM outputs. Since reproducibility and precision are essential for hypothesis testing, the stochastic nature of LLM responses—where identical prompts may yield different outputs across participants—can complicate the interpretation of experimental results and affect the reliability of study findings. Finally, agentic software development tools such as Claude Code have configurable degrees of human involvement, and can run almost any terminal command. This autonomy together with the before-mentioned non-determinism make it hard to reproducibly study agentic tools in the field or in laboratory environments.

LLMs for New Software Engineering Tools

Description

LLMs are being integrated into new tools that support software engineers in their daily tasks, e.g., to assist in code comprehension (Yan et al. 2024) and test case generation (Schäfer et al. 2024). One way of integrating LLM-based tools into software engineers’ workflows are GenAI agents. Unlike traditional LLM-based tools, these agents are capable of acting autonomously and proactively, are often tailored to meet specific user needs, and can interact with external environments (Takerngsaksiri et al. 2024; Wiesinger, Marlow, and Vuskovic 2025). From an architectural perspective, GenAI agents can be implemented in various ways (Wiesinger, Marlow, and Vuskovic 2025). However, they generally share three key components: (1) a reasoning mechanism that guides the LLM (often enabled by advanced prompt engineering), (2) a set of tools to interact with external systems (e.g., APIs or databases), and (3) a user communication interface that extends beyond traditional chat-based interactions (Richards and Wessel 2024; Sumers et al. 2024; W. Zhou et al. 2023). Researchers can also test and compare different tool architectures to increase artifact quality and developer satisfaction.

Example(s)

Yan et al. (2024) proposed IVIE, a tool integrated into the VS Code graphical interface that generates and explains code using LLMs (Yan et al. 2024). The authors focused more on the presentation, providing a user-friendly interface to interact with the LLM. Schäfer et al. (2024) (Schäfer et al. 2024) presented a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation. They presented TestPilot, a tool that implements an approach in which the LLM is provided with prompts that include the signature and implementation of a function under test, along with usage examples extracted from the documentation. Richards and Wessel (2024) introduced a preliminary GenAI agent designed to assist developers in understanding source code by incorporating a reasoning component grounded in the theory of mind (Richards and Wessel 2024).

Advantages

From an engineering perspective, developing LLM-based tools is easier than implementing many traditional SE approaches such as static analysis or symbolic execution. Depending on the capabilities of the underlying model, it is also easier to build tools that are independent of a specific programming language. This enables researchers to build tools for a more diverse set of tasks. In addition, it allows them to test their tools in a wider range of contexts.

Challenges

Traditional approaches, such as static analysis, are deterministic. LLMs are not. Although the non-determinism of LLMs can be mitigated using configuration parameters and prompting strategies, this poses a major challenge. It can be challenging for researchers to evaluate the effectiveness of a tool, as minor changes in the input can lead to major differences in the performance. Since the exact training data is often not published by model vendors, a reliable assessment of tool performance for unknown data is difficult. From an engineering perspective, while open models are available, the most capable ones require substantial hardware resources. Using cloud-based APIs or relying on third-party providers for hosting, while seemingly a potential solution, introduces new concerns related to data privacy and security.

Benchmarking LLMs for Software Engineering Tasks

Description

Benchmarking is the process of evaluating an LLM’s performance using standardized tasks and metrics, which requires high-quality reference datasets. LLM output is compared to a ground truth from the benchmark dataset using general metrics for text generation, such as ROUGE, BLEU, or METEOR (Hou et al. 2024), or task-specific metrics, such as CodeBLEU for code generation. For example, HumanEval (Chen et al. 2021) is often used to assess code generation, establishing it as a de facto standard.

Example(s)

In SE, benchmarking may include the evaluation of an LLM’s ability to produce accurate and reliable outputs for a given input, usually a task description, which may be accompanied by data obtained from curated real-world projects or from synthetic SE-specific datasets. Typical tasks include code generation, code summarization, code completion, and code repair, but also natural language processing tasks, such as anaphora resolution (i.e., the task of identifying the referring expression of a word or phrase occurring earlier in the text). RepairBench (Silva and Monperrus 2024), for example, contains 574 buggy Java methods and their corresponding fixed versions, which can be used to evaluate the performance of LLMs in code repair tasks. This benchmark uses the Plausible@1 metric (i.e., the probability that the first generated patch passes all test cases) and the AST Match@1 metric (i.e., the probability that the abstract syntax tree of the first generated patch matches the ground truth patch). SWE-Bench (Jimenez et al. 2024) is a more generic benchmark that contains 2,294 SE Python tasks extracted from GitHub pull requests. To score the LLM’s performance on the tasks, the benchmark validates whether the generated patch is applicable (i.e., successfully compiles) and calculates the percentage of passed test cases.

Advantages

Properly built benchmarks provide objective, reproducible evaluation across different tasks, enabling a comparison between different models (and versions). In addition, benchmarks built for specific SE tasks can help identify LLM weaknesses and support their optimization and fine-tuning for such tasks. They can foster open science practices by providing a common ground for sharing data (e.g., as part of the benchmark itself) and results (e.g., of models run against a benchmark). Benchmarks built using real-world data can help legitimize research results for practitioners, supporting industry-academia collaboration.

Challenges

Benchmark contamination, that is, the inclusion of the benchmark in the LLM training data (Ahuja, Gumma, and Sitaram 2024), has recently been identified as a problem. The careful selection of samples and the creation of the corresponding input prompts is particularly important, as correlations between prompts may bias the benchmark results (Siska et al. 2024). Although LLMs might perform well on a specific benchmark such as HumanEval or SWE-bench, they do not necessarily perform well on other benchmarks. Moreover, benchmarks usually do not capture the full complexity and diversity of software engineering work (Chandra 2025). Another issue is that metrics used for benchmarks, such as perplexity or BLEU-N, do not necessarily reflect human judgment. Recently, Cao et al. (2025) (Cao et al. 2025) have proposed guidelines for the creation of LLM benchmarks related to coding tasks, grounded in a systematic survey of existing benchmarks. In this process, they highlight current shortcomings related to reliability, transparency, irreproducibility, low data quality, and inadequate validation measures. For more details on benchmarks, see Section Use Suitable Baselines, Benchmarks, and Metrics.

References

Ahmed, Toufique, Premkumar T. Devanbu, Christoph Treude, and Michael Pradel. 2025. “Can LLMs Replace Manual Annotation of Software Engineering Artifacts?” In 22nd IEEE/ACM International Conference on Mining Software Repositories, MSR@ICSE 2025, Ottawa, ON, Canada, April 28-29, 2025, 526–38. IEEE. https://doi.org/10.1109/MSR66628.2025.00086.

Ahuja, Sanchit, Varun Gumma, and Sunayana Sitaram. 2024. “Contamination Report for Multilingual Benchmarks.” CoRR abs/2410.16186. https://doi.org/10.48550/ARXIV.2410.16186.

Argyle, Lisa P., Ethan C. Busby, Nancy Fulda, Joshua Gubler, Christopher Michael Rytting, and David Wingate. 2022. “Out of One, Many: Using Language Models to Simulate Human Samples.” CoRR abs/2209.06899. https://doi.org/10.48550/ARXIV.2209.06899.

Association, American Psychological. 2018. “APA Dictionary of Psychology: Subject.” https://dictionary.apa.org/subject.

Azanza, Maider, Juanan Pereira, Arantza Irastorza, and Aritz Galdos. 2024. “Can LLMs Facilitate Onboarding Software Developers? An Ongoing Industrial Case Study.” In 36th International Conference on Software Engineering Education and Training, CSEE&t 2024, 1–6. IEEE. https://doi.org/10.1109/CSEET62301.2024.10662989.

Bano, Muneera, Hashini Gunatilake, and Rashina Hoda. 2025. “What Does a Software Engineer Look Like? Exploring Societal Stereotypes in LLMs.” https://arxiv.org/abs/2501.03569.

Bano, Muneera, Rashina Hoda, Didar Zowghi, and Christoph Treude. 2024. “Large Language Models for Qualitative Research in Software Engineering: Exploring Opportunities and Challenges.” Autom. Softw. Eng. 31 (1): 8. https://doi.org/10.1007/S10515-023-00407-8.

Bano, Muneera, Didar Zowghi, and Jon Whittle. 2023. “Exploring Qualitative Research Using LLMs.” https://arxiv.org/abs/2306.13298.

Barros, Cauã Ferreira, Bruna Borges Azevedo, Valdemar Vicente Graciano Neto, Mohamad Kassab, Marcos Kalinowski, Hugo Alexandre D. do Nascimento, and Michelle C. G. S. P. Bandeira. 2024. “Large Language Model for Qualitative Research – a Systematic Mapping Study.” https://arxiv.org/abs/2411.14473.

Bavaresco, Anna, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, et al. 2024. “LLMs Instead of Human Judges? A Large Scale Empirical Study Across 20 NLP Evaluation Tasks.” CoRR abs/2406.18403. https://doi.org/10.48550/ARXIV.2406.18403.

Byun, Courtni, Piper Vasicek, and Kevin D. Seppi. 2023. “Dispensing with Humans in Human-Computer Interaction Research.” In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, CHI EA 2023, edited by Albrecht Schmidt, Kaisa Väänänen, Tesh Goyal, Per Ola Kristensson, and Anicia Peters, 413:1–26. ACM. https://doi.org/10.1145/3544549.3582749.

Cao, Jialun, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Chaozheng Wang, et al. 2025. “How Should i Build a Benchmark?” arXiv Preprint arXiv:2501.10711.

Chandra, Satish. 2025. “Benchmarks for AI in Software Engineering (BLOG@CACM).” https://cacm.acm.org/blogcacm/benchmarks-for-ai-in-software-engineering/.

Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. 2021. “Evaluating Large Language Models Trained on Code.” CoRR abs/2107.03374. https://arxiv.org/abs/2107.03374.

Cohen, Jacob. 1960. “A Coefficient of Agreement for Nominal Scales.” Educational and Psychological Measurement 20 (1): 37–46. https://doi.org/10.1177/001316446002000104.

Crowell, Rachel. 2023. “Why AI’s diversity crisis matters, and how to tackle it.” Nature Career Feature. https://doi.org/10.1038/d41586-023-01689-4.

De Paoli, Stefano. 2024. “Performing an Inductive Thematic Analysis of Semi-Structured Interviews with a Large Language Model: An Exploration and Provocation on the Limits of the Approach.” Social Science Computer Review 42 (4): 997–1019.

Elangovan, Aparna, Jongwoo Ko, Lei Xu, Mahsa Elyasi, Ling Liu, Sravan Bodapati, and Dan Roth. 2024. “Beyond Correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge.” CoRR abs/2410.03775. https://doi.org/10.48550/ARXIV.2410.03775.

El-Hajjami, Abdelkarim, and Camille Salinesi. 2025. “How Good Are Synthetic Requirements? Evaluating LLM-Generated Datasets for AI4RE.” CoRR abs/2506.21138. https://doi.org/10.48550/ARXIV.2506.21138.

Gallegos, Isabel O., Ryan A. Rossi, Joe Barrow, Md. Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen Ahmed. 2024. “Bias and Fairness in Large Language Models: A Survey.” Computational Linguistics 50: 1097–1179. https://doi.org/10.1162/coli_a_00524.

Gerosa, Marco Aurélio, Bianca Trinkenreich, Igor Steinmacher, and Anita Sarma. 2024. “Can AI Serve as a Substitute for Human Subjects in Software Engineering Research?” Autom. Softw. Eng. 31 (1): 13. https://doi.org/10.1007/S10515-023-00409-6.

Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. 2023. “ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks.” CoRR abs/2303.15056. https://doi.org/10.48550/ARXIV.2303.15056.

Harding, Jacqueline, William D’Alessandro, N. G. Laskowski, and Robert Long. 2024. “AI Language Models Cannot Replace Human Research Participants.” AI Soc. 39 (5): 2603–5. https://doi.org/10.1007/S00146-023-01725-X.

He, Zeyu, Chieh-Yang Huang, Chien-Kuang Cornelia Ding, Shaurya Rohatgi, and Ting-Hao Kenneth Huang. 2024. “If in a Crowdsourced Data Annotation Pipeline, a GPT-4.” In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, edited by Florian ’Floyd’Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Max L. Wilson, Phoebe O. Toups Dugas, and Irina Shklovski, 1040:1–25. ACM. https://doi.org/10.1145/3613904.3642834.

Hou, Xinyi, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. “Large Language Models for Software Engineering: A Systematic Literature Review.” ACM Trans. Softw. Eng. Methodol. 33 (8). https://doi.org/10.1145/3695988.

Huang, Fan, Haewoon Kwak, and Jisun An. 2023. “Is ChatGPT Better Than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech.” In Companion Proceedings of the ACM Web Conference 2023, WWW 2023, edited by Ying Ding, Jie Tang, Juan F. Sequeda, Lora Aroyo, Carlos Castillo, and Geert-Jan Houben, 294–97. ACM. https://doi.org/10.1145/3543873.3587368.

Huang, Jiangping, Bochen Yi, Weisong Sun, Bangrui Wan, Yang Xu, Yebo Feng, Wenguang Ye, and Qinjun Qin. 2024. “Enhancing Review Classification via LLM-Based Data Annotation and Multi-Perspective Feature Representation Learning.” SSRN Electronic Journal, 1–15. https://doi.org/10.2139/ssrn.5002351.

Jahic, Jasmin, and Ashkan Sami. 2024. “State of Practice: LLMs in Software Engineering and Software Architecture.” In 21st IEEE International Conference on Software Architecture, ICSA 2024 - Companion, Hyderabad, India, June 4-8, 2024, 311–18. IEEE. https://doi.org/10.1109/ICSA-C63560.2024.00059.

Jimenez, Carlos E., John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. “SWE-Bench: Can Language Models Resolve Real-World Github Issues?” In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66.

Khojah, Ranim, Mazen Mohamad, Philipp Leitner, and Francisco Gomes de Oliveira Neto. 2024. “Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice.” Proc. ACM Softw. Eng. 1 (FSE): 1819–40. https://doi.org/10.1145/3660788.

Kong, Aobo, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, and Xin Zhou. 2023. “Better Zero-Shot Reasoning with Role-Play Prompting.” CoRR abs/2308.07702. https://doi.org/10.48550/ARXIV.2308.07702.

Lubos, Sebastian, Alexander Felfernig, Thi Ngoc Trang Tran, Damian Garber, Merfat El Mansi, Seda Polat Erdeniz, and Viet-Man Le. 2024. “Leveraging LLMs for the Quality Assurance of Software Requirements.” In 32nd IEEE International Requirements Engineering Conference, RE 2024, Reykjavik, Iceland, June 24-28, 2024, edited by Grischa Liebel, Irit Hadar, and Paola Spoletini, 389–97. IEEE. https://doi.org/10.1109/RE59067.2024.00046.

Madampe, Kashumi, John Grundy, Rashina Hoda, and Humphrey O. Obie. 2024. “The Struggle Is Real! The Agony of Recruiting Participants for Empirical Software Engineering Studies.” In 2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), Liverpool, UK, September 2-6, 2024, 417–22. IEEE. https://doi.org/10.1109/VL/HCC60511.2024.00065.

Mathis, Walter S, Sophia Zhao, Nicholas Pratt, Jeremy Weleff, and Stefano De Paoli. 2024. “Inductive Thematic Analysis of Healthcare Qualitative Interviews Using Open-Source Large Language Models: How Does It Compare to Traditional Methods?” Computer Methods and Programs in Biomedicine 255: 108356.

Mohamed, Suad, Abdullah Parvin, and Esteban Parra. 2024. “Chatting with AI: Deciphering Developer Conversations with ChatGPT.” In 21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024, Lisbon, Portugal, April 15-16, 2024, edited by Diomidis Spinellis, Alberto Bacchelli, and Eleni Constantinou, 187–91. ACM. https://doi.org/10.1145/3643991.3645078.

Morais Leça, Matheus de, Lucas Valença, Reydne Santos, and Ronnie de Souza Santos. 2024. “Applications and Implications of Large Language Models in Qualitative Analysis: A New Frontier for Empirical Software Engineering.” https://arxiv.org/abs/2412.06564.

Pan, Qian, Zahra Ashktorab, Michael Desmond, Martin Santillan Cooper, James Johnson, Rahul Nair, Elizabeth Daly, and Werner Geyer. 2024. “Human-Centered Design Recommendations for LLM-as-a-Judge.” In ACL 2024 Workshop HuCLLM. arXiv. https://doi.org/10.48550/arXiv.2407.03479.

Pangakis, Nicholas, Samuel Wolken, and Neil Fasching. 2023. “Automated Annotation with Generative AI Requires Validation.” CoRR abs/2306.00176. https://doi.org/10.48550/ARXIV.2306.00176.

Panickssery, Arjun, Samuel Bowman, and Shi Feng. 2024. “LLM Evaluators Recognize and Favor Their Own Generations.” In Advances in Neural Information Processing Systems, edited by A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, 37:68772–802. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2024/file/7f1f0218e45f5414c79c0679633e47bc-Paper-Conference.pdf.

Peters, Uwe, and Benjamin Chin-Yee. 2025. “Generalization Bias in Large Language Model Summarization of Scientific Research.” R. Soc. Open Sci. 12 (12241776). https://doi.org/10.1098/rsos.241776.

Pezeshkpour, Pouya, and Estevam Hruschka. 2024. “Large Language Models Sensitivity to the Order of Options in Multiple-Choice Questions.” In Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024, edited by Kevin Duh, Helena Gómez-Adorno, and Steven Bethard, 2006–17. Association for Computational Linguistics. https://doi.org/10.18653/V1/2024.FINDINGS-NAACL.130.

Rabbi, Md. Fazle, Arifa I. Champa, Minhaz Fahim Zibran, and Md. Rakibul Islam. 2024. “AI Writes, We Analyze: The ChatGPT Python Code Saga.” In 21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024, Lisbon, Portugal, April 15-16, 2024, edited by Diomidis Spinellis, Alberto Bacchelli, and Eleni Constantinou, 177–81. ACM. https://doi.org/10.1145/3643991.3645076.

Reiss, Michael V. 2023. “Testing the Reliability of ChatGPT for Text Annotation and Classification: A Cautionary Remark.” CoRR abs/2304.11085. https://doi.org/10.48550/ARXIV.2304.11085.

Richards, Jonan, and Mairieli Wessel. 2024. “What You Need Is What You Get: Theory of Mind for an LLM-Based Code Understanding Assistant.” In IEEE International Conference on Software Maintenance and Evolution, ICSME 2024, 666–71. IEEE. https://doi.org/10.1109/ICSME58944.2024.00070.

Russo, Daniel. 2024. “Navigating the Complexity of Generative AI Adoption in Software Engineering.” ACM Transactions on Software Engineering and Methodology 33 (5): 1–50.

Schäfer, Max, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation.” IEEE Trans. Software Eng. 50 (1): 85–105. https://doi.org/10.1109/TSE.2023.3334955.

Schröder, Sarah, Thekla Morgenroth, Ulrike Kuhl, Valerie Vaquet, and Benjamin Paaßen. 2025. “Large Language Models Do Not Simulate Human Psychology.” https://arxiv.org/abs/2508.06950.

Schroeder, Kayla, and Zach Wood-Doughty. 2024. “Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge.” CoRR abs/2412.12509. https://doi.org/10.48550/ARXIV.2412.12509.

Silva, André, and Martin Monperrus. 2024. “RepairBench: Leaderboard of Frontier Models for Program Repair.” arXiv Preprint arXiv:2409.18952.

Siska, Charlotte, Katerina Marazopoulou, Melissa Ailem, and James Bono. 2024. “Examining the Robustness of LLM Evaluation to the Distributional Assumptions of Benchmarks.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 10406–21. Association for Computational Linguistics. https://doi.org/10.18653/V1/2024.ACL-LONG.560.

Sullivan, Gail M, and Joan Sargeant. 2011. “Qualities of Qualitative Research: Part I.” J Grad Med Educ 3 (4): 449–52.

Sumers, Theodore R., Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. 2024. “Cognitive Architectures for Language Agents.” Trans. Mach. Learn. Res. 2024. https://openreview.net/forum?id=1i6ZCvflQJ.

Takerngsaksiri, Wannita, Jirat Pasuksmit, Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Ruixiong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, and Ming Wu. 2024. “Human-in-the-Loop Software Development Agents.” arXiv Preprint arXiv:2411.12924.

Wan, Mengting, Tara Safavi, Sujay Kumar Jauhar, Yujin Kim, Scott Counts, Jennifer Neville, Siddharth Suri, et al. 2024. “TnT-LLM: Text Mining at Scale with Large Language Models.” In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024, edited by Ricardo Baeza-Yates and Francesco Bonchi, 5836–47. ACM. https://doi.org/10.1145/3637528.3671647.

Wang, Angelina, Jamie Morgenstern, and John P. Dickerson. 2024. “Large Language Models Cannot Replace Human Participants Because They Cannot Portray Identity Groups.” CoRR abs/2402.01908. https://doi.org/10.48550/ARXIV.2402.01908.

Wang, Fanyu, Chetan Arora, Yonghui Liu, Kaicheng Huang, Chakkrit Tantithamthavorn, Aldeida Aleti, Dishan Sambathkumar, and David Lo. 2025. “Multi-Modal Requirements Data-Based Acceptance Criteria Generation Using LLMs.” https://arxiv.org/abs/2508.06888.

Wang, Shuohang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. “Want to Reduce Labeling Cost? GPT-3 Can Help.” In Findings of the ACL: EMNLP 2021, edited by Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, 4195–205. Association for Computational Linguistics. https://doi.org/10.18653/V1/2021.FINDINGS-EMNLP.354.

Wang, Xinru, Hannah Kim, Sajjadur Rahman, Kushan Mitra, and Zhengjie Miao. 2024. “Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels.” In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, edited by Florian ’Floyd’Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Max L. Wilson, Phoebe O. Toups Dugas, and Irina Shklovski, 303:1–21. ACM. https://doi.org/10.1145/3613904.3641960.

Wiesinger, Julia, Patrick Marlow, and Vladimir Vuskovic. 2025. “Agents,” February. https://gemini.google.com.

Xiao, Tao, Christoph Treude, Hideaki Hata, and Kenichi Matsumoto. 2024. “DevGPT: Studying Developer-ChatGPT Conversations.” In 21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024, Lisbon, Portugal, April 15-16, 2024, edited by Diomidis Spinellis, Alberto Bacchelli, and Eleni Constantinou, 227–30. ACM. https://doi.org/10.1145/3643991.3648400.

Xu, Ruoxi, Yingfei Sun, Mengjie Ren, Shiguang Guo, Ruotong Pan, Hongyu Lin, Le Sun, and Xianpei Han. 2024. “AI for Social Science and Social Science of AI: A Survey.” Inf. Process. Manag. 61 (2): 103665. https://doi.org/10.1016/J.IPM.2024.103665.

Yan, Litao, Alyssa Hwang, Zhiyuan Wu, and Andrew Head. 2024. “Ivie: Lightweight Anchored Explanations of Just-Generated Code.” In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, edited by Florian ’Floyd’Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Max L. Wilson, Phoebe O. Toups Dugas, and Irina Shklovski, 140:1–15. ACM. https://doi.org/10.1145/3613904.3642239.

Zhao, Chenguang, Meirewuti Habule, and Wei Zhang. 2025. “Large Language Models (LLMs) as Research Subjects: Status, Opportunities and Challenges.” New Ideas in Psychology 79: 101167. https://doi.org/https://doi.org/10.1016/j.newideapsych.2025.101167.

Zhou, Ruiyang, Lu Chen, and Kai Yu. 2024. “Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, edited by Nicoletta Calzolari, Min-Yen Kan, Véronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, 9340–51. ELRA; ICCL. https://aclanthology.org/2024.lrec-main.816.

Zhou, Wangchunshu, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, et al. 2023. “Agents: An Open-Source Framework for Autonomous Language Agents.” CoRR abs/2309.07870. https://doi.org/10.48550/ARXIV.2309.07870.

Zhu, Yiming, Peixian Zhang, Ehsan ul Haq, Pan Hui, and Gareth Tyson. 2023. “Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks.” CoRR abs/2304.10145. https://doi.org/10.48550/ARXIV.2304.10145.