Study Types
This list of study types is currently a DRAFT and based on a discussion sessions with researchers at the 2024 International Software Engineering Research Network (ISERN) meeting and at the 2nd Copenhagen Symposium on Human-Centered Software Engineering AI.
The development of empirical guidelines for studies involving large language models (LLMs) in software engineering is crucial for ensuring the validity and reproducibility of results. However, these guidelines must be tailored for different study types as they may pose unique challenges. Therefore, understanding the classification of these studies is essential for developing appropriate guidelines. We envision that a mature set of guidelines provides specific guidance for each of these study types, addressing their individual methodological idiosyncrasies. Moreover, we currently focus on large language models, that is, on natural language processing. In the future, we might extend our focus to multimodal foundation models.
LLMs as Tools for Software Engineering Researchers
LLMs can be leveraged as powerful tools to assist researchers conducting empirical studies. They can automate various tasks such as data collection, preprocessing, and analysis. For example, LLMs can apply pre-defined coding guides to large qualitative datasets (annotation), assess the quality of software artifacts (rating), generate summaries of research papers (synthesis), and even simulate human behavior in empirical studies (subject). This can significantly reduce the time and effort required by researchers, allowing them to focus on more complex aspects of their studies. However, all these applications also come with limitations, potential threats to validity, and implications for the reproducibility of study results. In our guidelines, the following study types and used to contextualize the recommendations we provide.
LLMs as Annotators
Description: LLMs can serve as annotators by automatically labeling artifacts with corresponding categories for data analysis based on a pre-defined coding guide. In qualitative data analysis, manually annotating or coding text passages, e.g. in software artifacts, open-ended survey responses, or interview transcripts, is often a time-consuming manual process. LLMs can be used to augment or even replace human annotations, provide suggestions for new codes (see synthesis), or even automate the entire process.
Example: For example, in a study analyzing code changes in version control systems, researchers may need to categorize each individual change into predefined categories. For that, they may use LLMs to analyze commit messages and categorize them using labels such as bug fixes, feature additions, or refactorings, based on a short description of each label.
Promises: Recent research demonstrates several key advantages of using LLMs as annotators, which can be organized into the following categories:
-
Cost-Effectiveness and Efficiency: LLM-based annotation dramatically reduces costs compared to human labeling, with studies showing cost reductions of 50-96% across various natural language tasks [1]. This efficiency can be further enhanced through lightweight classifiers trained on LLM-generated labels, which achieve comparable or better performance than direct LLM classification while being more manageable at scale [2].
-
Performance and Accuracy Benefits: LLMs consistently demonstrate strong performance, with ChatGPT’s accuracy exceeding crowd workers by approximately 25% on average [3] and achieving particularly impressive results in specific tasks such as sentiment analysis (64.9% accuracy) and counterspeech detection (0.791 precision) [4]. They also show remarkably high intercoder agreement, surpassing both crowd workers and trained annotators [3]. Notably, models trained on LLM-generated labels can sometimes outperform the LLM itself [1].
-
Enhanced Functionality: LLMs offer additional benefits beyond basic annotation, including the generation of explanations alongside their annotations to help verify label quality and assist human understanding [5]. Their versatility enables effective hybrid approaches, where combining LLM and human annotations achieves better results than either approach alone [1].
Perils: Several important challenges and limitations must also be considered, which can be grouped into the following categories:
-
Performance and Reliability Issues: LLMs show significant variability in annotation quality across different tasks [5], [6], with particular challenges in complex tasks and post-training events. They are especially unreliable for high-stakes labeling tasks [5], demonstrating notable performance disparities across different label categories [4].
-
Human-LLM Interaction Challenges: LLMs can negatively impact human judgment when their labels are incorrect [6], and their overconfidence requires careful verification [2]. The quality of LLM-generated explanations significantly impacts human annotators’ performance and satisfaction [5].
-
Systematic Biases and Errors: Research has identified consistent biases in label assignment, including tendencies to overestimate certain labels and misclassify neutral content, particularly in stance detection tasks [4]. These biases are especially pronounced when dealing with newer topics, such as the Russo-Ukrainian conflict sentiment analysis.
-
Resource Considerations: While generally cost-effective, LLM annotation requires careful management of per-token charges, particularly for longer texts [1].
Previous Work in SE: Recent work in software engineering has begun exploring the use of LLMs for annotation tasks. Ahmed et al. [7] conducted a comprehensive study evaluating whether LLMs could replace human annotators for software engineering artifacts. Their findings demonstrated that LLMs performed particularly well on name-value inconsistency detection and semantic similarity tasks, achieving human-comparable inter-rater agreement levels. However, they found LLMs struggled significantly with causality detection and static analysis warning classification, where human-model agreement was much lower than human-human agreement. For code summarization tasks, LLMs showed moderate success but still fell short of human performance. Based on these results, they recommend using LLMs to replace only one human rater rather than the entire annotation team, as model-model agreement strongly correlates with human-model agreement.
Colavito et al. [8] investigated the potential of GPT-like models for automated issue labeling in software repositories. Their study compared different prompting strategies (zero-shot and few-shot) using various versions of GPT-3.5 models with different context lengths. Their results demonstrated that GPT-like models could achieve performance comparable to state-of-the-art BERT-like models without requiring fine-tuning. In particular, GPT-3.5 achieved an F1-micro score of 0.8155 using only zero-shot learning (without any training data), which was close to the 0.8321 F1-micro score achieved by the SETFIT baseline model (Sentence Transformer Fine-Tuning, a framework optimized for fine-tuning transformer models on small datasets), despite SETFIT requiring fine-tuning on labeled training data. Notably, they observed substantial agreement between GPT-3.5 and human annotators (Cohen’s κ > 0.7) across different experimental settings, suggesting that these models could effectively support human annotation efforts in creating gold-standard datasets.
Huang et al. [9] proposed an approach leveraging multiple LLMs for joint annotation of mobile application reviews. Their Multi-model Joint Annotation Reviews (MJAR) dataset construction method used three 7B-parameter models (Llama3, Gemma, and Mistral) with an absolute majority voting rule (i.e., a label is only accepted if it receives more than half of the total votes from the models). Accordingly, the annotations fell into three categories: exact matches (where all models agreed), partial matches (where a majority agreed), and non-matches (where no majority was reached). The authors evaluated the quality of these annotations by training BERT and RoBERTa classifiers on different training sets. When trained on the MJAR dataset, BERT achieved an F1 score of 78.62% and accuracy of 80.36%. In comparison, BERT’s performance was lower when trained on data annotated by single models: Mistral annotations led to an F1 score of 72.56%, Gemma to 75.84%, and Llama3 to 77.48%. RoBERTa showed similar improvements when trained on the MJAR dataset, as compared to single-model annotations, with a 1.73% increase in F1 score.
Wang et al. [5] developed a human-LLM collaborative annotation framework with a novel verification component. The verifier is a Random Forest classifier trained on three types of features: characteristics of the input sample (such as text coherence and readability), LLM output features (including logits and probability scores), and features from LLM-generated explanations (including their coherence and sufficiency); the verifier then assigns confidence scores to LLM annotations, allowing the framework to identify which samples need human review. Their empirical evaluation found a moderate positive correlation between model-to-model agreement and human-model agreement (Spearman ρ = 0.65), suggesting that model agreement could be one useful indicator for when to trust LLM annotations. Using this approach, they demonstrated that up to 50-100% of annotations for certain tasks could be delegated to LLMs while maintaining quality comparable to human annotation. The system was particularly effective when LLMs showed high confidence in their predictions, though performance varied by task type.
These studies suggest that while LLMs show promise as annotation tools in SE, their optimal use may be in augmenting rather than replacing human annotators entirely, with careful consideration given to verification mechanisms and confidence thresholds.
LLMs as Raters
Description: In empirical studies, LLMs can act as raters to evaluate the quality or other properties of software artifacts such as code, documentation, and design patterns.
Example: For instance, LLMs can be trained to assess code readability, adherence to coding standards, or the quality of comments.
Promises: By providing—depending on the model configuration—consistent and relatively “objective” evaluations, LLMs can help mitigate certain biases and part of the variability that human raters might introduce. This can lead to more reliable and reproducible results in empirical studies.
Perils: However, when relying on the judgment of LLMs, researchers have to make sure to build a reliable process for generating ratings that considers the non-deterministic nature of LLMs and report the intricacies of that process transparently.
Previous Work in SE: TODO: Examples of such studies in software engineering include…
LLMs for Synthesis
Description: LLMs can be used to synthesis large amounts of qualitative data.
Example: For example, they can summarize or compare papers for literature reviews or support researchers in deriving codes and developing coding guides during the initial phase of qualitative data analysis. Those code can then later be used to annotate more data (see annotation).
Promises: TODO
Perils: TODO
Previous Work in SE: TODO: TODO: Examples of such studies in software engineering include…
LLMs as Subjects
Description: LLMs can be used as subjects in empirical studies to simulate human behavior and interactions.
Example: For example, researchers can use LLMs to generate responses in user studies, simulate developer interactions in collaborative coding environments, or model user feedback in software usability studies.
Promises: This approach can provide valuable insights while reducing the need to recruit human participants, which can be time-consuming and costly. Additionally, using LLMs as subjects allows for controlled experiments with consistent and repeatable conditions.
Perils: However, it is important that researchers are aware of LLMs’ inherent biases [10] and limitations [11] when using them as study subjects.
Previous Work in SE: TODO: TODO: Examples of such studies in software engineering include…
References
[1] S. Wang, Y. Liu, Y. Xu, C. Zhu, and M. Zeng, “Want to reduce labeling cost? GPT-3 can help,” in Findings of the association for computational linguistics: EMNLP 2021, virtual event / punta cana, dominican republic, 16-20 november, 2021, M.-F. Moens, X. Huang, L. Specia, and S. W. Yih, Eds., Association for Computational Linguistics, 2021, pp. 4195–4205. doi: 10.18653/V1/2021.FINDINGS-EMNLP.354.
[2] M. Wan et al., “TnT-LLM: Text mining at scale with large language models,” in Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, KDD 2024, barcelona, spain, august 25-29, 2024, R. Baeza-Yates and F. Bonchi, Eds., ACM, 2024, pp. 5836–5847. doi: 10.1145/3637528.3671647.
[3] F. Gilardi, M. Alizadeh, and M. Kubli, “ChatGPT outperforms crowd-workers for text-annotation tasks,” CoRR, vol. abs/2303.15056, 2023, doi: 10.48550/ARXIV.2303.15056.
[4] Y. Zhu, P. Zhang, E. ul Haq, P. Hui, and G. Tyson, “Can ChatGPT reproduce human-generated labels? A study of social computing tasks,” CoRR, vol. abs/2304.10145, 2023, doi: 10.48550/ARXIV.2304.10145.
[5] X. Wang, H. Kim, S. Rahman, K. Mitra, and Z. Miao, “Human-LLM collaborative annotation through effective verification of LLM labels,” in Proceedings of the CHI conference on human factors in computing systems, CHI 2024, honolulu, HI, USA, may 11-16, 2024, F. ’Floyd’Mueller, P. Kyburz, J. R. Williamson, C. Sas, M. L. Wilson, P. O. T. Dugas, and I. Shklovski, Eds., ACM, 2024, pp. 303:1–303:21. doi: 10.1145/3613904.3641960.
[6] F. Huang, H. Kwak, and J. An, “Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech,” in Companion proceedings of the ACM web conference 2023, WWW 2023, austin, TX, USA, 30 april 2023 - 4 may 2023, Y. Ding, J. Tang, J. F. Sequeda, L. Aroyo, C. Castillo, and G.-J. Houben, Eds., ACM, 2023, pp. 294–297. doi: 10.1145/3543873.3587368.
[7] T. Ahmed, P. T. Devanbu, C. Treude, and M. Pradel, “Can LLMs replace manual annotation of software engineering artifacts?” CoRR, vol. abs/2408.05534, 2024, doi: 10.48550/ARXIV.2408.05534.
[8] G. Colavito, F. Lanubile, N. Novielli, and L. Quaranta, “Leveraging GPT-like LLMs to automate issue labeling,” in 21st IEEE/ACM international conference on mining software repositories, MSR 2024, lisbon, portugal, april 15-16, 2024, D. Spinellis, A. Bacchelli, and E. Constantinou, Eds., ACM, 2024, pp. 469–480. doi: 10.1145/3643991.3644903.
[9] J. Huang et al., “Enhancing review classification via LLM-based data annotation and multi-perspective feature representation learning,” SSRN Electronic Journal, pp. 1–15, 2024, doi: 10.2139/ssrn.5002351.
[10] R. Crowell, “Why AI’s diversity crisis matters, and how to tackle it,” Nature Career Feature, 2023, doi: 10.1038/d41586-023-01689-4.
[11] J. Harding, W. D’Alessandro, N. G. Laskowski, and R. Long, “AI language models cannot replace human research participants,” AI Soc., vol. 39, no. 5, pp. 2603–2605, 2024, doi: 10.1007/S00146-023-01725-X.
LLMs as Tools for Software Engineers
TODO: Write section intro
Studying LLM Usage in Software Engineering
Description: Empirical studies can also focus on understanding how software engineers use LLMs in their workflows.
Examples: This involves investigating the adoption, usage patterns, and perceived benefits and challenges of LLM-based tools. Surveys, interviews, and observational studies can provide insights into how LLMs are integrated into development processes, how they influence decision-making, and what factors affect their acceptance and effectiveness. Such studies can inform the design of more user-friendly and effective LLM-based tools.
Promises: TODO
Perils: TODO
Previous Work in SE: Khojah et al. investigated the use of ChatGPT by professional software engineers in a week-long observational study [1].
LLMs for new Software Engineering Tools
Description: LLMs are being integrated into new tools designed to support software engineers in their daily tasks.
Examples: These tools can include intelligent code editors that provide real-time code suggestions, automated documentation generators, and advanced debugging assistants. Empirical studies can evaluate the effectiveness of these tools in improving productivity, code quality, and developer satisfaction.
Promises: By assessing the impact of LLM-powered tools, researchers can identify best practices and areas for further improvement.
Perils: TODO
Previous Work in SE: Choudhuri et al. conducted an experiment with students in which they measured the impact of ChatGPT on the correctness and time taken to solve programming tasks [2].
Benchmarking LLMs for Software Engineering Tasks
Description: Another typical type of study focuses on benchmarking the LLM output quality on large-scale datasets. In these benchmarks, reference datasets such as HumanEval [3] play an important role in establishing standardized evaluation methods across studies. LLM output is often compared against a ground truth from the dataset using similarity metrics such as ROUGE, BLEU, or METEOR [4]. Moreover, the evaluation may be augmented by task-specific measures that focus on the type of SE artifact produced.
Examples: In software engineering, benchmarking may include the evaluation of LLMs’ ability to produce accurate and robust outputs for input data from real-world projects or synthetically created SE datasets. Used metrics can include code quality or performance metrics for code generation tasks or metrics for writing quality in SE tasks with natural language artifacts, such as requirements documents or domain descriptions.
Promises: TODO
Perils: Benchmark contamination [5] has recently been identified as an issue. The careful selection of samples and building of corresponding input prompts is particularly important, as correlations between prompts may bias benchmark results [6].
Previous Work in SE: TODO
References
[1] R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto, “Beyond code generation: An observational study of ChatGPT usage in software engineering practice,” Proc. ACM Softw. Eng., vol. 1, no. FSE, pp. 1819–1840, 2024, doi: 10.1145/3660788.
[2] R. Choudhuri, D. Liu, I. Steinmacher, M. A. Gerosa, and A. Sarma, “How far are we? The triumphs and trials of generative AI in learning software engineering,” in Proceedings of the 46th IEEE/ACM international conference on software engineering, ICSE 2024, lisbon, portugal, april 14-20, 2024, ACM, 2024, pp. 184:1–184:13. doi: 10.1145/3597503.3639201.
[3] M. Chen et al., “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021, Available: https://arxiv.org/abs/2107.03374
[4] X. Hou et al., “Large language models for software engineering: A systematic literature review,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, Dec. 2024, doi: 10.1145/3695988.
[5] S. Ahuja, V. Gumma, and S. Sitaram, “Contamination report for multilingual benchmarks,” CoRR, vol. abs/2410.16186, 2024, doi: 10.48550/ARXIV.2410.16186.
[6] C. Siska, K. Marazopoulou, M. Ailem, and J. Bono, “Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks,” in Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), ACL 2024, bangkok, thailand, august 11-16, 2024, L.-W. Ku, A. Martins, and V. Srikumar, Eds., Association for Computational Linguistics, 2024, pp. 10406–10421. doi: 10.18653/V1/2024.ACL-LONG.560.