Comprehensive predictive analytics for collaborators’ answers, code quality, and dropout: stack overflow case study

Elijah Zolduoarrati; Sherlock A. Licorish; Nigel Stanger

doi:10.1007/s10664-025-10692-4

Back

Comprehensive predictive analytics for collaborators’ answers, code quality, and dropout: stack overflow case study

Journal article

Open access

Peer reviewed

Comprehensive predictive analytics for collaborators’ answers, code quality, and dropout: stack overflow case study

Elijah Zolduoarrati, Sherlock A. Licorish and Nigel Stanger

Empirical software engineering, Vol.30(5), 147

23/07/2025

DOI: https://doi.org/10.1007/s10664-025-10692-4

Handle:

https://hdl.handle.net/10523/47362

Abstract

answers

code quality

prediction

Stack Overflow

user dropout

Previous studies that used data from Stack Overflow to develop predictive models often employed limited benchmarks of 3–5 models or adopted arbitrary selection methods. Despite being insightful, their limited scope suggests the need to benchmark more models to avoid overlooking untested algorithms. Our study evaluates 21 algorithms across three tasks: predicting the number of question a user is likely to answer, their code quality violations, and their dropout status. We employed normalisation, standardisation, as well as logarithmic and power transformations paired with Bayesian hyperparameter optimisation and genetic algorithms. CodeBERT, a pre-trained language model for both natural and programming languages, was fine-tuned to classify user dropout given their posts (questions and answers) and code snippets. We found Bagging ensemble models combined with standardisation achieved the highest R2 value (0.821) in predicting users’ answers. The Stochastic Gradient Descent regressor, followed by Bagging and Epsilon Support Vector Machine models, consistently demonstrated superior performance to other benchmarked algorithms in predicting users’ code quality across multiple quality dimensions and languages. Extreme Gradient Boosting paired with log-transformation exhibited the highest F1-score (0.825) in predicting users’ dropout. CodeBERT was able to classify users’ dropout with a final F1-score of 0.809, validating the performance of Extreme Gradient Boosting that was solely based on numerical data. Overall, our benchmarking of 21 algorithms provides multiple insights. Researchers can leverage findings regarding the most suitable models for specific target variables, and practitioners can utilise the identified optimal hyperparameters to reduce the initial search space during their own hyperparameter tuning processes.

Files and links (1)

url

https://rdcu.be/ey3fZView

Metrics

21 Record Views

Details

Record Identifier: 9926758544601891
Title: Comprehensive predictive analytics for collaborators’ answers, code quality, and dropout: stack overflow case study
Creators: Elijah Zolduoarrati
Sherlock A. Licorish
Nigel Stanger
Publication Details: Empirical software engineering, Vol.30(5), 147
Academic Unit: Information Science; School of Computing
Publisher: Springer Nature
Date published ; e-published: 23/07/2025
Copyright: Copyright © The Author(s) 2025. All rights reserved. This work was first published in Empirical Software Engineering (Springer Nature). The open access link to the subscription article is provided under the Springer Nature SharedIt Content-Sharing Initiative (https://www.springernature.com/gp/researchers/sharedit) making the view-only full-text article freely and legally accessible to anyone for research purposes and private study via the link: https://rdcu.be/ey3fZ.
Comment: The published version is not available in full-text in OUR Archive. Where available, a link to the published version is provided (check the DOI and/or the Files and links section). The full-text item may be open access on the publisher's website. An earlier version of the work (such as authors' accepted manuscript following peer-review or unreviewed preprint/author's original version) may be available in the Files and links section of this record. Alternatively, readers may have subscription access to the full-text from the publisher.
Language: English
Resource Type ; Subtype: Journal article

Comprehensive predictive analytics for collaborators’ answers, code quality, and dropout: stack overflow case study

Abstract

Files and links (1)

Related content

Metrics

Details