Understanding Regression Models on Stack Overflow Code: GBM Returns the Best Prediction Performance among Regression Techniques

Sherlock A. Licorish; Brendon Woodford; Lakmal Kiyaduwa Vithanage; Osayande Pascal Omondiagbe

doi:10.17706/jsw.20.2.51-65

Back

Understanding Regression Models on Stack Overflow Code: GBM Returns the Best Prediction Performance among Regression Techniques

Journal article

Open access

Peer reviewed

Understanding Regression Models on Stack Overflow Code: GBM Returns the Best Prediction Performance among Regression Techniques

Sherlock A. Licorish, Brendon Woodford, Lakmal Kiyaduwa Vithanage and Osayande Pascal Omondiagbe

Journal of software, Vol.20(2), pp.51-65

09/07/2025

DOI: https://doi.org/10.17706/jsw.20.2.51-65

Handle:

https://hdl.handle.net/10523/47151

Abstract

evaluation study

regression methods

stack overflow

code quality

electrical grid stability

USA cancer mortality

Practitioners are often dependent on Stack Overflow code during software development, where poor quality is occasionally reported. Research tends to focus on ranking content, identifying defects and predicting future content, but less attention is dedicated to identifying the most suitable techniques for modelling/prediction. Contextualizing the Stack Overflow code quality problem as regression-based, we examined the variables that predict Stack Overflow (Java) code quality, and the regression approach that provides the best predictive power. We observed answer count (β = 0.138), code length (β = 0.382), code spaces (β = 0.099) and lines of code (β = 1.959) as the strongest predictors of code quality on Stack Overflow. Six regression approaches were considered in our evaluation, where Gradient Boosting Machine (GBM) achieved superior performance (RMSE = 2.77, R2 = 0.99, MAE = 0.79) compared to other methods including eXtreme Gradient Boosting (XGBoost) (RMSE = 3.12, R2 = 0.97, MAE = 2.36), and Classification and Regression Trees (CART) (RMSE = 3.45, R2 = 0.96, MAE = 1.77). In fact, even when evaluated against Deep Neural Networks (DeepNN), GBM’s superior performance is maintained. Follow-up evaluations using two independent datasets on Electrical Grid Stability and USA Cancer Mortality confirm GBM’s superior performance, supporting claims for generalizability of our findings. Outcomes here point to the value of the GBM ensemble learning mechanism and need for continued modelling techniques’ experimentation.

Files and links (2)

pdf

JSW-V20N2-5072.36 MBDownload View

Published (Version of record)CC BY V4.0, Open Access

url

https://doi.org/10.17706/jsw.20.2.51-65View

Published (Version of record)CC BY V4.0, Open

Metrics

1 Record Views

Details

Record Identifier: 9926755615801891
Title: Understanding Regression Models on Stack Overflow Code: GBM Returns the Best Prediction Performance among Regression Techniques
Creators: Sherlock A. Licorish
Brendon Woodford
Lakmal Kiyaduwa Vithanage
Osayande Pascal Omondiagbe
Publication Details: Journal of software, Vol.20(2), pp.51-65
Academic Unit: School of Computing
Publisher: International Academy Publishing
Date published ; e-published: 09/07/2025
Copyright: Copyright © The Author(s) 2025. This work was first published in Journal of Software (International Academy Publishing). This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://www.creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, provided that the original work is properly attributed to the creator(s) and the source, a link to the Creative Commons license is provided, and any changes made are indicated.
Language: English
Resource Type ; Subtype: Journal article