Abstract
Practitioners are often dependent on Stack Overflow code during software development, where poor quality is occasionally reported. Research tends to focus on ranking content, identifying defects and predicting future content, but less attention is dedicated to identifying the most suitable techniques for modelling/prediction. Contextualizing the Stack Overflow code quality problem as regression-based, we examined the variables that predict Stack Overflow (Java) code quality, and the regression approach that provides the best predictive power. We observed answer count (β = 0.138), code length (β = 0.382), code spaces (β = 0.099) and lines of code (β = 1.959) as the strongest predictors of code quality on Stack Overflow. Six regression approaches were considered in our evaluation, where Gradient Boosting Machine (GBM) achieved superior performance (RMSE = 2.77, R2 = 0.99, MAE = 0.79) compared to other methods including eXtreme Gradient Boosting (XGBoost) (RMSE = 3.12, R2 = 0.97, MAE = 2.36), and Classification and Regression Trees (CART) (RMSE = 3.45, R2 = 0.96, MAE = 1.77). In fact, even when evaluated against Deep Neural Networks (DeepNN), GBM’s superior performance is maintained. Follow-up evaluations using two independent datasets on Electrical Grid Stability and USA Cancer Mortality confirm GBM’s superior performance, supporting claims for generalizability of our findings. Outcomes here point to the value of the GBM ensemble learning mechanism and need for continued modelling techniques’ experimentation.