Data Quality in Empirical Software Engineering: An Investigation of Time-Aware Models in Software Effort Estimation
Bosu, Michael Franklin

View/ Open
Cite this item:
Bosu, M. F. (2016). Data Quality in Empirical Software Engineering: An Investigation of Time-Aware Models in Software Effort Estimation (Thesis, Doctor of Philosophy). University of Otago. Retrieved from http://hdl.handle.net/10523/6142
Permanent link to OUR Archive version:
http://hdl.handle.net/10523/6142
Abstract:
Since its inception as a recognized sub-discipline, empirical software engineering (ESE) has been plagued with data quality issues, and in recent years this has led to an increasing number of questions being raised about the accuracy and reliability of the models that have been derived from ESE data. This general ‘data quality problem’ has been compounded by an imbalance in the addressing of data quality issues in the field; noise, outliers and incompleteness have been given the most attention to the near neglect of other challenges.The research reported in this thesis first proposes a taxonomy of data quality challenges based on a survey of prior literature, in order to broaden the concept of data quality in ESE so that all of its relevant dimensions are considered. The survey identified eleven distinct data quality issues that were classified into three major classes to form the taxonomy. The first class is Accuracy, which refers to all properties of a dataset that can result in the development of inaccurate prediction systems. The second class is Relevance. This refers to the characteristics of a dataset such that models derived from that dataset are not applicable to another dataset. The third class is Provenance, which comprises factors that prevent or limit access to data, thus raising trust issues and hindering the replication of software engineering experiments.A targeted systematic literature review that investigates the treatment of data quality is then reported. This addresses three perspectives of data quality in terms of the practices reported in the literature: data collection, data pre-processing and data quality identification. The findings indicate that consideration of data quality is an unsystematic endeavour in the empirical software engineering community, as just 11% of the 282 papers reviewed indicated some level of consideration of all three perspectives. Thirteen publicly available effort estimation datasets that have been analysed in ESE studies are then benchmarked against the quality dimensions proposed in the taxonomy. The rationale for this is twofold: first, it provides a holistic assessment of the quality of many of the datasets commonly used in the field; and second, it enables an initial evaluation of the utility of the taxonomy as a benchmarking mechanism. Multiple data quality issues in addition to the three most often noted (noise, outliers, incompleteness) are identified among these datasets. The benchmarking also reveals inconsistent reporting associated with ESE datasets, as the same properties about datasets are sometimes reported differently. A data collection and submission template is therefore proposed, to proactively enhance the recording of high-quality data, and to provide a transparent means of assessing the quality of existing datasets.Both the data quality taxonomy and the benchmarking of effort datasets identified timeliness as one of the neglected data quality challenges. Two time-aware model-building approaches are therefore applied to the software effort estimation problem. The first uses Time-Aware Sequential Accumulation (TASA), where projects are ordered according to their completion date and are then used to build models that estimate the effort of projects in a subsequent period (typically the next year). The second approach uses a Time-Aware Moving Window (TAMW), which modifies the TASA approach in that old projects are removed from the training set. These time-aware models are constructed for five datasets, four from the public domain and one proprietary. Model performance is evaluated using three unbiased accuracy measures, and these outcomes are compared with those obtained from three baseline models: mean, median and leave-one-out cross validation (LOO). The vast majority of the time-aware models are more accurate than their associated mean and median baseline models. Although it can be expected that the optimistic LOO models will be superior to the time-aware models, there are several instances where the time-aware models are in fact more accurate than the LOO baselines. Perhaps more importantly, this analysis reveals that the form and nature of the two sets of models – time-aware and non-time-aware – are different. This establishes that it is both feasible and important to develop effort estimation models that take the timing of projects into consideration. The research then proceeds to consider whether the processes underlying three effort estimation datasets have remained stationary over time. A Gaussian kernel estimator is used to generate non-uniform weights that are then applied to the datasets. This approach ensures that more recently completed projects are weighted more than older projects, reflecting their relatively greater importance (due to assumed higher relevance) to the prediction models. Weighted regression models are built using the non-uniform weightings (local models) and are compared with uniformly weighted models (global models), where no weightings are applied. The results indicate that, when the underlying process is non-stationary, uniform (global) models are more accurate than non-uniform (local) models over time. In addition, the accuracy of the models for non-stationary datasets is found to be worse than that obtained for the models of stationary datasets. A further finding indicates that, across time, the stationarity or otherwise of a dataset is dependent on the extent of its heterogeneity. Finally, this study also confirms the importance of bandwidth values to the performance of models that employ kernel estimators.
Date:
2016
Advisor:
MacDonell, Stephen; Whigham, Peter
Degree Name:
Doctor of Philosophy
Degree Discipline:
Information Science
Publisher:
University of Otago
Keywords:
Data Quality; Empirical Software Engineering; Time-Aware Models; Weighted linear regression; Stationary models; Data Benchmarking; Noise; Outliers; Inconsistency; Software Engineering; Effort Estimation
Research Type:
Thesis
Languages:
English
Collections
- Information Science [486]
- Thesis - Doctoral [3042]