Abstract
Coffee consumption continues to increase throughout the world. Unfortunately, this increasing demand has led to increased levels of reported geographical origin fraud, exacerbated by the COVID-19 pandemic. It is thus essential to develop an effective toolkit to verify the origins of coffee. Current methods mainly rely on subjective sensory reports and on the trust that consumers have in the product label. To objectively verify the origins of coffee, it is essential to test the actual product itself. Traditional analytical methods are costly, time consuming, make use of harmful solvents, and destructive sample preparation steps. Vibrational spectroscopic methods are affordable, solvent-less, rapid, and potentially non-destructive solutions that have grown in popularity, driven by advancements in chemometrics, machine learning, and artificial intelligence (AI).
Nonetheless, the gaps in the literature include the lack of understanding of the sensitivity of vibrational spectroscopy tools for classification, prediction, and marker selection in comparison to traditional analytical techniques and, a lack of understanding of the potential of a comprehensive data pipeline and the use of advanced machine learning techniques for origin traceability. The primary objective of this thesis was to develop a rapid and non-destructive vibrational spectroscopy-based toolbox for origin classification and the prediction of traditional origin markers, coupled with advanced non-linear machine learning and data fusion techniques compared to the traditional analytical methods.
To meet these objectives, 24 green coffee bean (GCB) samples (arabica and wet washed) originating from three continents, eight countries, and 22 regions were analysed using a multi-omics and machine learning approach. The multi-omics techniques included geochemistry (stable isotope and trace elements), metabolomics [gas chromatography-mass spectrometry (GC-MS) and nuclear magnetic resonance (NMR)], and vibrational spectroscopy [Bulk near-infrared (NIR) and hyperspectral imaging-near infrared (HSI-NIR)]. Various (non)linear machine learning classification models were explored: partial least squares discriminant analysis (PLS-DA), random forest (RF), radial basis function - support vector machine (RBF-SVM), linear SVM, eXtreme gradient boosting (XGB), and k-nearest neighbours (KNN). (Non) linear prediction/estimation regression models included partial least squares regression model (PLS-R) and support vector regression (SVR).
Geochemical techniques demonstrated good sensitivity across the various origin scales, confirming their position as a gold standard for origin traceability, serving as a foundation for comparing the classification sensitivity with other omics methods. Several origin discriminating markers were identified from each dataset, with better performance found when the stable isotope and trace element datasets were fused with 96.00% accuracy at continental level. Non-linear machine learning models such as RF and SVM also demonstrated potential for improving the classification performance with perfect prediction accuracy and more relevant markers identified. Metabolomics methods provided a similar to lower classification sensitivity compared to geochemistry. Agreeing with the geochemistry chapter, non-linear models and data fusion increased the classification performance of metabolomics datasets, with origin discriminating markers identified. Several vibrational spectroscopy instruments were then optimised across several preprocessing steps to ensure that the most important information was extracted from the dataset. Bulk NIR and HSI-NIR performed better than mid-infrared and Raman spectroscopy. NIR is a rapid technique that demonstrates classification sensitivity at the continental and country levels, with potential at the regional level when coupled with non-linear models. Important wavelength regions contributing towards origin discrimination were also identified. Nonetheless, bulk NIR is limited by the need for sample grinding for better signal-to-noise ratio and more representative sampling. HSI-NIR overcomes this limitation with its ability to analyse higher volumes of coffee on the whole bean. HSI-NIR also showed classification sensitivity across all origin scales, with better prediction/estimation of traditional markers compared to NIR, especially when paired with non-linear regression techniques.
This thesis served as a proof-of-principle demonstrating the potential of rapid and non-destructive analytical tools to complement traditional methods for origin traceability with advanced machine learning techniques and a comprehensive data pipeline. There is potential for HSI-NIR to be engineered in-field as a cost-effective, rapid, and non-destructive solution to the global origin fraud epidemic, with potential applications towards other datasets and matrices.