Abstract
Soybean is one of the world’s most vital agricultural commodities, serving as a key source of protein and oil for both human consumption and livestock feed. Brazil, the largest soybean-producing country, is projected to produce 169 million metric tons (MMT) in the 2024/25 season, contributing significantly to the global production of over 420 MMT. While this high output strengthens Brazil’s position in international markets, it also makes the country a deforestation hotspot, particularly in ecologically sensitive regions like the Amazon and Cerrado. To combat deforestation-driven agricultural expansion, the European Union Deforestation Regulation (EUDR) mandates that soybean imports be traceable to their regions of origin. This underscores the urgent need for robust and commercially viable traceability methods that can authenticate soybean provenance with high accuracy.
Various analytical techniques have been employed for soybean traceability, ranging from traditional methods such as stable isotope and trace element analysis to rapid spectroscopic techniques like near-infrared (NIR) and hyperspectral imaging (HSI-NIR). While traditional approaches such as geochemical and metabolomics offer high specificity, they can be time- consuming and resource-intensive. On the other hand, rapid spectroscopic methods are cost-effective and scalable but often lack chemical specificity. Integrating both approaches, along with advanced machine learning techniques, presents a promising solution.
However, despite the increasing adoption of machine learning in food authentication, existing studies exhibit gaps in sensitivity, particularly in state-level classification. The use of advanced non-linear supervised models, data fusion strategies, and explainable artificial intelligence (XAI) in soybean traceability remains underexplored. To bridge this gap, this thesis aims to identify the most effective combination of analytical techniques and machine learning approaches for soybean traceability—one that ensures both high classification accuracy and commercial feasibility in compliance with international trade regulations.
A total of 60 soybean samples were collected from six Brazilian states spanning two distinct biomes. Five analytical techniques were employed: geochemical analysis (stable isotopes and trace elements), metabolomics (gas chromatography-mass spectrometry (GC-MS)), and vibrational spectroscopy (NIR and HSI-NIR). These were combined with various machine learning classifiers, including partial least squares discriminant analysis (PLS-DA), support vector machine with linear kernels (SVM-Linear) random forest (RF), radial basis function support vector machine (RBF-SVM), and eXtreme gradient boosting (XGB). Data fusion techniques (low, mid, and high-level) were utilized to enhance classification performance, while XAI was applied to interpret RF model outputs and identify key chemical markers.
Among the machine learning models, RF emerged as the most effective due to its superior classification accuracy and visualization capabilities. However, its interpretability remained a challenge, which was addressed through XAI. Among the analytical techniques, NIR spectroscopy demonstrated the highest classification performance but lacked chemical specificity. While geochemical techniques also produced strong results, volatile organic compounds (VOCs) proved to be the least effective in origin classification due to their high sensitivity and variability. Additionally, HSI-NIR was tested on both whole soybeans and powdered samples, revealing that powdered samples yielded better classification accuracy. The study also examined the long-term stability of soybean samples stored for over two years, with NIR spectroscopy detecting subtle compositional changes despite controlled storage conditions.
While all techniques showed strong classification capabilities at the biome level, state-level differentiation remained a challenge due to overlapping characteristics. To address this, low, mid, and high-level data fusion approaches were explored. High-level data fusion successfully resolved state-level overlaps, achieving a refined classification with clear regional differentiation.
This research highlights the power of integrating traditional and modern analytical techniques with machine learning to establish a scalable and reliable soybean traceability framework. The findings demonstrate the potential of vibrational spectroscopy, particularly NIR and HSI-NIR, for in-field deployment as a rapid, cost-effective, and non-destructive solution to combat global origin fraud. The proposed methodologies extend beyond soybean, offering promising applications for other agricultural commodities and food authentication challenges worldwide.