Abstract
The PCOS subreddit is a cache of posts and comments detailing people's experiences with polycystic ovary syndrome (PCOS). This paper details an ensemble machine learning approach to extract feature and sentiment information relating to PCOS from the subreddit. Ensemble classifiers, which utilized CNNs, key word searches, and Bayesian theory, were created. Individual outputs from the pieces of the ensemble classifier were weighted using their specificities or sensitivities on the testing dataset and added together. Thresholds were calculated using probability theory to decide how high an output needed to be for a feature to be deemed present in the input text. The machine learning output labels were randomly sampled for each feature to calculate precision. Overall, most features of interest were able to be identified with suitably high precision. Over 100 different features were identified among the users, leading to hundreds of thousands of feature labels in the user dataset. Sentiment classification CNNs were also created and typically performed with high accuracy on the testing datasets. A complete dataset of approximately 100,000 PCOS subreddit users, the list of features they presented with, and the sentiments they expressed, was created. This large and detailed dataset has significant clinical potential.