Abstract
Our project aimed to create the groundwork for a tool that identifies foodborne illness cases from Twitter as the first steps in creating an unofficial warning system to slow the spread of foodborne illness. We collected and stored historical tweets, created visualizations that can display the trends of foodborne illness over time, compared Twitter data to official foodborne illness data, and evaluated the machine learning model in order to set up the framework for this early warning system.
Motivation
Foodborne illness is a widespread issue: approximately 1 in 6 Americans get sick every year. The CDC outbreak detection process tends to take weeks and lab testing can be slow, costly, and potentially inaccessible, so often the word does not get out until after many have already gotten sick (CDC, n.d). An early warning system could slow the spread of foodborne illness. The goal of this project is to continue developing an efficient framework to create and evaluate a tool to warn of the early signs of foodborne illness outbreaks.
Objectives
Our first objective was to create an efficient pipeline that will collect the appropriate foodborne illness related data. To do this, we created a suitable query made up of keywords that will retrieve tweets from Twitter that are most likely related to foodborne illness. After retrieving these tweets, we passed them through our machine learning to get predictions on whether or not each tweet indicates a case of foodborne illness. This data was then cleaned and stored appropriately so it could be easily retrieved for analysis.
Our second objective was to create a website that contains user-friendly visualizations. We want to convey our results to both laypeople and those involved in public policy or public health, and by displaying interactive results on an easily accessible webpage, we hope to make it easy for users to understand what foods and symptoms commonly related to foodborne illness, and for them to see which areas of the US are most impacted.
Our third objective was to explore the validity of our collected Twitter data in indicating foodborne illness outbreaks. We compared the data we collected for our pipeline to ground truth data obtained directly through the CDC/NORS. By first preparing the data to be in comparable formats, we performed analysis to see how closely our data collected relates to real foodborne illness cases.
Our fourth objective was to evaluate the performance of our machine learning model on the data that we collected. To do this, we hand-labeled data as ground truth data and compared it to what the model predicted to see how accurate the predictions were.
Tweets collected from 2017 to 2022
- Tweets Collected
- 114,572
- Tweets with >90% Prediction
- 94,357
- Tweets with Identified Food Entities
- 50,923
- Tweets with Identified Cities
- 99,355
- #1 State by Tweet Volume
- CA (17,204)
- #2 State by Tweet Volume
- TX (11,531)
- #3 State by Tweet Volume
- FL (7,542)
- #4 State by Tweet Volume
- NY (6,608)
- #5 State by Tweet Volume
- OH (4,621)