Abstract


Our project aimed to create the groundwork for a tool that identifies foodborne illness cases from Twitter as the first steps in creating an unofficial warning system to slow the spread of foodborne illness. We collected and stored historical tweets, created visualizations that can display the trends of foodborne illness over time, compared Twitter data to official foodborne illness data, and evaluated the machine learning model in order to set up the framework for this early warning system.

Motivation


Foodborne illness is a widespread issue: approximately 1 in 6 Americans get sick every year. The CDC outbreak detection process tends to take weeks and lab testing can be slow, costly, and potentially inaccessible, so often the word does not get out until after many have already gotten sick (CDC, n.d). An early warning system could slow the spread of foodborne illness. The goal of this project is to continue developing an efficient framework to create and evaluate a tool to warn of the early signs of foodborne illness outbreaks.

Objectives


Our first objective was to create an efficient pipeline that will collect the appropriate foodborne illness related data. To do this, we created a suitable query made up of keywords that will retrieve tweets from Twitter that are most likely related to foodborne illness. After retrieving these tweets, we passed them through our machine learning to get predictions on whether or not each tweet indicates a case of foodborne illness. This data was then cleaned and stored appropriately so it could be easily retrieved for analysis.


Our second objective was to create a website that contains user-friendly visualizations. We want to convey our results to both laypeople and those involved in public policy or public health, and by displaying interactive results on an easily accessible webpage, we hope to make it easy for users to understand what foods and symptoms commonly related to foodborne illness, and for them to see which areas of the US are most impacted.


Our third objective was to explore the validity of our collected Twitter data in indicating foodborne illness outbreaks. We compared the data we collected for our pipeline to ground truth data obtained directly through the CDC/NORS. By first preparing the data to be in comparable formats, we performed analysis to see how closely our data collected relates to real foodborne illness cases.


Our fourth objective was to evaluate the performance of our machine learning model on the data that we collected. To do this, we hand-labeled data as ground truth data and compared it to what the model predicted to see how accurate the predictions were.


Tweets collected from 2017 to 2022

Tweets Collected
114,572
Tweets with >90% Prediction
94,357
Tweets with Identified Food Entities
50,923
Tweets with Identified Cities
99,355
#1 State by Tweet Volume
CA (17,204)
#2 State by Tweet Volume
TX (11,531)
#3 State by Tweet Volume
FL (7,542)
#4 State by Tweet Volume
NY (6,608)
#5 State by Tweet Volume
OH (4,621)