NLP on disaster response messages
This project is the fifth of a series of seven projects to be delivered as part of the Udacity DataScience Nanodegree
self-study.
Machine learning is critical to helping different organizations understand which messages are relevant to them and which
messages to prioritize.
During these disasters is when they have the least capacity to filter out messages that matter, and find basic methods
such as using key word searches to provide trivial results.
In this project our goal is to analyze thousands of real messages provided by Figure 8, sent during natural disasters either via social media or directly to disaster response organizations.
We have several steps to follow:
- Build an ETL pipeline that processes message and category data from csv files and load them into a SQLite database
- Build a Machine Learning pipeline that will then read from SQLite DB to create and save a multi-output supervised learning model (yes, this is multi-class classification problem). Goal is to categorize these events so that we can send the messages to an appropriate disaster relief agency.
- Use Flask framework to build a webapp that will:
- be able to first launch both pipelines in order to populate everything that needs to be
- provide some data visualizations
- use our trained and saved model to classify new messages for 36 categories.
spaCy is the python package that has been used in order to perform some NLP preprocessing tasks such as lemmatization.
Then a sklearn TF-IDF vectorizer and some other transformers are gathered within a single sklearn pipeline.
In the given use case, this webapp is used by an emergency worker: he gives a new message and gets classification results in several categories.
For more details, please refer to my Github repository for this project.