Churn prediction on Big Data
This project is the last of a series of seven projects to be delivered as part of the Udacity DataScience Nanodegree
self-study.
In this project our goal is to predict user churn. We work for a fictitious company called “Sparkify”, a digital music service similar to famous Spotify or Deezer.
We are provided 2 datasets in JSON format:
- a "tiny" one (128MB although)
- a full dataset (12GB)
Note that if the dataset can be easily used on a single machine to start the exploration phase, it is not the case for the full one as it contains millions of events.
We are in big data context and so we have to use appropriate tools.
To use the full dataset one has to build a Spark cluster on the cloud (AWS, IBM, whatever).
You cand find here my blog post about this project.
For more details, please refer to my Github repository for this project.