16 minute read

Place Tertre, Paris
Credit: Artiom Vallat - Unsplash

Paris, with more than 2000 years of history1, will host the 2024 Olympic Games in particular and, although it was the scene of a recent world drama with the Notre-Dame fire, remains one if not the most visited city in the world (and I am not saying that because I am french!)23.

If you are dying to visit the Eiffel Tower, stroll through the artistic districts of Montmartre or even sail on a barge on the Seine, here is some information that might be useful to you.

To come and visit Paris you can look for accommodation, a hotel by your own means. Otherwise in recent years, an online service has developed considerably: Airbnb.

Evolution Airbnb places to rent in Paris
Evolution of AirBnb places to rent in Paris since 2008 (note: this information has been extracted from the data)
Poker player
Credit: Michal Parzuchowski - Unsplash

If you have followed technological news recently you have necessarily heard about the resounding successes such as:

  • those of DeepMind with AlphaZero, an AI that masters chess, shogi, and Go games through self-play45
  • or Pluribus an AI bot that masters Texas hold’em poker with 6 players6!

The advances of AI are numerous and affect many fields (commerce, finance, medicine, etc.).

Machine learning and Artificial Intelligence are no longer a fashion trend but a real subject that has developed quickly in recent years:

Google trend machine learning
"Machine Learning" Google Trend all over the world for last past years

Well, Airbnb is a company very present and active in the machine learning sector. It does not hesitate to share some of its data, publish tools it has developed in open source such as Airflow7 and regularly publish posts on its technical blog8 to share its thoughts and research areas.

In this post I propose you to explore AirBnb’s data relating to the city of Paris and thus discover what we can extract from it and to what extent we can train algorithms to learn from these data.

NOTE: Everything that follows is based on personal work done as part of one project done during my self-training via Udacity DataScience nanodegree. The comments, interpretations and conclusions are therefore my own ones and are my sole responsibility. This work should therefore not be considered for anything other than their learning value.

ACKNOWLEDGEMENT: data come from airbnbinside and can be collected from here. I took the one from July 2019 so everything that comes next is time-based on this period.


Let’s get this party started!

The remainder of this post is organized into two topics:

  • The first one is related to data analysis and will aim to answer questions from this scenario:

    Let’s put myself in the shoes of a tourist who would love to come and visit Paris. As a tourist/foreigner, an Airbnb service user and also someone who absolutely does not know Paris, I would like to know what is the best period or the best place (or both) in order to plan my visit/holidays. As “best” can means a lot, here:

    1. I would like to have a pretty good idea of the cost variation depending on location place
    2. Same for visiting period
    3. I would be glad to know what are the odds to find something available for a given period so that perhaps I can adapt my holidays accordingly to places availabilities (there is an alternative for the homeowner who would like be sure to rent at the best time that will maximize the profit and so maybe adapt the location availability depending on the market as well).
  • The second one is more related to machine learning algorithms:

    Let’s see if we are able to predict price with pretty good accuracy/confidence for a given listing? Indeed, as a service user who plans to rent its home, I can spend plenty of time by looking at other places and their price to find the best but still competitive price to ensure I will be chosen. Or, more user friendly: based on information I give, the system predicts the right price!

1. At the right place…

In data, Airbnb listings are provided with latitude and longitude coordinates. When we plot them on a scatterplot, here is what we get:

Paris Airbnb listings per neighbourhood
Paris Airbnb listings per neighbourhood

NOTE: we can clearly see that there are no listing in some places in neighbourhoods 12 or 16 (respectively Reuilly and Passy), it is because there are some woods (Vincennes, Boulogne). We are also able to guess the Seine which is the river that crosses the city.

Price per neighbourhood
Mean price in euros per square meter of real estate per neighbourhood as per french study

It is written in this french study published in April 20199 that “the price of real estate has reached stratospheric levels and totally disconnected from residents’ incomes. But what are the prices, neighbourhood per neighbourhood?”

According to this study, top 5 is:
– 6e (Luxembourg) (13 944 €/m2)
– 7e (Palais-Bourbon) (13 569 €/m2)
– 4e (Hotel-de-Ville) (13 262 €/m2)
– 8e (Elysee) (12 436 €/m2)
– 16e (Passy) (11 753 €/m2)

Whereas 19e and 20e (respectively Buttes-Chaumont and Menilmontant) have the “cheapest” price.


So, has the neighbourhood a real impact on the price?

Is it something that we can also observe in our data? Can we confirm the french study observations?

Paris Airbnb mean price per neighbourhood
Paris Airbnb listings mean price per neighbourhood

Interesting: the top 5 from the study is within the top 6 of the mean price! Here are some other insights caught from data:

Paris Airbnb number of listings and their availaility per neighbourhood
Paris Airbnb number of listings and their availaility per neighbourhood

Elysee is far the most expensive neighbourhood and the one with most availability (perhaps due to the price?) even if there are few listings (only Louvre has less listings to offer).
10 neighbourhoods have mean price above the mean value (116€) while 10 have mean price below. So it’s a 50-50.
It is though important to notice that only 2 neighbourhoods have mean price below the median value (80€): Menilmontant and Buttes-Chaumont. Same result than the study! In those 2 neighbourhoods, there are a lot of listings but with few availability per year, 60 days in average.

But of course it depends on type of room I will rent, right?

In Airbnb data, there are 3 types of room available to rent:

  • Entire home/apt
  • Private room
  • Shared room

What is their share? Where are they? Does it really change something on the price?

Paris Airbnb listing location per room type
Paris Airbnb listing location per room type
Paris Airbnb listing mean price per room type
Paris Airbnb listing mean price per room type

86% of the listings are ‘Entire home/apt’ whereas there are less than 1% of ‘Shared room’.
This is interesting: choosing a private room or an entire apartment has few influence on the price. On the other side, you can expect to divide by 2 the mean price by choosing a shared room. There are less than 500 shared rooms available but they are almost everywhere.

Where are the most and less expensive listings?

We have seen that shared rooms are less expensive so in this section they are filtered to not catch them in the less expensive listings.

Paris Airbnb listing location most and less expensive
Paris Airbnb listing location most and less expensive

OBSERVATIONS: depending on your budget, it can really cost you a lot to sleep in Paris. The most expensive neighbourhoods are the one in the center of the city, you will pay less and less gradually as you go further but in those neighbourhoods, the availability rate is not very high because you will probably find apartments in which people live most of the time and rent when they are away from their home.
Tip: you can expect to divide by 2 the mean price by choosing a shared room and there are almost everywhere in the city so that you are not stucked in a specific place. So, ready to share your room?

2. …and at the right time!

Choosing the right period

Are there some periods of time with higher/lower availabilty?

Paris Airbnb listing mean availability evolution over time
Paris Airbnb listing mean availability over time

Note: vertical dashed red lines corresponds to french school holidays begin/end dates. I wanted to see whether availability was impacted or not. The blue ones are just here to show the end of each month.

OBSERVATIONS:

  • availability is very low at the beginning (July) but I guess this behaviour is expected as most of the listings have found guests for the next coming weeks (remember that the data are the one from July).
  • there is a peak in August which is the highest availability. Again, this is summer holidays and I guess that a lot of people are leaving and then put their home/apartment as available.
  • we then see the 'Back to school' impact in September with another availability during autumn holidays.
  • it then becomes more and more available with a new availability peak just after Christmas holidays, for New Year (want to see fireworks at Eiffel Tower?).
  • after that it goes back to "normal" and flat rate perhaps because it is so far that people just left their calendar as it has been filled without changing anything.

Note: it seems that school holidays have not really a big impact.

Variation of price over time

Paris Airbnb listing mean price evolution over time
Paris Airbnb listing mean price evolution over time

OBSERVATIONS:

It is typically the law of supply and demand:

  • prices are lower in August (because there are a lot of rooms to share and home owners prefer offering discount than keeping the apartment and in the end earn nothing).
  • there is a peak during Christmas holidays so even if there are more apartments available, you will spend more money if you plan coming for New Year fireworks at Eiffel Tower.
  • the "small" and recurrent peaks corresponds to weekends. Prices are higher during the weekends. There are no small savings.

Is there a big difference between neighbourhoods?

Paris Airbnb listing mean price evolution over time per neighbourhood
Paris Airbnb listing mean price evolution over time per neighbourhood

There are real differences depending on the neighbourhood: Entrepôt has high variations whereas Hôtel-de-Ville is more or less at the same price over the year.

No matter the neighbourhood, they all contain, at different level of price, the same peak in january for New Year.

Paris New Year fireworks
Paris New Year fireworks - Credit: Joe DeSousa - Unsplash

OBSERVATIONS:

If we do not take into account the period after Christmas (because prices are increasing with lower availability, I guess people mostly gave availabilities for 2019 but dit not yet do the same thing for 2020), we can see that:

  • the 'Christmas effect' results in +6% (average) for prices with some neighbourhoods above that (Elysee has +10% for example, where you can find the Champs Elysees).
  • the 'New Year effect' has a huge impact on neighbourhoods in the center such as Elysee, Palais-Bourbon, or Temple with +17 to +19%!

And now it’s time to try to predict the price!

Crystal ball
Crystal ball - Credit: Marc Schulte - Unsplash

What kind of problem is it and how can we evaluate our performance?

Here our goal is to predict the price, which is a quantitative value, that is why we call that a regression problem (in opposition to classification problems for which we predict qualitative value, for example “Is it a dog or a cat?”).

Dartboard
Credit: Engin Akyurt - Pexels

To evaluate the performance I will use a metric called RMSE which stands for Root Mean Squared Error. Do not be afraid by this strange name, it is actually very simple and oftenly used as performance metric.

Basically here is how it works:

  • we predict the price: let’s call the result y_pred
  • we compare y_pred with the real price y and compute the difference y - y_pred: this is our error. Note that it can be a positive or a negative number depending on our prediction (did we predict above or below the real price?)
  • as this error can be positive or negative, when summing all errors to see how much we are wrong we do not want negative errors kind of cancel positive errors. That is why the error is squarred so that the error will always be a positive number.
  • those squarred errors are then summed and this number is then divided by the number of elements for which we are making prediction, so it is in the end a mean value for overall error.
  • in the end we take the square root

One of the advantages (among others) it that RMSE gives higher weightage and punishes large errors (due to the power of 2). You can then easily understand that the lower this RMSE value, the best our model will be. Our model will then have for objective to minimize this value.

OK so now what is the model to use?

Actually I will not pick one model randomly but instead I will try a few:

  • a very “naive” one that will return as predicted value the mean price of the neighbourhood the listing belongs to
  • another very simple statistic model: a linear regression (it will be our reference model to compare to)
  • a model based on decision trees
  • a more complex model (but yet still based on trees) named XGBoost.

Why?

Naive model
Motivation for the naive model is to compare more complex models with something very simple that does not even require machine learning and see how much we do better (or not…). I could have applied the mean price of the whole dataset, that would have been even more naive but let’s challenge it a little bit as we have seen that price varies from one neighbourhood to another.
As a customer I would expect a listing to cost something around the mean in the neighbourhood. Of course there are much more features that have impacts on the price (I guess that number of beds, accomodates and so on are part of that but let’s keep things simple, it’s a naive predictor after all…).
Linear Regression
It is a very basic model but which sometimes give good results and could be an outsider due to the computation time which is not very high.
Decision tree
The cool thing with trees is that it is possible to plot them, so explain them. Interpretability of the results is one of the major obstacles and challenges of Machine Learning10. Well, with decision trees, we can do that because the algorithm is just a sequence of binary decision (if variable is higher than this value, go to right, else go to left, and so on). By plotting the full tree in the end we are able to understand why we arrived to such a prediction.
More than that we are also able to have a look at features (i.e parameters) importance and see what are the ones that contribute the most to the prediction value.
Extreme Gradient Boosting (XGBoost)
XGBoost is also based on trees and has proven to be a very powerful machine learning technique which is often used on Kaggle competitions to solve problems.

If XGBoost is also based on tree, why trying it?
Because I would like to try an ensemble method: the key principle is that it builds a lot of simple decision trees instead of building a complex model that tries to understand very well the data or even finds relationships between columns — that would obviously lead us to something hard to understand.
Each tree, taken independently is very simple but the strength here is that combining predictions coming from all those basic estimators will provide in the end a stronger model, so a better prediction.
There are several ensemble technics11. With XGBoost training is sequential and trees are added over iterations in order to reduce the errors made by the combination of all the previous ones.

As an example, here is what we can produce with Decision Tree:

Decision Tree
We can plot Decision Tree so that it is easier to interpret results

Note: we can specify the depth of the tree (i.e number of levels) and in the end, in leaves, you can see that value corresponds to the predicted price.

The results (at least!)

Algorithms predictions

After a mandatory data preparation phase and some little transformations to handle missing values, here are the first results I got:

Algorithms results
First 15 price predictions for each of the algorithms

Note: 1st column is the real price whereas following ones are the predicted price for each of the tried algorithm.

On the 15 displayed rows, most of the time the XGBoost prediction is not very far from the truth.
BUT sometimes it is totally wrong. For example the 540€ prediction while the truth was 119€ (line 7).
It is interesting to note that DecisionTree (y_dt) and LinearRegression(y_linear_reg) are wrong as well with also big difference (though XGBoost remains the worst).

To go further and perhaps having a better general intuition about how good or how bad are those results, I have decided to classify predictions into different groups based on the error percentage between prediction and ground truth. The motivation for that is that it is not the same thing when you are wrong of 40€ when the actual price is 100€ or when it is 1000€.

Here are then the 6 categories:

  • 1-very accurate (prediction is 2% less or more than the truth)
  • 2-pretty good: 2 to 5%
  • 3-good: 5 to 10%
  • 4-acceptable: 10 to 20%
  • 5-bad: 20 to 40%
  • 6-awful: more than 40%

Algorithms predictions classification

Algorithms predictions classes
Algorithms predictions have been classified into 6 groups

XGBoost is our best candidate for further tuning. We can see how it reduced the number of elements in class 6 (worst ones) in comparison to other models.

Classes 1 to 4 counts for 48.75% with XGBoost. It means than once in two our best model is at least wrong about 20% the real value of the price.

I said earlier that with trees it is possible to analyze what were the features (i.e columns in the data table) that had the most importance in the tree building.
Here is an example for XGBoost:

Feature importances for XGBoost
Most important features XGBoost found

Note: you can read 0.04 value by saying that 4% of the price is explained by this or that feature. For example the number of bathrooms (4th position) explains 2.5% of a listing price.

Tuning phase

Next step is to select the best candidate and tune it.

Clockwork-cogs
Credit: Pixabay - Pexels

Machine Learning algorithms may have 1 to a lot of parameters (so called hyper-parameters) that the DataScientist can play with.
The tuning phase consists in trying to change one, few or all of them and see if results are improved.

As you can easily imagine, depending on the number of hyper-parameters this task can be very time consuming.
Moreover we will never have the absolute warranty that in the end the hyper-parameters combination found would be THE best one, it will just be the best among all our tries. There are technical ways to optimize the search but it is far beyond the scope of this post.

Fortunately (or not), XGBoost has a lot of hyper-parameters (that is one of its drawbacks) so this tuning phase can very soon turn itself into a never ending one…


Anyway, in order to improve our good but not totally satisfying results, tuning has been made with XGBoost and several techniques (that will not be detailed here but which you can retrieve in the code repository12).

Tuning results

Several tuning techniques (so models) were tried. Here are the predictions classification results:

Results after XGBoost tuning
XGBoost tuned models results compared to previous models

In spite of all the efforts made for parameters tuning, I was unable to improve significantly the results I got with the first try. Even if I found a combination that slightly improved the RMSE value, when analyzing in more details the first XGBoost remains the best choice as it is with it that most of the prices are near the truth.

To improve results, here are some things one could do:

  • Try another model, for example Catboost from the russian company Yandex (as we have a lot of categorical features) and they say that there is less parameter tuning.
  • Try to reduce the number of data input features with dimensionnality reduction techniques (for example PCA).
  • Perform feature engineering by building new features.
  • Create an ensemble of models, stack them then average the results.

WRAP-UP

Based on all our observations within provided data we have seen that:

  • It will cost you more or less 116€/night in average but this varies a lot depending on the neighbourhood where you plan to stay.
  • The most expensive neighbourhoods are the one in the center of the city, you will pay less and less gradually as you go further but be careful because in those neighbourhoods, the availability rate is not very high. You will find apartments in which people live most of the time and rent when they are away from their home.
  • If money is an important criteria for your visit then you can divide by 2 the mean price by choosing a shared room and there are almost everywhere (but a few) in the city so that you are not stucked in a specific place.
  • There are a lot of places to rent in August and prices are lower but this information is merely due to:
    • the fact that the data are from July, homeowners may have offered discount prices to engage website visitors
    • a lot of people living in Paris are in holidays somewhere else => more places, lower prices.
  • The other 2 periods with more availability are in early October or early January but for the second period, it will cost you much more (+6% in average during Christmas holidays and it can increase until +19% for New Year’s Day depending on the neighbourhood).

Price prediction gave first encouraging results which still need to be improved through further investigations.

Even if it is quite frustrating not having improved the results, let’s recall that the first goal of this project was to show our ability to follow the Datascience process and this is what has been done through the different steps: Data Understanding, Business Understanding, Data preparation, Modeling, tuning and Evaluation.
And let’s say that to see the glass half full, we were able to build a model with 80% price prediction accuracy half the time. Naive model was 80% precision 25% of the time and linear regression 33% of the time.


Architectural design
Credit: Philipp Birmes - Pexels

The world of data is a very fascinating one and you could never end to try to do something with data, interpret it.

Here you have read just an example of what is possible but we could imagine a lot of other things to try:

  • we have all data since 2015, one could grab all data per year, keep listings that exists over all years and analyze trends (seasonal, recurrents). Why not timeseries forecasting?
  • we have all the reviews, one could try to do NLP (Natural Language Processing) and classify reviews as positive/negative ones, try to detect recurrent topics per neighbourhood…

So, are you ready to give a try by yourself? Collect the data, it’s here!

Author: nidragedd


If you would like to check out the whole project you can see it from my Github repository.
You can also directly give a look at the Jupyter notebooks if you prefer.


  1. See Paris History on Wikipedia 

  2. Paris is the most visited city according to Telegraph.co.uk website 

  3. Paris is the 2nd most visited city according to a ranking compiled by Mastercard Inc. 

  4. AlphaZero mastering games on DeepMind blog 

  5. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play on Science 

  6. AI-Bot masters Texas hold’em poker on Science 

  7. Airflow tool was born at Airbnb and is now under Apache Software Foundation 

  8. Airbnb technical blog about AI 

  9. Paris, what price for your neighbourhood? (in french on seloger.com

  10. “Please, explain the black-box” by Olga Mierzwa-Sulima 

  11. Ensemble methods: bagging, boosting and stacking, understanding the key concepts of ensemble learning 

  12. Project code repository on Github 

Feel free to leave a (nice) comment if you want

Required fields are marked *

Loading...