pydata archivos - Snippets y Más

This is how Airbnb is really being used in Brussels.

Airbnb claims to be part of the «sharing economy» and disrupting the hotel industry. However, data shows that the majority of Airbnb listings in most cities are entire homes, many of which are rented all year round – disrupting housing and communities.

In this article, we will be exploring how AirBnb is being used at my home city: Brussels, Belgium.

Room Type

Airbnb hosts can list entire homes/apartments, private or shared rooms.

Depending on the room type, availability, and activity, an airbnb listing could be more like a hotel, disruptive for neighbours, taking away housing, and illegal.

I first wanted to have an idea on the distribution of room types in the city.

Room type distribution for AirBnB apartments in Brussels, BE

No surprise here. Most listings are for entire homes or apartments, which is what most people look for when on a vacation. Now, let’s take a look at the average price for rooms on each category:

Avg price for each type of AirBnB room in Brussels, BE

Hotel rooms on that price are not really attractive. Interesting is to see that the avg price for a shared room is not much more cheap than the price for a private room.

Activity

Airbnb guests may leave a review after their stay, and these can be used as an indicator of airbnb activity. This is how most metrics are estimated.

The minimum stay, price and number of reviews have been used to estimate the occupancy rate, the number of nights per year and the income per month for each listing. Everything is in the companion dataset.

AirBnB activity in Brussels from Jan 1, 2010 to Apr 1, 2020

AirBnB activity in Brussels from Jan 1, 2019 to Apr 1, 2020

You can definitely see a steep decrease towards the end there!. Covid-19 effects!

Some interesting questions worth answering are:

Do the number of nights booked per year make it impossible for a listing to be used for residential housing?
And what is renting to a tourist full-time rather than a resident doing to our neighbourhoods and cities?
How does the income from Airbnb compare to a long-term lease?

I will answer these on a new post.

Availability

An Airbnb host can setup a calendar for their listing so that it is only available for a few days or weeks a year.

Other listings are available all year round (except for when it is already booked).

Entire homes or apartments highly available year-round for tourists, probably don’t have the owner present, could be illegal, and more importantly, are displacing residents.

Let’s see the availability distribution from 1 to 365 days a year :

Now, let’s do a pie chart to check the relationship between low (less than 90 nights/year) and high (more than 90 nights/year) availability:

Listings Per Host

Some Airbnb hosts have multiple listings. A host may list separate rooms in the same apartment, or multiple apartments or homes available in their entirety.

Hosts with multiple listings are more likely to be running a business, are unlikely to be living in the property, and in violation of most short term rental laws designed to protect residential

Number of listings per unique host registered on AirBnb Brussels.

As we can see, AirBnb is not necessarily being used as it is originally intended. If you want to know more on the concerns related to these practices, read here, here and here.

Get the data

All data is available on http://insideairbnb.com/get-the-data.html

The python notebook is on https://github.com/sansagara/kedro-experiment

Below is the PDF printout of the Jupyter Notebook.

How Airbnb is really used in brussels Descarga

Referencia Rápida: Scikit-learn de Python

Scikit-learn es una librería de código abierto para Python, que implementa un rango de algoritmos de Machine Learning, pre-procesamiento, referencias cruzadas y visualización usando una interfaz unificada.

Un Ejemplo Básico

from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X, y = iris.data[:, :2], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy_score(y_test, y_pred)

Cargar la data

Nuestra data debe ser numérica y estar almacenada como arreglos de NumPy o matrices de SciPy. Otro tipo de data que pueda convertirse en arreglos numericos tambien se aceptan, como los DataFrames de Panda.

import numpy as np
X = np.random.random((10,5))
y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
X[X < 0.7] = 0 Preprocessing The Data Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)

NORMALIZACIÓN

from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)

BINARIZACIÓN

from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)

CODIFICAR ATRIBUTOS CATEGÓRICOS

from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(y)

IMPUTAR VALORES FALTANTES

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit_transform(X_train)

GENeraR atributos polinomiales

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
oly.fit_transform(X)

entrenaR y probaR la data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)

Crear el Modelo

Estimadores Supervisados

REGRESIÓN lineal

from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)

Support Vector Machines (SVM)

from sklearn.svm import SVC
svc = SVC(kernel='linear')

Naive Bayes

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

KNN

from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

Estimadores No Supervisados

ANÁLISIS de Componente principal (PCA)

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)

K Means

from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)

Ajustar el Modelo

Aprendizaje supervisado

lr.fit(X, y)
knn.fit(X_train, y_train)
svc.fit(X_train, y_train)

aprendizaje no supervisado

k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)

Predecir

estimadores supervisados

y_pred = svc.predict(np.random.random((2,5)))
y_pred = lr.predict(X_test)
y_pred = knn.predict_proba(X_test))

estimadores no supervisados

y_pred = k_means.predict(X_test)

Evaluar el Desempeño del Modelo

Métricas de Clasificación

puntaje de exactitud

knn.score(X_test, y_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

reporte de CLASIFICACIÓN

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred)))

matriz de CONFUSIÓN

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred)))

Métricas de Regresión

error absoluto promedio

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2])
mean_absolute_error(y_true, y_pred))

error medio cuadrado

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred))

puntaje R2

from sklearn.metrics import r2_score
r2_score(y_true, y_pred))

Metricas de Agrupacion

ÍNDICE Ajustado en Radianes

from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_true, y_pred))

Homogeneidad

from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_pred))

V-measure

from sklearn.metrics import v_measure_score
metrics.v_measure_score(y_true, y_pred))

VALIDACIÓN cruzada

print(cross_val_score(knn, X_train, y_train, cv=4))
print(cross_val_score(lr, X, y, cv=2))

Ajustar el Modelo

BÚSQUEDA de Cuadrillas

from sklearn.grid_search import GridSearchCV
params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn,param_grid=params)
grid.fit(X_train, y_train)
print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)

OPTIMIZACIÓN DE PARÁMETROS ALEATORIZADOS

from sklearn.grid_search import RandomizedSearchCV
params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
rsearch = RandomizedSearchCV(estimator=knn,
param_distributions=params,
cv=4,
n_iter=8,
random_state=5)
rsearch.fit(X_train, y_train)
print(rsearch.best_score_)

Tomado DataCamp, donde hay una version descargable muy practica para imprimir y tener a la mano!.

Programacion Funcional en Python

Como parte de mi colaboración con el grupo de Meetup de PyData Panama, he creado una pequeña presentación a manera de introducción a los conceptos de la programación funcional que pueden ser usados en Python.

Esto es un documento de <a target="_blank" href="https://office.com">Microsoft Office</a> incrustado con tecnología de <a target="_blank" href="https://office.com/webapps">Office Online</a>.

Además, como compañía, adjunto el excelente libro de Stephen Lott, «Functional Python Programming» publicado por PACKT Publishing.

Functional Python Programming – Steve Lott

Spark y Project Tungsten

Sin duda alguna, Spark es el nuevo gran protagonista en la escena Big Data. La meta de Spark siempre ha sido ofrecer una plataforma unica donde los usuarios puedan obtener los mejores algoritmos distribuidos para cualquier tarea de procesamiento de datos.Para hacer esto, Spark cuenta con una herramienta llamada RDD, o Resilient Distributed Dataframe. Sin embargo los RDD’s han sido completamente mejorados con Project Tungsten. Los beneficios de Tungsten podran verse a partir de Spark V 1.6+ y lo mejor: Cloudera Manager lo tiene disponible desde la version 5.8.

Panama PyData 2: Spark vs Pandas Dataframes

Comparto con los asistentes al Meetup del 24 de Septiembre de PyData Panama, la presentación que acompaña mi taller: Pandas vs Spark Dataframes – Una introducción a las arquitecturas distribuidas.

Los notebooks de iPython que acompañaron el taller están disponibles en mi repo en GitHub: https://github.com/sansagara/Panama-PyData

Los invito entonces a unise al grupo http://meetu.ps/e/C15vZ/xHDxP/f