{"tags":["artificial-intelligence","machine-learning","regression"],"id":"d035bc33cd37467db92e5b428a7565fd_1","article_id":"d035bc33cd37467db92e5b428a7565fd","article_version":1,"title":"Regression with Python, Keras and Tensorflow","content":"{\"markdown\":\"In this tutorial we are going to do a quick and dirty estimation of house prices based on a dataset from a Kaggle competition. Kaggle is the leading data science competition platform and provides a lot of datasets you can use to improve your skills.\\n\\nFor simplicity's sake, we will build a simple model to get us started and we will explore how to improve it in later articles. Before we start, download the following file, which contains the training dataset, the test dataset and a sample submission (in case you want to see how your model fares in comparison to others by submitting it to the competition on Kaggle)\\n\\n [Download the dataset](https://www.kaggle.com/c/5407/download-all)\\n\\n[Link to the competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview)\\n\\nStart your Jupyter Notebook, create and name a new kernel and let's start by importing the dependencies that we'll need.\\n\\n```python\\nimport pandas as pd\\nimport numpy as np\\nimport seaborn as sns\\nimport tensorflow as tf\\nfrom tensorflow.keras.models import Sequential\\nfrom tensorflow.keras.layers import Dense, Activation\\nprint(tf.__version__)\\n```\\n\\nThen, we need to import the dataset into our kernel, Pandas provides a handy `read_csv` method to import CSV files.\\n```python\\nraw_dataset = pd.read_csv('./train.csv', skipinitialspace=True)\\ntest_dataset = pd.read_csv('./test.csv', skipinitialspace=True)\\n```\\n\\nand let's visualise the first few rows of both datasets with `raw_dataset.head()` and `test_dataset.head()`\\n![](https://api.kauri.io:443/ipfs/QmPHmJkQDmHrS7Cs7XTUey7q74KBFG4mZbicm61FZmYKjQ)\\n\\nAs we can see, we have a lot of columns which we'll call features, of different types (you can run `raw_dataset.dtypes` to verify each columns data type), but for this tutorial we will focus on a small subset of features.\\n\\nFirst let's extract our `SalePrice` column, which will be our label or dependent variable (the one we want to estimate) and display its distribution.\\n\\n```python\\nlabels = raw_dataset['SalePrice']\\nsns.distplot(labels)\\n```\\n![](https://api.kauri.io:443/ipfs/QmZHSbHas8HmfqDgyLKrpqwHSvCj95w8DTFSuTsUu9DsnE)\\n\\nThe prices distribution is heavily skewed towards the left and definitely not normally distributed, while we can train a model using the labels as they are, a more normally distributed input will make training easier.\\n\\n```python\\nlabels = np.log1p(raw_dataset['SalePrice'])\\nsns.distplot(labels)\\n```\\n![](https://api.kauri.io:443/ipfs/Qmb9iuLirTD9iWSPhakbJ5VuaapmeKoZDkWMscL3vJkd5L)\\n\\nMuch better now, let's just remember that our model will now estimate the log of the price, so we will need to convert it back by using `np.exp1()`.\\n\\nWe are now ready to filter out our datasets for the columns we are interested in:\\n\\n```python\\ntrain_data = raw_dataset[[\\n 'MoSold',\\n 'YrSold',\\n 'OverallCond',\\n 'OverallQual',\\n 'LotArea',\\n 'YearBuilt',\\n 'TotalBsmtSF',\\n 'GrLivArea',\\n 'GarageCars',\\n 'Neighborhood'\\n]]\\ntest_data = test_dataset[[\\n 'MoSold',\\n 'YrSold',\\n 'OverallCond',\\n 'OverallQual',\\n 'LotArea',\\n 'YearBuilt',\\n 'TotalBsmtSF',\\n 'GrLivArea',\\n 'GarageCars',\\n 'Neighborhood']]\\ntrain_data.head()\\n```\\n![](https://api.kauri.io:443/ipfs/QmaHX4QqPXKYn1En9TgVmLmkKRQitLeDZ7C6Yn1zoRgAVu)\\n\\nMuch more manageable! We now have a couple of problems. First, some of the numeric columns actually represent categories, like `GarageCars` or `OverallQual`. Secondly, our model will only accept numeric data, so we will need to convert our qualitative data into numbers. Let's first convert the first set to string.\\n\\n```python\\ntrain_data['MoSold'] = train_data['MoSold'].apply(str)\\ntrain_data['YrSold'] = train_data['YrSold'].apply(str)\\ntrain_data['OverallCond'] = train_data['OverallCond'].apply(str)\\ntrain_data['OverallQual'] = train_data['OverallQual'].apply(str)\\n# train_data['YearBuilt'].apply(str)\\n# train_data['GarageCars'].apply(str)\\ntest_data['MoSold'] = test_data['MoSold'].apply(str)\\ntest_data['YrSold'] = test_data['YrSold'].apply(str)\\ntest_data['OverallCond'] = test_data['OverallCond'].apply(str)\\ntest_data['OverallQual'] = test_data['OverallQual'].apply(str)\\n# test_data['YearBuilt'].apply(str)\\n# test_data['GarageCars'].apply(str)\\ntrain_data.dtypes\\n```\\n\\nIgnore the warnings for now, as you can see we successfully migrated the columns in question are not integers anymore. For the second problems we are going to use a technique called OneHot, in which each value of a categorical column gets its own numeric column with either a 1 or a 0, depending if the columns match the original value.\\n\\n```python\\none_hot_train = pd.get_dummies(train_data)\\none_hot_test = pd.get_dummies(test_data)\\n```\\nFinally, we will need to address the same distribution problem we had with `SalePrice`, for example, if we plot `sns.distplot(one_hot_train['GrLivArea'])` we'll see a similar skew in the distribution. To do so, we could use the log of the value as we did before, but for the inputs we can use another technique. We'll extract the stats of each column and normalize the data based on the `mean` and `std` of each column.\\n\\n```python\\nstats = one_hot_train.describe().transpose()\\n\\ndef norm(x):\\n return (x - stats['mean']) / stats['std']\\n\\nnormed_train = norm(one_hot_train)\\nnormed_test = norm(one_hot_test)\\n\\nnormed_train.head()\\n```\\n\\nLastly we want to discard the normalized one hot columns, for a stronger input signal.\\n\\n```python\\ninput_train = one_hot_train\\ninput_train['LotArea'] = normed_train['LotArea']\\ninput_train['TotalBsmtSF'] = normed_train['TotalBsmtSF']\\ninput_train['GrLivArea'] = normed_train['GrLivArea']\\ninput_train['GarageCars'] = normed_train['GarageCars']\\ninput_train['YearBuilt'] = normed_train['YearBuilt']\\ninput_test = one_hot_test\\ninput_test['LotArea'] = normed_test['LotArea']\\ninput_test['TotalBsmtSF'] = normed_test['TotalBsmtSF']\\ninput_test['GrLivArea'] = normed_test['GrLivArea']\\ninput_test['GarageCars'] = normed_test['GarageCars']\\ninput_test['YearBuilt'] = normed_test['YearBuilt']\\n```\\n\\nOur final input data should look like this:\\n![](https://api.kauri.io:443/ipfs/QmR3J77skdWhmcDU6pGGsMK8WV3W2GyQZt7hNuXWe2Qgcp)\\n\\nAnd let's save these datapoints to a pickle file, so we don't need to do all of this in case we want to reuse this data.\\n\\n```python\\nimport pickle\\npickle_out = open(f\\\"{ITERATION}labels.pickle\\\",\\\"wb\\\")\\npickle.dump(labels, pickle_out)\\npickle_out.close()\\n\\npickle_out = open(f\\\"{ITERATION}input_train.pickle\\\",\\\"wb\\\")\\npickle.dump(input_train, pickle_out)\\npickle_out.close()\\n\\npickle_out = open(f\\\"{ITERATION}input_test.pickle\\\",\\\"wb\\\")\\npickle.dump(input_test, pickle_out)\\npickle_out.close()\\n```\\nYou can later access the data using\\n```python\\nimport pickle\\npickle_in = open(\\\"../input/house-prices-pickles-1/1.labels.pickle\\\",\\\"rb\\\")\\nlabels = pickle.load(pickle_in)\\n```\\n\\n\\nTime to build our model and train it!\\n\\n```python\\nmodel = Sequential()\\n\\nmodel.add(Dense(32, input_shape=input_train.shape[1:]))\\nmodel.add(Activation('sigmoid'))\\nmodel.add(Dense(1))\\nmodel.add(Activation('relu'))\\n\\nmodel.compile(\\n loss='mean_squared_error',\\n optimizer='adam',\\n metrics=['mean_squared_error','mean_absolute_error']\\n)\\n\\nmodel.fit(\\n input_train,\\n labels,\\n batch_size=32,\\n epochs=30,\\n validation_split=0.1,\\n verbose=1\\n)\\n```\\nFor each Epoch, you'll see some stats, as we did input the log of the price we'll want to focus on the `mean_absolute_error`. After 30 epochs, it will be around `0.135`, that means that for each prediction we should be in the range of ±0.135 from the log of the price in question. For a $500,000 we could calculate it like so:\\n\\n```python\\nlogged_price = np.log(500000) # 13.122363377404328\\nlower_boundary = np.exp(logged_price - 0.135) # 436857.95584401704\\nupper_boundary = np.exp(logged_price + 0.135) # 572268.3921756567 \\n```\\n\\nThat's around 13% off, not perfect, but not bad either. The score is calculated on a small subset of the input data which we have defined with our `validation_split` parameter.\\n\\nIt is now time to generate some results on our test_dataset!\\n\\n```python\\npredictions = np.exp(model.predict(input_test))\\nsns.distplot(predictions)\\n```\\nUnfortunately, we won't be able to render the chart, as our model wasn't able to estimate a few values, a reasonable approach, for now would be to just replace them with the mean of the dataset.\\n\\n```python\\npredictions = np.exp(model.predict(input_test))\\ntest_dataset['SalePrice'] = predictions\\nresults = test_dataset[['Id','SalePrice']]\\nresults = results.fillna(np.exp(labels.describe()['mean']))\\nresults.isna().sum()\\nresults.head()\\n```\\n\\nand finally let's render the two distribution plots for a quick eye check on how our model works :)\\n\\n```python\\nsns.distplot(results['SalePrice'])\\nsns.distplot(np.exp(labels))\\n```\\n\\n![](https://api.kauri.io:443/ipfs/QmXMJ8MquB61jor9Y6G4WyLWmQJudE719ndWVMgBaNUrK1)\\n\\nThat's it! You've built your first models for estimating the price of real estate property! This model clearly needs some work but we'll cover it in the following articles. If you want to get ahead, try tweaking some of the parameters, like increasing the number of Epochs, pre-processing the data a bit differently or the structure of the models and see if you can improve the model yourself.\\n\\nAlso feel free to join the competition on Kaggle and see how your model fairs against fellow data nerds!\\nIf you have any question or spot any error please feel free to comment or submit and update to this article :)\\n\\n\\n\\n\\n\\n\\n\\n\\n\"}","author":"4cd5d72ffd950260e47f9e14f45811c5ccdd0283","timestamp":1570637557192,"attributes":{"background":"https://api.kauri.io:443/ipfs/QmNWabGoWpGE821uheTAYrRkR5Koo8e7zsBMWUnw7gTNST"}}