{"tags":["cryptocurrency","artificial-intelligence","machine-learning","tensorflow","python"],"id":"badf8853afb9404196bd8b5cbeec61a2_2","article_id":"badf8853afb9404196bd8b5cbeec61a2","article_version":2,"title":"Predict cryptocurrency prices with Tensorflow as binary classification problem","content":"{\"markdown\":\"## Introduction\\nIn this tutorial we'll go through the prototype for a neural network which will allow us to estimate cryptocurrency prices in the future, as a binary classification problem, using Keras and Tensorflow as the our main clarvoyance tools.\\n\\nWhile it is most likely not the best way to approach the problem (after all investment banks invest billions in developing such algorithms), if we can get it right more than 55% of the times, we are in the money!\\n\\n## What we'll be doing\\n- Download data using Binance API\\n- Preprocess the data\\n- Train our model(s)\\n- Feature engineering\\n- Evaluate best performing models\\n\\n## Downloading data using Binance API\\nfor this example we'll download the maximum amount of data that can be fetched in a single call. If you want to train a better more and use it in the real world (which is not recommended by the way, you will likely loose real money), I would suggest to gather more data using multiple calls.\\n\\n```python\\nimport requests\\nimport json\\nimport pandas as pd\\nimport datetime as dt\\n\\nSTART_DATE = '2019-01-01'\\nEND_DATE = '2019-10-01'\\nINTERVAL = '15m'\\n\\ndef parse_date(x):\\n return str(int(dt.datetime.fromisoformat(x).timestamp()))\\n\\ndef get_bars(symbol, interval):\\n root_url = 'https://api.binance.com/api/v1/klines'\\n url = root_url + '?symbol=' + symbol + '&interval=' + interval + '&startTime=' + parse_date(START_DATE) + '&limit=1000'\\n data = json.loads(requests.get(url).text)\\n df = pd.DataFrame(data)\\n df.columns = ['open_time',\\n 'o', 'h', 'l', 'c', 'v',\\n 'close_time', 'qav', 'num_trades',\\n 'taker_base_vol', 'taker_quote_vol', 'ignore']\\n df.drop(['ignore', 'close_time'], axis=1, inplace=True)\\n return df\\n\\nethusdt = get_bars('ETHUSDT', INTERVAL)\\nethusdt.to_csv('./data.csv', index=False)\\n```\\n\\nIn this simple piece of code we are requiring the necessary packages, setup a couple of parameters (I picked a 15 minutes interval but you can pick more granular interval for higher frequency trading) and setup a couple of convenience functions, then save the data to csv for future reuse. This should be self explanatory but if something confuses you, please feel free to leave a comment asking for clarifications :)\\n\\n## Preprocessing the data\\n\\nAs prices overtime is a form of sequential data we are going to use a LSTM layer (Long-short-term-memory) as the first layer in our net. We want to provide data as a sequence of events, which will predict the price at time `t+n` where `t` is the current time and `n` defines how far in the future we want to predict, to do so we'll feed data as a time window of `w` length. It will all be clearer once we look at the code, let's start importing the required packages.\\n\\n```python\\nimport pandas as pd\\nimport numpy as np\\nimport seaborn as sns\\nimport random\\nfrom tensorflow.keras.models import Sequential\\nfrom tensorflow.keras.layers import Dense, LSTM, Dropout\\nfrom tensorflow.keras.callbacks import TensorBoard\\nimport time\\nimport matplotlib.pyplot as plt\\n```\\n\\nThis will import Pandas, Numpy, all the Tensorflow functions we need to train our model and a couple of other useful packages.\\n\\nNext, we want to define some constants, and load our data from csv (in case you are writing the training code on a different file:\\n\\n```python\\nWINDOW = 10 # how many time units we are going to use to evaluate the future value, in our case each time unit is 15 minutes so we are going to look at 15 * 10 = 150 minutes trading data\\nLOOKAHEAD = 5 # how far ahead we want to estimate if the future prices is going to be higher or lower? In this case is 5 * 15 = 75 minutes in the future\\nVALIDATION_SAMPLES = 100 # We want to validate our model on data that wasn't used for the training, we are establishing how many data point we are going to use here.\\n\\ndata = pd.read_csv('./data.csv')\\ndata['future_value'] = data['c'].shift(-LOOKAHEAD) # This allows us to define a new column future_value with as the value of c 5 time units in the future\\ndata.drop([\\n 'open_time'\\n], axis=1, inplace=True) # we don't care about the timestamp for predicting future prices\\n```\\n\\nLet's define a function that allows us to define if the future value is higher or lower than the current close price:\\n```python\\ndef define_output(last, future):\\n if future > last:\\n return 1\\n else:\\n return 0\\n```\\nSimply set the target as 0 if the price is lower or equal than the current close and 1 if it is higher.\\nNow let's define a function that allows us to create the moving time windows we need to feed to our neural network:\\n\\n```python\\ndef sequelize(x):\\n data = x.copy()\\n buys = []\\n sells = []\\n holds = []\\n data_length = len(data)\\n for index, row in data.iterrows():\\n if index <= data_length - WINDOW:\\n last_index = index + WINDOW -1\\n rowset = data[index : index + WINDOW]\\n row_stats = rowset.describe().transpose()\\n last_close = rowset['c'][last_index]\\n future_close = rowset['future_value'][last_index]\\n rowset = 2 * (rowset - row_stats['min']) / (row_stats['max'] - row_stats['min']) - 1\\n rowset.drop(['future_value'], axis=1, inplace=True)\\n rowset.fillna(0, inplace=True)\\n category = define_output(last_close, future_close)\\n if category == 1:\\n buys.append([rowset, category])\\n elif category == 0:\\n sells.append([rowset, category])\\n min_len = min(len(sells), len(buys))\\n results = sells[:min_len] + buys[:min_len]\\n return results\\n\\nsequences = sequelize(data)\\n```\\nOh ok, that's a lot of stuff going on there. Let's look at it bit by bit:\\n\\n```python\\n data = x.copy() # let's copy the dataframe, just in case\\n buys = []\\n sells = []\\n holds = []\\n data_length = len(data)\\n```\\n\\nHere we are doing some preliminary stuff, copy the dataframe to ensure we don't override it (it can be annoying if you are using Jupyter Notebook for example) and setting up arrays for buys and sells, which we'll use to balance our data.\\n\\n```python\\n for index, row in data.iterrows():\\n if index <= data_length - WINDOW:\\n last_index = index + WINDOW -1\\n rowset = data[index : index + WINDOW]\\n```\\nAs we iterate each row in the dataset if the index is greater than our defined window size, we can create a new slice of the dataset that is the size of our window size. Before we store this data in another array we need to normalize it with the following code:\\n\\n```python\\nrow_stats = rowset.describe().transpose()\\nlast_close = rowset['c'][last_index]\\nfuture_close = rowset['future_value'][last_index] # we'll need to save this separately from the rest of the data\\nrowset = 2 * (rowset - row_stats['min']) / (row_stats['max'] - row_stats['min']) - 1\\n```\\n\\nAnd we also want to remove the future_value from our dataset as well as replacing any possible NaN with 0s (not ideal good enough for our purpose):\\n\\n```python\\nrowset.drop(['future_value'], axis=1, inplace=True)\\nrowset.fillna(0, inplace=True)\\n```\\n\\nFinally we want to ensure that our sells and buys are balanced, if one occurs more often than the other, our network will quickly get biased toward the skew and not provide us with reliable estimations:\\n\\n```python\\n if category == 1:\\n buys.append([rowset, category])\\n elif category == 0:\\n sells.append([rowset, category])\\n # the following 2 lines will ensure that we have an equal amount of buys and sells\\n min_len = min(len(sells), len(buys))\\n results = sells[:min_len] + buys[:min_len]\\n return results\\n```\\n\\nFinally we run this function on our data `sequences = sequelize(data)`\\n\\nIt's also a good idea to randomize our data, so that our model is not influenced by the precise order our dataset is sorted by, the following code will randomize the dataset, split training vs testing dataset and displaying the distribution of buys vs sells in both datasets. Feel free to rerun this snippet to ensure a more balanced distribution of buys and sells:\\n\\n```python\\nrandom.shuffle(sequences)\\ndef split_label_and_data(x):\\n length = len(x)\\n data_shape = x[0][0].shape\\n data = np.zeros(shape=(len(x),data_shape[0],data_shape[1]))\\n labels = np.zeros(shape=(length,))\\n for index in range(len(x)):\\n labels[index] = x[index][1]\\n data[index] = x[index][0]\\n return data, labels\\nx_train, y_train = split_label_and_data(sequences[: -VALIDATION_SAMPLES])\\nx_test, y_test = split_label_and_data(sequences[-VALIDATION_SAMPLES :])\\nsns.distplot(y_test)\\nsns.distplot(y_train)\\nlen(y_train)\\n```\\n\\n![Dataset result](https://api.kauri.io:443/ipfs/QmRwCWoLMrUkRxuQHAaJohdAXsFAz3ay23gd6mxRgdqSA2)\\n\\nAfter running the snippet a couple of time, you should get something like this, with an even split of buys and sells (left vs right) across both datasets.\\n\\n\\n## Training the model(s)\\n\\nWe are now ready to train the model, but as we have yet to explore what hyper-parameters work best with our model and data we'll try a slightly more complex approach. First let's define four hyper-parameters arrays:\\n\\n```python\\nDROPOUTS = [\\n 0.1,\\n 0.2,\\n]\\nHIDDENS = [\\n 32,\\n 64,\\n 128\\n]\\nOPTIMIZERS = [\\n 'rmsprop',\\n 'adam'\\n]\\nLOSSES = [\\n 'mse',\\n 'binary_crossentropy'\\n]\\n```\\n\\nThen we'll iterate through each array to train a model with that combinations of hyper parameters, so that we can later compare them using TensorBoard:\\n\\n```python\\nfor DROPOUT in DROPOUTS:\\n for HIDDEN in HIDDENS:\\n for OPTIMIZER in OPTIMIZERS:\\n for LOSS in LOSSES:\\n train_model(DROPOUT, HIDDEN, OPTIMIZER, LOSS)\\n```\\n\\nNow we need to define the `train_model` function that will actually create and train the model:\\n\\n```python\\ndef train_model(DROPOUT, HIDDEN, OPTIMIZER, LOSS):\\n NAME = f\\\"{HIDDEN} - Dropout {DROPOUT} - Optimizer {OPTIMIZER} - Loss {LOSS} - {int(time.time())}\\\"\\n tensorboard = TensorBoard(log_dir=f\\\"logs/{NAME}\\\", histogram_freq=1)\\n\\n model = Sequential([\\n LSTM(HIDDEN, activation='relu', input_shape=x_train[0].shape),\\n Dropout(DROPOUT),\\n Dense(HIDDEN, activation='relu'),\\n Dropout(DROPOUT),\\n Dense(1, activation='sigmoid')\\n ])\\n model.compile(\\n optimizer=OPTIMIZER,\\n loss=LOSS,\\n metrics=['accuracy']\\n )\\n model.fit(\\n x_train,\\n y_train,\\n epochs=60,\\n batch_size=64,\\n verbose=1,\\n validation_data=(x_test, y_test),\\n callbacks=[\\n tensorboard\\n ]\\n )\\n```\\n\\nFor now this is a very simple model with an LSTM layer as the first layer, one Dense intermediate layer and one Dense output layer of size 1 and `sigmoid` activation. This layer will output the probability (ranging from 0 to 1) that a specific sequence of size `WINDOW` will be followed by a higher closing price after `LOOKAHEAD` intervals, where 0 is a high probability of a lower closing price and 1 is a high probability of a higher closing price.\\n\\nWe are also adding a Tensorboard callback, which will allow us to see how each model performs over each training cycle (EPOCH)\\n\\nFeel free to run this code and then run Tensorboard, in your terminal `tensorboard --logdir=logs`\\n\\n## Feature engineering\\nThe best model should have an accuracy on the validation data that is higher than 60%, which is already quite good. However, we can improve our model very quickly by extracting more data from our existing data set. The process of extracting new features from existing features is called `Feature Engineering`. Examples of feature engineering would be extracting a weekend boolean column from a data, or a country from a coordinates pair. In our case we are going to add technical analysis data to our OHLC dataset.\\n\\nAt the top of your notebook or file, add the `ta` package: `from ta import *`.\\n\\nJust after loading the data from csv, add the following line, which will append TA data to our existing dataset in the form of new columns\\n```python\\ndata = pd.read_csv('./data.csv')\\n#add the following line\\nadd_all_ta_features(data, \\\"o\\\", \\\"h\\\", \\\"l\\\", \\\"c\\\", \\\"v\\\", fillna=True) \\ndata['future_value'] = data['c'].shift(-LOOKAHEAD)\\n```\\n\\nThat's it, in a few line we have massively enriched our dataset. We can now run the model generator loop to figure out how our models perform with the new dataset, this will take quite a bit longer, but should be worth the wait.\\n\\n![](https://api.kauri.io:443/ipfs/QmdvN4dQGfGqdhDxuaQWg7kzEvtneccnDrjUcTFa8XwYJY)\\n\\nA richer, more meaningful dataset should ensure a more accurate model, and in the image above, we can clearly see how the richer dataset perform better than the simple data-set, with a validation accuracy hovering around the 80% mark!\\n\\n## Evaluating the best performing models.\\nNow that we have some models that seem to perform nicely on paper, how do we evaluate which one should be used in an hypothetical trading system?\\n\\nThis can be quite subjective, but in my opinion a good approach would be to separately looks at the buys and sells from the known validation labels and plot the distribution of the corresponding predictions. Hopefully, for all the buys, our model mostly predicts buys and not many sells and viceversa.\\n\\nLet's define a function that displays such chart for each model:\\n```python\\ndef display_results(NAME, y_test, predictions):\\n plt.figure()\\n buys = []\\n sells = []\\n for index in range(len(y_test)):\\n if y_test[index] == 0:\\n sells.append(predictions[index])\\n elif y_test[index] == 1:\\n buys.append(predictions[index])\\n sns.distplot(buys, bins=10, color='green').set_title(NAME)\\n sns.distplot(sells, bins=10, color='red')\\n plt.show()\\n```\\n\\nand let's now call this function every time we finish training on a model:\\n```python\\n model.fit(\\n x_train,\\n y_train,\\n epochs=60,\\n batch_size=64,\\n verbose=0,\\n validation_data=(x_test, y_test),\\n callbacks=[\\n tensorboard\\n ]\\n )\\n # after the model.fit call, add the following 2 lines.\\n predictions = model.predict(x_test)\\n display_results(NAME, y_test, predictions)\\n```\\n\\nAs the different models train we should now see images similar to the below, where the buys are plotted in green (and we want them on the right end side, clustered around the 1 value) and the sells are plotted in red (clustered around the 0 values on the left). These should help us decide which model provides a more reliable estimation of future prices.\\n\\n![](https://api.kauri.io:443/ipfs/QmWJ4Zjwiwi5814oKRi7FF8tgEWQRXfC92FYGneX9qydmc)\\n\\n\\nAnd that's it, we now have a few prototypes to play with that provide a decent estimation of future prices.\\nAs an exercise for yourself try the following:\\n- What happens if you increase the number of hidden layers of the network? \\n- What happens if your datasets is unbalanced?\\n- What happens if you increase the DROPOUT value?\\n- What happens if you test your best model on new data? (e.g. by fetching a different timestamp from Binance?\\n\\nIf you have any question or suggestion, please feel free to comment below or suggest an update to this article :)\"}","author":"37648fc15a8365735289e002d65d44d80c505e8b","timestamp":1571673154731,"attributes":{"background":"https://api.kauri.io:443/ipfs/QmevA8kcH7ZfEVeepf72EvvUvmZBfUMZXuWhcC74XLrscB"}}