Normalizing Stock data for Machine Learning

When analyzing historical time frame data in machine learning it needs to be normalized. In this code example I will show how to get S&P data then convert it to a percent of daily increase/decrease as well as a logarithmic daily increase/decrease.

Colab can be found here – https://colab.research.google.com/drive/1Wvh7mRX0PUthBhXw1HD5tJAC8mQr6DYH?usp=sharing

The first part of this code will use yfinance as our datasource.

#we're going to use yfinance as our data source
!pip install yfinance --upgrade --no-cache-dir

import pandas as pd
import numpy as np
import yfinance as yf

Next we’re going to create a dataframe called df and download SPY data from 2000 to it. Finally we’ll print the result of df so you can get an idea of what is inside of it.

#Here we're creating a dataframe for spy data from 2000-current
df = yf.download('spy',
  start='2000-01-01',
  end='2020-08-21',
  progress=True)
  #dauto_adjust=True,
  #actions='inline',) #adjust for stock splits and dividends
#print the dataframe to see what lives in it
print(df)

We’re going to drop all the columns except Adj Close. Then we’ll rename it adj_close. Next we’ll create a column labeled simple_rtn. This is the daily simple return or percent increase/decrease. The next line of code creates a logarithmic increase/decrease. Logarithmic gives equal bearing to the Y axis and can be defined as follows, “A logarithmic price scale uses the percentage of change to plot data points, so, the scale prices are not positioned equidistantly. A linear price scale uses an equal value between price scales providing an equal distance between values.”

#only keep adj close
df = df.loc[:, ['Adj Close']]
df.rename(columns={'Adj Close':'adj_close'}, inplace=True)
#create column simple return
df['simple_rtn'] = df.adj_close.pct_change()
#create column logrithmic returns
df['log_rtn'] = np.log(df.adj_close/df.adj_close.shift(1))
print(df)

This next command just analyzes the data so you can spot check what you’ve created.

#here we can analyze our data
df.info()

This next section describes what the daily increase/decrease of the SPY looks like. You can see didn’t statistically relevant information about S&P here.

#get statistical data on the data frame
df.describe()

Next we can see a distribution of adjustable close, logarithmic return, and simple return.

#view chart of data to get an overview of what lives in the data
import matplotlib.pyplot as plt
df.hist(bins=50, figsize=(20,15))
plt.show()

This is all for data normalization. You can now apply different algorithmic analysis to the data.

Facebook Comments