# Normalizing Stock data for Machine Learning

When analyzing historical time frame data in machine learning it needs to be normalized. In this code example I will show how to get S&P data then convert it to a percent of daily increase/decrease as well as a logarithmic daily increase/decrease.

Colab can be found here

The first part of this code will use yfinance as our datasource.

``````#we're going to use yfinance as our data source

import pandas as pd
import numpy as np
import yfinance as yf
``````

Next we’re going to create a dataframe called df and download SPY data from 2000 to it. Finally we’ll print the result of df so you can get an idea of what is inside of it.

``````#Here we're creating a dataframe for spy data from 2000-current
start='2000-01-01',
end='2020-08-21',
progress=True)
#actions='inline',) #adjust for stock splits and dividends
#print the dataframe to see what lives in it
print(df)``````

We’re going to drop all the columns except Adj Close. Then we’ll rename it adj_close. Next we’ll create a column labeled simple_rtn. This is the daily simple return or percent increase/decrease. The next line of code creates a logarithmic increase/decrease. Logarithmic gives equal bearing to the Y axis and can be defined as follows, “A logarithmic price scale uses the percentage of change to plot data points, so, the scale prices are not positioned equidistantly. A linear price scale uses an equal value between price scales providing an equal distance between values.”

``````#only keep adj close
#create column simple return
#create column logrithmic returns
print(df)``````

This next command just analyzes the data so you can spot check what you’ve created.

``````#here we can analyze our data
df.info()
``````

This next section describes what the daily increase/decrease of the SPY looks like. You can see didn’t statistically relevant information about S&P here.

``````#get statistical data on the data frame
df.describe()``````

Next we can see a distribution of adjustable close, logarithmic return, and simple return.

``````#view chart of data to get an overview of what lives in the data
import matplotlib.pyplot as plt
df.hist(bins=50, figsize=(20,15))
plt.show()``````

This is all for data normalization. You can now apply different algorithmic analysis to the data.