80/20
Pareto relationships are one of those things that once you learn what they are, you see everywhere. Pareto relationships follow an 80/20 ratio where 80% of the results come from 20% of the inputs. For example, in group projects for school, 20% of the members will do 80% of the work. Similarly, in dating pools, 20% of the potential dates will get 80% of the attention. In data analysis, the “data cleaning” process will take 80% of the time while actual analysis takes only about 20% of the time.
Due to the tedious nature of data cleaning, data science tourists will half ass it or avoid doing it at all. However, Doing the work will force you to better understand the data you are working with, which has its own benefit such as gaining insights on problems in the collection process. Additionally, organizing data and filling gaps will force you to solve problems, the most effective way to improve your programming.
In order to provide some context, I have attached the data that I used for my post The Minsky Moment below. Use this to follow along and see some examples of basic data cleaning and visualization techniques.
Normalizing
Having perfectly formatted data without any problems only exists in theory land. Most of the time you will need to ensure that the data is the right type. One of the first things I do is check the head of the data frame and the type of the values in the columns to get a sense of the data.
spx = pd.read_csv('data_csv.csv')
print(spx.dtypes)
print(spx.head(5))
------------------------------Output-----------------------------------
Date object
SP500 float64
Dividend float64
Earnings float64
Consumer Price Index float64
Long Interest Rate float64
Real Price float64
Real Dividend float64
Real Earnings float64
PE10 float64
Unnamed: 10 float64
From object
dtype: object
Date SP500 Dividend Earnings Consumer Price Index \
0 1871-01-01 4.44 0.26 0.4 12.46
1 1871-02-01 4.50 0.26 0.4 12.84
2 1871-03-01 4.61 0.26 0.4 13.03
3 1871-04-01 4.74 0.26 0.4 12.56
4 1871-05-01 4.86 0.26 0.4 12.27
Long Interest Rate Real Price Real Dividend Real Earnings PE10 \
0 5.32 89.00 5.21 8.02 NaN
1 5.32 87.53 5.06 7.78 NaN
2 5.33 88.36 4.98 7.67 NaN
3 5.33 94.29 5.17 7.96 NaN
4 5.33 98.93 5.29 8.14 NaN
Most raw data are pure text format. Adjusting data types is a pretty straightforward process and Pandas makes it even easier with its built-in functions. The date column in the dataframe is text type. If it becomes a “datetype” we will gain more functionality like quickly being able to parse out days, months, and years. To perform this change just run the following code:
spx.Date = pd.to_datetime(spx.Date)
print(spx.dtypes)
------------------------------Output-----------------------------------
Date datetime64[ns]
SP500 float64
Dividend float64
Earnings float64
Consumer Price Index float64
Long Interest Rate float64
Real Price float64
Real Dividend float64
Real Earnings float64
PE10 float64
Unnamed: 10 float64
From object
dtype: object
N/A
The process of identifying and filling missing values is easy with pandas built-in functions. While I did not have any n/a values in the dataset I used. It is a common enough occurrence that is needed to be mentioned. If the n/a value is an error it can be removed entirely. But if you need to fill those values, you could replace them with one of the measures of central tendency or some other calculation that makes sense for the context. Ultimately, as an analyst, you’re going to have to make the call. Here are a couple of functions that I’ve found helpful with n/a value identification and replacement complete with the pandas documentation linked.
Matplotlib
Plotting in a Python or R is not as intuitive as it is in Excel or Tableau. However, getting the data in the correct format for plotting is relatively easy and is the same process as using a pivot table in excel. Structured data can be formatted in either Long or Wide from. Long data is presented with one column containing the values for the data and one column containing the context. Wide data is presented with each column having its own data values. This is easier to see visually so here are some examples.
One of my professors, said that the first thing you should do when you have a data set is to plot it. Unless you are a math genius, it is much easier to identify trends or relationships when the data is presented visually.
Matplotlib is a python package for data visualization. It is pretty easy to use but I find myself constantly forgetting the specific functions. So I often find myself referencing the documentation or Googling. This package has many different chart types with full control over the various chart elements. The code is quite readable and is basically English.
The first chart I wanted in the Minsky moment was the distribution of annual returns for the S&P500. First, to demonstrate the wide range of equity returns possible when investing in the United States. I had to calculate the annual return from the daily returns. From there it is easy to subset the dataframe to be ready for plotting. Here’s the code that I used to get a data frame with year and its annual return.
# subsetting data frame and calculating, formatting for plotting
small_spx = spx.loc[:,('Date', 'SP500', 'Dividend', 'Year')]
small_spx['Adjusted'] = small_spx['SP500'] + small_spx['Dividend']
small_spx['% Change'] = small_spx['Adjusted'].pct_change()
small_spx['Annual Return (%)'] = small_spx['% Change']+1
print(small_spx.tail())
------------------------------Output-----------------------------------
Date SP500 Dividend Year Adjusted % Change \
1763 2017-12-01 2664.34 48.93 2017 2713.27 0.026863
1764 2018-01-01 2789.80 49.29 2018 2839.09 0.046372
1765 2018-02-01 2705.16 49.64 2018 2754.80 -0.029689
1766 2018-03-01 2702.77 50.00 2018 2752.77 -0.000737
1767 2018-04-01 2642.19 NaN 2018 NaN 0.000000
Annual Return (%)
1763 1.026863
1764 1.046372
1765 0.970311
1766 0.999263
1767 1.000000
# for loop to get annual return per each year
years = small_spx['Year'].unique()
since_1900 = years[-119:]
# seperating the data by year
annual_rate = []
for year in since_1900:
monthly_rate = small_spx.loc[small_spx['Year'] == year,['Date','% Change',]]
# annualizing monthly returns
total = 1
for index, row in monthly_rate.iterrows():
total = total * (row['% Change']+1)
annual_rate.append(((total - 1)*100))
# zipping the two lists together
combined_list=list(zip(since_1900, annual_rate))
# creating a dataframe from the list
final_df = pd.DataFrame(combined_list, columns = ['Year', '% Return'])
final_df = final_df.round({'Year':0, '% Return':2})
print(final_df.head())
------------------------------Output-----------------------------------
Year % Return
0 1900 15.09
1 1901 15.34
2 1902 1.33
3 1903 -17.42
4 1904 23.70
The code above is pretty inefficient and isn’t using pandas to its full potential. After going back to this later, I found a better way to use the pandas functions that is much more elegant from a coding perspective. This code replaces the second block above:
group_df = (small_spx[['Year','Annual Return (%)']].groupby(['Year']).prod() -1)*100
print(group_df.head())
------------------------------Output-----------------------------------
Annual Return (%)
Year
1871 6.382979
1872 7.400000
1873 -11.545624
1874 2.526316
1875 -4.106776
Now that the data is clean and ready to go its time for the easy part. Here is the code that I used to create the chart below. The ax object that I created is necessary for manipulating the axes. If you want to set custom ticks you’re going to need to do this part. So the only aspect of the visualization that requires any mental work is deciding what story you are going to tell.
# distribution of returns from 1900-2018
returns = final_df['% Return'].values
mean = round(np.average(returns), 2)
medain = np.median(returns)
plt.figure(figsize=[10,8])
plt.hist(returns, range = (-44, 43), bins = 20, color = 'forestgreen',edgecolor="black")
plt.title("Annual SPX Returns 1900-2018")
plt.ylabel("Amount of Years")
plt.xlabel("Percentage Return")
ax = plt.subplot()
ax.set_xticks([-40, -30, -20, -10, 0, 10, 20, 30, 40])
# mean and median lines
plt.axvline(mean, color = 'red')
plt.axvline(medain, color = 'darkblue')
plt.text(-5, 15.2, 'Mean: {:.2f}'.format(mean))
plt.text(13, 15.2, 'Median: {:.2f}'.format(medain))
plt.savefig('1900-2018.png')
plt.show()

I will be doing more of these in the future as I improve my skills. If you liked the post or found it helpful please let me know! Also, if you have some constructive criticism please let me know as well!
