Data Cleaning and Visualization Basics

80/20

Pareto relationships are one of those things that once you learn what they are, you see everywhere. Pareto relationships follow an 80/20 ratio where 80% of the results come from 20% of the inputs. For example, in group projects for school, 20% of the members will do 80% of the work. Similarly, in dating pools, 20% of the potential dates will get 80% of the attention. In data analysis, the “data cleaning” process will take 80% of the time while actual analysis takes only about 20% of the time.

Due to the tedious nature of data cleaning, data science tourists will half ass it or avoid doing it at all. However, Doing the work will force you to better understand the data you are working with, which has its own benefit such as gaining insights on problems in the collection process. Additionally, organizing data and filling gaps will force you to solve problems, the most effective way to improve your programming.

In order to provide some context, I have attached the data that I used for my post The Minsky Moment below. Use this to follow along and see some examples of basic data cleaning and visualization techniques.

Normalizing

Having perfectly formatted data without any problems only exists in theory land. Most of the time you will need to ensure that the data is the right type. One of the first things I do is check the head of the data frame and the type of the values in the columns to get a sense of the data.

spx = pd.read_csv('data_csv.csv')
print(spx.dtypes)
print(spx.head(5))

------------------------------Output-----------------------------------
Date                     object
SP500                   float64
Dividend                float64
Earnings                float64
Consumer Price Index    float64
Long Interest Rate      float64
Real Price              float64
Real Dividend           float64
Real Earnings           float64
PE10                    float64
Unnamed: 10             float64
From                     object
dtype: object
         Date  SP500  Dividend  Earnings  Consumer Price Index  \
0  1871-01-01   4.44      0.26       0.4                 12.46   
1  1871-02-01   4.50      0.26       0.4                 12.84   
2  1871-03-01   4.61      0.26       0.4                 13.03   
3  1871-04-01   4.74      0.26       0.4                 12.56   
4  1871-05-01   4.86      0.26       0.4                 12.27   

   Long Interest Rate  Real Price  Real Dividend  Real Earnings  PE10  \
0                5.32       89.00           5.21           8.02   NaN   
1                5.32       87.53           5.06           7.78   NaN   
2                5.33       88.36           4.98           7.67   NaN   
3                5.33       94.29           5.17           7.96   NaN   
4                5.33       98.93           5.29           8.14   NaN  

Most raw data are pure text format. Adjusting data types is a pretty straightforward process and Pandas makes it even easier with its built-in functions. The date column in the dataframe is text type. If it becomes a “datetype” we will gain more functionality like quickly being able to parse out days, months, and years. To perform this change just run the following code:

spx.Date = pd.to_datetime(spx.Date)
print(spx.dtypes)
------------------------------Output-----------------------------------
Date                    datetime64[ns]
SP500                          float64
Dividend                       float64
Earnings                       float64
Consumer Price Index           float64
Long Interest Rate             float64
Real Price                     float64
Real Dividend                  float64
Real Earnings                  float64
PE10                           float64
Unnamed: 10                    float64
From                            object
dtype: object

N/A

The process of identifying and filling missing values is easy with pandas built-in functions. While I did not have any n/a values in the dataset I used. It is a common enough occurrence that is needed to be mentioned. If the n/a value is an error it can be removed entirely. But if you need to fill those values, you could replace them with one of the measures of central tendency or some other calculation that makes sense for the context. Ultimately, as an analyst, you’re going to have to make the call. Here are a couple of functions that I’ve found helpful with n/a value identification and replacement complete with the pandas documentation linked.

Matplotlib

Plotting in a Python or R is not as intuitive as it is in Excel or Tableau. However, getting the data in the correct format for plotting is relatively easy and is the same process as using a pivot table in excel. Structured data can be formatted in either Long or Wide from. Long data is presented with one column containing the values for the data and one column containing the context. Wide data is presented with each column having its own data values. This is easier to see visually so here are some examples.

One of my professors, said that the first thing you should do when you have a data set is to plot it. Unless you are a math genius, it is much easier to identify trends or relationships when the data is presented visually.

Matplotlib is a python package for data visualization. It is pretty easy to use but I find myself constantly forgetting the specific functions. So I often find myself referencing the documentation or Googling. This package has many different chart types with full control over the various chart elements. The code is quite readable and is basically English.

The first chart I wanted in the Minsky moment was the distribution of annual returns for the S&P500. First, to demonstrate the wide range of equity returns possible when investing in the United States. I had to calculate the annual return from the daily returns. From there it is easy to subset the dataframe to be ready for plotting. Here’s the code that I used to get a data frame with year and its annual return.

# subsetting data frame and calculating, formatting for plotting 

small_spx = spx.loc[:,('Date', 'SP500', 'Dividend', 'Year')]

small_spx['Adjusted'] = small_spx['SP500'] + small_spx['Dividend']
small_spx['% Change'] = small_spx['Adjusted'].pct_change()
small_spx['Annual Return (%)'] = small_spx['% Change']+1
print(small_spx.tail())

------------------------------Output-----------------------------------
           Date    SP500  Dividend  Year  Adjusted  % Change  \
1763 2017-12-01  2664.34     48.93  2017   2713.27  0.026863   
1764 2018-01-01  2789.80     49.29  2018   2839.09  0.046372   
1765 2018-02-01  2705.16     49.64  2018   2754.80 -0.029689   
1766 2018-03-01  2702.77     50.00  2018   2752.77 -0.000737   
1767 2018-04-01  2642.19       NaN  2018       NaN  0.000000   

      Annual Return (%)  
1763           1.026863  
1764           1.046372  
1765           0.970311  
1766           0.999263  
1767           1.000000  
# for loop to get annual return per each year

years = small_spx['Year'].unique()
since_1900 = years[-119:]
# seperating the data by year

annual_rate = []
for year in since_1900:
    monthly_rate = small_spx.loc[small_spx['Year'] == year,['Date','% Change',]]
    
    # annualizing monthly returns
    total = 1
    for index, row in monthly_rate.iterrows():
        total = total * (row['% Change']+1)
    annual_rate.append(((total - 1)*100))    
    
# zipping the two lists together    
combined_list=list(zip(since_1900, annual_rate))

# creating a dataframe from the list
final_df = pd.DataFrame(combined_list, columns = ['Year', '% Return'])
final_df = final_df.round({'Year':0, '% Return':2})

print(final_df.head())
------------------------------Output-----------------------------------
   Year  % Return
0  1900     15.09
1  1901     15.34
2  1902      1.33
3  1903    -17.42
4  1904     23.70

The code above is pretty inefficient and isn’t using pandas to its full potential. After going back to this later, I found a better way to use the pandas functions that is much more elegant from a coding perspective. This code replaces the second block above:

group_df = (small_spx[['Year','Annual Return (%)']].groupby(['Year']).prod() -1)*100
print(group_df.head()) 
------------------------------Output-----------------------------------
      Annual Return (%)
Year                   
1871           6.382979
1872           7.400000
1873         -11.545624
1874           2.526316
1875          -4.106776

Now that the data is clean and ready to go its time for the easy part. Here is the code that I used to create the chart below. The ax object that I created is necessary for manipulating the axes. If you want to set custom ticks you’re going to need to do this part. So the only aspect of the visualization that requires any mental work is deciding what story you are going to tell.

# distribution of returns from 1900-2018

returns = final_df['% Return'].values
mean = round(np.average(returns), 2)
medain = np.median(returns)

plt.figure(figsize=[10,8])
plt.hist(returns, range = (-44, 43), bins = 20, color = 'forestgreen',edgecolor="black")
plt.title("Annual SPX Returns 1900-2018")
plt.ylabel("Amount of Years")
plt.xlabel("Percentage Return")

ax = plt.subplot()
ax.set_xticks([-40, -30, -20, -10, 0, 10, 20, 30, 40])

# mean and median lines 
plt.axvline(mean, color = 'red')
plt.axvline(medain, color = 'darkblue')
plt.text(-5, 15.2, 'Mean: {:.2f}'.format(mean))
plt.text(13, 15.2, 'Median: {:.2f}'.format(medain))

plt.savefig('1900-2018.png')
plt.show()

I will be doing more of these in the future as I improve my skills. If you liked the post or found it helpful please let me know! Also, if you have some constructive criticism please let me know as well!


Leave a comment