Kinds of Correlation in Python with Pandas
And which one to use when.
Correlation in HR Analytics
When we tell stories about data, correlation helps drive the plot. Correlation can help the heroes of our story choose which path to take because correlation closes some doors of enquiry and opens others.
As data analysts, we are interested in correlation for two reasons:
- Identifying possible causation - who are the actors with agency in our story?
- Predictability of future events - how is the story going to end?
Using Python, Pandas is the most commonly used library to calculate correlation, giving us the coefficient we need in a number of useful ways.
Three correlations are possible in Python, and in this article I am going to give a short rundown of the differences and why and when we should use each one.
Pearson Correlation Coefficient
This correlation is bivariate and looks at the linear relationship between two continuous variables.
Use when you:
- have pairs of variables which are continuous (integers, or floats).
- want to know the direction of the relationship (increasing or decreasing).
- want to know the strength of the relationship.
Do not use:
- if your data is categorical.
- if you have any interest in causation.
You data should:
- be bivariate and normally distributed.
- have no internal relationship between variables.
- have no chance of one case influencing another case
- be linear in relationship.
Example of when Pearson's coefficient is a good choice:
- Age and length of service in an organisation: You can test the hypothesis that older people change job less often and control for impossible age:service combinations where someone under 20 cannot have service of over 2 years. Pearson's is the right choice because age allows length of service but does not cause length of service. If the correlation is there, we can do further exploration to identify possible causal factors.
Kendall Tau Correlation Coefficient
This coefficient tells you the correlation based on the ranks of the data and works best with monotonic data that is parametric.
Use when you:
- have data which did not meet the criteria for a Pearson calculation.
- have a smaller sample size.
- have lots of tied ranks.
- want to analyse ordinal data
Your data should:
- be ordinal or continuous.
- preferably be monotonic.
- fit a normal distribution (be parametric).
Examples of when Kendall's rank is a good choice
- Customer satisfaction and delivery time: When there are two sets of ordinal data such as a satisfaction score from a Likert scale (e.g. Very Satisfied, Somewhat Satisfied, Neutral…) and delivery time which is ordinal from a drop down or similar (< 30 Minutes, 30 minutes — 1 Hour, 1–2 Hours etc.)
Spearman Rank Correlation
This coefficient tells you the strength and direction of correlation for nonparametric data that is monotonic.
Use when you:
- have data which does not fit the normal distribution.
- are analysing two sets of ordinal variables.
Examples of when to use Spearman rank
- Induction hours and year one performance review: When you have an ordinal description of time spend in induction (8 hours, 6 hours, 4 hours, 2 hours) and a performance review (excellent, good, satisfactory, poor, unacceptable) you are unlikely to get a normal distribution but would expect a monotonic increasing relationship.
Each of the three coefficients has advantages and should be chosen based on the data you are using. Here is a simple cheat sheet to get you start.
|Sample Size||Parametric||Linear||Monotonic||Strong Coefficient|