R : A First Date with Data
An entry point to Data Analysis can be a (Regression Example). But first things first, data should be prepared before the analysis can be performed. This step is called preprocessing. Real-world (raw) data can be inconsistent or incomplete and can even contain errors. Through the following lines, we will try to walk through a simple data preprocessing task using a famous dataset.
A statistician with the name Francis Galton wanted to see if there was a connection between the heights of sons and their fathers. He measured the height of children and their parents across 205 families.
We will look closely at the data he used (which can be found here) , explore it and see what different information it contains and do some data preprocessing.
We will use R in this post, here is the Python version. So let´s dive in :).
We can read the data into a Dataframe which is a data structure with columns of potentially different types. It basically looks like a spreadsheet or SQL table. And since we have a tab-separated data file (data separated with a tab rather than a comma in “comma-separated values”), We need to specify it as follows:
df <- read.table("galton-data.tab", sep = '\t',header = TRUE)
Now that we have our data in a Dataframe, we can use head() to see the first 5 rows.
head(df)
family | father | mother | gender | height | kids | male | female | |
---|---|---|---|---|---|---|---|---|
1 | 1 | 78.5 | 67.0 | M | 73.2 | 4 | 1.0 | 0.0 |
2 | 1 | 78.5 | 67.0 | F | 69.2 | 4 | 0.0 | 1.0 |
3 | 1 | 78.5 | 67.0 | F | 69.0 | 4 | 0.0 | 1.0 |
4 | 1 | 78.5 | 67.0 | F | 69.0 | 4 | 0.0 | 1.0 |
5 | 2 | 75.5 | 66.5 | M | 73.5 | 4 | 1.0 | 0.0 |
We can see that there are 8 columns. For each of the adult children of one family, we have the data about their height and gender as well as their parents height and the number of siblings.
Now, we will try to find more about our DataFrame by asking some basic questions:
# # How many rows and columns are in the dataframe ?
dim(df)
(898, 8)
# What are the available data concerning family number 7
df[df$family == "7",]
family | father | mother | gender | height | kids | male | female | |
---|---|---|---|---|---|---|---|---|
23 | 7 | 74.0 | 68.0 | M | 76.5 | 6 | 1.0 | 0.0 |
24 | 7 | 74.0 | 68.0 | M | 74.0 | 6 | 1.0 | 0.0 |
25 | 7 | 74.0 | 68.0 | M | 73.0 | 6 | 1.0 | 0.0 |
26 | 7 | 74.0 | 68.0 | M | 73.0 | 6 | 1.0 | 0.0 |
27 | 7 | 74.0 | 68.0 | F | 70.5 | 6 | 0.0 | 1.0 |
28 | 7 | 74.0 | 68.0 | F | 64.0 | 6 | 0.0 | 1.0 |
# what is the maximum height of fathers
max(df$father)
78.5
# what is the maximum height of sons
max(df[df$male == "1","height"])
79
# what is the maximum height of mothers
max(df$mother)
70.5
# what is the maximum height of daughters
max(df[df$female == "1","height"])
70.5
# what is the mean height of fathers
mean(df$father)
69.23285077951
# what is the mean height of sons
mean(df[df$male == "1","height"])
69.2288172043011
# What is the number of observed families
length(unique(df$family))
197
From the study of Francis Galton we know that he gathered data across 205 families. This means that we have some missing data.
A part of data analysis is dealing with missing or incomplete data. We will try to take a look into the last rows of our dataframe just out of curiosity. For this we will use tail() and specify that we want the last 20 rows
tail(df,20)
family | father | mother | gender | height | kids | male | female | |
---|---|---|---|---|---|---|---|---|
879 | 199 | 64.0 | 64.0 | F | 65.0 | 7 | 0.0 | 1.0 |
880 | 199 | 64.0 | 64.0 | F | 64.0 | 7 | 0.0 | 1.0 |
881 | 199 | 64.0 | 64.0 | F | 64.0 | 7 | 0.0 | 1.0 |
882 | 199 | 64.0 | 64.0 | F | 60.0 | 7 | 0.0 | 1.0 |
883 | 200 | 64.0 | 63.0 | M | 64.5 | 1 | 1.0 | 0.0 |
884 | 201 | 64.0 | 60.0 | M | 66.0 | 2 | 1.0 | 0.0 |
885 | 201 | 64.0 | 60.0 | F | 60.0 | 2 | 0.0 | 1.0 |
886 | 203 | 62.0 | 66.0 | M | 64.0 | 3 | 1.0 | 0.0 |
887 | 203 | 62.0 | 66.0 | F | 62.0 | 3 | 0.0 | 1.0 |
888 | 203 | 62.0 | 66.0 | F | 61.0 | 3 | 0.0 | 1.0 |
889 | 204 | 62.5 | 63.0 | M | 66.5 | 2 | 1.0 | 0.0 |
890 | 204 | 62.5 | 63.0 | F | 57.0 | 2 | 0.0 | 1.0 |
891 | 136A | 68.5 | 65.0 | M | 72.0 | 8 | 1.0 | 0.0 |
892 | 136A | 68.5 | 65.0 | M | 70.5 | 8 | 1.0 | 0.0 |
893 | 136A | 68.5 | 65.0 | M | 68.7 | 8 | 1.0 | 0.0 |
894 | 136A | 68.5 | 65.0 | M | 68.5 | 8 | 1.0 | 0.0 |
895 | 136A | 68.5 | 65.0 | M | 67.7 | 8 | 1.0 | 0.0 |
896 | 136A | 68.5 | 65.0 | F | 64.0 | 8 | 0.0 | 1.0 |
897 | 136A | 68.5 | 65.0 | F | 63.5 | 8 | 0.0 | 1.0 |
898 | 136A | 68.5 | 65.0 | F | 63.0 | 8 | 0.0 | 1.0 |
With a quick look into the family column we find that the number 202 is missing from the count and, as a bonus, there appears to be a strange value “136A” in the end of our dataframe.
# What is the data type of each column? (numerical or character values?)
str(df)
'data.frame': 898 obs. of 8 variables:
$ family: Factor w/ 197 levels "1","10","100",..: 1 1 1 1 108 108 108 108 123 123 ...
$ father: num 78.5 78.5 78.5 78.5 75.5 75.5 75.5 75.5 75 75 ...
$ mother: num 67 67 67 67 66.5 66.5 66.5 66.5 64 64 ...
$ gender: Factor w/ 2 levels "F","M": 2 1 1 1 2 2 1 1 2 1 ...
$ height: num 73.2 69.2 69 69 73.5 72.5 65.5 65.5 71 68 ...
$ kids : int 4 4 4 4 4 4 4 4 2 2 ...
$ male : num 1 0 0 0 1 1 0 0 1 0 ...
$ female: num 0 1 1 1 0 0 1 1 0 1 ...
We notice that each of the height columns : “father”, “mother”, “height”(of the adult children) are numeric values. For the column: “kid” as it indicates the number of children it should be obviously an integer.
Let´s look now at the remaining two culumns with the data type: “Factor”.
Data like gender, blood types, ratings are usually called a categorical data, because they can take on only a limited, and usually fixed, number of possible values. So, in R, they are called “Factor” columns.
First, the column “gender” is supposed to have one of two values: F for female, or M for male. let´s make sure that it is the case.
unique(df$gender)
M F
The last one is “family”. This is the column which can be considered as a family identification number. It should be an integer as well. But, since it has the “136A” value, it was given the type “Factor”.
We want to check this column for any missing values:
fml_id <- unique(df$family)
nb_id <- as.factor(1:205)
setdiff(levels(nb_id), levels(fml_id))
'13' '50' '84' '111' '120' '161' '189' '202' '205'
Well, we can see that we have 9 values missing. If we include the “136A” family, we should have data for 197, which is the number we found earlier.
We will see if the “136A” is the same as the family number 136, to verify it was not a mistyping case.
df[df$family == "136",]
family | father | mother | gender | height | kids | male | female | |
---|---|---|---|---|---|---|---|---|
589 | 136 | 68.0 | 64.0 | M | 71.0 | 10 | 1.0 | 0.0 |
590 | 136 | 68.0 | 64.0 | M | 68.0 | 10 | 1.0 | 0.0 |
591 | 136 | 68.0 | 64.0 | M | 68.0 | 10 | 1.0 | 0.0 |
592 | 136 | 68.0 | 64.0 | M | 67.0 | 10 | 1.0 | 0.0 |
593 | 136 | 68.0 | 64.0 | F | 65.0 | 10 | 0.0 | 1.0 |
594 | 136 | 68.0 | 64.0 | F | 64.0 | 10 | 0.0 | 1.0 |
595 | 136 | 68.0 | 64.0 | F | 63.0 | 10 | 0.0 | 1.0 |
596 | 136 | 68.0 | 64.0 | F | 63.0 | 10 | 0.0 | 1.0 |
597 | 136 | 68.0 | 64.0 | F | 62.0 | 10 | 0.0 | 1.0 |
598 | 136 | 68.0 | 64.0 | F | 61.0 | 10 | 0.0 | 1.0 |
df[df$family == "136A",]
family | father | mother | gender | height | kids | male | female | |
---|---|---|---|---|---|---|---|---|
891 | 136A | 68.5 | 65.0 | M | 72.0 | 8 | 1.0 | 0.0 |
892 | 136A | 68.5 | 65.0 | M | 70.5 | 8 | 1.0 | 0.0 |
893 | 136A | 68.5 | 65.0 | M | 68.7 | 8 | 1.0 | 0.0 |
894 | 136A | 68.5 | 65.0 | M | 68.5 | 8 | 1.0 | 0.0 |
895 | 136A | 68.5 | 65.0 | M | 67.7 | 8 | 1.0 | 0.0 |
896 | 136A | 68.5 | 65.0 | F | 64.0 | 8 | 0.0 | 1.0 |
897 | 136A | 68.5 | 65.0 | F | 63.5 | 8 | 0.0 | 1.0 |
898 | 136A | 68.5 | 65.0 | F | 63.0 | 8 | 0.0 | 1.0 |
Well, we can clearly see that there are two different families. it might be a good idea to replce the “136A” with an acual number. We can choose the number “205” since it is next on the list.
levels(df$family) <- c(levels(df$family), "205")
df$family[df$family == "136A"] <- "205"
df[df$family == "205",]
family | father | mother | gender | height | kids | male | female | |
---|---|---|---|---|---|---|---|---|
891 | 205 | 68.5 | 65.0 | M | 72.0 | 8 | 1.0 | 0.0 |
892 | 205 | 68.5 | 65.0 | M | 70.5 | 8 | 1.0 | 0.0 |
893 | 205 | 68.5 | 65.0 | M | 68.7 | 8 | 1.0 | 0.0 |
894 | 205 | 68.5 | 65.0 | M | 68.5 | 8 | 1.0 | 0.0 |
895 | 205 | 68.5 | 65.0 | M | 67.7 | 8 | 1.0 | 0.0 |
896 | 205 | 68.5 | 65.0 | F | 64.0 | 8 | 0.0 | 1.0 |
897 | 205 | 68.5 | 65.0 | F | 63.5 | 8 | 0.0 | 1.0 |
898 | 205 | 68.5 | 65.0 | F | 63.0 | 8 | 0.0 | 1.0 |
Now that we are sure every value in “family” is numeric, we can convert it to be so, as follows:
df$family <- as.numeric(as.character(df$family))
# We can check the change we made as follows
str(df)
'data.frame': 898 obs. of 8 variables:
$ family: num 1 1 1 1 2 2 2 2 3 3 ...
$ father: num 78.5 78.5 78.5 78.5 75.5 75.5 75.5 75.5 75 75 ...
$ mother: num 67 67 67 67 66.5 66.5 66.5 66.5 64 64 ...
$ gender: Factor w/ 2 levels "F","M": 2 1 1 1 2 2 1 1 2 1 ...
$ height: num 73.2 69.2 69 69 73.5 72.5 65.5 65.5 71 68 ...
$ kids : int 4 4 4 4 4 4 4 4 2 2 ...
$ male : num 1 0 0 0 1 1 0 0 1 0 ...
$ female: num 0 1 1 1 0 0 1 1 0 1 ...
As a last step, we should make sure there is no missing data, a.k.a NA values.
any(is.na(df))
FALSE