R for Data Exploration
When I have to work with a new dataset, I find it worthwhile to spend some time in the beginning to explore and become familiar with it. The first reaction, especially if the data is in tabular form, is always to open it with a spreadsheet program. The problem is that besides looking at the data, it quickly becomes cumbersome to explore it.
By contrast, working with R doesn't feel as easy in the beginning, but once you have prepared the data (set each variable to the correct type, dealt with missing data, fixed variable names, etc) getting insight and answer new questions takes no time.
In addition to some standard R functions, dplyr is the best tool for data wrangling I have found. It works by chaining commands like in Unix pipelines: you start with a data source and then you filter, modify and aggregate the data until you get it in the format you want. Instead of pipes, you chain operations using these symbols:
For visualisations I like to use the lattice package. All methods in it have the same structure:
function(formula, dataset, additional_opts). There are other libraries like ggplot2 that are more flexible and create more professional-looking graphics, but they not as fast and simple to use, which is what you need when exploring data.
Let's see how to use them with an example.
Exploring London Crime Data
We can get crime data in London from the London Datastore. First we take a look at the file and we see that, like the website indicates, it's a well-formed CSV file. So we can load the data easily using R into a
crimes <- read.csv("boro.csv.gz", stringsAsFactors=FALSE)
We usually don't want the strings to be converted as factors by default, so we disable it. We can always convert them later if needed. We can use dplyr to get the first 3 rows of the data frame we have just loaded:
library(dplyr) crimes %>% top_n(3)
X201212 Theft...Handling Theft.Taking.Of.Pedal.Cycle Barking.and.Dagenham X17 1 201212 Theft & Handling Other Theft Westminster 1730 2 201301 Theft & Handling Other Theft Westminster 1596 3 201302 Theft & Handling Other Theft Westminster 1490
Using the function
str() (for structure, not string) we can check the structure of the data:
'data.frame': 22409 obs. of 5 variables: $ X201212 : int 201212 201212 201212 201212 201212 201212 201212 201212 201212 201212 ... $ Theft...Handling : chr "Criminal Damage" "Violence Against The Person" ... $ Theft.Taking.Of.Pedal.Cycle: chr "Criminal Damage To Other Building" "Harassment" ... $ Barking.and.Dagenham : chr "Barking and Dagenham" "Barking and Dagenham" ... $ X17 : int 14 73 20 12 43 42 43 71 42 33 ...
It tells us that the first column has been interpreted as an int, when it should be a date. We should also name the columns to make it easy to work with them:
names(crimes) <- c("date", "type", "specific_type", "borough", "num") crimes$date <- as.Date(paste(crimes$date, "01", sep=""), "%Y%m%d")
We have already seen the dplyr method
top_n(), we'll now use
filter() to get the crimes in Westminster borough that happened in November 2014 and
select() to get the columns that we want. We can do that chaining them:
westminster.2014.11 <- crimes %>% filter(borough == "Westminster" & date == "2014-11-01") %>% select(borough, type, num)
If we now inspect crimes.westminster.2014.11, we'll see that there are multiple rows with the same type, but what we want is to add up the ones with the same type. To do it we'll use the dplyr functions
summarise(), and we'll also sort the rows so the most common crimes appear at the top using
westminster.2014.11 <- westminster.2014.11 %>% group_by(borough, type) %>% summarise(num = sum(num)) %>% arrange(desc(num))
borough type num 1 Westminster Theft & Handling 2352 2 Westminster Violence Against The Person 768 3 Westminster Burglary 301 4 Westminster Drugs 217 5 Westminster Criminal Damage 179 6 Westminster Robbery 143 7 Westminster Other Notifiable Offences 65 8 Westminster Sexual Offences 58 9 Westminster Fraud & Forgery 4
Ok, that's great, now let's see how each type of crime evolved over time. First, we'll create a new variable that contains the crime history in Westminster:
westminster <- crimes %>% filter(borough == "Westminster") %>% group_by(date, type) %>% summarise(num = sum(num)) dim(westminster) # 'westminster' has 3 columns and 216 rows
But understanding crime trends in westminster just looking at the numbers is not an easy task, so we'll visualise it using lattice. The most common function is
xyplot(formula, dataset), which draws bivariate scatterplots and time-series:
library(lattice) xyplot(num ~ date, westminster)
num ~ date basically means: represent num as a function of date. The scatterplot looks OK, but with some extra options we can make a time-series that is much more understandable (and good-looking):
xyplot(num ~ date, westminster, groups = type, # print groups with colors auto.key=list(space="right"), # legend at the right t="l") # use lines instead of dots
This is better! We can also draw separate time-series adding a "condition" in the formula. Like in conditional probability the condition is expressed with a pipe, and it will create a trellis chart:
xyplot(num ~ date | type, westminster, t="l")
We can also print the distribution of the number of drug-related crimes with a histogram:
histogram(~ num, westminster %>% filter(type == "Drugs"), breaks = 6) # number of columns
In this case the graphic is univariate, so
~ num means: represent the histogram as a function of num. Also notice that instead of creating a new variable to store the drug-related crimes we have used dplyr to filter them on the fly. Neat!
Let's now compare crime data across Westminster, Camden and Islington. As before, dplyr and lattice make it very easy:
crime.boroughs <- crimes %>% filter(borough %in% c("Westminster", "Camden", "Islington")) %>% group_by(date, borough) %>% summarise(num = sum(num)) xyplot(num ~ date, crime.boroughs, groups=borough, auto.key=list(columns=3), t="l")
We can also create a box plot to visualise the ranges in which the number of crimes vary over time in each borough:
bwplot(num ~ borough, crime.boroughs)
To finalise, we'll plot the average number of crimes per month in each borough. To do so, we'll first add up all the crimes in each month per borough, and then we will extract the month number of each date and we will average the number of crimes across the different years. We will use the dplyr function
mutate() to create a new column
month from the
crime.by.date <- crimes %>% filter(borough %in% c("Westminster", "Camden", "Islington")) %>% group_by(borough, date) %>% summarise(num = sum(num)) crime.by.month <- crime.by.date %>% mutate(month = as.numeric(format(date, "%m"))) %>% group_by(borough, month) %>% summarise(num = mean(num))
Notice that we could have avoided creating the
crime.by.date variable and join all the expressions in a single pipeline, but breaking them in two expressions makes the code easier to understand.
We can now visualise this data using a bar chart:
barchart(num ~ month, crime.by.month, groups = borough, # print groups with colors horizontal=FALSE, # by default is horizontal auto.key=list(columns=3), # print legend ylim = 0:5500) # set the vertical limits
I hope that next time you have to work with data you'll give R, dplyr and lattice a try. You won't go back to spreadsheets except for the easiest tasks!
You can find the script and data used for this post in github.com/danisola/london-crime