January 25, 2015 · data programming R

R for Data Exploration

When I have to work with a new dataset, I find it worthwhile to spend some time in the beginning to explore and become familiar with it. The first reaction, especially if the data is in tabular form, is always to open it with a spreadsheet program. The problem is that besides looking at the data, it quickly becomes cumbersome to explore it.

By contrast, working with R doesn't feel as easy in the beginning, but once you have prepared the data (set each variable to the correct type, dealt with missing data, fixed variable names, etc) getting insight and answer new questions takes no time.

In addition to some standard R functions, dplyr is the best tool for data wrangling I have found. It works by chaining commands like in Unix pipelines: you start with a data source and then you filter, modify and aggregate the data until you get it in the format you want. Instead of pipes, you chain operations using these symbols: %>%.

For visualisations I like to use the lattice package. All methods in it have the same structure: function(formula, dataset, additional_opts). There are other libraries like ggplot2 that are more flexible and create more professional-looking graphics, but they not as fast and simple to use, which is what you need when exploring data.

Let's see how to use them with an example.

Exploring London Crime Data

We can get crime data in London from the London Datastore. First we take a look at the file and we see that, like the website indicates, it's a well-formed CSV file. So we can load the data easily using R into a crimes variable:

crimes <- read.csv("boro.csv.gz", stringsAsFactors=FALSE)  

We usually don't want the strings to be converted as factors by default, so we disable it. We can always convert them later if needed. We can use dplyr to get the first 3 rows of the data frame we have just loaded:

library(dplyr)  
crimes %>% top_n(3)  
  X201212 Theft...Handling Theft.Taking.Of.Pedal.Cycle Barking.and.Dagenham  X17
1  201212 Theft & Handling                 Other Theft          Westminster 1730  
2  201301 Theft & Handling                 Other Theft          Westminster 1596  
3  201302 Theft & Handling                 Other Theft          Westminster 1490  

Using the function str() (for structure, not string) we can check the structure of the data:

str(crimes)  
'data.frame':    22409 obs. of  5 variables:  
 $ X201212                    : int  201212 201212 201212 201212 201212 201212 201212 201212 201212 201212 ...
 $ Theft...Handling           : chr  "Criminal Damage" "Violence Against The Person" ...
 $ Theft.Taking.Of.Pedal.Cycle: chr  "Criminal Damage To Other Building" "Harassment" ...
 $ Barking.and.Dagenham       : chr  "Barking and Dagenham" "Barking and Dagenham" ...
 $ X17                        : int  14 73 20 12 43 42 43 71 42 33 ...

It tells us that the first column has been interpreted as an int, when it should be a date. We should also name the columns to make it easy to work with them:

names(crimes) <- c("date", "type", "specific_type", "borough", "num")  
crimes$date <- as.Date(paste(crimes$date, "01", sep=""), "%Y%m%d")  

We have already seen the dplyr method top_n(), we'll now use filter() to get the crimes in Westminster borough that happened in November 2014 and select() to get the columns that we want. We can do that chaining them:

westminster.2014.11 <- crimes %>%  
  filter(borough == "Westminster" & date == "2014-11-01") %>%
  select(borough, type, num)

If we now inspect crimes.westminster.2014.11, we'll see that there are multiple rows with the same type, but what we want is to add up the ones with the same type. To do it we'll use the dplyr functions group_by() and summarise(), and we'll also sort the rows so the most common crimes appear at the top using arrange():

westminster.2014.11 <- westminster.2014.11 %>%  
  group_by(borough, type) %>%
  summarise(num = sum(num)) %>%
  arrange(desc(num))
      borough                        type  num
1 Westminster            Theft & Handling 2352  
2 Westminster Violence Against The Person  768  
3 Westminster                    Burglary  301  
4 Westminster                       Drugs  217  
5 Westminster             Criminal Damage  179  
6 Westminster                     Robbery  143  
7 Westminster   Other Notifiable Offences   65  
8 Westminster             Sexual Offences   58  
9 Westminster             Fraud & Forgery    4  

Ok, that's great, now let's see how each type of crime evolved over time. First, we'll create a new variable that contains the crime history in Westminster:

westminster <- crimes %>%  
  filter(borough == "Westminster") %>%
  group_by(date, type) %>%
  summarise(num = sum(num))

dim(westminster) # 'westminster' has 3 columns and 216 rows  

But understanding crime trends in westminster just looking at the numbers is not an easy task, so we'll visualise it using lattice. The most common function is xyplot(formula, dataset), which draws bivariate scatterplots and time-series:

library(lattice)  
xyplot(num ~ date, westminster)  

Basic scatterplot

The formula num ~ date basically means: represent num as a function of date. The scatterplot looks OK, but with some extra options we can make a time-series that is much more understandable (and good-looking):

xyplot(num ~ date, westminster,  
       groups = type,                # print groups with colors
       auto.key=list(space="right"), # legend at the right
       t="l")                        # use lines instead of dots

Time-series with legend

This is better! We can also draw separate time-series adding a "condition" in the formula. Like in conditional probability the condition is expressed with a pipe, and it will create a trellis chart:

xyplot(num ~ date | type, westminster, t="l")  

Trellis time-series

We can also print the distribution of the number of drug-related crimes with a histogram:

histogram(~ num,  
  westminster %>% filter(type == "Drugs"),
  breaks = 6)         # number of columns

Histogram

In this case the graphic is univariate, so ~ num means: represent the histogram as a function of num. Also notice that instead of creating a new variable to store the drug-related crimes we have used dplyr to filter them on the fly. Neat!

Let's now compare crime data across Westminster, Camden and Islington. As before, dplyr and lattice make it very easy:

crime.boroughs <- crimes %>%  
  filter(borough %in% c("Westminster", "Camden", "Islington")) %>%
  group_by(date, borough) %>%
  summarise(num = sum(num))

xyplot(num ~ date, crime.boroughs,  
  groups=borough, auto.key=list(columns=3), t="l")

Comparing crime in different boroughs

We can also create a box plot to visualise the ranges in which the number of crimes vary over time in each borough:

bwplot(num ~ borough, crime.boroughs)  

Comparing crime in different boroughs

To finalise, we'll plot the average number of crimes per month in each borough. To do so, we'll first add up all the crimes in each month per borough, and then we will extract the month number of each date and we will average the number of crimes across the different years. We will use the dplyr function mutate() to create a new column month from the date:

crime.by.date <- crimes %>%  
  filter(borough %in% c("Westminster", "Camden", "Islington")) %>%
  group_by(borough, date) %>%
  summarise(num = sum(num))

crime.by.month <- crime.by.date %>%  
  mutate(month = as.numeric(format(date, "%m"))) %>%
  group_by(borough, month) %>%
  summarise(num = mean(num))

Notice that we could have avoided creating the crime.by.date variable and join all the expressions in a single pipeline, but breaking them in two expressions makes the code easier to understand.

We can now visualise this data using a bar chart:

barchart(num ~ month, crime.by.month,  
         groups = borough,           # print groups with colors
         horizontal=FALSE,           # by default is horizontal
         auto.key=list(columns=3),   # print legend
         ylim = 0:5500)              # set the vertical limits

Visualising crime across months

I hope that next time you have to work with data you'll give R, dplyr and lattice a try. You won't go back to spreadsheets except for the easiest tasks!

You can find the script and data used for this post in github.com/danisola/london-crime

More Resources

Comments powered by Disqus