Intro to Dplyr on nycflights13 (Updated Jan 18)
Introduction
In this entry I’m going to work through a sequence of analysis that will cover some of the main operations in dplyr. Dplyr is a tool in R that simplifies, and thus standardizes functions that you would generally perform on dataframes using base R functons. It will allow us to query the dataframe similar to what is possible in SQL.
The main operations in dplyr are:
- Select
- Filter
- Arrange
- Mutate
- Summarise
Let’s begin
We’re going to be looking at the nycflights13 dataset, which contains records of all the flights departing from NYC airports in 2013. Let’s take a look at the data:
To get started with the select
function, let’s figure out a few ways select dep_time, dep_delay, arr_time, and arr_delay from flights.
Let’s now try mutate
. We can use this function in order to create new variables in our data frame.
Perhaps we can create a variable that will give us the total delay of the flights.
What we see is a right-skewed distribution. We can confirm this because the mean is: 19.4505325
which is larger than the median: -6
, which indicates a right-skewed distribution.
Perhaps now we can look at all the flights with reasonable arrival and departing delays (e.g. let’s assume that 2 hrs is reasonable). We will do this by using the filter
function supplied with dplyr.
Moreover, you may have notice we used something new (%>%
). This symbol tells us that we are using what’s on the left of it as the first parameter for the expression on the right, and is thus equivalent to:
Now with this data set, we could ask: Which airlines have the most "acceptable" delays(including flights ahead of schedule)?
.
In this example we made use of multiple pipes. Moreover, we introduced the group_by()
function, which was used to the dataframe into subsets based on the carrier. We then looked at the average total delay for all the flights for each different airline using the summarise()
function. Lastly, we used the arrange()
function in order to look at the flights with the most “acceptable” delays. In fact we do see UA and DELTA which are probably handle the most traffic, and thus this analysis might not actually tell us much about the quality of their services.
If we were to use ok_flights
, a more appropriate question would be to see which carrier had the shortest delays on average, given that they were acceptable flights.
One last thing we could do is to look at the previous bar graph, and create one that symbolizes what proportion of flights were acceptable in a visual manner, although we can see this from the original, let’s just make use of some cool settings in ggplot.
Woah, this is kind of weird, there are some un-acceptable flights with total delays less than 0. But remember the conditions for an acceptable flight were no arrival or departure delays longer than 2 hours; However some of these flights indicate cases where for example, dep_delay = 5, but arr_delay=-10. phew!. However, now the question is whether our conditions for acceptability are valid, because of their inconsistency with total_delay, but that’s a question we’ll leave for the moment.
In closing
We took a brief look at the operations included in the dplyr package, and how simple it was for us to do some initial exploratory analysis on the data set.