Analysing nycflights13 Using Relational Structure of its DataFrames (Updated)
Introduction
In this post we’ll be usig the nycflights13 data again, and this is because it has many other dataframes within it, so that we can use some of dplyr’s relational function.
In fact there are the following datasets within this package:
flights which contains information about all the flights out of New York, and is the most central df
airports which gives us information regarding the airports, ie:the name and location
planes which gives us information regarding particular planes, used in flights
airlines gives information regarding airlines
weather which gives us the weather conditions at the departing city/aiport in New York.
In this post we’ll take a look at flights, planes, and airports in the following 2 questions.
Let’s begin
So the first question we’ll take a look at is:
Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays.
Now we’re ready to visualize this information. First let’s make a plot of the United States, as it seems the flights only include flights to other American cities. We use the map data included with the ggplot2 package, and we we set the variable states to the dataframe that includes the coordinates to plot each state.
We then use geom_polygon in order to plot the map using the coordinate information.
Let’s now add the information regarding which destination airports are the cause of most delay. Let’s remove the points that are too west to be plotted onto this map, (ie: all airports in Alaska).
Let’s make this even more detailed by adding some labels to the worst, and best destinations.
As an added bonus, we’ll plot this information(including those points too Western for the ggplot map of the USA) using plotly.
Second Question now is:
Is there a relationship between the age of a plane and its delays?
So for this we are using the year the plane is made as a proxy for age, ie: latest models are the youngest planes.
Answering this question now requires us to join the planes table with the flights table in a similar fashion as we did before:
From the looks of the plot, it doesn’t really seem like there’s a relationship. The average delay times are all fairly spread out for all the years with a lot of planes, there does seem to be a lot more variation in the early 2000s compared to the years following however.
However I plotted the blue line, which represents a linear model (avg_delay ~ year), and it seems to signal a positive relationship between the two variables. We can further examine the model to see if this is valid.
We see that the p-value for the coefficient for year is actually significant, and is equal to 0.16929, which is again quite negligile in practical terms. In addition, the r-squared of the model is: summary(model)$r.squared which indicates an extremely poor model, and thus it’s evidence against a linear relationship between year of the plane and average delay. In addition we also tried a polynomial model(indicated in red), but that returned a poor r-squared as well. Thus I don’t believe that there’s a practical relationship between the age of the plane the delays they face, at least for the data that is present in the nycflights13 dataset.
Conclusion
In this post we used join functions in the dplyr package in order to generate new data frames for our analysis. We were able to use this information to visualize delays by airports via a map of the USA, and we were able to answer the question of whether or not the age of the plane had an influence of the average delay time of the plane.