Chapter 2 dbplyr
2.1 Introduction
In this chapter, we will learn to query data from a database using dplyr.
We will use the following R packages:
All the data sets used in this chapter can be found here and code can be downloaded from here.
2.2 Connect to Database
Let us connect to an in memory SQLite database using dbConnect().
We will copy the mtcars data to the database so that we can use it for running
dplyr statements.
2.3 Reference Data
In order to use dplyr functions, we need to reference the table in the database using
tbl().
## # Source:   table<mtcars> [?? x 11]
## # Database: sqlite 3.30.1 [:memory:]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with more rows2.4 Query Data
We will look at some simple examples. Let us start by selecting mpg, cyl and drat
columns from mtcars2.
## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.30.1 [:memory:]
##      mpg   cyl  drat
##    <dbl> <dbl> <dbl>
##  1  21       6  3.9 
##  2  21       6  3.9 
##  3  22.8     4  3.85
##  4  21.4     6  3.08
##  5  18.7     8  3.15
##  6  18.1     6  2.76
##  7  14.3     8  3.21
##  8  24.4     4  3.69
##  9  22.8     4  3.92
## 10  19.2     6  3.92
## # ... with more rowsWe can filter data as well. Filter all the rows from mtcars2 where mpg is
greater than 25.
## # Source:   lazy query [?? x 11]
## # Database: sqlite 3.30.1 [:memory:]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
## 2  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
## 3  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
## 4  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
## 5  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
## 6  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2Time to do some grouping and summarizing. Let us compute the average mileage for different types of cylinders.
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.30.1 [:memory:]
##     cyl mileage
##   <dbl>   <dbl>
## 1     4    26.7
## 2     6    19.7
## 3     8    15.12.5 Show Query
If you want to view the SQL query generated in the above step, use show_query() or explain().
mileages <- 
  mtcars2 %>%
  group_by(cyl) %>%
  summarise(mileage = mean(mpg))
dplyr::show_query(mileages)
## <SQL>
## SELECT `cyl`, AVG(`mpg`) AS `mileage`
## FROM `mtcars`
## GROUP BY `cyl`
dplyr::explain(mileages)
## <SQL>
## SELECT `cyl`, AVG(`mpg`) AS `mileage`
## FROM `mtcars`
## GROUP BY `cyl`
## 
## <PLAN>
##   id parent notused                       detail
## 1  6      0       0            SCAN TABLE mtcars
## 2  8      0       0 USE TEMP B-TREE FOR GROUP BY2.6 Collect Data
Now, some interesting facts. When working with databases, dplyr never pulls data into R unless you explicitly ask for it. In the previous example, dplyr will not do anything until you ask for the mileages data. It generates the SQL and only pulls down a few rows when you try to print mileages.
So how do we pull all the data and store it for further analysis? collect() will pull all
the data and store it in a tibble and you can use it for any further analysis.
## # A tibble: 3 x 2
##     cyl mileage
##   <dbl>   <dbl>
## 1     4    26.7
## 2     6    19.7
## 3     8    15.1