15 min read

Christmas Trees

Introduction

In this post we will look at the total number of Christmas trees harvested in a year as well as well as the total number of operations in a state with tree sales. The data we have is from realchristmastrees.org and includes the years 2002, 2007 and 2012.

Our objective with the data will be to create choropleth maps for each year and each type of data. To accomplish this we use spatial data found here.

The primary packages we will use are shown below.

library(sp)
library(tmap)
library(here)
library(dplyr)
library(magrittr)
library(ggplot2)
library(stringr)
library(tools)
library(pdftools)

Data Cleaning

Our first objective is to load the Christmas tree data and make it usable. Since our original data starts as a pdf we need to carefully inspect it at each step to determine the correct next step. Using pdf_text from the pdftools package we are able to load our data into character vectors of length 2 and 1 for our trees harvested dataset and number of retailer operations dataset respectively.

f1 <- here("csv", "christmas_trees", "trees_harvested.pdf")
trees_harvested <- pdftools::pdf_text(pdf = f1)

f2 <-here("csv", "christmas_trees", "operations_with_sales.pdf")
retailers <- pdftools::pdf_text(pdf = f2)

Below we look at the first element of the trees_harvested dataset. This is the data we need to work with. So we can ignore the second element.

print(trees_harvested[1])
## [1] "                               STATES BY TOTAL TREES HARVESTED\n\n                       2012             2007         2002\nTOTAL            17,318,440       17,415,971   20,806,705\nOREGON            6,446,506        6,850,841    6,466,551\nNORTH CAROLINA    4,288,563        3,085,383    2,915,507\nMICHIGAN          1,739,538        1,572,208    2,380,173\nPENNSYLVANIA      1,028,888        1,179,733    1,724,419\nWISCONSIN           611,387          950,440    1,605,981\nWASHINGTON          587,047          785,304    1,164,139\nVIRGINIA            478,069          313,710      507,791\nNEW YORK            274,444          348,043      618,917\nMAINE               195,833          126,908      164,406\nCONNECTICUT         159,091          113,622      133,861\nOHIO                151,327          272,981      372,957\nVERMONT             134,504          168,206      151,249\nNEW HAMPSHIRE       131,876           82,124      107,725\nMINNESOTA           130,527          202,259      463,885\nCALIFORNIA          109,045          119,855      383,940\nTENNESSEE            93,874          166,542      149,770\nINDIANA              89,252          198,899      186,303\nNEW JERSEY           68,471           78,791      132,458\nILLINOIS             65,937          112,617      144,008\nMARYLAND             55,926           77,801       99,183\nMASSACHUSETTS        52,188           75,914       72,522\nGEORGIA              50,112           50,607       80,952\nWEST VIRGINIA        49,867           42,102       60,098\nTEXAS                38,645           42,327       80,914\nSOUTH CAROLINA       35,381           31,113       38,871\nMISSOURI             32,810           27,344       92,483\nIDAHO                27,732           46,145       55,083\nIOWA                 27,077           39,575       57,254\nNEBRASKA             22,513           15,160       24,215\nALABAMA              16,355           31,183       35,670\nFLORIDA              16,214           13,776       15,320\nMISSISSIPPI          15,997           20,889       39,594\nRHODE ISLAND         15,962           19,251       23,085\nKANSAS               11,350           15,731       29,094\nLOUISIANA            10,811           17,227       43,742\nMONTANA               9,028           32,104       43,793\nOKLAHOMA              8,760           14,310       18,818\nKENTUCKY              7,987           23,724       56,473\nCOLORADO              7,902           13,404       11,508\nDELAWARE              7,627           10,819       16,183\nARKANSAS              5,806           10,636       18,146\nSOUTH DAKOTA          2,620            4,161        3,715\nUTAH                  2,525            2,126        3,744\nHAWAII                2,007            8,323        4,233\nNORTH DAKOTA            735            1,150        2,007\nARIZONA                 300              289           (D)\nALASKA                    24               0             0\nNEVADA                   (D)               0           (D)\nNEW MEXICO               (D)             314        2,935\nWYOMING                    0               0        3,030\n"

We can see that the entire pdf table of data in the original dataset has been transformed into a single character string. We will have to do a few different things to get this data into a usable format. Our first step is the remove the new space characters \n and create a new character vector with each line. We do this below and display the first 10 lines.

th_lines <- strsplit(trees_harvested[1], '\n')[[1]]

th_lines %>% head(10) %>% print
##  [1] "                               STATES BY TOTAL TREES HARVESTED"
##  [2] ""                                                              
##  [3] "                       2012             2007         2002"     
##  [4] "TOTAL            17,318,440       17,415,971   20,806,705"     
##  [5] "OREGON            6,446,506        6,850,841    6,466,551"     
##  [6] "NORTH CAROLINA    4,288,563        3,085,383    2,915,507"     
##  [7] "MICHIGAN          1,739,538        1,572,208    2,380,173"     
##  [8] "PENNSYLVANIA      1,028,888        1,179,733    1,724,419"     
##  [9] "WISCONSIN           611,387          950,440    1,605,981"     
## [10] "WASHINGTON          587,047          785,304    1,164,139"

With our slightly cleaner data, we now have each row of our table as a single element in our new character vector. This is a good step, however, the first 4 lines are not really part of our state data. Our next step is to remove the top four rows and then split the character string into four parts corresponding to state, 2002, 2007 and 2012. Since we will have to do something similar for both datasets we create a function below to do this which returns a dataframe. Note that we also remove some confusing characters like (D) and extra commas.

df_from_chr <- function(lines, splits) {
    lines <- lines[- c(1:4)]  # remove first 4 lines
    df <- data.frame()  # create empty dataframe
    
    
    # loop over lines to go into dataframe
    for (line in lines) {
        # split lines along splits
        new_row = c(
            substr(line, 1, splits[1]),
            substr(line, splits[1] + 1, splits[2]),
            substr(line, splits[2] + 2, splits[3]),
            substr(line, splits[3] + 3, nchar(line))
        )
        # remove confusing characters
        for (i in 1:4) {
            new_row[i] %<>%
                trimws %>%
                str_replace_all("\\(D\\)", "NA") %>%
                str_remove_all(",")
        }
        df <- rbind(df, new_row)  # add line to dataframe
    }
    
    # add dataframe names and create numeric columns
    names(df) <- c("State", "2012", "2007", "2002")
    df$`2012` <- as.numeric(unlist(df["2012"]))
    df$`2007` <- as.numeric(unlist(df["2007"]))
    df$`2002` <- as.numeric(unlist(df["2002"]))
    
    # return value
    df
}

Below we apply our function to the harvested tree dataset using appropriately placed splitting points.

th_df <- df_from_chr(th_lines, c(15, 30, 47))

We print our new result below:

print(th_df)
##             State    2012    2007   2002
## 1          OREGON 6446506 6850841 466551
## 2  NORTH CAROLINA 4288563 3085383 915507
## 3        MICHIGAN 1739538 1572208 380173
## 4    PENNSYLVANIA 1028888 1179733 724419
## 5       WISCONSIN  611387  950440 605981
## 6      WASHINGTON  587047  785304 164139
## 7        VIRGINIA  478069  313710 507791
## 8        NEW YORK  274444  348043 618917
## 9           MAINE  195833  126908 164406
## 10    CONNECTICUT  159091  113622 133861
## 11           OHIO  151327  272981 372957
## 12        VERMONT  134504  168206 151249
## 13  NEW HAMPSHIRE  131876   82124 107725
## 14      MINNESOTA  130527  202259 463885
## 15     CALIFORNIA  109045  119855 383940
## 16      TENNESSEE   93874  166542 149770
## 17        INDIANA   89252  198899 186303
## 18     NEW JERSEY   68471   78791 132458
## 19       ILLINOIS   65937  112617 144008
## 20       MARYLAND   55926   77801  99183
## 21  MASSACHUSETTS   52188   75914  72522
## 22        GEORGIA   50112   50607  80952
## 23  WEST VIRGINIA   49867   42102  60098
## 24          TEXAS   38645   42327  80914
## 25 SOUTH CAROLINA   35381   31113  38871
## 26       MISSOURI   32810   27344  92483
## 27          IDAHO   27732   46145  55083
## 28           IOWA   27077   39575  57254
## 29       NEBRASKA   22513   15160  24215
## 30        ALABAMA   16355   31183  35670
## 31        FLORIDA   16214   13776  15320
## 32    MISSISSIPPI   15997   20889  39594
## 33   RHODE ISLAND   15962   19251  23085
## 34         KANSAS   11350   15731  29094
## 35      LOUISIANA   10811   17227  43742
## 36        MONTANA    9028   32104  43793
## 37       OKLAHOMA    8760   14310  18818
## 38       KENTUCKY    7987   23724  56473
## 39       COLORADO    7902   13404  11508
## 40       DELAWARE    7627   10819  16183
## 41       ARKANSAS    5806   10636  18146
## 42   SOUTH DAKOTA    2620    4161   3715
## 43           UTAH    2525    2126   3744
## 44         HAWAII    2007    8323   4233
## 45   NORTH DAKOTA     735    1150   2007
## 46        ARIZONA     300     289     NA
## 47         ALASKA      24       0      0
## 48         NEVADA      NA       0     NA
## 49     NEW MEXICO      NA     314   2935
## 50        WYOMING       0       0   3030

We now repeat our process for retail operations. Below we look at the data.

print(retailers)
## [1] "                         STATES BY NUMBER OF OPERATIONS WITH SALES\n\n                      2012      2007      2002\nTOTAL            12,976.00 13,366.00 14,677.00\nOREGON               1,250      1320      1106\nNORTH CAROLINA       1,151       934      1043\nPENNSYLVANIA         1,079      1205      1326\nNEW YORK               875       844      1001\nMICHIGAN               826       900      1076\nNEW JERSEY             700       884       899\nWISCONSIN              689       849       859\nOHIO                   534       594       775\nWASHINGTON             518       534       508\nVIRGINIA               512       352       514\nCONNECTICUT            490       338       382\nMASSACHUSETTS          409       280       306\nCALIFORNIA             328       322       403\nMAINE                  310       236       216\nMINNESOTA              276       280       327\nILLINOIS               254       248       338\nVERMONT                232       255       252\nNEW HAMPSHIRE          223       181       173\nINDIANA                189       202       265\nWEST VIRGINIA          179       173       184\nMARYLAND               151       180       204\nSOUTH CAROLINA         149       179       152\nFLORIDA                148        53        79\nIOWA                   137       169       215\nGEORGIA                136       144       188\nTENNESSEE              111       179       152\nMISSISSIPPI            110       147       162\nTEXAS                  108       254       310\nMISSOURI               105       131       196\nCOLORADO                92       122        91\nKENTUCKY                81       129       143\nIDAHO                   79        80        56\nNEBRASKA                77        71        84\nLOUISIANA               74        87       113\nALABAMA                 67        59        91\nKANSAS                  63        67       102\nRHODE ISLAND            51        49        60\nMONTANA                 46        51        54\nARKANSAS                31        51        41\nOKLAHOMA                31        63        73\nDELAWARE                25        41        58\nHAWAII                  23        44        19\nUTAH                    21        33        26\nSOUTH DAKOTA            13        19        14\nNORTH DAKOTA            12        12        15\nALASKA                   3         0         0\nARIZONA                  3         0         0\nNEW MEXICO               3        17        20\nNEVADA                   2         1         2\nWYOMING                  0         3         4\n"

Our next step is to remove new line characters \n and print the first 10 rows.

r_lines <- strsplit(retailers, '\n')[[1]]

r_lines %>% head(10) %>% print
##  [1] "                         STATES BY NUMBER OF OPERATIONS WITH SALES"
##  [2] ""                                                                  
##  [3] "                      2012      2007      2002"                    
##  [4] "TOTAL            12,976.00 13,366.00 14,677.00"                    
##  [5] "OREGON               1,250      1320      1106"                    
##  [6] "NORTH CAROLINA       1,151       934      1043"                    
##  [7] "PENNSYLVANIA         1,079      1205      1326"                    
##  [8] "NEW YORK               875       844      1001"                    
##  [9] "MICHIGAN               826       900      1076"                    
## [10] "NEW JERSEY             700       884       899"

We use our function df_from_chr to split the character rows and create a new dataframe.

r_df <- df_from_chr(r_lines, c(15, 27, 38))

We now print our new dataframe and after this we are ready to consider spatial choropleth maps.

print(r_df)
##             State 2012 2007 2002
## 1          OREGON 1250 1320 1106
## 2  NORTH CAROLINA 1151  934 1043
## 3    PENNSYLVANIA 1079 1205 1326
## 4        NEW YORK  875  844 1001
## 5        MICHIGAN  826  900 1076
## 6      NEW JERSEY  700  884  899
## 7       WISCONSIN  689  849  859
## 8            OHIO  534  594  775
## 9      WASHINGTON  518  534  508
## 10       VIRGINIA  512  352  514
## 11    CONNECTICUT  490  338  382
## 12  MASSACHUSETTS  409  280  306
## 13     CALIFORNIA  328  322  403
## 14          MAINE  310  236  216
## 15      MINNESOTA  276  280  327
## 16       ILLINOIS  254  248  338
## 17        VERMONT  232  255  252
## 18  NEW HAMPSHIRE  223  181  173
## 19        INDIANA  189  202  265
## 20  WEST VIRGINIA  179  173  184
## 21       MARYLAND  151  180  204
## 22 SOUTH CAROLINA  149  179  152
## 23        FLORIDA  148   53   79
## 24           IOWA  137  169  215
## 25        GEORGIA  136  144  188
## 26      TENNESSEE  111  179  152
## 27    MISSISSIPPI  110  147  162
## 28          TEXAS  108  254  310
## 29       MISSOURI  105  131  196
## 30       COLORADO   92  122   91
## 31       KENTUCKY   81  129  143
## 32          IDAHO   79   80   56
## 33       NEBRASKA   77   71   84
## 34      LOUISIANA   74   87  113
## 35        ALABAMA   67   59   91
## 36         KANSAS   63   67  102
## 37   RHODE ISLAND   51   49   60
## 38        MONTANA   46   51   54
## 39       ARKANSAS   31   51   41
## 40       OKLAHOMA   31   63   73
## 41       DELAWARE   25   41   58
## 42         HAWAII   23   44   19
## 43           UTAH   21   33   26
## 44   SOUTH DAKOTA   13   19   14
## 45   NORTH DAKOTA   12   12   15
## 46         ALASKA    3    0    0
## 47        ARIZONA    3    0    0
## 48     NEW MEXICO    3   17   20
## 49         NEVADA    2    1    2
## 50        WYOMING    0    3    4

Spatial Data

Now that our data is in a good format for use, we need to load up our spatial data. We do that below.

tree_geo <- here("csv", "christmas_trees",
                 "gadm36_USA_1_sp.rds") %>%
    readRDS()

Taking a look at our spatial data we can see that we are using a class of SpatialPolygonsDataFrame. The dataframe portion of this data contains 51 observations which corresponds to the 50 states of the U.S. as well as the District of Columbia.

str(tree_geo, 2)
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  51 obs. of  10 variables:
##   ..@ polygons   :List of 51
##   ..@ plotOrder  : int [1:51] 2 44 27 5 32 29 3 23 38 6 ...
##   ..@ bbox       : num [1:2, 1:2] -179.2 18.9 179.8 72.7
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot

Our goal is to combine this spatial dataframe with our two Christmas tree datasets. Below we take a look at four rows of the dataframe in our spatial data. We can see that we have the country name, state name, and whether it is a state or other type of region. The other type of region will corresponds to the District of Columbia.

tree_geo@data[1:4, c("NAME_0", "NAME_1", "ENGTYPE_1")]
##           NAME_0   NAME_1 ENGTYPE_1
## 1  United States  Alabama     State
## 12 United States   Alaska     State
## 23 United States  Arizona     State
## 34 United States Arkansas     State

The next two steps check whether or not our tree datasets are missing data from our spatial dataframe.

# for th_df

# Determine spelling mismatches between dataframe and
# geographical dataframe missing from or mismatched
# with corresponding regions in tree_geo
excluded_names = c()
for (name in tree_geo$NAME_1){
    if (!(tolower(name) %in% tolower(th_df$State))) {
        excluded_names <- append(excluded_names, name)
    }
}
print(excluded_names)
## [1] "District of Columbia"
# for r_df

# Determine spelling mismatches between dataframe and
# geographical dataframe missing from or mismatched
# with corresponding regions in tree_geo
excluded_names = c()
for (name in tree_geo$NAME_1){
    if (!(tolower(name) %in% tolower(r_df$State))) {
        excluded_names <- append(excluded_names, name)
    }
}
print(excluded_names)
## [1] "District of Columbia"

We can see in both cases we do not have data for the District of Columbia. To resolved this we add to each of our tree dataframes a row containing the missing name along with NA values.

# add NA values as needed for District of Columbia
new_row <- data.frame("District of Columbia", NA, NA, NA)

names(new_row) <- names(th_df)
th_df <- rbind(th_df, new_row)
th_df <- th_df[order(th_df$State),]

names(new_row) <- names(r_df)
r_df <- rbind(r_df, new_row)
r_df <- r_df[order(r_df$State),]

Our next check for each dataframe is to determine if our spatial data and tree dataframes have the same order.

# We determine whether or not tree_geo and our th_df have
# same order for state / region names
incorrect_names <- c()
for (index in 1:length(tree_geo$NAME_1)){
    if (tolower(tree_geo$NAME_1[index]) != 
            tolower(th_df$State[index])) {
        incorrect_names <- append(incorrect_names, th_df$State[index])
    }
}
print(incorrect_names)
## NULL
# We determine whether or not tree_geo and our r_df have
# same order for state / region names
incorrect_names <- c()
for (index in 1:length(tree_geo$NAME_1)){
    if (tolower(tree_geo$NAME_1[index]) != 
            tolower(r_df$State[index])) {
        incorrect_names <- append(incorrect_names, r_df$State[index])
    }
}
print(incorrect_names)
## NULL

Since we get NULL for each result we are free to combine each dataframe which we do below.

th_df["State"] <- th_df$State %>% tolower %>% toTitleCase
th_spdf <- SpatialPolygonsDataFrame(Sr=tree_geo, data=th_df, match.ID = FALSE)

r_df["State"] <- r_df$State %>% tolower %>% toTitleCase
r_spdf <- SpatialPolygonsDataFrame(Sr=tree_geo, data=r_df, match.ID = FALSE)

Below we look at our new SpatialPolygonsDataFrame for each category of tree data and are now ready for spatial visualization.

str(th_spdf, 2)
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  51 obs. of  4 variables:
##   ..@ polygons   :List of 51
##   ..@ plotOrder  : int [1:51] 2 44 27 5 32 29 3 23 38 6 ...
##   ..@ bbox       : num [1:2, 1:2] -179.2 18.9 179.8 72.7
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   ..$ comment: chr "TRUE"
str(r_spdf, 2)
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  51 obs. of  4 variables:
##   ..@ polygons   :List of 51
##   ..@ plotOrder  : int [1:51] 2 44 27 5 32 29 3 23 38 6 ...
##   ..@ bbox       : num [1:2, 1:2] -179.2 18.9 179.8 72.7
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   ..$ comment: chr "TRUE"

Visualization

In this section we put everything together and showcase our results. We will look at our results for the harvested tree dataset first for each year and then for the retail operations dataset. Due to the small amount of harvested trees and making the map too stretched we remove Alaska and Hawaii from our results.

condition = th_spdf@data$State != "Alaska" & th_spdf@data$State != "Hawaii"
new_spdf <- th_spdf[condition,]

bbox_new <- sf::st_bbox(new_spdf) # current bounding box

xrange <- bbox_new$xmax - bbox_new$xmin # range of x values
yrange <- bbox_new$ymax - bbox_new$ymin # range of y values

# bbox_new[1] <- bbox_new[1] - (0.5 * xrange) # xmin - left
# bbox_new[3] <- 0.1 * bbox_new[3] # xmax - right
# bbox_new[2] <- bbox_new[2] - (0.5 * yrange) # ymin - bottom
bbox_new[4] <- bbox_new[4] + (0.5 * yrange) # ymax - top

bbox_new <- bbox_new %>%  # take the bounding box ...
  sf::st_as_sfc() # ... and make it a sf polygon

tm_shape(new_spdf, bbox=bbox_new) +
    tm_polygons(col="2002", breaks=c(0, 500, 10000, 100000, 1000000, 5000000, 10000000)) +
    tm_borders(lty="solid") +
    tm_style("cobalt", bg.color="#9CC3CD") +
    tm_layout(frame=FALSE, title="Christmas Trees Harvested\nper State", 
            title.size=3,
            legend.title.size=1.5, legend.text.size=1,
            legend.position=c("right", "top"))

We now look at our values for retail operations.

condition = th_spdf@data$State != "Alaska" & th_spdf@data$State != "Hawaii"
new_spdf <- r_spdf[condition,]

bbox_new <- sf::st_bbox(new_spdf) # current bounding box

xrange <- bbox_new$xmax - bbox_new$xmin # range of x values
yrange <- bbox_new$ymax - bbox_new$ymin # range of y values

# bbox_new[1] <- bbox_new[1] - (0.5 * xrange) # xmin - left
# bbox_new[3] <- 0.1 * bbox_new[3] # xmax - right
# bbox_new[2] <- bbox_new[2] - (0.5 * yrange) # ymin - bottom
bbox_new[4] <- bbox_new[4] + (0.5 * yrange) # ymax - top

bbox_new <- bbox_new %>%  # take the bounding box ...
  sf::st_as_sfc() # ... and make it a sf polygon

tm_shape(new_spdf, bbox=bbox_new) +
    tm_fill("2002") +
    tm_borders(lty="solid") +
    tm_style("cobalt", bg.color="#9CC3CD") +
    tm_layout(frame=FALSE, title="Christmas Tree Retail\nOperations per State", 
            title.size=3,
            legend.title.size=1.5, legend.text.size=1,
            legend.position=c("right", "top"))

Conclusion

Based on our choropleth maps we can see that the total number of trees being harvested tends to be near the states with large bodies of water. This, however, does not extend to the southern states. We can also see a slight trend over time towards certain states such as Oregon and North Carolina harvesting the most trees.

In terms of retail operations the results are almost identical each year and show that most operations are either in the north east or north west of the country. Similar to the harvested tree dataset we see that Oregon and North Carolina are very prominent in the map.

Overall we can see that most operations and trees are grown either on the east or west coast and generally further north. We also see additional activity near the great lakes.

Further research in this topic could include determining where evergreen trees and species used for Christmas trees naturally grow and what type of climate they grow best in. Additionally, trying to find more recent data could be advantageous. We could also attempt a statistical analysis on the datasets to see if the numbers really are changing significantly between years.