Bank Box Penny Distribution

Introduction

In this post we look at a few key features from a dataset I put together. In November 2022 I requested a box of pennies. This is equivalent to $25 dollars worth. From there I opened each roll of pennies (worth ¢50 each). For each penny I recorded the coin’s year, mint mark, whether it contains 95% copper (or is clad), the country, the roll number, and included a few notes on special pennies.

All pennies except one were of US origin, a 2002 Canadian cent.

We begin our data exploration by loading up our R packages and dataset.

library(here)
library(readODS)
library(ggplot2)

fp <- here("csv", "pennies", "pennies_2022.ods")
pennies <- readODS::read_ods(fp)

Data Cleaning

Here was take a brief look at our dataframe. Notice below that our date column is recorded as a character datatype. This is because some of the dates were unreadable or very hard to accurately determine, so I used “unknown” as a placeholder in these cases.

str(pennies)

## 'data.frame':    2500 obs. of  6 variables:
##  $ date    : chr  "2022" "2022" "2018" "2019" ...
##  $ mint    : chr  "d" "d" "d" "d" ...
##  $ copper95: num  NA NA NA NA NA 1 NA NA NA NA ...
##  $ country : chr  NA NA NA NA ...
##  $ roll    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ notes   : chr  NA NA NA NA ...

We will now go ahead and clean up the dataset. First off, we remove the foreign cent from the mix. Next, we convert our date column to numeric. Lastly, we update our copper95 column to use 0 for clad pennies.

pennies <- pennies[which(is.na(pennies$country)),]
pennies$date <- gsub("unknown", NA, pennies$date)
pennies$date <- as.numeric(pennies$date)
pennies$copper95[is.na(pennies$copper95)] <- 0

Data visualization

With our data a little cleaner, we can now take a closer examination. Below we create two histograms, one showing counts for each year, the other showing counts for each decade. Note that ggplot2 automatically removes the dates with NA values from the histograms.

ggplot(pennies, aes(x=date, fill=mint, color=mint)) +
    geom_histogram(breaks=1920:2023) +
    xlab("Date") + ylab("Count") +
    ggtitle("Histogram of Pennies Sorted by Year")

## Warning: Removed 7 rows containing non-finite values (stat_bin).

ggplot(pennies, aes(x=date, fill=mint, color=mint)) +
    geom_histogram(breaks=seq(from=1920, to=2030, by=10)) +
    scale_x_continuous(breaks = 10 * 192:203) +
    xlab("Date") + ylab("Count") +
    ggtitle("Histogram of Pennies Sorted by Decade")

## Warning: Removed 7 rows containing non-finite values (stat_bin).

We can see from both histograms that most of our mint marks are from the Denver mint while the second most common mint mark is none or NA. Note that pennies from the Philadelphia mint are not rare, however, the P mint mark is less common showing up in this case only in 2017 pennies.

Also note that the 2017 penny which indicates NA mint mark was probably incorrectly categorized when recording the data and likely was a P mint mark.

Further Analysis

Before wrapping things up let’s extract a few interesting values from our sample of pennies. Some quick analysis is done below.

(total <- length(pennies$copper95))

## [1] 2499

(copper <- sum(pennies$copper95 == 1))

## [1] 580

(clad <- total - copper)

## [1] 1919

(proportion_copper <- sum(pennies$copper95 == 1) / total)

## [1] 0.2320928

So out of our 2499 pennies we have 580 copper pennies constituting 23.2092837% of the total.

Now, one interesting fact is that 1982 has both copper and clad pennies. Let’s take a look at how many of each we found.

penny1982 <- pennies[(pennies$date == 1982) & !is.na(pennies$date) ,]
length(penny1982$date)  # number of observations

## [1] 46

sum(penny1982$copper95 == 1)

## [1] 37

So we ended up with 46 pennies from 1982; the number of copper pennies being 37. Note that to determine whether a 1982 penny was copper or clad we measured the weight. Copper pennies are roughly 25% heavier than clad pennies.

Lastly, we do a quick calculation for the number of wheat pennies in our dataset. Note that we could also calculate this number by using the date of 1958 or below.

length(which(pennies$notes == "wheat"))

## [1] 12

Conclusion

This was a pretty neat project since I was able to collect the data myself. While I did a reasonable job of collecting the data I think I could have recorded things in a more uniform fashion. In particular, I originally listed all of the pennies with no mint mark as P due to the fact these coins should have been minted by the Philadelphia mint. This created two issues, first being that some coins without a mint mark are not actually minted in Philadelphia and secondly that 2017 explicitly uses the mint mark. Overall, I think I did a decent job recording the data, but I can improve in the future. Additionally, the importance of domain specific knowledge played a role in the my observation about the Philadelphia mint.

In the future, this would be a cool project to reproduce, perhaps on a larger scale. Sorting through 2500 pennies did take around 10 hours though. Asides from just getting a bigger dataset it could also be pretty neat to apply some statistical analysis. Did for example our 1982 copper and clad pennies have a good distribution relative to actual US mint production numbers. Overall, this was a really fun project and I hope to do something similar to it in the future where I can collect and analyze my own data!