Introduction
In this post we look at a few key features from a dataset I put together. In November 2022 I requested a box of pennies. This is equivalent to $25 dollars worth. From there I opened each roll of pennies (worth ¢50 each). For each penny I recorded the coin’s year, mint mark, whether it contains 95% copper (or is clad), the country, the roll number, and included a few notes on special pennies.
All pennies except one were of US origin, a 2002 Canadian cent.
We begin our data exploration by loading up our R packages and dataset.
library(here)
library(readODS)
library(ggplot2)
fp <- here("csv", "pennies", "pennies_2022.ods")
pennies <- readODS::read_ods(fp)
Data Cleaning
Here was take a brief look at our dataframe. Notice below that our date column is recorded as a character datatype. This is because some of the dates were unreadable or very hard to accurately determine, so I used “unknown” as a placeholder in these cases.
str(pennies)
## 'data.frame': 2500 obs. of 6 variables:
## $ date : chr "2022" "2022" "2018" "2019" ...
## $ mint : chr "d" "d" "d" "d" ...
## $ copper95: num NA NA NA NA NA 1 NA NA NA NA ...
## $ country : chr NA NA NA NA ...
## $ roll : num 1 1 1 1 1 1 1 1 1 1 ...
## $ notes : chr NA NA NA NA ...
We will now go ahead and clean up the dataset. First off, we remove the foreign cent from the mix. Next, we convert our date column to numeric. Lastly, we update our copper95
column to use 0 for clad pennies.
pennies <- pennies[which(is.na(pennies$country)),]
pennies$date <- gsub("unknown", NA, pennies$date)
pennies$date <- as.numeric(pennies$date)
pennies$copper95[is.na(pennies$copper95)] <- 0
Data visualization
With our data a little cleaner, we can now take a closer examination. Below we create two histograms, one showing counts for each year, the other showing counts for each decade. Note that ggplot2
automatically removes the dates with NA
values from the histograms.
ggplot(pennies, aes(x=date, fill=mint, color=mint)) +
geom_histogram(breaks=1920:2023) +
xlab("Date") + ylab("Count") +
ggtitle("Histogram of Pennies Sorted by Year")
## Warning: Removed 7 rows containing non-finite values (stat_bin).
ggplot(pennies, aes(x=date, fill=mint, color=mint)) +
geom_histogram(breaks=seq(from=1920, to=2030, by=10)) +
scale_x_continuous(breaks = 10 * 192:203) +
xlab("Date") + ylab("Count") +
ggtitle("Histogram of Pennies Sorted by Decade")
## Warning: Removed 7 rows containing non-finite values (stat_bin).
We can see from both histograms that most of our mint marks are from the Denver mint while the second most common mint mark is none or NA
. Note that pennies from the Philadelphia mint are not rare, however, the P mint mark is less common showing up in this case only in 2017 pennies.
Also note that the 2017 penny which indicates NA
mint mark was probably incorrectly categorized when recording the data and likely was a P mint mark.
Further Analysis
Before wrapping things up let’s extract a few interesting values from our sample of pennies. Some quick analysis is done below.
(total <- length(pennies$copper95))
## [1] 2499
(copper <- sum(pennies$copper95 == 1))
## [1] 580
(clad <- total - copper)
## [1] 1919
(proportion_copper <- sum(pennies$copper95 == 1) / total)
## [1] 0.2320928
So out of our 2499 pennies we have 580 copper pennies constituting 23.2092837% of the total.
Now, one interesting fact is that 1982 has both copper and clad pennies. Let’s take a look at how many of each we found.
penny1982 <- pennies[(pennies$date == 1982) & !is.na(pennies$date) ,]
length(penny1982$date) # number of observations
## [1] 46
sum(penny1982$copper95 == 1)
## [1] 37
So we ended up with 46 pennies from 1982; the number of copper pennies being 37. Note that to determine whether a 1982 penny was copper or clad we measured the weight. Copper pennies are roughly 25% heavier than clad pennies.
Lastly, we do a quick calculation for the number of wheat pennies in our dataset. Note that we could also calculate this number by using the date of 1958 or below.
length(which(pennies$notes == "wheat"))
## [1] 12
Conclusion
This was a pretty neat project since I was able to collect the data myself. While I did a reasonable job of collecting the data I think I could have recorded things in a more uniform fashion. In particular, I originally listed all of the pennies with no mint mark as P due to the fact these coins should have been minted by the Philadelphia mint. This created two issues, first being that some coins without a mint mark are not actually minted in Philadelphia and secondly that 2017 explicitly uses the mint mark. Overall, I think I did a decent job recording the data, but I can improve in the future. Additionally, the importance of domain specific knowledge played a role in the my observation about the Philadelphia mint.
In the future, this would be a cool project to reproduce, perhaps on a larger scale. Sorting through 2500 pennies did take around 10 hours though. Asides from just getting a bigger dataset it could also be pretty neat to apply some statistical analysis. Did for example our 1982 copper and clad pennies have a good distribution relative to actual US mint production numbers. Overall, this was a really fun project and I hope to do something similar to it in the future where I can collect and analyze my own data!