Five Essential Characteristics of Analytic Tools

Joseph Mohr
8 min readMar 18, 2024
Photo by Scott Graham on Unsplash

As projects change over time, analysis needs to be agile. When developing your software there are five essential characteristics which ought to be considered which I have listed in order of easiest to most difficult implementation based on my own personal experience: reproducibility, dynamicity, scalability, modularity, and automation.

As general rule in data analysis, which might be appropriately coined as the Wickham’s Function Razor, states that when you have copied and pasted a code chunk twice during your analysis (meaning that you now have three copies of the same code), it should be rewritten as a function.

Consider this scenario. You are an economist reporting on workforce trends in the Dallas-Fort Worth, Houston, and Austin Metro Statistical Areas (MSA). Your task is to calculate the monthly change and cumulative growth and create some graphs that visualize the trends. Additional information about the project states that the report is looking at recovery from the COVID-19 Pandemic so you need a 5-year window from January 2019 to December 2023.

You download the Current Employment Statistics for the selected MSAs from 2019–2023. Some things to note about this data, it is originally downloaded as an excel workbook so before you start, resave it as a csv. It is updated on the third Friday of each month and is always lagged by one month. However, benchmark estimates are updated in January so the January release is delayed until the second Friday of March and the February release occurs two weeks later.

Once you have your data, you open up RStudio and get to work:

# load packages
library(dplyr)
library(ggplot2)
library(janitor)
library(readr)

# read in data
ces <- read_csv("data/ces.csv") %>%
clean_names("screaming_snake")

# calculate monthly change and cumulative growth
ces_update <- ces %>%
select("YEAR",
"PERIOD",
"AREA_NUMBER",
"AREA",
"ADJUSTMENT",
"NO_OF_JOBS") %>%
filter(ADJUSTMENT == "Adjusted",
AREA == "Houston-The Woodlands-Sugar Land")) %>%
mutate(YEAR = as.numeric(YEAR)) %>%
group_by(AREA_NUMBER) %>%
mutate(MONTHLY_CHANGE = (NO_OF_JOBS - lag(NO_OF_JOBS))) %>%
replace_na(list(MONTHLY_CHANGE = 0)) %>%
mutate(CUM_GROWTH = cumsum(MONTHLY_CHANGE)) %>%
mutate(DATE = str_c(PERIOD," 1, ", YEAR)) %>%
mutate(DATE = as.Date(DATE, "%B %d, %Y"))

# plot simple line graph of cumulative growth
ggplot(ces_update, aes(DATE, CUM_GROWTH))+
geom_line()+
scale_y_continuous(n.breaks = 10,labels = scales::comma)+
scale_x_date(date_breaks = "12 months", date_labels = "%b-%y",
limits = c(min(ces_update$DATE),
max = max(ces_update$DATE)),
expand = c(0,0))+
theme_classic()+
ggtitle("Houston MSA Cumulative Job Growth")+
labs(x = NULL, y = "No. of Jobs")+
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
Cumulative growth of the Houston MSA workforce from January 2019 to December 2023

Now just to copy and two more times, make some minor edits, and run the code. You draft up the report and send it out for publishing.

However, your boss sends you an email an hour later asking if you can reproduce the same visuals for all 25 MSAs in Texas. Additionally, there is some interest expressed around seeing how this compares to the 2007-2008 Financial Crisis as some stakeholders read your initial report. Your boss suggests that you will need to also expand your timeline to 20 years.

1. Reproducibility

The initial script executes a specific series of tasks for a specific dataset. However, to run this code as a function unique variables need to be defined. Unique variables are identified by considering what changes from one analysis to another. In this case, the analysis is looking at different geographic regions.

calc_trends <- function(df, area){

df %>%
select("YEAR",
"PERIOD",
"AREA_NUMBER",
"AREA",
"ADJUSTMENT",
"NO_OF_JOBS")%>%
filter(ADJUSTMENT == "Adjusted",
AREA == area) %>%
mutate(YEAR = as.numeric(YEAR))%>%
group_by(AREA_NUMBER) %>%
mutate(MONTHLY_CHANGE = (NO_OF_JOBS - lag(NO_OF_JOBS))) %>%
replace_na(list(MONTHLY_CHANGE = 0)) %>%
mutate(CUM_GROWTH = cumsum(MONTHLY_CHANGE)) %>%
mutate(DATE = str_c(PERIOD," 1, ", YEAR)) %>%
mutate(DATE = as.Date(DATE, "%B %d, %Y"))

}

calc_trends(ces, area = "Houston-The Woodlands-Sugar Land")

Also, when writing functions it’s best to use a verb. This is especially important when working in the tidyverse. Review the tidyverse style guide for more info. To check if your function is behaving as desired, use testthat or waldo for unit testing on your outputs!

2. Dynamicity

Dynamicity is the quality of the analysis to change based on a certain variable. For example, the window of time will move forward when the report is updated next month. Currently, the specific months of data were downloaded and the window looked at a 60-month window from January 2019 to December 2023. What if we wanted to produce a 60-month Year-to-Date graph? Unfortunately, the source requires the year and month to be selected separately meaning that if our start month was February, we would still have to select January because it would not be included in subsequent years. However, this also means that while the start month for our analysis is February, our start month in our data is still January so it needs to be filtered out.

While filter is one possible solution here, we would not want to explicitly call the months to be filtered because as the month shifts, the set of months to be filtered expands. Additionally, this would be a complex call which looks something like filter(!MONTH == January & YEAR == 2019) meaning that the YEAR variable would need to be redefined every so often as well. Conversely, calling slice_tail(n = 61) extracts the last 5 years worth of data that we need and changes dynamically (Remember that to calculate the monthly change we still need one month prior). However, to maintain reproducibility, the function will add a new months variable so that the window can be easily redefined.

calc_trends <- function(df, area, months){

df %>%
select("YEAR",
"PERIOD",
"AREA_NUMBER",
"AREA",
"ADJUSTMENT",
"NO_OF_JOBS")%>%
filter(ADJUSTMENT == "Adjusted",
AREA == area) %>%
mutate(YEAR = as.numeric(YEAR))%>%
group_by(AREA_NUMBER) %>%
slice_tail(n = months) %>%
mutate(MONTHLY_CHANGE = (NO_OF_JOBS - lag(NO_OF_JOBS))) %>%
replace_na(list(MONTHLY_CHANGE = 0)) %>%
mutate(CUM_GROWTH = cumsum(MONTHLY_CHANGE)) %>%
mutate(DATE = str_c(PERIOD," 1, ", YEAR)) %>%
mutate(DATE = as.Date(DATE, "%B %d, %Y"))

}

# 5-year timeframe
calc_trends(ces, "Houston-The Woodlands-Sugar Land", months = 61)

# 20-year timeframe
calc_trends(ces, "Houston-The Woodlands-Sugar Land", months = 241)

3. Scalability

Writing scalable code is a matter of properly ordering your functions but also extends beyond the scope of your current operation. Consider the function as it is currently written:

calc_trends <- function(df, area, months){

df %>%
select("YEAR",
"PERIOD",
"AREA_NUMBER",
"AREA",
"ADJUSTMENT",
"NO_OF_JOBS") %>%
filter(ADJUSTMENT == "Adjusted",
AREA == area) %>%
mutate(YEAR = as.numeric(YEAR)) %>%
group_by(AREA_NUMBER) %>%
slice_tail(n = months) %>%
mutate(MONTHLY_CHANGE = (NO_OF_JOBS - lag(NO_OF_JOBS))) %>%
replace_na(list(MONTHLY_CHANGE = 0)) %>%
mutate(CUM_GROWTH = cumsum(MONTHLY_CHANGE)) %>%
mutate(DATE = str_c(PERIOD," 1, ", YEAR)) %>%
mutate(DATE = as.Date(DATE, "%B %d, %Y"))

}

Notice that filter is called before group_by so that that operation only occurs once. Yet, we don't really need both. In this case, grouping by the area is actually extra work. Also, mutate has some issues and we’ll need to move it up so DATE is only calculated one time. Then, we can reduce the number of operations by we load lubridate and call mdy instead. It’s important to use all functions in the same ecosystem if possible.

library(lubridate)

calc_trends <- function(df, area, months){

df %>%
select("YEAR",
"PERIOD",
"AREA_NUMBER",
"AREA",
"ADJUSTMENT",
"NO_OF_JOBS") %>%
mutate(DATE = mdy(str_c(PERIOD," 1, ", YEAR))) %>%
filter(ADJUSTMENT == "Adjusted",
AREA == area) %>%
slice_tail(n = months) %>%
mutate(MONTHLY_CHANGE = (NO_OF_JOBS - lag(NO_OF_JOBS))) %>%
replace_na(list(MONTHLY_CHANGE = 0)) %>%
mutate(CUM_GROWTH = cumsum(MONTHLY_CHANGE))

}

4. Modularity

However, this function is still written very much like a pipeline and executes in a specific way. Consider what happens if we break our function apart.

# load packages
library(dplyr)
library(ggplot2)
library(janitor)
library(lubridate)
library(readr)

# Import ----
read_ces <- function(file){

read_csv(file) %>%
clean_names("screaming_snake") %>%
select("YEAR",
"PERIOD",
"AREA_NUMBER",
"AREA",
"ADJUSTMENT",
"NO_OF_JOBS") %>%
filter(ADJUSTMENT == "Adjusted") %>%
mutate(DATE = mdy(str_c(PERIOD," 1, ", YEAR)))

}

# Transform ---
calc_trends <- function(df, area, months){

df %>%
filter(AREA == area) %>%
slice_tail(n = months) %>%
mutate(MONTHLY_CHANGE = (NO_OF_JOBS - lag(NO_OF_JOBS))) %>%
replace_na(list(MONTHLY_CHANGE = 0)) %>%
mutate(CUM_GROWTH = cumsum(MONTHLY_CHANGE))

}

# Visualize
plot_cumgrowth <- function(data, area){

ggplot(data, aes(DATE, CUM_GROWTH))+
geom_line()+
scale_y_continuous(n.breaks = 10,labels = scales::comma)+
scale_x_date(date_breaks = "12 months", date_labels = "%b-%y",
limits = c(min(ces_update$DATE),
max = max(ces_update$DATE)),
expand = c(0,0))+
theme_classic()+
ggtitle(paste0(area, "MSA Cumulative Job Growth")+
labs(x = NULL, y = "No. of Jobs")+
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

}

Speed is paramount. Selecting out comlumns and filtering for adjusted every single time makes things slow. Notice below how the readability and maintainability of the workflow increases. The above code might be saved out as a separate R file named functions.R and loaded by calling sourcewhile this code is now your main workflow.

ces <- read_ces("data/ces.csv")

ces_update <- ces %>%
calc_trends("Houston-The Woodlands-Sugar Land", months = 61)

plot_cumgrowth(ces_update, "Houston-The Woodlands-Sugar Land")

With this setup, the dataframe and graph are separate which improves overall agility of the code and flexibility with how the data is stored as well.

5. Automation

Automation is also oriented towards large scale operations. It’s implementation requires some higher level strategic thinking. Let’s say you’re using quarto to generate economic reports on all of the MSAs in Texas on a monthly basis. In this case, it would be best to store all of your data as one file in its own directory setup like so.

MSA
|- code
|-|- functions.R
|- data
|-|- ces.csv
|- economics.qmd

The Quarto report named economics.qmd would look like this:

---
title: "Tourism report: `r params$msa"
date: `r Sys.Date()`
format:
html:
embed-resources: true
params:
msa: "Houston-The Woodlands-Sugarland"

---

```{r setup}

source("~/analysis/code/functions.R")

```

```{r}

ces <- read_ces("data/ces.csv")

ces_update <- ces %>%
calc_trends(params$msa, months = 61)

plot_cumgrowth(ces_update, params$msa)

```

To render reports for the three MSAs, call walk in the console like so:

areas <- c("Austin-Round Rock", 
"Dallas-Fort Worth-Arlington",
"Houston-The Woodlands-Sugar Land")

walk(
areas,
\(x) quarto::quarto_render(
"economics.qmd",
execute_params = list(area = x),
output_file = glue::glue("{x}.html")
)
)

Now, we have transformed a single pipeline in R into an analytic toolkit that works with a full-fledged report that can be produced at scale.

A few things to keep in mind

As these design characteristics are implemented, the scope of the operation often increases thus impacting the work of others. Coordination between teams and departments will most likely be necessary.

While this buildout resulted in a rendered Quarto report, that might not always be necessary or viable. It’s imperative to use the tools already in place and minimize disruption when building out tools, workflows, and operations.

With these five essential characteristics in mind, projects become easier to run as they are more agile, functions across projects also become easier to share, and the lead time to reporting is reduced. Overall, deliverables are more accurate and of higher quality, and your organization is that much closer to being a trusted source for analysis in whatever field you might work in.

Joseph Mohr is a Catholic, Athlete, Craftsman, Writer, Stoic, and, most importantly, a Floral Enthusiast. He graduated from Texas A&M in 2019 with a B.S. in Human Geography but is more so a polymath studying the intersection of politics, economics, construction, public health, and religion through the lense of demographics. He currently works as a data analyst for a non-profit working to improve education in Texas and runs The Night Watch, his personal consulting office. In his spare time, he enjoys studying art, architecture, cartography, psychology, sports medicine, and language, and is a lover of running, lifting, drawing, cold brew coffee, camping, hiking, hockey, leatherwork, and Texas Two-Stepping.

--

--

Joseph Mohr

Geographer, Analyst, Polymath, Catholic, and Stoic | TAMU '19