Using R to remove email addresses from Excel files based on SurveyMonkey responses

This is part 2 of my email cleaning work for the youth mentoring service organization, “100 Black Men of Triangle East”. The national “100 Black Men” motto is “What They See is What They’ll Be”. Here is part 1.

Asking our email list whether they wanted to unsubscribe through a Gmail email created a mess of embedded emails that mixed unsubscribe requests, comments and questions, and bounced emails.

So our president, Nate Branscomb, advised we use SurveyMonkey‘s free ability to email a survey link that would trigger an automatic collection of bounced emails. The survey itself would simply include “unsubscribe” responses. Finally, I discovered the SurveyMonkey email included an “opt-out” link that would cause users’ email addresses to be stored as such.

I sent out the survey about a week and a half ago and have observed responses to have trickled off. SurveyMonkey allowed emailing 1000 individuals at a time and so I created two duplicate surveys. SurveyMonkey fortunately had a duplicate feature.

Respondents’ individual responses were stored in a detailed way. This was not part of my solution because it was too cumbersome to copy and paste email addresses and the “Export” feature was available only to those with a paid membership.

The feature I needed was found under “COLLECT RESPONSES”. Clicking through yielded a “Complete responses” screen (not shown) from which I could copy and paste the email addresses of those who wished to unsubscribe.

Chose “Complete responses” to see a list of emails associated with an unsubscribe response. These appeared in immutable groups of 10, as I recall, so this process would be unwieldly for a larger email list.

To identify bounced email addresses as well as those who opted-out, I clicked under my user-name at the top-right and chose Contacts.

In Notepad++, I pasted email addresses and then used the ALT+SPACE shortcut to vertically highlight as needed to eliminate leading spaces that might compromise parsing in R. To be clear, the “copy” that preceded the “paste” was a mouse click-and-drag on SurveyMonkey screens.

This R code shows how I then created the final output file, of which the first worksheet ultimately excluded (i) unsubscribed, (ii) bounced, and (iii) opted-out email addresses.

####################################################
## Reads email addresses copied and pasted from ##
## Surveymonkey screens into .txt files and then ##
## removes them from the original list. The ##
## final Excel workbook has the cleaned emails, ##
## a copy of the previously created second sheet ##
## (invalid emails, produced by ##
## excel_email_cleaner_new.R, found at: ##
## https://gist.github.com/RickPack/907a200cd40c786e19d045b379527f6d) ##
## and then worksheets with (i) bounced email ##
## addresses and (ii) opt-out + unsubscribes ##
####################################################
library(readr)
library(readxl)
library(dplyr)
library(openxlsx)
out_folder <- 'Output/'
fl <- paste0(out_folder, 'old100BM_Email_Contacts.xlsx')
# Read the first sheet in the last cleaned emails Excel file in the output folder
# Sheet is named Valid_2019_02_14
originalxls1 <- read_excel(fl, col_types = "text", sheet = 1)
# Now sheet 2, named Invalid_2019_02_14
originalxls2 <- read_excel(fl, col_types = "text", sheet = 2)
# Read the bounced email addresses copied and pasted from SurveyMonkey
# read_delim is used because fortunately, copying and pasting caused
# columns to be separated by spaces, which I set as the delimiter.
# This causes a false read of other columns but I only want the first
# (email_addresses), hence the use of select(1)
bounced_emails <- readr::read_delim('Survey1_bounces.txt', " ") %>%
select(1) %>%
mutate(Email_Address = toupper(stringr::str_trim(Email_Address)))
unsub_emails <- readr::read_delim('Survey1_unsubscribes.txt', " ") %>%
select(1) %>%
mutate(Email_Address = toupper(stringr::str_trim(Email_Address)))
# Confirm email counts match .txt files
nrow(bounced_emails)
nrow(unsub_emails)
# Now let's remove the email addresses that were either bounced or unsubscribes
remove_emails <- bind_rows(bounced_emails, unsub_emails)
## Are there any strange characters that we want to remove?
remove_emails %>%
dplyr::filter(grepl("[^A-Z0-9_@.\\-]", Email_Address))
## No additional cleaning needed so let's remove any email addresses
## from the original emails that were found in the remove list
newxls1 <- anti_join(originalxls1, remove_emails)
# Count of emails in list went from 1593 to 1388
# Finally, let's create the new output Excel file
###########################################
# This is code using openxlsx functions #
# that I often reuse. #
# It emboldens and enlarges the header #
# row and facilitates adding worksheets #
# with desired names. #
###########################################
### Create the Excel workbook
### style to embolden the first row
bldStyle <- createStyle(fontSize = 14, fontColour = "black", textDecoration = c("BOLD"))
### function to position data frames on separate worksheets, then save workbook at end
### When n_xlsx argument equals max_nxlsx argument, save workbook
xlsxformat <- function(wb, namxlsx="", wksht_name, df_inxlsx, nxlsx, max_nxlsx){
if (nxlsx == 1) {
wb <- createWorkbook()
}
addWorksheet(wb, wksht_name)
writeData(wb, nxlsx, df_inxlsx, colNames = TRUE, headerStyle = bldStyle)
setColWidths(wb, sheet = nxlsx, cols = 1:ncol(df_inxlsx), widths = "auto")
if (max_nxlsx == nxlsx) {
saveWorkbook(wb, paste0(namxlsx, ".xlsx"), overwrite = TRUE)
}
invisible(wb)
}
date_creation <- gsub("", "_", Sys.Date())
wb <- xlsxformat(wb,
wksht_name = paste0("Valid_", date_creation),
df_inxlsx = newxls1, nxlsx = 1, max_nxlsx = 4)
wb <- xlsxformat(wb, namxlsx = paste0(out_folder, "100BM_Email_Contacts"),
wksht_name = paste0("Invalid_", date_creation),
df_inxlsx = originalxls2, nxlsx = 2, max_nxlsx = 4)
wb <- xlsxformat(wb, namxlsx = paste0(out_folder, "100BM_Email_Contacts"),
wksht_name = "Bounced Emails",
df_inxlsx = bounced_emails, nxlsx = 3, max_nxlsx = 4)
wb <- xlsxformat(wb, namxlsx = paste0(out_folder, "100BM_Email_Contacts"),
wksht_name = "Unsubscribed Emails",
df_inxlsx = unsub_emails, nxlsx = 4, max_nxlsx = 4)

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s