Excel Email Address Cleaner & Aggregator – R function (#readxl #openxlsx)

Zillow $1M machine learning winner, Jordan Meyer, keynoting Analytics>Forward by Research Triangle Analysts in Durham, NC on March 9, 2019. Register using this link!

From the output Valid worksheet. The second column has valid email addresses. Other columns display the origin of the address including Excel filename and column name.

Using R packages including the tidyverse’s readxl (Jenny Bryan) and stringr (Hadley Wickham), I wrote an R script (code here and at the bottom of this post) called excel_email_cleaner to aggregate and filter out invalid email addresses found in any number of Excel workbooks with any number of worksheets that might contain email addresses.

This was created for a service organization of which I am a member, “100 Black Men of Triangle East” (serving Raleigh, Durham, and Chapel Hill, NC). If the name bothers you, please consider that the motto of the organization is “What They See is What They’ll Be”. The national “100 Black Men” organization counts among its original 1963 members Jackie Robinson and Dr. William Hayling. We mentor young men likely to be inspired by our image to commit to excellence and serving others (all beings).

As an aside: I believe we will eventually have mentoring programs that affirm all young people and build bridges between all of us through a courageous engagement with history and identity.

The excel_email_cleaner function:

  1. Finds Excel files as indicated by file extensions .xls or .xlsx
  2. Looks in each Excel file for column names that suggest an email address exists in that column
  3. Cleans email addresses found of invalid characters (anything not a letter, number, underscore, period or at-symbol [@])
  4. Labels the resulting email address as VALID or INVALID depending on whether the email address has more than one @ or zero periods [this could obviously use augmentation]
  5. Assigns each email address a corresponding name (e.g., “Tom Jones”) based on one or more found name columns (e.g., “First Name” or just “Name”), concatenating when appropriate
  6. Eliminates duplicate email address [this could use refinement – I currently use dplyr distinct() rather than a criterion like the most recent Excel file date, which could associate the email address with a more recent name
  7. Outputs an Microsoft Excel .xlsx workbook with:
    1. Name
    2. Cleaned Email Address
    3. Origin Excel filename
    4. Origin Excel sheetname
    5. Original uncleaned email address
    6. Name of column containing that original email address
    7. A flag indicating whether cleaning took place
    8. A flag indicating whether the cleaned email address appears to still be invalid
    9. a simple row-counter
    10. an email_group column that one could use to identify what email addresses to use each day to avoid getting blocked by an email provider. This is currently set to groups of 300 email addresses and can be adjusted by changing the number 300 in the script.
From the output Valid worksheet. The second column has valid email addresses. Other columns display the origin of the address including Excel filename and column name.
As the script runs, text indicates status and shows examples. Notice the invalid email addresses at the top, an Excel file was found with an email column but no name [still gets processed], and at the bottom are invalid email addresses that get cleaned prior to insertion in the output Excel workbook.
Status message. Notice that the above picture displays an older version of this message.
Worksheets in the output Excel file.

Something like this R script may already exist. Please let me know if you find another example. I know my current script could use refining.

Cleaning email addresses can be critical. Odd characters can appear in Excel cells like ¢ as one copies and pastes between file formats or some other file-encoding issue comes into play. I have not yet researched whether such values as ¢ are legal in email addresses but in the case of the served non-profit, those values were undesired suffixes (e.g., DONTEMAILME¢@AOL.COM).

Some discussion of these undesired characters and encoding challeges appear in these links:

This code is to remind me of how I (or some kind soul) might improve the script so it better handles where multiple email address exist in the same cell. I think tidyr separate_rows() could do it. Fortunately, in my case, there were just a few email addresses of this kind one could manually extract from the output Invalid worksheet.

modset <- starwars %>% select(name) %>% 
mutate(simple_nam_email = paste0("dummy (", name, "@aol.com) ; <", name, "second@yahoo.com>"))
modset_h <- head(modset) modset_h2 <- modset_h %>%
separate_rows(sep = "[^[:alnum:].]+", convert = FALSE)
https://stackoverflow.com/questions/43427786/conditionally-insert-rows-to-a-dataframe-using-dplyr/43427988#43427988
dplyr::filter(str_count(simple_nam_email, "@") == 2) %>%
group_by(row_number()) %>%
do (
rbind(., runif(length(.)))
) %>%
ungroup() %>%
bind_rows(., modset_h %>% dplyr::filter(str_count(simple_nam_email, "@") != 2)) %>%
mutate(new = purrr::map(data,~ addrow_multiemail(.x))) %>%
mutate(email_mod = case_when(
str_detect(simple_nam_email, "\<") ~ str_extract(simple_nam_email, "(?<=\<).(?=\>)"), str_detect(simple_nam_email, "\(") ~ str_extract(simple_nam_email, "(?<=\().(?=\))"), TRUE ~ simple_nam_email)) %>%
mutate(email_mod = str_replace_all(toupper(email_mod),
"[^A-Z0-9_@.\-]", "")
)
head(modset_h2)

# clean and aggregate email addresses from Excel workbooks with the Tidyverse
# 12tailedvampire name reference – link to Kevin Feasel reference – Microsoft AI bootcamp
# readxl for opening the Excel workbooks
# First test case is an email aggregation for the non-profit, scholarship-raising,
# youth mentorship program-leading organization 100 Black Men of Triangle East
# NOTE: Currently needs Excel column names on the first row
# !! Set this variable to the folder containing the Microsoft Excel files
dir_emails <- "C:/Users/rick2/Documents/100bm/content_agg/"
# Uncomment these lines to install R packages
# packages <- c("dplyr", "stringr", "readxl", "purrr", "openxlsx", "tidyr")
# if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
# install.packages(setdiff(packages, rownames(installed.packages())))
# }
# https://gist.github.com/RickPack/907a200cd40c786e19d045b379527f6d
date_creation <- gsub("", "_", Sys.Date())
library( readxl )
library( stringr )
library( dplyr )
library( purrr )
library( tidyr )
library( openxlsx )
excel_email_cleaner <- function(dir) {
out_folder <- paste0(dir_emails,"Output/")
dir.create(out_folder, showWarnings = FALSE)
# get the list of Excel files to process
fil_lst <- list.files(dir_emails)[grepl("(\\.xlsx)$|(\\.xls)$", tolower(list.files(dir_emails)))]
# Avoids processing temporary files (start with a tilde) like:
# ~$Copy of 100BMTE contacts (10-10-18).xlsx
fil_lst <- fil_lst[str_sub(fil_lst, 1, 1) != "~"]
fil_length <- length(fil_lst)
all_emails <- data.frame()
for (l in 1:fil_length){
fl <- fil_lst[l]
sht_length <- length( excel_sheets(fl) )
sht_names_var <- excel_sheets(fl)
for(sht in 1:sht_length){
xldf <- read_excel(fl, col_types = "text", sheet = sht)
sht_name <- sht_names_var[sht]
xldf_col <- colnames(xldf)
xldf_col_mod <- str_replace_all(toupper(xldf_col), "[^A-Z]", "")
xldf_col_names_all_num <- which(grepl("NAME", xldf_col_mod))
xldf_col_names_all_length <- length(which(grepl("NAME", xldf_col_mod)))
xldf_col_names_all <- xldf_col_mod[xldf_col_names_all_num]
name_simp_num <- which(str_trim(xldf_col_mod) == "NAME")
name_first <- which(grepl("FIRST", xldf_col_mod))
name_mid <- which(grepl("MIDDLE", xldf_col_mod))
name_last <- which(grepl("LAST", xldf_col_mod))
if(xldf_col_names_all_length > 0){
if(length(name_simp_num) > 0){
if(length(name_simp_num) > 1) {
message("too many NAME columns")
print(paste("e =", e, "of ", length(xldf_col_emails),
"sht =", sht, "of ", sht_length, "l =", l, "of", fil_length,
"finished"))}
if(length(name_simp_num) == 1) {
xldf <- xldf %>% mutate(final_name = .[[name_simp_num]])
}
} else if (xldf_col_names_all_length == 1){
xldf <- xldf %>% mutate(final_name = .[[xldf_col_names_all_num]])
} else {
# Zero (0) for the index will ultimately cause nothing to populate, for that
# value, the resulting vector that is a concatenation of
# first, middle, and last names
# (so no middle name, nothing appears between the first and last name,
# or only the NAME column will appear if that )
xldf_col_names_all2 <- c(name_first, name_mid, name_last)
xldf_col_names_all3 <- xldf_col_names_all2[!is.na(xldf_col_names_all2)]
xldf <- xldf %>% mutate(final_name =
purrr::reduce(.[xldf_col_names_all3], paste, sep = " ")) %>%
mutate(final_name =
str_replace_all(final_name, "(NA)", "")) %>%
mutate(final_name =
str_squish(final_name)) %>%
mutate(final_name =
case_when(final_name == "" ~ "UNKNOWN",
TRUE ~ final_name))
}
} else {
message(paste("No Email recipient name column found in workbook", fl, ", sheet ",
sht_name))
xldf$final_name = "UNKNOWN"
}
final_name_num <- which(colnames(xldf)=="final_name")
# store column names containing EMAIL as a numeric vector
xldf_col_emails_num <- which(grepl("EMAIL", xldf_col_mod))
xldf_col2 <- colnames(xldf)
xldf_col_emails <- xldf_col2[xldf_col_emails_num]
if(length(xldf_col_emails) == 0){
message(paste("No Email column found in workbook", fl, ", sheet ",
sht_name))
next
}
for (e in 1:length(xldf_col_emails)){
# Create a single-column
colnam_email <- xldf_col2[xldf_col_emails_num[e]]
xldf_sel <- xldf %>% select(final_name_num, xldf_col_emails[e]) %>%
mutate(simple_nam_email := toupper((!!as.name(colnam_email))))
colnam_email_1 <- quo_name(colnam_email)
em <- xldf_sel %>%
mutate(email_mod = case_when(
str_detect(simple_nam_email, "\\<") ~
str_extract(simple_nam_email,
"(?<=\\<).*(?=\\>)"),
str_detect(simple_nam_email, "\\(") ~
str_extract(simple_nam_email,
"(?<=\\().*(?=\\))"),
TRUE ~ simple_nam_email)) %>%
mutate(email_mod = str_replace_all(toupper(email_mod),
"[^A-Z0-9_@.\\-]", "")) %>%
distinct(email_mod, .keep_all = TRUE) %>%
mutate(CLEANED = case_when(
email_mod != simple_nam_email ~ "CLEANED",
TRUE ~ ""),
INVALID = case_when(
str_count(email_mod, "@") != 1 |
str_count(email_mod, ".") == 0 ~ "INVALID",
TRUE ~ ""))
em$Excel_filename <- fl
em$sheet_name <- sht_name
em$Name <- em$final_name
em$Email_Address <- em$email_mod
em$original_email <- em$simple_nam_email
em$original_email_field <- colnam_email
all_emails <- em %>%
distinct(Email_Address, .keep_all = TRUE) %>%
select(Name, Email_Address, Excel_filename, sheet_name, original_email,
original_email_field, CLEANED, INVALID) %>%
bind_rows(all_emails, .)
message("CLEANED")
print(all_emails %>% dplyr::filter(CLEANED=="CLEANED") %>%
slice(1:min(3, nrow(.))))
message("INVALID")
print(all_emails %>% dplyr::filter(INVALID=="INVALID") %>%
slice(1:min(3, nrow(.))))
print(paste("email_column =", e, "of ", length(xldf_col_emails),
"worksheet =", sht, "of ", sht_length, "Excel file =", l, "of", fil_length,
"finished"))
}
}
}
new_xl_frm <- all_emails %>%
mutate(name_priority = case_when(Name == "UNKNOWN" | str_trim(Name) == "" ~ 0,
TRUE ~ 1)) %>%
group_by(Email_Address) %>%
slice(which.max(name_priority)) %>%
ungroup() %>%
select(name_priority) %>%
replace(., is.na(.), "") %>%
dplyr::filter(str_length(Email_Address) > 4) %>%
distinct(Name, Email_Address, .keep_all = TRUE)
invalid_frm <- new_xl_frm %>% dplyr::filter(INVALID == "INVALID")
valid_frm <- new_xl_frm %>% dplyr::filter(INVALID != "INVALID") %>%
mutate(ct = row_number()) %>%
mutate(email_group = ceiling(ct/300))
### Create the Excel workbook
### style to embolden the first row
bldStyle <- createStyle(fontSize = 14, fontColour = "black", textDecoration = c("BOLD"))
### function to position data frames on separate worksheets, then save workbook at end
xlsxformat <- function(wb, namxlsx="", wksht_name, df_inxlsx, nxlsx, max_nxlsx){
if(nxlsx==1){
wb <- createWorkbook()
}
addWorksheet(wb, wksht_name)
writeData(wb, nxlsx, df_inxlsx, colNames=TRUE, headerStyle = bldStyle)
setColWidths(wb, sheet = nxlsx, cols = 1:ncol(df_inxlsx), widths = "auto")
if(max_nxlsx == nxlsx){
saveWorkbook(wb, paste0(namxlsx, ".xlsx"), overwrite = TRUE)
}
invisible(wb)
}
# When n_xlsx argument equals max_nxlsx argument, save workbook
wb <- xlsxformat(wb,
wksht_name = paste0("Valid_", date_creation),
df_inxlsx = valid_frm, nxlsx = 1, max_nxlsx = 2)
wb <- xlsxformat(wb, namxlsx = paste0(out_folder, "100BM_Email_Contacts"),
wksht_name = paste0("Invalid_", date_creation),
df_inxlsx = invalid_frm, nxlsx = 2, max_nxlsx = 2)
final_lst <- list()
final_lst[[1]] <- valid_frm
final_lst[[2]] <- invalid_frm
invisible(final_lst)
}
final_lst_out <- excel_email_cleaner(dir_emails)
valid_frm_out <- final_lst_out[[1]]
invalid_frm_out <- final_lst_out[[2]]
message("Notice invalid characters")
valid_frm_out %>% dplyr::filter(CLEANED != "")
middle <- function(d, n = 5){
N = nrow(d) / 2
print(tbl_df(d[(Nn/2):(N + n 1 n / 2),]), n = n)
}

4 thoughts on “Excel Email Address Cleaner & Aggregator – R function (#readxl #openxlsx)

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s