Zillow $1M machine learning winner, Jordan Meyer, keynoting Analytics>Forward by Research Triangle Analysts in Durham, NC on March 9, 2019. Register using this link!
Using R packages including the tidyverse’s readxl (Jenny Bryan) and stringr (Hadley Wickham), I wrote an R script (code here and at the bottom of this post) called excel_email_cleaner to aggregate and filter out invalid email addresses found in any number of Excel workbooks with any number of worksheets that might contain email addresses.
This was created for a service organization of which I am a member, “100 Black Men of Triangle East” (serving Raleigh, Durham, and Chapel Hill, NC). If the name bothers you, please consider that the motto of the organization is “What They See is What They’ll Be”. The national “100 Black Men” organization counts among its original 1963 members Jackie Robinson and Dr. William Hayling. We mentor young men likely to be inspired by our image to commit to excellence and serving others (all beings).
As an aside: I believe we will eventually have mentoring programs that affirm all young people and build bridges between all of us through a courageous engagement with history and identity.
The excel_email_cleaner function:
- Finds Excel files as indicated by file extensions .xls or .xlsx
- Looks in each Excel file for column names that suggest an email address exists in that column
- Cleans email addresses found of invalid characters (anything not a letter, number, underscore, period or at-symbol [@])
- Labels the resulting email address as VALID or INVALID depending on whether the email address has more than one @ or zero periods [this could obviously use augmentation]
- Assigns each email address a corresponding name (e.g., “Tom Jones”) based on one or more found name columns (e.g., “First Name” or just “Name”), concatenating when appropriate
- Eliminates duplicate email address [this could use refinement – I currently use dplyr distinct() rather than a criterion like the most recent Excel file date, which could associate the email address with a more recent name
- Outputs an Microsoft Excel .xlsx workbook with:
- Cleaned Email Address
- Origin Excel filename
- Origin Excel sheetname
- Original uncleaned email address
- Name of column containing that original email address
- A flag indicating whether cleaning took place
- A flag indicating whether the cleaned email address appears to still be invalid
- a simple row-counter
- an email_group column that one could use to identify what email addresses to use each day to avoid getting blocked by an email provider. This is currently set to groups of 300 email addresses and can be adjusted by changing the number 300 in the script.
Something like this R script may already exist. Please let me know if you find another example. I know my current script could use refining.
Cleaning email addresses can be critical. Odd characters can appear in Excel cells like ¢ as one copies and pastes between file formats or some other file-encoding issue comes into play. I have not yet researched whether such values as ¢ are legal in email addresses but in the case of the served non-profit, those values were undesired suffixes (e.g., DONTEMAILME¢@AOL.COM).
Some discussion of these undesired characters and encoding challeges appear in these links:
This code is to remind me of how I (or some kind soul) might improve the script so it better handles where multiple email address exist in the same cell. I think tidyr separate_rows() could do it. Fortunately, in my case, there were just a few email addresses of this kind one could manually extract from the output Invalid worksheet.
modset <- starwars %>% select(name) %>%
mutate(simple_nam_email = paste0("dummy (", name, "@aol.com) ; <", name, "email@example.com>"))
modset_h <- head(modset) modset_h2 <- modset_h %>%
separate_rows(sep = "[^[:alnum:].]+", convert = FALSE)
dplyr::filter(str_count(simple_nam_email, "@") == 2) %>%
bind_rows(., modset_h %>% dplyr::filter(str_count(simple_nam_email, "@") != 2)) %>%
mutate(new = purrr::map(data,~ addrow_multiemail(.x))) %>%
mutate(email_mod = case_when(
str_detect(simple_nam_email, "\<") ~ str_extract(simple_nam_email, "(?<=\<).(?=\>)"), str_detect(simple_nam_email, "\(") ~ str_extract(simple_nam_email, "(?<=\().(?=\))"), TRUE ~ simple_nam_email)) %>%
mutate(email_mod = str_replace_all(toupper(email_mod),