Excel Email Address Cleaner & Aggregator – R function (#readxl #openxlsx)

Zillow $1M machine learning winner, Jordan Meyer, keynoting Analytics>Forward by Research Triangle Analysts in Durham, NC on March 9, 2019. Register using this link!

From the output Valid worksheet. The second column has valid email addresses. Other columns display the origin of the address including Excel filename and column name.

Using R packages including the tidyverse’s readxl (Jenny Bryan) and stringr (Hadley Wickham), I wrote an R script (code here and at the bottom of this post) called excel_email_cleaner to aggregate and filter out invalid email addresses found in any number of Excel workbooks with any number of worksheets that might contain email addresses.

This was created for a service organization of which I am a member, “100 Black Men of Triangle East” (serving Raleigh, Durham, and Chapel Hill, NC). If the name bothers you, please consider that the motto of the organization is “What They See is What They’ll Be”. The national “100 Black Men” organization counts among its original 1963 members Jackie Robinson and Dr. William Hayling. We mentor young men likely to be inspired by our image to commit to excellence and serving others (all beings).

As an aside: I believe we will eventually have mentoring programs that affirm all young people and build bridges between all of us through a courageous engagement with history and identity.

The excel_email_cleaner function:

  1. Finds Excel files as indicated by file extensions .xls or .xlsx
  2. Looks in each Excel file for column names that suggest an email address exists in that column
  3. Cleans email addresses found of invalid characters (anything not a letter, number, underscore, period or at-symbol [@])
  4. Labels the resulting email address as VALID or INVALID depending on whether the email address has more than one @ or zero periods [this could obviously use augmentation]
  5. Assigns each email address a corresponding name (e.g., “Tom Jones”) based on one or more found name columns (e.g., “First Name” or just “Name”), concatenating when appropriate
  6. Eliminates duplicate email address [this could use refinement – I currently use dplyr distinct() rather than a criterion like the most recent Excel file date, which could associate the email address with a more recent name
  7. Outputs an Microsoft Excel .xlsx workbook with:
    1. Name
    2. Cleaned Email Address
    3. Origin Excel filename
    4. Origin Excel sheetname
    5. Original uncleaned email address
    6. Name of column containing that original email address
    7. A flag indicating whether cleaning took place
    8. A flag indicating whether the cleaned email address appears to still be invalid
    9. a simple row-counter
    10. an email_group column that one could use to identify what email addresses to use each day to avoid getting blocked by an email provider. This is currently set to groups of 300 email addresses and can be adjusted by changing the number 300 in the script.
From the output Valid worksheet. The second column has valid email addresses. Other columns display the origin of the address including Excel filename and column name.
As the script runs, text indicates status and shows examples. Notice the invalid email addresses at the top, an Excel file was found with an email column but no name [still gets processed], and at the bottom are invalid email addresses that get cleaned prior to insertion in the output Excel workbook.
Status message. Notice that the above picture displays an older version of this message.
Worksheets in the output Excel file.

Something like this R script may already exist. Please let me know if you find another example. I know my current script could use refining.

Cleaning email addresses can be critical. Odd characters can appear in Excel cells like ¢ as one copies and pastes between file formats or some other file-encoding issue comes into play. I have not yet researched whether such values as ¢ are legal in email addresses but in the case of the served non-profit, those values were undesired suffixes (e.g., DONTEMAILME¢@AOL.COM).

Some discussion of these undesired characters and encoding challeges appear in these links:

This code is to remind me of how I (or some kind soul) might improve the script so it better handles where multiple email address exist in the same cell. I think tidyr separate_rows() could do it. Fortunately, in my case, there were just a few email addresses of this kind one could manually extract from the output Invalid worksheet.

modset <- starwars %>% select(name) %>% 
mutate(simple_nam_email = paste0("dummy (", name, "@aol.com) ; <", name, "second@yahoo.com>"))
modset_h <- head(modset) modset_h2 <- modset_h %>%
separate_rows(sep = "[^[:alnum:].]+", convert = FALSE)
https://stackoverflow.com/questions/43427786/conditionally-insert-rows-to-a-dataframe-using-dplyr/43427988#43427988
dplyr::filter(str_count(simple_nam_email, "@") == 2) %>%
group_by(row_number()) %>%
do (
rbind(., runif(length(.)))
) %>%
ungroup() %>%
bind_rows(., modset_h %>% dplyr::filter(str_count(simple_nam_email, "@") != 2)) %>%
mutate(new = purrr::map(data,~ addrow_multiemail(.x))) %>%
mutate(email_mod = case_when(
str_detect(simple_nam_email, "\<") ~ str_extract(simple_nam_email, "(?<=\<).(?=\>)"), str_detect(simple_nam_email, "\(") ~ str_extract(simple_nam_email, "(?<=\().(?=\))"), TRUE ~ simple_nam_email)) %>%
mutate(email_mod = str_replace_all(toupper(email_mod),
"[^A-Z0-9_@.\-]", "")
)
head(modset_h2)

Advertisements

2 thoughts on “Excel Email Address Cleaner & Aggregator – R function (#readxl #openxlsx)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s