Linear Regression 9/25 Raleigh talk preview using rstats #rvest package

I am preparing with TriPASS leader, Kevin Feasel, a data science talk discussing Linear Regression for Tuesday, September 25th at Vaco Raleigh (near PNC Arena).

Part of my presentation will explore the prediction of track running times during this Summer’s Carolina Godiva Track Club Summer Track Series (2018). I have written a few blog posts (here’s one) about these fun track meets, which took place at Durham Academy.

The idea is that the time it took one to run say a 100m dash might be generally predictable based on age, gender, temperature, and how many events one had completed prior to that day’s event. I ultimately may instead show that the track times could not be predicted by linear regression because of a violation of a statistical assumption, such as the errors showing constant variance (homoscedasticity).

I set out to use the rvest R package by Hadley Wickham to web scrape the 7pm (start-of-meet) temperature from the Carolina Godiva Track Club weather conditions webpage.

godiva_track.png

The Google Chrome web browser SelectorGadget tool helped me to accomplish this task. I clicked the SelectorGadget icon and then carefully clicked within each cell displaying the temperature (fahrenheit) under “7pm”. I discovered that I needed a “p” to show under the box.

p_box

I similarly clicked under Dew Point to de-select those cells (turn them red). I had some trouble selecting the correct temperature cells. They would sometimes turn red instead of green. De-selecting a field to the right of “2018 Min Value” (a table row?) allowed me to select all of the temperature cells without issue. SelectorGadget sometimes needs a little extra help to know what particular data you want, and you might need to search the page to see if any undesired element has been selected.

rvest_xpath_full.png
Notice need to de-select (turn red) the field (a vertical rectangle in the image) at the bottom-left. Clicking XPath revealed the code for the rvest html_nodes function.

The copyable code in the XPath dialog box was then inserted into the rvest html_nodes function call (xpath argument) to get the numbers I wanted.

weather_2018_ht % html_nodes(xpath="//tr[(((count(preceding-sibling::*) + 1) = 12) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 11) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 10) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 9) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 8) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 7) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 6) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 4) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span") %>%
    html_text() %>%
    as.numeric()

temp_r.png

Godiva_runners_SirWalterRaleigh
Carolina Godiva Track Club “Sir Walter Raleigh” runners. We all participated in quite a few of the Summer Track Series meets I hope to analyze with linear regression.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s