# Linear Regression 9/25 Raleigh talk preview using rstats #rvest package

I am preparing with TriPASS leader, Kevin Feasel, a data science talk discussing Linear Regression for Tuesday, September 25th at Vaco Raleigh (near PNC Arena).

Part of my presentation will explore the prediction of track running times during this Summer’s Carolina Godiva Track Club Summer Track Series (2018). I have written a few blog posts (here’s one) about these fun track meets, which took place at Durham Academy.

The idea is that the time it took one to run say a 100m dash might be generally predictable based on age, gender, temperature, and how many events one had completed prior to that day’s event. I ultimately may instead show that the track times could not be predicted by linear regression because of a violation of a statistical assumption, such as the errors showing constant variance (homoscedasticity).

I set out to use the rvest R package by Hadley Wickham to web scrape the 7pm (start-of-meet) temperature from the Carolina Godiva Track Club weather conditions webpage. The Google Chrome web browser SelectorGadget tool helped me to accomplish this task. I clicked the SelectorGadget icon and then carefully clicked within each cell displaying the temperature (fahrenheit) under “7pm”. I discovered that I needed a “p” to show under the box. I similarly clicked under Dew Point to de-select those cells (turn them red). I had some trouble selecting the correct temperature cells. They would sometimes turn red instead of green. De-selecting a field to the right of “2018 Min Value” (a table row?) allowed me to select all of the temperature cells without issue. SelectorGadget sometimes needs a little extra help to know what particular data you want, and you might need to search the page to see if any undesired element has been selected. Notice need to de-select (turn red) the field (a vertical rectangle in the image) at the bottom-left. Clicking XPath revealed the code for the rvest html_nodes function.

The copyable code in the XPath dialog box was then inserted into the rvest html_nodes function call (xpath argument) to get the numbers I wanted.

```weather_2018_ht % html_nodes(xpath="//tr[(((count(preceding-sibling::*) + 1) = 12) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 11) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 10) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 9) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 8) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 7) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 6) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 4) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span | //tr[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//span") %>%
html_text() %>%
as.numeric()
```  Carolina Godiva Track Club “Sir Walter Raleigh” runners. We all participated in quite a few of the Summer Track Series meets I hope to analyze with linear regression.

Advertisements