A text mining function for websites

For one of my projects I needed to download text from multiple websites. In this case, I used rvest and dplyr. Accessing the information you want can be relatively easy if the sources come from the same websites, but pretty tedious when the websites are heterogenous. The reason is how the content is kept in the HTML of the website.1 Assume that you want to extract the title, author information, publish date, and of course the main article text. You can identify the location of that information via Cascading Style Sheets (CSS) or XML Path Language (XPath). As soon as you have the CSS or XPath locations, you can access it in R. The following text will walk you through an example and provide the relevant code.

Where is the information I need?

Assume you want to get the relevant information from an article from The Guardian. Open this website in your browser. I recommend using Google Chrome because I will use a handy tool called SelectorGadget which allows you to easily find the CSS or XPath information via point and click. You know exactly what you want on that website, i.e. title, author information, main text and publishing date. But how to get that into R? We’ll start by loading the HTML page into R using rvest.

library(rvest)
url <- "https://www.theguardian.com/environment/2015/jan/08/mayors-failure-clean-up-londons-air-pollution-risks-childrens-health"
# Read the HTML document using try to handle 404 errors
try(html_document <- read_html(url))

print(html_document) # does not provide us the information we want. It just shows the HTML code.
## {xml_document}
## <html id="js-context" class="js-off is-not-modern id--signed-out" lang="en" data-page-path="/environment/2015/jan/08/mayors-failure-clean-up-londons-air-pollution-risks-childrens-health">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="top" class="" itemscope itemtype="http://schema.org/WebPag ...

In order to get the information we want, we need to find out where it is stored on the website, i.e. look at the website code. There are different ways to do that.

  • Highlight the text you want to access, e.g. the title on the website, then right-click Inspect. This will open a window on the right (in Chrome), where the HTML (XPath) code is highlighted. In our specific example, it should look like <h1 class="content__headline " itemprop="headline">Mayor's failure to clean up London's air pollution 'risks children's health'</h1>
  • Select the SelectorGadget Addon in Google Chrome, then click on the headline. Make sure that only the content that you want to extract is highlighted in green or yellow, by clicking on the respective parts. At the end, you should see the HTML (CSS) code to access that information at the bottom. In our specific example, it should look like .content__headline.
  • Right click an empty part on the website and click View page source. Then search for headline text with Ctrl+F. This, most often, while provide you with multiple hits. Sometimes this is good if any of the other ways to access the information won’t work, since it provides you with alternatives. In our specific example, one outcome is the same as with the XPath method. The others, e.g. "headline":"Mayor's failure to clean up London's air pollution 'risks children's health'" are difficult to access, however.

How can I retrieve the information I need?

The following code shows you how to access the information found with the above means in R. First, with the XPath method, then with the CSS method. At the end, we will construct a data frame with that information.

library(dplyr)
# Specify the xpath content for the headline in title_xpath
# Note that SelectorGadget provides you with: //*[contains(concat( " ", @class, " " ), concat( " ", "content__headline", " " ))], which is equivalent
title_xpath <- "//h1[contains(@class, 'content__headline')]"
title_text <- html_document %>%
    html_node(xpath = title_xpath) # Only provides the node.

# In order to get the information we want, we need html_text, which extracts attributes, text and tag name from html
title_text <- title_text %>%
    html_text(trim = T) # Stores title in title_text
  
# Access author information (CSS)
author_css <- ".tone-colour span" # Using SelectorGadget ('.byline span' does also work)
author_text <- html_document %>%
    html_node(css = author_css) %>%
    html_text(trim = T) # Stores author in author_text

# Access article text information (XPath)
body_xpath <- "//div[contains(@class, 'content__article-body')]//p" # '.js-article__body > p' is also possible, but needs css option in html_nodes
# The above location can be found when searching for the first two words of the article in the source code (or when inspecting the first to lines of the article).
# This provides you with the location information <div class="content__article-body"<p>
body_text <- html_document %>%
    html_nodes(xpath = body_xpath) %>%
    html_text(trim = T) %>%
    paste0(collapse = "\n")
 
# Access publishing date information (XPath)
date_xpath <- "//time" # '.content__dateline-wpd--modified' does not work for some reason, although it is the output of SelectorGadget. 
# In such a case just try to look for alternatives witht he other methods outlined above
library(lubridate) # to handle date information (important for later analysis including time)
date_text <- html_document %>%
    html_node(xpath = date_xpath) %>%
    html_attr(name = "datetime") %>% # accesses the attribute information datetime in //time (different from html_text above)
    as.Date() %>% 
    parse_date_time(., "ymd", tz = "UTC") 
  
# Store all information in a data frame called article
article <- data.frame(
    url = url,
    date = date_text,
    title = title_text,
    author = author_text,
    body = body_text
  )

print(as_tibble(article))
## # A tibble: 1 x 5
##                                                                           url
##                                                                        <fctr>
## 1 https://www.theguardian.com/environment/2015/jan/08/mayors-failure-clean-up
## # ... with 4 more variables: date <dttm>, title <fctr>, author <fctr>,
## #   body <fctr>

The next step would be to wrap this code in a function, in order to be able to run it for multiple The Guardian articles.

# Define the function
scrape_guardian_article <- function(url) {
try(html_document <- read_html(url))
    title_xpath <- "//h1[contains(@class, 'content__headline')]"
title_text <- html_document %>%
    html_node(xpath = title_xpath)

title_text <- title_text %>%
    html_text(trim = T) 
  
author_css <- ".tone-colour span" 
author_text <- html_document %>%
    html_node(css = author_css) %>%
    html_text(trim = T) 

body_xpath <- "//div[contains(@class, 'content__article-body')]//p" 
body_text <- html_document %>%
    html_nodes(xpath = body_xpath) %>%
    html_text(trim = T) %>%
    paste0(collapse = "\n")

date_xpath <- "//time" 
library(lubridate) 
date_text <- html_document %>%
    html_node(xpath = date_xpath) %>%
    html_attr(name = "datetime") %>% 
    as.Date() %>% 
    parse_date_time(., "ymd", tz = "UTC") 
  
article <- data.frame(
    url = url,
    date = date_text,
    title = title_text,
    author = author_text,
    body = body_text
  )
return(article)
}

# Run the function for multiple links
articles <- data.frame()
links <- c("https://www.theguardian.com/environment/2015/jan/08/mayors-failure-clean-up-londons-air-pollution-risks-childrens-health", "https://www.theguardian.com/world/2016/dec/07/marshall-islands-natives-return-mass-exodus-climate-change", "https://www.theguardian.com/environment/2016/dec/14/queenslands-largest-solar-farm-plugs-into-the-grid-a-month-early")

for (i in 1:length(links)) { # Iterate over number of links
  cat("Downloading", i, "of", length(links), "URL:", links[i], "\n")
  article <- scrape_guardian_article(links[i]) # Use downloder function specified above for link[i]
  articles <- rbind(articles, article) # Append new article to old
}
## Downloading 1 of 3 URL: https://www.theguardian.com/environment/2015/jan/08/mayors-failure-clean-up-londons-air-pollution-risks-childrens-health 
## Downloading 2 of 3 URL: https://www.theguardian.com/world/2016/dec/07/marshall-islands-natives-return-mass-exodus-climate-change 
## Downloading 3 of 3 URL: https://www.theguardian.com/environment/2016/dec/14/queenslands-largest-solar-farm-plugs-into-the-grid-a-month-early
print(as_tibble(articles))
## # A tibble: 3 x 5
##                                                                           url
##                                                                        <fctr>
## 1 https://www.theguardian.com/environment/2015/jan/08/mayors-failure-clean-up
## 2 https://www.theguardian.com/world/2016/dec/07/marshall-islands-natives-retu
## 3 https://www.theguardian.com/environment/2016/dec/14/queenslands-largest-sol
## # ... with 4 more variables: date <dttm>, title <fctr>, author <fctr>,
## #   body <fctr>

You can also modify the function to use it with lapply. To do that, use the following code modifications

# Change the return code in the functin defined above to:
articles <- rbind(articles, article)

# Run the function over vector of links
text_df <- as.data.frame(lapply(links, scrape_guardian_article))

How can I adapt this to other websites?

Unfortunately, the above code won’t work for every website, probably not even for all The Guardian websites, because these websites are built differently. Main text will be stored just in p sometimes, whereas you will be more elaborated CSS path specifications on others. Depending on the number of different websites you want to scrape, it can be pretty tedious to write a function with the adequate CSS or XPath specifiers for everyone. However, as of now I do not know of a better way to do this.2 I tried using RSelenium, which sets up a server, navigates to the respective website and clicks or copies whatever you specify. However, in this case the algorithm cannot know perfectly what kind of information you want. Maybe there are machine learning methods that allow an algorithm to learn based on text on how to best identify the main text of a website, its title, etc. This sounds like a really interesting method. However, I am not yet aware of any such approaches.


  1. Disclaimer: I am not an expert at all on HTML or anything website related.

  2. I would be grateful for tips and tricks, though.

Related

Next
Previous