A text mining function for websites

For one of my projects I needed to download text from multiple websites. I did this with rvest and dplyr. While this can be relatively easy if the sources come from the same websites, it can be pretty tedious when the website hosts are various. The reason is how the content is kept in the HTML of the website. Assume that you want to extract the title, author information, publish date, and of course the main article text. You can access that information via CSS or XPath. The following text will walk you through an example and provide the relevant code.

Using RStudio and LaTeX

This post will explain how to integrate RStudio and LaTeX, especially the inclusion of well-formatted tables and nice-looking graphs and figures produced in RStudio and imported to LaTeX. To follow along you will need RStudio, MS Excel and LaTeX.

Using RStudio and Git version control

It is fairly easy to link Github or Bitbucket with RStudio, in order to enable version control, or to work collectively on a data project, science article, or book, or in order to make your data or project publicly accessible.