This is the fourth in a series of posts on online learning resources for data science and programming.
By Dan Turner, Data Science Consultant
It used to be that for tasks like web scraping and text analysis, Python was the preferred language over R. After all, Python brings treats like list comprehension and packages like Beautiful Soup, which makes scraping raw data and imposing a structure over it as straightforward as it has ever been. But for people who primarily use R for their data analysis, it might make sense to design their whole data pipeline in one language.
If you are looking for ways to scrape the web using R, take a look at these resources – and read on for some tips. As with other guides in this series, we’re focusing on resources that can be accessed for free by members of the Northwestern community, and we’re focusing on resources other than full-length online courses.
Getting Started
It bears noting that most of the modern packages for web scraping and handling text in R are built on the tidy framework and principles. That means you can string together your multiple stages of data processing, from downloading an HTML file and interpreting one of its tables as a data frame, as fast and compact statements. If you’re an aspiring R programmer, I highly encourage you to experiment with the Tidyverse.
Harvesting the web with rvest
Dmytro Perepolkin
The best R package for doing web scraping, in my opinion, is rvest. It’s based partly off of the great Beautiful Soup Python package, it is part of the Tidyverse and thus fully compatible with most modern R packages used in data science, and it’s reasonably efficient to extract information from the web.
This beginning-level tutorial comes from the programmers of rvest and it will walk you through how to use pattern matching and the hierarchical organization of HTML and CSS to extract information.
A quick note on some prerequisite knowledge: If you are not sure about the difference between HTML and CSS, or you want a refresher on how information flows across the internet, check out this informative tutorial by Mozilla, the creators of the Firefox browser.
Webscraping with R – from messy & unstructured to blisfully tidy
Kasia Kulma
This is a little more advanced than the rvest tutorial above, but it gets straight to the point of how to clean your data as you are scraping it. After all, the fewer steps you have to design to make sure your data is ready for analysis, the better. One particular problem I have wrestled with has been handling poorly designed HTML tables, and this tutorial uses a nifty-looking package called Janitor to clean them up. I can’t wait to try it out on my own projects.
Getting Better
Scraping the web can be tricky, and sometimes you cannot simply download an HTML or XML or some other structured data and interpret it as-is. For example, some tables on the web are generated by Javascript and only load data when the user scrolls or clicks. In these cases, the data is not just out there on the web ready to be harvested – your computer has to convince the server that a human is interacting with it. In my opinion , the best solution to this problem in R is the package RSelenium. Here are some advanced tutorials for how to use rvest and RSelenium.
rselenium tutorial
John Harrison
Selenium is not special to R – it’s a general automated web browser that you install on your computer and control using the R package RSelenium. This means that setting it up is not as easy as just installing the package, but if you have to scrape data that’s populated using Javascript, this is your best option. You can use it to do things like simulate mouse clicks on certain visual elements of the web page, like buttons and links, or navigate a multi-frame page where the HTML source is spread out across multiple files.
Cheat Sheet for Web Scraping using R
Yifu Yan
This cheat sheet gives lightning-fast introductions to many of the web scraping tools I’ve discussed here, like rvest and Selenium, and some other packages that I have found to play more of a supporting role like httr and Rcurl. It’s a good place to start for advanced R users who just need a push in the right direction, and who can figure out the little details along the way. It found its way into my bookmarks!
Stuck?
If you have a question about web scraping in R, don’t know what resource to start with, or need to learn something not covered above, remember you can always request a free consultation with our data science consultants. We’re more than happy to answer questions and point you in the right direction.