7  Getting Data by Web Scraping

Published

November 6, 2024

Keywords

rvest, xml, chromote, rselenium

7.1 Introduction

7.1.1 Learning Outcomes

  • Check the Terms of Service and robots.txt prior to scraping.
  • Employ Cascading Style Sheet (CSS) selectors in web scraping.
  • Use the {rvest} package for web scraping static sites.
  • Apply techniques to tidy data after web scraping.

7.1.2 References:

7.1.2.1 Other References

Web scraping is a way to extract data from web sites that do not have APIs or whose APIs don’t provide access to the data one is interested in.

Web sites are composed of individual pages of HTML code which contains structure, content, and formatting information.

  • Web pages are composed based on the files in a specific directory in the website file structure.
  • The directory structure of a web site is often in a root-level file called URL/sitemap.xml.

Scraping is the act of extracting data from the content of the individual web pages.

Note

This section focuses on scraping data from individual web pages where the data is static, i.e., the page is not dynamically adjusting its content based on user selections or use pop-up windows.

Programmatically navigating web sites to access data on multiple pages or programatically making selections on dynamic web pages to manipulate content is more complicated.

The {RSelenium} package can support those types of operations.

Individuals can scrape web pages of interest. However most scraping is done by companies such as data aggregators or search/recommender system providers who use automated tools.

  • These tools are known by a variety of terms: web scrapers, web-crawlers, web-bots, spiders, robots, bots, ….
  • They can run tens of thousands of searches a minute trying to gather information from targeted sites or all sites they can find.
  • They use the sitemap file to determine the pages on a site they might want to scrape.

7.2 To Scrape or Not to Scrape

Is web scraping legal? Yes, but …

  • Recent Court cases have ruled in favor of the web scrapers in some circumstances.
  • Scraping data for personal use appears to fall under the “fair use” doctrine.
  • Commercial use often violates the Terms of Service for the web site owner.
  • Laws are being updated in the US and elsewhere to address data privacy, especially for personal identifiable data (PII).
  • The post Is Web Scraping Legal? Ethical Web Scraping Guide in 2023 Dilmegani (2023) has recent information and suggestions about responsible web scraping.

7.2.1 What is Robots.txt?

Websites use a “standard” approach to identify their “desires” or preferences for which portions of their site they will “allow” to be scraped.

  • They want to allow scrapers to find their sites so they can be found via search engines or other recommendation systems.
  • They want to protect their “proprietary” data and privacy information so they can preserve their ability to monetize their intellectual capital.

Most websites have a file called robots.txt right at the top of their structure at URL/robots.txt, in the root directory.

The file uses “standard” terms (the Robots Exclusion Protocol) to indicate to people and automated web scrapers or “bots” what portions of the website are allowed to being scraped and which should not be scraped (disallowed).

  • The robots.txt file uses the directory structure described in the sitemap.xml.

  • The file contains “rules” (also known as “directives”) to specify actions by directory tree structure.

  • The file identifies User-agent as either everyone by using * or specific scraping tools by name.

    • A line starting with # is a comment and not a rule.

    • The following entries apply to everyone (the *) and nothing is disallowed.:

      User-agent: *
      Disallow:

    • The following entries apply to everyone and everything is disallowed - do not access this web site.

      User-agent: *
      Disallow: /

Let’s look at several websites to see how the robots.txt can vary.

Different organizations use varying degrees of complexity in the rules they establish.

  • These rules are not law so compliance is a matter of responsible behavior.

7.2.2 General Guidelines for Scraping

  • Scraping can cause significant load on a web site’s servers and services.
  • Several sites offer suggested guidelines so you can avoid being denied access.
  • One example is Best Practices Kenny (2021).
  • Scrapers wanting to violate TOS may use multiple methods to disguise their activity, e.g., proxy servers.

7.2.3 The {polite} Package

If you need to scrape a complex site multiple times to get hundreds or thousands of elements, consider using the {polite} package to help manage your activity within the guidelines of the site and robots.txt.

  • The {polite} package Perepolkin (2023) is designed to help people scrape efficiently without violating the terms of service or robots.txt rules.
  • Per the README, the package’s two main functions, bow and scrape, define and realize a web harvesting session.
    • bow is used to introduce the client to the host and ask for permission to scrape (by inquiring against the host’s robots.txt file).
    • scrape is the main function for retrieving data from the remote server.
  • The package allows you to scrape multiple pages from a site without exceeding traffic constraints.
Tip

Do not download thousands of HTML files from a website to parse!

This could violate the site permissions and the site admins might block you if you send too many requests.

Download your file once in a separate code chunk from code that manipulates the downloaded object.

Save the downloaded web page to a location so you don’t have to download it again; you can just read it in.

7.3 HTML and Cascading Style Sheets (CSS) Selectors

7.3.1 The Web Page Document Object Model

A Web page is a document which can be either displayed in the browser window or as the HTML source; it is the same document in both cases.

The Document Object Model (DOM) Docs (2023) is an upside-down tree structure that describes the organization of the webpage in terms of nodes and objects.

  • The DOM is a data structure than can be manipulated using JavaScript.
  • Each branch of the tree ends in a node, and each node contains one or more objects.
  • All of the properties, methods, and events available for manipulating and creating web pages are organized into objects.
  • Browsers use the DOM to figure out how to organize the objects for presentation in the browser.
Note

Even though not every mode is an HTML element (some are text or attribute nodes), it’s common to refer to the all the nodes in the DOM as elements.

  • You will see the term “elements” used to refer to “nodes” regardless of the actual node type.
A depiction of a notional DOM as a tree with the html node and the top and descending branches with nodes with html elements and attribute nodes.
Figure 7.1: Notional DOM for a web page document.

7.3.2 Common HTML elements

The elements of the DOM are defined using HTML Tags - the “Hyper-Text Markup” of the language.

  • When you scrape an HTML document, and convert it into text, you will see HTML tags marking up the text.
  • These tags (always in pairs for begin and end) delineate the nodes in the document, the internal structure of the nodes, and can also convey formatting information.
  • There are over 100 standard HTML tags. Some commonly used tags are:
Common HTML Tags
Tag Usage
<html> At the start and end of an HTML document
<head> Header Information
<title> website title </title> Website Title
<body> Before and after all the content
<div> ... </div> Divide up page content into sections, and applying styles
<h?> heading </h?> Heading (h1 for largest to h6 for smallest)
<p> paragraph </p> Paragraph of Text
<a href="url"> link name </a> Anchor with a link to another page or website
<img src="filename.jpg"> Show an image
<ul> <li> list </li> </ul> Un-ordered, bullet-point list
<b> bold </b> Make text between tags bold
<i> italic </i> Make text between tags italic
<br> Line Break (force a new line)
<span style="color:red"> red </span> Use CSS style to change text color for part of a text, or a part of a document.

An HTML element is defined by a start tag, some content, and an end tag.

  • Elements can be nested inside other elements as seen in Figure 7.1.
  • The top level element in the DOM is the root element <html> which contains the whole document.
  • Every element can have additional attributes which provide additional information about the element, usually in a name/value pair such as name="my_name" or class="my_class".

7.3.3 CSS Selectors

We have to know a tiny bit about CSS to understand how to use CSS to extract individual elements from a website.

CSS is a formatting language to tell browsers how to present/format HTML elements on a web page.

  • Every website is formatted with CSS.

Here is some example CSS:

    h3 {
      color: red;
      font-style: italic;
    }
    
    footer div.alert {
      display: none;
    }
  • The part before the curly braces is called a selector. The selector identifies the specific web page elements to be affected by the formatting that follows.
  • Specifically, for the two examples above, they correspond to:
    <h3>Some text</h3>
    
    <footer>
    <div class="alert">More text</div>
    </footer>
  • The code inside the CSS curly braces are the formatting properties.
    • The h3 properties says all h3 headers should be colored red and use an italic font style.
    • The second CSS chunk says all <div> tags of class "alert" in the <footer> should be hidden.
  • CSS applies the same formatting properties to each instance of the selector on the web page.
    • So every time we use the h3 tag, it will result in the h3 content being styled in red, italicized text.
  • The “cascading” in CSS means that the formatting “cascades” down to all child elements nested underneath any parent element unless there is another selector that takes priority.
    • CSS syntax has a variety of rules to establish priority among selectors so a higher level format may be over-ruled.

CSS is a flexible and efficient way to identify HTML elements on a web page and assign formatting properties.

Important

We will use CSS selectors to identify the elements of a website we want to scrape.

7.3.3.1 There are five categories of CSS selectors.

  • Simple selectors (select elements based on name, id, class).
  • Combinator selectors (select elements based on a specific relationship between them).
  • Pseudo-class selectors (select elements based on a certain state).
  • Pseudo-elements selectors (select and style a part of an element).
  • Attribute selectors (select elements based on their attribute or attribute value).
  • Typically, we will see simple, combinator, and attribute selectors.

We can use any of these to select an element.

  • The name selector just uses the name value of the element such as the h3 above. All elements with the same name value will be selected.
  • The id selector uses a #, e.g., #my_id, to select a single element with id=my_id (all ids are unique within a page).
  • The class selector uses a ., e.g., .my_class, where class=my_class. All elements with the same class value will be selected.
  • We can combine selectors with ., ““, and/or \ to select a single element or groups of similar elements.
    • A selector of my_name.my_classcombines name and class to select all (only) elements with the name=my_name and class=my_class.
  • The most important combinator is the white space, ““, the descendant combination.
    • As an example, p a selects all <a> elements that are a child of (nested beneath) a <p> element in the DOM tree.
  • You can also find elements based on the values of attributes, e.g., find an element based on an attribute containing specific text.
    • For a partial text search you would use '[attribute_name*="my_text"]'. Note the combination of single quotes and double quotes so you have double quotes around the value.

These selectors provide substantial flexibility in choosing elements of interest.

7.4 Using SelectorGadget with Chrome

SelectorGadget is an open-source extension for the Google Chrome browser.

We will use SelectorGadget to identify the CSS selectors for a particular element (node) we are interested in on a webpage.

  • SelectorGadget is convenient for interactive use. It is not required to do programmatic web-scraping.

To install SelectorGadget, go to Chrome Web Store.

Once installed, manage your chrome extensions to enable it and “pin” it so to appears as an icon, , on the right end of the search bar.

  • You click on the icon to turn SelectorGadget on and off.

7.4.1 United Nations Web Page on Deliever Humanitarian Aid.

Go to Deliver Humanitarian Aid in a Chrome browser.

  • This web page has several sections. Scroll down to the section called Where are the UN entities delivering humanitarian aid located?

Turn on Selector Gadget.

Use your mouse to the edge of grey box and click it.

  • The background should turn green (indicating what we have selected) as in Figure 7.2.
A snapshot of the web page showing the box of agencies and locations is highlighted in green and the selection box shows 2 elements have been selected.
Figure 7.2: The Selected element turns green.
  • Look in the SelectorGadget info box at the bottom.
    • It shows .grey-well is the CSS selector associated with the box we clicked on.
    • It shows the selector identified a total of (2) elements were found.
    • The “XPath” button will pop up the XPath for the elements.
      • XPath is a standard approach for using XML format to identify the path from the top node of the Document in the DOM, down through the tree, to node (element) that was selected.
      • For the grey-well box, the XPath is //*[contains(concat( " ", @class, " " ), concat( " ", "grey-well", " " ))]
    • The “?” is the help button.
Tip

It’s important to visually inspect the selected elements throughout the whole HTML page/file, top to bottom as you inspect.

  • SelectorGadget doesn’t always get all of what you want, or it sometimes gets too much.
  • It can select hidden elements that don’t show up until you actually start working with the data.
  • Sometimes not every element is present for every item on the page as well.
  • Figure 7.3 shows two elements were selected and we expected only one.
  • If we scroll up we can see another grey-box element highlighted in yellow which means it is currently selected as well (Figure 7.3).
A snapshot of the web page showing two grey boxes were on the page and both are selected. THe original is in green and the is in yellow.
Figure 7.3: The two elements that were selected are both grey-boxes.
  • To de-select the yellow grey-box, click on the yellow item we don’t want. It turns red to indicate it is no longer selected.
  • However, there is a lot more yellow showing on the page, all elements we do not want.
  • Continue to click on the yellow areas until they go away and just the grey box is left in Green and there is no yellow.
  • Your screen should look like Figure 7.4 with Selector Gadget showing one item is selected.
A snapshot of the web page showing two grey boxes were on the page and only the original is in green.
Figure 7.4: Oinly the original grey-box is now selected and colored green.
  • The Selector Gadget selector is now .col-sm-12.grey-well which means only the items with class grey-well that are inside or below an items with class col-sm will be selected.

Scraping the whole box might just give us a lot of text with no clear way to organize it. Let’s try to select individual elements inside the box.

  • To start over, turn Selector Gadget off and then on (or click on the edge of the grey box to deselect it).

  • Now click on the edge of “OFFICE FOR THE COORDINATION OF HUMANITARIAN AFFAIRS (OCHA)” inside the grey box.

  • It should turn Green and other offices should turn yellow. The yellow indicates all the other elements with the same selectors as OCHA. namely, .bottommargin-sm, as in Figure 7.5. Note there are 16 items that were selected.

Now the web page shows the OCHA highlighted in green, and other elements highlighted in yellow, and 16 items selected.
Figure 7.5: Selecting OCHA also selects 15 other elements.
  • However, in addition to selecting the office name element, it selected headings for office locations, e.g. “UNDP Locations”, which we do not want to include with the office names if we can avoid it.
  • If we click one of the yellow items we don’t want, it turns red. This indicates we don’t want to select it as in Figure 7.6.
  • We now see the selector as .horizontal-line-top-drk and there are 9 of them which we can verify are correct by scrolling the page up and down.
The web page shows the nine office names in green and yello and the unwanted location element in red.
Figure 7.6: Only the nine office names are now selected.
  • One click was all we needed here. On other pages you may need to click several times until only the elements of interest remain in Green or Yellow.
  • What are the CSS selectors for the headquarters locations?
  • How many elements are selected?
  • It takes a few clicks but you can get to just the HQs locations with .horizontal-line-top-drk+ p.
  • There should be 9 items.
A new snapshot showing the web page elements for HQs locations with 9 location elements.
Figure 7.7: Office HQs locatons are complete.

Selector Gadget can provide a useful tool for understanding the selectors associated with elements of interest on a static web page.

7.4.2 Chrome Developer Tools

More and more web sites are transitioning from static to dynamic. SelectorGadget may have challenges with a dynamic site that is using java script to manage how the browser presents the page. It may be able to click on items but not find a valid path, or, it may not even be able to click on items without activating an underlying hyperlink.

If Selector Gadget is not getting you want you want, you can also use the browser’s Developer Tools.

These notes are based on the Google Chrome browser. If you using a different browser, its developer tools will look different but should have similar functionality.

  • A browser’s Developer Tools have many capabilities to help web page developers build, inspect, and test their web pages. We are primarily interested in using the Elements pane to navigate the DOM tree and identify elements and their attributes.

To open the Chrome developer tools, choose one the following to get a result as in Figure 7.8.

  • Open up the tools through the hamburger menu on the top right ⋮ \> More tools \> Developer tools.
  • Right-click on an element in the web page and select Inspect.
The web page is on the left with the Elements pane of the Chrome Developer Tools on the right.
Figure 7.8: Access the Chrome Developer Tools and the ELements pane
  • Clicking on the Elements menu item (top left of the developer tools) will show you the DOM tree of elements on the web page along with their names, ids, and attributes such as class.
  • You can scroll up or down through the elements or the web page.
  • As you move your mouse up and down the list, you may see different areas of the web page get highlighted and pop-ups appear with the CSS class.
  • Right-click on any element in the web page and select Inspect and the Elements pane will scroll to highlight the element.

As an example, right click on the grey background from before, choose Inspect, and you should see something like in Figure 7.9.

The web page is on the left with the grey-well highlighted in light green with the Elements pane of the Chrome Developer Tools on the right shwoing the element in the DOM with its class.
Figure 7.9: Developer Tools Elements from Inspecting the grey background
  • Look for the name, id, class and other attributes. They may be higher in the tree structure. - If the name is the same as the HTML element, e.g., div you will not see a specific name= attribute.

  • You can use the arrows to the left of each element to collapse and expand the element to see any nested elements (further down in its branch of the DOM tree).

  • Right click on a specific element in the Elements pane and you have several options in the pop-up menu

    • You can add or edit attributes of the element (not for us).
    • You have several choices of information you can copy.
      • Copy element generally provides the class and some content.
      • Copy selector provides a very specific path down the DOM to this one element. Usually not helpful when you are interested in several elements of the same class.
      • Copy XPath and Copy full XPath use the XPath languate for navigating web pages. Again, it provides a path to this one element.

XPath is its own language for navigating a DOM and has a lot of flexibility for selections. It can be a useful alternative to using CSS selectors depending upon the design or structure of the web site DOM and the types of information of interest. For more information try XPath Cheat Sheet for Web Scraping – Guide & Examples as another source.

  • Use the developer tools to inspect the Title for the OCHA.
  • What is the class for it? What is the CSS selector?
  • What is the XPath for it?
  • Right click on the OCHA title in the web page and Inspect the element or go to the Elements pane and click on the down arrows and scroll till you can see the element (and it is highlighted on the web page) as in Figure 7.10
  • You can see the class in the Elements pane, or.
  • Right Click and Copy element and paste into your file to get <h3 class="caps horizontal-line-top-drk bottommargin-sm">Office for the Coordination of Humanitarian Affairs (OCHA)</h3>
  • You can see the cascading classing in this. Note that you can pick out the horizontal-line-top-drk found by selector gadget earlier.
  • Add the ‘.’ to the front to designate it as a CSS class selector or ".horizontal-line-top-drk".
The OCHA Title element on the left is highlighted with the ELements pane on the right showing the location of the element in the DOM and its class attributes.
Figure 7.10: OCHA Title Element in the Elements Pane.
  • The Xpath and Full Xpath are:
    • //\*[@id="node-120624"]/div/div\[3\]/div/div/div\[11\]/div\[2\]/div\[1\]/h3\[1\]
    • /html/body/div\[2\]/div\[1\]/div\[1\]/div/section/section/section/div/div/div\[3\]/div/div/div\[11\]/div\[2\]/div\[1\]/h3\[1\]
  • Note they are precise but not as useful for our purposes unless we need to select a single element.
  • The XPath language can be very flexible though for identifying groups of elements based on locations in the DOM.
  • The developer tools are helpful for identifying the selectors for one type of element at a time.
  • They are not as convenient as SelectorGadget to identify a set of selectors for different elements all at once.

7.5 Scraping Data with the {rvest} Package

7.5.1 The {rvest} Package

The {rvest} package is designed to help users scrape (or “harvest”) data from HTML web pages.

  • There are functions that work with the elements of web pages and web page forms.
  • There are also functions to manage “sessions” where you simulate a user interacting with a website, using forms, and navigating from page to page.

The {rvest} package depends on the {xml2} package, so all the xml functions are available, and {rvest} adds a thin wrapper for working specifically with HTML documents.

  • The XML in “xml2” stands for “Extensible Markup Language”. It’s a markup language (like HTML and Markdown), useful for representing data.

  • You can parse XML documents with xml().

  • You can extract components from XML documents with xml_element(), xml_attr(), xml_attrs(), xml_text() and xml_name().

The {xml2} package has functions to write_html() or write_xml() to save the HTML/XML document to a path/file_name for later use.

Note

the {rvest} package has a dependency on the {xml2} package so will install it automatically. Many functions in {rvest} will call {xml2} functions so you do not need to library{xml2} to use {rvest}. However, {xml2} has functions that are not used by {rvest} so you may need to library it to use those functions e.g, xml2::write_html().

7.5.2 Steps for Scraping Data

Once you have identified the elements you want to scrape and the CSS selectors to get them, the next step is to download the web page and scrape the content for those elements.

We will use {rvest} and {xml2} package functions do three things:

  • Access the web page,
  • Extract the elements of interest, and,
  • Convert the extracted elements into a data structure you can manipulate to tidy and clean the data for analysis.
  1. Use read_html() to access the HTML document (from a URL, an HTML file, or an HTML string) to create a copy in memory.
    • Since read_html() is an {xml2} function, the result is a list object with class xml_document.
  2. Use html_element(doc, CSS) or html_elements(doc, CSS) to select the element/elements you want from the xml_document object with the CSS selectors you have identified.
    • The result is a list object with class xml_node if one element is selected or, xml_nodeset, if more than one element is selected.
  3. Use the appropriate function to Extract data from the selected elements
  • html_name(x): extract the name of the element(s) (xml_document, xml_node, or xml_nodeset)
  • html_attr() or html_attrs(): extract the attribute(s) of the element(s). The result is always a string so may need to be cleaned.
  • html_text2(): extract the text inside the element with browser-style formatting of white space or html_text() for original text code formatting (which may look messy).
  • html_table(): extract data structured as an HTML table. It will return a tibble (if one element is selected) or a list of tibbles (when multiple elements are selected).
Important

You can use different strategies for how to extract the elements of interest.

  • Extract each element individually.
  • Extract all of the elements at once.
  • Extract a parent element to get all the children elements.
  • Some combination of the above.

Once you have extracted the data from the elements, you can use multiple approaches to tidy and clean the data.

7.5.3 Scrape Text Data from the UN Site

Load the {tidyverse}, {rvest} and {xml2} packages.

library(tidyverse)
library(rvest)
library(xml2)

7.5.3.1 Use read_html() to Access the Web Page Document.

Use read_html() to create a variable with all the static content from a URL (or file or text).

  • It’s our local copy of the web page document.
  • It may have embedded Java Script but it does not contain dynamic content such as ads that pop-up or are inserted on the web page.
un_html_obj <- read_html("https://www.un.org/en/our-work/deliver-humanitarian-aid")

We can use xml2::write_html() to save the web page object to a location.

  • We can use read_htm() to read it in as well from a path instead of a URL.
  • Saving a static web page and reading it in allows us to scrape it without having to access the original site over the internet.
  • The R object will have class "xml_document" "xml_node.
my_path <- "./data/un_aid_web_page.html"
# xml2::write_html(un_html_obj, my_path)

un_html_obj <- read_html(my_path)
class(un_html_obj)
[1] "xml_document" "xml_node"    

The web page R object is an R list where the elements, here $head and $body, contain pointers to the elements in parts of the document.

  • The web page document is the top level, single node, for the page so the parent class is xml_node instead of "node_set".
  • We can extract the element name and the attributes as well.
html_name(un_html_obj)
[1] "html"
html_attrs(un_html_obj)
  dir  lang 
"ltr"  "en" 

7.5.3.2 Extract the elements we want from the xml_document object we just created.

Use html_element() (we only have one element) with the CSS selector we found (".col-sm-12.grey-well").

  • Insert the CSS selectors as a character string (not a vector) as the value for the css = argument.
    • The xpath argument is not needed if css = has a value.
  • The result will be an object of class xml_node since the CSS selects 1 element.
 html_element(un_html_obj, css = ".col-sm-12.grey-well") ->
  un_element_box
class(un_element_box)
[1] "xml_node"
  • html_name() shows us the name of an element.
html_name(un_element_box)
[1] "div"
  • Here the name is just the HTML tag for a division, div.

7.5.3.3 Extract the components of interest from the elements

Now that we have an object with the elements of interest, we can extract the content.

Here we want to extract the text.

  • Use html_text2() to extract the text from the selected elements and save as a variable.
  • The result will be a character vector (here length 4970).
un_box_text <- html_text2(un_element_box)
str_sub(un_box_text, end = 900)
[1] "Office for the Coordination of Humanitarian Affairs (OCHA)\n\nBased at UN Headquarters in New York City, the Office for the Coordination of Humanitarian Affairs (OCHA, part of the UN Secretariat) has regional and country offices in West and Central Africa, the Middle East and North Africa, Asia and the Pacific, Southern and Eastern Africa, Latin America and the Caribbean, and Humanitarian Advisor Teams in many countries.\n\nUnited Nations Development Programme (UNDP)\n\nUNDP Headquarters is in New York City.\n\nUNDP locations\nAfrica\nRegional Service Centre: Addis Ababa, Ethiopia.\nThe Americas\nRegional Hub: Panama\nAsia and the Pacific\nRegional Hub: Bangkok, Thailand\nEurope and Central Asia\nUNDP has a regional office for Europe and Central Asia.\nThe Middle East\nThe UNDP Regional Bureau for Arab States (RBAS) is based in New York\nUnited Nations Refugee Agency (UNHCR)\n\nUNHCR Headquarters is in Genev"
str_length(un_box_text)
[1] 4970

We have “scraped” the data of interest and converted it into a character vector.

However, as suspected, it is a lot of text with only new line characters to indicate any kind of structure It would take a lot of cleaning to figure this out.

Instead, let’s try selecting individual elements to see if that is better.

  • Use the CSS selectors for the Office Names and their HQs locations and extract the text for those elements.
  • You should get a character vector of 9 + 9 = 18 elements.
  • Convert into a tidy data frame.
  • Approach 1. Scrape each individually and combine into data frame.
html_elements(un_html_obj, css = ".horizontal-line-top-drk") |> 
  html_text2() ->
  office
html_elements(un_html_obj, css = ".horizontal-line-top-drk+ p") |> 
  html_text2() ->
  location
un_aid_offices <- tibble(office, location)
  • Approach 2. Scrape together and tidy.
html_elements(un_html_obj, 
1              css = ".horizontal-line-top-drk, .horizontal-line-top-drk+ p") |>
2  html_text2() |>
3  tibble(un_text = _) |>
4  mutate(
    office_id = rep(1:9, each = 2),
5    id = if_else(row_number() %% 2 == 1, "office", "location")
    ) |> 
6  pivot_wider(names_from = id, values_from = un_text) |>
7  select(-office_id)
  un_aid_offices 
1
Combine the two (or more) selectors into a character string separated by commas.
2
Extract the text as a character vector.
3
Convert the vector into a tibble with column name un_text.
4
Add a column to uniquely identify each pair of office and location.
5
Add a column id with what will be the new column names.
6
Pivot wider based on the new column names column and un_text.
7
Remove the office_id as no longer needed.
# A tibble: 9 × 2
  office                                                                location
  <chr>                                                                 <chr>   
1 Office for the Coordination of Humanitarian Affairs (OCHA)            Based a…
2 United Nations Development Programme (UNDP)                           UNDP He…
3 United Nations Refugee Agency (UNHCR)                                 UNHCR H…
4 United Nations Children’s Fund (UNICEF)                               UNICEF …
5 United Nations Population Fund (UNFPA)                                UNFPA H…
6 World Food Programme (WFP)                                            WFP Hea…
7 Food and Agriculture Organization of the United Nations (FAO)         FAO Hea…
8 United Nations Relief and Works Agency for Palestine Refugees in the… UNRWA H…
9 World Health Organization (WHO)                                       WHO Hea…
# A tibble: 9 × 2
  office                                                                location
  <chr>                                                                 <chr>   
1 Office for the Coordination of Humanitarian Affairs (OCHA)            Based a…
2 United Nations Development Programme (UNDP)                           UNDP He…
3 United Nations Refugee Agency (UNHCR)                                 UNHCR H…
4 United Nations Children’s Fund (UNICEF)                               UNICEF …
5 United Nations Population Fund (UNFPA)                                UNFPA H…
6 World Food Programme (WFP)                                            WFP Hea…
7 Food and Agriculture Organization of the United Nations (FAO)         FAO Hea…
8 United Nations Relief and Works Agency for Palestine Refugees in the… UNRWA H…
9 World Health Organization (WHO)                                       WHO Hea…

A is more complicated but shows techniques that may be useful for more complex data.

Now that you have a tidy tibble, you can clean the text as desired, e.g.,

un_aid_offices |> 
  mutate(location = str_replace(location, " Headquarters( is)? in ", ": "))
# A tibble: 9 × 2
  office                                                                location
  <chr>                                                                 <chr>   
1 Office for the Coordination of Humanitarian Affairs (OCHA)            Based a…
2 United Nations Development Programme (UNDP)                           UNDP: N…
3 United Nations Refugee Agency (UNHCR)                                 UNHCR: …
4 United Nations Children’s Fund (UNICEF)                               UNICEF:…
5 United Nations Population Fund (UNFPA)                                UNFPA H…
6 World Food Programme (WFP)                                            WFP: Ro…
7 Food and Agriculture Organization of the United Nations (FAO)         FAO: Ro…
8 United Nations Relief and Works Agency for Palestine Refugees in the… UNRWA: …
9 World Health Organization (WHO)                                       WHO: Ge…

7.5.4 IMDB.com Example

We want to scrape data about the top 100 greatest movies of all time from IMDB (per Chris Walczyk55).

Snapshot view of the web page showing the intro paragraph and the first two entries.
Figure 7.11: IMDB 100 Top Movies of All Time

7.5.4.1 Use read_html() to Access the Web Page Document.

Let’s access the 100 greatest movies page from IMDB.

Use read_html() to create a variable with all the static content from a URL (or file or text).

  • It’s our local copy of the web page document.
  • It does not contain dynamic content such as ads that pop-up or are inserted on the web page.
  • The R object will have class "xml_document" "xml_node.
imdb_html_obj <- read_html("https://www.imdb.com/list/ls055592025/")
class(imdb_html_obj)
[1] "xml_document" "xml_node"    

The web page R object is an R list where the $head and $body elements contain pointers to the elements in the parts of document.

  • The web page document is the top level, single node, for the page so the parent class is xml_node instead of "node_set".

  • We can extract the element name and the attributes as well.

html_name(imdb_html_obj)
[1] "html"
html_attrs(imdb_html_obj)
                                  lang                               xmlns:og 
                               "en-US" "http://opengraphprotocol.org/schema/" 
                              xmlns:fb 
   "http://www.facebook.com/2008/fbml" 

The list object attributes include <html xmlns:og="http://opengraphprotocol.org/schema#" and xmlns:fb="http://www.facebook.com/2008/fbml">.

  • The term xmlns stands for the XML “namespace” which defines the list of HTML tags that are available for use in the web page. This tells the browser what standard to use to interpret the elements.

  • The XML namespace xmlns is a required attribute for the HTML document. If not defined, the default is the standard http://www.w3.org/1999/xhtml.

  • In this example there are two standards used; one for Open Graph and one for the FaceBook Manual Language (which is now deprecated). Using the standard does not connect you to the Facebook web site.

We can use write_html() to save the web page object to a location.

  • We can use read_htm() to read it in as well from a path instead of a URL.
my_date = today()
my_path <- paste0("./data/imdb_top_100_movies_", my_date, ".html")
xml2::write_html(imdb_html_obj, my_path)

imdb_html_obj <- read_html(my_path)

The web page does not have a simple DOM structure. Selector Gadget will Not work on it.

  • Use the browser developer tools to find CSS selectors the number and title and the year.
  • Number and title are combined in .ipc-title__text.

The web page elements for number and title for the the first movie highlighted with the developer tools element highlighted on the right. - Metascore is by itself in .metacritic-score-box.

The web page elements for metacritic score for the the first movie highlighted with the developer tools element highlighted on the right.

Metacritic Score

7.5.4.2 Extract the elements we want from the xml_document object we just created.

Use html_elements() with the CSS selectors we found (in the previous section).

  • Insert the CSS selectors as a character string (not a vector) as the value for the css = argument.
    • The xpath argument is not needed if css= has a value.
  • The result will be an object of class xml_nodeset as we used html_elements() and there are multiple elements.
  • Note the length is 52 - not 100. We will investigate this more as we get into the text.
imdb_elements <- html_elements(imdb_html_obj,
                css = ".ipc-title__text, .metacritic-score-box")
class(imdb_elements)
[1] "xml_nodeset"
length(imdb_elements)
[1] 52

html_name() shows us the name of each of the elements.

html_name(imdb_elements)
 [1] "h3"   "span" "h3"   "span" "h3"   "span" "h3"   "span" "h3"   "span"
[11] "h3"   "span" "h3"   "span" "h3"   "span" "h3"   "span" "h3"   "span"
[21] "h3"   "span" "h3"   "span" "h3"   "span" "h3"   "span" "h3"   "span"
[31] "h3"   "span" "h3"   "span" "h3"   "span" "h3"   "span" "h3"   "span"
[41] "h3"   "span" "h3"   "span" "h3"   "span" "h3"   "span" "h3"   "span"
[51] "h3"   "h3"  
  • We can see a pattern for the number and title as h3 and the metacritic score as span. These are generic HTML tags so not so useful. the name for the ranking is span and the name of the title is a
  • h3 is the name for the anchor tag header3.
  • span is the name for the <span> tag for a section in the document that can be of any length.

html_attrs() shows us all the attributes for each element.

html_attrs(imdb_elements)[47:52]
[[1]]
            class 
"ipc-title__text" 

[[2]]
                                      class 
"sc-b0901df4-0 bXIOoL metacritic-score-box" 
                                      style 
                 "background-color:#54A72A" 

[[3]]
            class 
"ipc-title__text" 

[[4]]
                                      class 
"sc-b0901df4-0 bXIOoL metacritic-score-box" 
                                      style 
                 "background-color:#54A72A" 

[[5]]
            class 
"ipc-title__text" 

[[6]]
            class 
"ipc-title__text" 
  • Here we see a CSS class and for the metacritic score a style as well. Note the last two don’t fit the pattern.
  • Attributes can be used as CSS selectors as well (the fifth type above).
Note
  • At times you may want to scrape an <a> tag which is the URL for a hyperlink.
    • You will usually Not see the full URL but a Relative URL (like a relative path).
    • <a href="/title/tt0068646/?ref_=ls_t_1" ... </a>
    • Any relative paths are appended to the home directory for the site (the <base> URL) which in this case is https://www.imdb.com to get the final https://www.imdb.com/title/tt0068646/?ref_=ls_t_1.
    • If you use Developer Tools to look in the header, you can search for “canonical” to find the canonical link <link rel="canonical" href="https://www.imdb.com/list/ls055592025/"> which is the URL the developers want search engines to use for this page.

7.5.4.3 Extract the components of interest from the elements

Now that we have an object with the elements of interest, we can extract the content.

Here we want to extract the text.

  • Use html_text2() to extract the text from the selected elements and save as a variable.
  • The result will be a character vector (here length 52).
imdb_text <- html_text2(imdb_elements)
head(imdb_text)
[1] "1. The Godfather"            "100"                        
[3] "2. The Shawshank Redemption" "82"                         
[5] "3. Schindler's List"         "95"                         
tail(imdb_text)
[1] "24. Chinatown"                    "92"                              
[3] "25. The Bridge on the River Kwai" "88"                              
[5] "More to explore"                  "Recently viewed"                 
  • Using tail(), we can see that we have picked up two extra elements that are not of interest.
    • Scroll down to the bottom of the page and you will find “More to Explore” and “Recently Viewed”.
    • These are also h3 titles but have not logical connection to the movies titles.
    • This is what can happen when web page designers “overuse” a class - it may be easier for them in the short run but makes the page harder to maintain and it can be a pain for us.
  • If we were scraping just this element it would be easy to just filter them out, however, if we are trying to get multiple elements, having extra rows will interfere with converting to a data frame later on.

To not scrape these elements, we need to get more specific in our CSS selector for number and title by adding an element higher up in the DOM tree. One that includes the movie numbers and title, but the non-movie elements.

  • Scroll up in the DOM to find the div with class="sc-74bf520e-3 avhjJ dli-parent".
  • Click on it and a box should highlight on the web page that shows this element contains all of the information about the movie as in Figure 7.12.
The web page elements for moveie element for the first movie highlighted with the developer tools element highlighted on the right.
Figure 7.12: The Movie element
  • We can add ".dli-parent .ipc-title__text" to front of our original selector, separated by a ,, and try again.
imdb_elements <- html_elements(imdb_html_obj,
                css = ".dli-parent .ipc-title__text, .metacritic-score-box")
length(imdb_elements)
[1] 50
  • We now see we have 50 elements.

Let’s get the text.

imdb_text <- html_text2(imdb_elements)
head(imdb_text)
[1] "1. The Godfather"            "100"                        
[3] "2. The Shawshank Redemption" "82"                         
[5] "3. Schindler's List"         "95"                         
tail(imdb_text)
[1] "23. The Silence of the Lambs"     "86"                              
[3] "24. Chinatown"                    "92"                              
[5] "25. The Bridge on the River Kwai" "88"                              

We have “scraped” the data of interest and converted it into a character vector.

Why 50 elements and not 200?
  • IMDB is now using javascript to dynamically load additional movies as you scroll.
    • This enables them to load the page faster and if you are interested, they can then load more.
    • If you are not interested, and go back or click somewhere else, they saved you time, and them resources.

They appear to have settled on 25 movies as a good place to stop. Thus 2 elements each for 25 movies is 50 elements.

We will have to use some additional methods later on to get all 100 movies.

7.5.4.4 Tidy and Clean the data

To convert a vector with character numbers and names into a tibble with two columns, one numeric and one character we can:

  • Turn the imdb_text vector into a tibble of one column called text (with 50 rows in this example).
  • Add a column called rownum with numeric values of the row_number()s.
  • Add a column called is_even with logical values indicating if the rownum value is even or odd (since we have two elements for each movie).
  • Add a column called movie with a pair of numbers for each movie tied to their rank, e.g., 1,1, 2,2, 3,3,…100,100.
  • Now the tibble is ready to be “tidy’ed”.
  • Select the columns, less rownum, which has served its purpose.
  • pivot_wider() to break out the ranks and movie titles with names_from = is_even and values_from = text.
  • Select only the columns named TRUE and FALSE and rename them as part of the select to Rank and Movie.
  • Now the tibble is tidy’ed and ready to be cleaned.
    • Use separate_wider_delim() to break out the Rank and Title.
    • Use parse_number() to convert Rank to a number.
  • Save as a new object.
tibble(text = imdb_text) |>
  mutate(
    rownum = row_number(),
1    is_even = rownum %% 2 == 0,
    movie = rep(1:25, each = 2)
  ) |>
  ## view()
  select(-rownum) |>
  pivot_wider(names_from = is_even, values_from = text) |>
  ## view()
  select(-movie, "Rank" = "FALSE", movie = "TRUE") |> 
  separate_wider_delim(Rank, delim = ". ", 
                       names = c("Rank", "Title"),
2                       too_many = "merge") |>
  mutate(Rank = parse_number(Rank)) ->
movierank

movierank |> head()
1
Use the modular division operator %% to see if the remainder is 0.
2
Use the too_many argument to handle E.T. the Extra-Terrestrial.
# A tibble: 6 × 3
   Rank Title                    movie
  <dbl> <chr>                    <chr>
1     1 The Godfather            100  
2     2 The Shawshank Redemption 82   
3     3 Schindler's List         95   
4     4 Raging Bull              90   
5     5 Casablanca               100  
6     6 Citizen Kane             100  

Voila! We have a tibble with the correct columns of the correct class.

7.5.5 Scraping from IMDB at the Movie level

We have scraped individual elements. As we will see later on, not every movie has every element. Scraping individual elements will generate lists of different lengths and it could require a lot of work to align the elements to movies.

Let’s create a strategy to scrape parent elements where we choose a parent that exists for all the data of interest and contains all the data (there is) for the elements we are interested in.

Our new strategy for getting the data is:

  1. Find a CSS selector for the “movie” that captures the whole set of data for each movie, including the fields we don’t want.
  2. Use html_elements() to select all of these movie elements.
  3. Use html_element() with the individual CSS selector for each desired element for all movies.
  4. Convert each of those to text with html_text2().
  5. If possible, combine into a tibble.
  • Now that we have a clear strategy, we can simplify it even more by combining steps 3 and 4 into a custom function.
  1. Find the CSS selector for movie (if it exists).
  • As seen in Figure 7.12, we can find a CSS selector at the movie level and it is .dli-parent.
  • There should be 100 of them but for now we can only scrape the first 25.
  1. Scrape the movies from our downloaded web page and see if we got 25.
imdb_movie_elements <- html_elements(imdb_html_obj, css = ".dli-parent")
length(imdb_movie_elements)
[1] 25

3-4. Build a custom function to get the text from an input xml object and a CSS selector.

get_text <- function(my_obj, my_css){
    html_element(my_obj, my_css) |> 
    html_text2()
}
  1. Put it all together inside a call to tibble() to get the rank_title and metascore elements using the selectors we identified earlier.
tibble(rank_title = get_text(imdb_movie_elements, ".ipc-title__text"),
        metascore = get_text(imdb_movie_elements, ".metacritic-score-box")
) ->
movie_df
glimpse(movie_df)
Rows: 25
Columns: 2
$ rank_title <chr> "1. The Godfather", "2. The Shawshank Redemption", "3. Schi…
$ metascore  <chr> "100", "82", "95", "90", "100", "100", "97", "92", "84", "1…

As an alternative, instead of finding individual CSS class selectors, let’s try just getting the data of interest using the div and span tags as CSS selectors.

  • This could be a shortcut instead of searching for all the classes.
  • Since we searching within just the found movie elements we don’t need to worry about getting every div or span on the page.
tibble(div = get_text(imdb_movie_elements, "div"),
       span = get_text(imdb_movie_elements, "span")
) 
# A tibble: 25 × 2
   div                                                                     span 
   <chr>                                                                   <chr>
 1 "1. The Godfather\n19722h 55mR\n9.2 (2.1M)Rate\n100Metascore"           1972 
 2 "2. The Shawshank Redemption\n19942h 22mR\n9.3 (3M)Rate\n82Metascore"   1994 
 3 "3. Schindler's List\n19933h 15mR\n9.0 (1.5M)Rate\n95Metascore"         1993 
 4 "4. Raging Bull\n19802h 9mR\n8.1 (387K)Rate\n90Metascore"               1980 
 5 "5. Casablanca\n19421h 42mPG\n8.5 (617K)Rate\n100Metascore"             1942 
 6 "6. Citizen Kane\n19411h 59mPG\n8.3 (473K)Rate\n100Metascore"           1941 
 7 "7. Gone with the Wind\n19393h 58mG\n8.2 (339K)Rate\n97Metascore"       1939 
 8 "8. The Wizard of Oz\n19391h 42mG\n8.1 (436K)Rate\n92Metascore"         1939 
 9 "9. One Flew Over the Cuckoo's Nest\n19752h 13mR\n8.7 (1.1M)Rate\n84Me… 1975 
10 "10. Lawrence of Arabia\n19623h 47mPG\n8.3 (321K)Rate\n100Metascore"    1962 
# ℹ 15 more rows
  • We can see this gets the data from the first div, iqHBGn and the first span within that div, <span class="sc-b189961a-8 hCbzGp dli-title-metadata-item">1972</span> as in Figure 7.13.
The web page elements for moveie element for the first movie highlighted with the developer tools element highlighted on the right.
Figure 7.13: The Movie element first div and span elements
  • However, we saw this does not get all the data for each movie and what it does get will take a bit more work to clean.

What if we changed our function to get multiple elements to see if we get all the data.

get_text_all <- function(my_obj, my_css){
1    html_elements(my_obj, my_css) |>
    html_text2()
}
1
Changed to html_elements() to select multiple elements and not just the first instance.
  • Let’s just scrape for all the divs and then scrape all the spans separately.
tibble(div = get_text_all(imdb_movie_elements, "div")) |> glimpse()
Rows: 425
Columns: 1
$ div <chr> "1. The Godfather\n19722h 55mR\n9.2 (2.1M)Rate\n100Metascore", "",…
tibble(span = get_text_all(imdb_movie_elements, "span")) |>  glimpse()
Rows: 479
Columns: 1
$ span <chr> "1972", "2h 55m", "R", "9.2 (2.1M)Rate\n100Metascore", "9.2 (2.1M…
  • We had to create two different tibbles as the results for div produced fewer rows than the results for span so we could not combine them into a single tibble without creating a lot of NAs.

  • The div tibble has 425 rows with a lot of duplication as there are divs within divs. All the data is contained within the 17 rows associated with each movie.

    • The can be cleaned to get any of the data elements with a fair amount of work.
  • The span tibble has 479 rows which means a lot of duplication but not element is present and not every movie has every element.

    • This can be clean but it would require extra effort to align the elements to the correct movie.

7.5.6 Scraping Strategies

So far we have scraped static web pages where the content was all present on the page without us having to manipulate the page.

We used a variety of strategies to accomplish three steps:

  1. Find CSS Selectors
    • Use Selector Gadget with Chrome browser to find individual element selectors or multiple element selectors on a web page.
    • Use browser Developer Tools to find navigate the DOM tree to find individual elements and identify their attributes and possible selectors as well as parents/children.
  2. Scrape the Web Page using CSS selectors
  • Scrape individual elements using classes,
  • Scrape parent elements using classes and then scrape individual elements by class within each parent,
  • Scrape parent elements using classes and then scrape the first or multiple elements using HTML tags as selectors.
  1. Extract content into a Tibble. Tidy the tibble, and Clean it.
  • We used html_text() as we were interested in just text for now. We will see other functions later.
  • The approach to create a tibble depends upon the shape of the outputs from step 2.
  • The approach to cleaning also depends upon the content of the data from step 2.

There are trade-offs across these steps. You final choice of strategy will depend upon the questions you want to answer, how the data of interest is aligned within the web page’s DOM tree, whether you need to worry about potential missing data or extra data, and how much work you will need to do to tidy and clean the data.

  • Typically the more effort into scraping, the easier the tidying and cleaning.

There are three basic approaches that you can use and blend together

7.5.6.1 Brute Force

  • Scrape all the individual elements at once.
  • Load into a tibble of one column with all the text.
  • Create new columns for each field by using characteristics of the data and {stringr} functions, regex, and combinations of logical checks to create new columns of logical variables to identify each row to a single element name.
    • As an example, detecting the presence of the ( ) around the year to create a year column.
    • This can take a fair amount of work whenever any element value is missing for one or more instances..
  • Use case_when() to create a final column with the field names for each of the rows.
  • Pivot wider using the field names to get to the tibble.

7.5.6.2 Medium Effort

  • Scrape each element individually and convert into a character vector.
  • Scrape all the elements together into one long vector and convert to a tibble.
  • For those fields with all values, create a new column of logicals (using %in% and the corresponding character vector).
  • For those fields that might be missing values, create logical columns based on unique characteristics of the filed.
    • This will often go faster if you can identify the field as being NOT one of the others in addition to {stringr} and regex combinations approach till done.
  • Then create the column with the field names and pivot wider.
  • This takes advantage of knowing precisely which elements have character vectors with all values of interest to reduce the amount of effort in cleaning.

7.5.6.3 Scrape the Parent Element

Fortunately there is a another strategy we might be able to use in this example that takes advantage of a property of html_element().

  • It can seem counter-intuitive, but instead of selecting individual elements of interest for each movie, we scrape the parent elements using html_elements() to capture the elements with all of the data of interest and perhaps data of no interest.
  • Then use html_element() with each individual CSS selector to scrape the elements. If an element is missing from the parent element, it will return NA in the proper position in the output vector!
  • Extract the text with html_text2().
  • Convert into a tibble.

7.5.6.4 Time to Clean the data.

The steps we take in this step always depend upon the shape and content of the results we have obtained and what we need to do to transform them into a useful data structure for our analysis.

  • Look at the data to identify specific features or characteristics you can use to help in cleaning.
  • In some cases, you can use across() with tidy select functions (?tidyr_tidy_select) to clean multiple variables at once.
  • In other cases, you have to clean individual variables one at a time.
  • Use a build-a-little-test-a-little approach. Start with the tibble at the top, use a pipe between steps so it is easy to check the data, and do not save the data frame until you are sure of the result.
  • Consider using functions form {tidyr} such as the separate_* functions and the parse_* `functions to save time and effort with {stringr} and regex. (Keep your cheat sheets handy!)

7.5.7 Manipulate a Dynamic Web Page with {rvest}

As we have seen, even though we can “see” all 100 elements on a page, they are not really there when the page is first loaded. This is due to the designers using a form of Infinite Scrolling to dynamically load content on the page.

We will have to programatically “interact” with the page to get see as much content as we want. That means using code to simulate as if there were a live user interacting with the page using a mouse or keyboard.

There are several options for interacting with a web page.

  • Minimalist approaches require less set up and can handle common actions such as scrolling or clicking.
  • Comprehensive approaches allow you to navigating and manipulating interactions with entire web sites. This includes use of the web-site testing capabilities seen with {rselenium} in Section 7.9 or {selenider} in Thorpe (2024).

We will use a minimalist approach of using the {rvest} package.

The {rvest} package includes functions to create a live session and interact with a web page, e.g., read_html_live(). See Interact with a live web page

  • These functions use the {chromote} package to establish the session and interact with the page.

The functions allow you to open a live session in the Chrome browser and use the live session’s methods to scroll across the page, type text, press a key, or click on buttons or links.

  • my_session <- read_htm_live(my_url) creates a session object, e.g., my_session, where the url web page is active.
  • The my_session$view() method is useful for interactive debugging to see what is happening on the page as you execute other methods.
    • Be sure to comment out for rendering.
  • There are several methods to scroll on the page, e.g, my_session$scroll_by(up.down-pixel, left-right-pixel) can help by scrolling down dynamic pages.
    • You may need to do several to allow the page to scroll, load, and then scroll again
  • The my_session#click("my_css") moves to a button identified by a css selector and clicks it,
    • This can be useful when a page has a “load more” button as opposed to dynamic scrolling.

Once you have all the page loaded, you can use html_element() or html_elements() as usual with the my_session object.

Here is an example of opening a page and scraping data without moving.

  • Remove the comment # on sess$view to see it open. You can close the browser tab manually.
library(rvest)
sess <- read_html_live("https://www.forbes.com/top-colleges/")
# sess$session$is_active()
# sess$view()
my_table <- html_element(sess,".ListTable_listTable__-N5U5") |> html_table()
row_data <- sess %>% html_elements(".ListTable_tableRow__P838D") |> html_text()
my_table |> head()
# A tibble: 6 × 8
   Rank Name       State Type  `Av. Grant Aid` `Av. Debt` Median 10-year Salar…¹
  <int> <chr>      <chr> <chr> <chr>           <chr>      <chr>                 
1     1 Princeton… NJ    Priv… $59,792         $7,559     $189,400              
2     2 Stanford … CA    Priv… $60,619         $12,999    $177,500              
3     3 Massachus… MA    Priv… $45,591         $13,792    $189,400              
4     4 Yale Univ… CT    Priv… $63,523         $4,926     $168,300              
5     5 Universit… CA    Publ… $21,669         $7,238     $167,000              
6     6 Columbia … NY    Priv… $61,061         $16,849    $156,000              
# ℹ abbreviated name: ¹​`Median 10-year Salary`
# ℹ 1 more variable: `Financial Grade` <chr>
row_data[1:10]
 [1] "1Princeton UniversityNJPrivate not-for-profit$59,792$7,559$189,400A-"                 
 [2] "2Stanford UniversityCAPrivate not-for-profit$60,619$12,999$177,500A+"                 
 [3] "3Massachusetts Institute of TechnologyMAPrivate not-for-profit$45,591$13,792$189,400A"
 [4] "4Yale UniversityCTPrivate not-for-profit$63,523$4,926$168,300A+"                      
 [5] "5University of California, BerkeleyCAPublic$21,669$7,238$167,000"                     
 [6] "6Columbia UniversityNYPrivate not-for-profit$61,061$16,849$156,000A+"                 
 [7] "7University of PennsylvaniaPAPrivate not-for-profit$57,175$12,499$171,800A+"          
 [8] "8Harvard UniversityMAPrivate not-for-profit$61,801$9,004$171,400A"                    
 [9] "9Rice UniversityTXPrivate not-for-profit$51,955$10,818$152,100A"                      
[10] "10Cornell UniversityNYPrivate not-for-profit$54,219$8,309$155,400A+"                  

Let’s get the data for 100 movies from the IMDB web page.

  • The code below works fine interactively but when rendering the file, IMDB will detect the default value for the user agent of NULL and block the request leading to a time-out error. The first few lines use httr::user_agent() to configure the user agent in the environment as other than NULL.

  • Remove the comment # on sess$view to see the activitly in a Chrome browser tab. You can close the browser tab manually when finished.

Listing 7.1: Using live session to manipulate a web page.
# Define User-Agent string
user_agent_str <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.3"
# or
user_agent_str <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15"

# Set global User-Agent
ua <- httr::user_agent(user_agent_str)

sess <- read_html_live("https://www.imdb.com/list/ls055592025/")
# sess$session$is_active()
# sess$view()
sess$scroll_by(10000,0)
sess$scroll_by(10000,0)
sess$scroll_by(10000,0)
sess$scroll_by(10000,0)
# sess$get_scroll_position()

sess |> 
html_elements(css = ".dli-parent") ->
  imdb_100_elements

tibble(rank_title = get_text(imdb_100_elements, ".ipc-title__text"),
        Metascore = get_text(imdb_100_elements, ".metacritic-score-box") 
) ->
movie_100_df 
glimpse(movie_100_df)
Rows: 75
Columns: 2
$ rank_title <chr> "1. The Godfather", "2. The Shawshank Redemption", "3. Schi…
$ Metascore  <chr> "100", "82", "95", "90", "100", "100", "97", "92", "84", "1…
  1. Create a new session.
  2. Check if the session is active - useful for interactive debugging.
  3. Open a browser to navigate to the desired web page
  4. Scroll down the web page as far as needed to load all the elements of interest. In this case, four scrolls of 10,000 pixels each was required. The reason for 4 was to allow the page to load and re-scroll to catch all the movies. Other options may work.
  5. The actual location after scrolling - useful for interactive debugging.
  6. Scraping all 100 movie elements
  7. Extracting the elements of interest into a data frame.

7.5.8 Scrape a Dynamic Page with Chromote

If you need more flexibility than scrolling and clicking, you can try working directly with Chromote.

Important

The {chromote} package uses the open-source Chromium protocol for interacting with Google Chrome-like browsers so you must have Chrome (or a similar browser) installed.

There is limited documentation of {chromote} so may need to search for help on other options.

{chromote} uses what is known as a “headless browser” to interact with a web page.

  • A headless browser is a browser without a user interface screen - there is nothing for a human head to see or do.
  • These tools are commonly used for testing web sites since without the need to load a user interface, the code can run much faster.
  • We can use a $view() method to create a screen we can observe.

Once the session is established, we can use {chromote} methods to manipulate the page and use {rvest} functions to get the elements we want as before.

Use the console to install the {chromote} package.

To scrape the data we will Load the {chromote} package and do the following

  1. Create a new session.
  2. Create a custom user agent. This is not always necessary but can be useful to avoid appearing like a Bot. IMDB will block the default user agent.
  3. Navigate to the desired web page
  4. When running interactively, open a browser we can observe.
  5. Use Sys.Sleep() to ensure the page loads all the way. The b$Page$loadEventFired() will suspend R until the page responds that it is full loaded. However, with IMDB it does not respond so we are manually imposing a delay on the next step.
  6. Use a for-loop to scroll down the page while using Sys.Sleep() to suspend operations so the dynamic content has loaded before scrolling again. This may take multiple loops that you can adjust.
  7. Once you have scrolled to the bottom of the page, use a Javascript function to create an object with the entire page.
  8. Switch to the usual {rvest} functions to get the html object, scrape the elements of interest, and tidy as before.
  9. Close the session. This does not close the view you created earlier. You will have to do that.
Listing 7.2: Using {chromote} to manipulate a web page.
library(chromote) 
options(chromote.verbose = TRUE)

b <- ChromoteSession$new()
b$Network$setUserAgentOverride(userAgent = "My fake browser")

b$Page$navigate("https://www.imdb.com/list/ls055592025/")
# b$Page$loadEventFired()
#b$view()
Sys.sleep(.3)
for (i in 1:6) {
b$Runtime$evaluate("window.scrollTo(0, document.body.scrollHeight);")
Sys.sleep(.3)
}
html <- b$Runtime$evaluate('document.documentElement.outerHTML')$result$value

page <- read_html(html)
imdb_100_elements <- html_elements(page, css = ".dli-parent")
tibble(rank_title = get_text(imdb_100_elements, ".ipc-title__text"),
        Metascore = get_text(imdb_100_elements, ".metacritic-score-box")
) ->
movie_100_df

b$close()
  1. Start a new session as a tab in a browser
  2. For IMDB, and perhaps other sites, create a custom agent to reduce the chance of being blocked as a bot and getting a status code 403 error.
  3. Navigate to the web page of interest.
  4. Un-comment if you want to see the page in the headless browser - For interactive use only..
  5. Use a delay to make sure the page is fully loaded before attempting to scroll
  6. Scroll to the bottom of the current page and wait, then scroll again, etc until you get to the bottom. You may have to increase the numb of loops for other sites.
  7. Create an object with the fully loaded page’s HTML.
  8. Use {rvest} to read in the the HTML page as before.
  9. Scrape the movie element
  10. Use the custom function from before to extract the text for the CSS selectors of interest and convert to a tibble.
  11. Save the resulting tibble.
  12. Close the session and disconnect from the browser

7.5.9 Clean the Data

Both {rvest} and {chromate} can generate a data frame with the desired data. However, it needs to be cleaned.

movie_100_df |> 
  separate_wider_delim(rank_title, delim = ". ", 
                       names = c("Rank", "Title"),
                       too_many = "merge") |>
  mutate(Rank = parse_number(Rank),
         Metascore = parse_number(Metascore)) ->
movie_100_df
skimr::skim(movie_100_df)
movie_100_df |> tail(10)
Data summary
Name movie_100_df
Number of rows 75
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Title 0 1 4 68 0 75 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Rank 0 1 38.00 21.79 1 19.5 38 56.5 75 ▇▇▇▇▇
Metascore 0 1 87.48 9.63 63 84.0 90 94.0 100 ▂▂▃▇▇
# A tibble: 10 × 3
    Rank Title                        Metascore
   <dbl> <chr>                            <dbl>
 1    66 Bonnie and Clyde                    86
 2    67 The French Connection               94
 3    68 City Lights                         99
 4    69 It Happened One Night               87
 5    70 A Place in the Sun                  76
 6    71 Midnight Cowboy                     79
 7    72 Mr. Smith Goes to Washington        73
 8    73 Rain Man                            65
 9    74 Annie Hall                          92
10    75 Fargo                               88
  • Use your Developer Tools to find the class for the value of Votes use to create the star ranking.
  • Scrape the Rank, Title, Metascore, and Votes from the existing imdb_100_elements.
  • Plot Rank as a function of Votes. Use a linear smoother and non-linear smoother and use a log scale on the x axis.
  • Interpret the plot.
  • Run the linear model to check if the linear relationship is real and check the residuals with plot().

The CSS selector for Votes is class="ipc-rating-star--voteCount".

  • Since we already scraped the movie elements we do not need to rerun the live session. We can scrape from the existing imdb_100_elemenents.
Show code
tibble(
  rank_title = get_text(imdb_100_elements, ".ipc-title__text"),
  Metascore = get_text(imdb_100_elements, ".metacritic-score-box"),
  Votes = get_text(imdb_100_elements, ".ipc-rating-star--voteCount")
) |>
  separate_wider_delim(rank_title,
    delim = ". ",
    names = c("Rank", "Title"),
    too_many = "merge"
  ) |>
  mutate(
    Rank = parse_number(Rank),
    Metascore = parse_number(Metascore),
    Votes = case_when(
      str_detect(Votes, "M") ~ 1000000 * parse_number(Votes),
      str_detect(Votes, "K") ~ 1000 * parse_number(Votes),
      .default = parse_number(Votes)
    )
  ) -> 
  movie_100_df2

movie_100_df2 |>
  ggplot(aes(Votes, Rank)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  geom_smooth(se = FALSE, color = "red") + 
  scale_x_log10()

Both the linear and loess smoother show a negative relationship but the data is highly variable around both smoothing lines. The non-linear smoother seems like it is little better than the linear smoother.

Show code
lm_out <- lm(Rank ~ log(Votes), data = movie_100_df2)
summary(lm_out)

Call:
lm(formula = Rank ~ log(Votes), data = movie_100_df2)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.716 -14.643  -1.963  16.353  42.430 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  140.063     28.149   4.976 4.19e-06 ***
log(Votes)    -7.955      2.186  -3.638 0.000509 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.19 on 73 degrees of freedom
Multiple R-squared:  0.1535,    Adjusted R-squared:  0.1419 
F-statistic: 13.24 on 1 and 73 DF,  p-value: 0.0005088

The \(p\)-value of 3.022e-06for log(Votes) of linear model suggests there is substantial evidence of a negative relationship with rank.

However, the adjusted R-squared of .19 suggests the relationship explains some of the variability of Rank but it might help to add other variables (other than Metascore which we already saw does not help).

Show code
plot(lm_out)

The residual plots do not show significant issues given the limited value of the model.

7.5.10 Scraping Tables with html_table()

When data on a web page is in the form of a table, you can use html_table() to convert to a useful form more easily.

Let’s look at a Wikipedia article on hurricanes: https://en.wikipedia.org/wiki/Atlantic_hurricane_season

  • This contains many tables which one could copy and paste into Excel but it could be a pain to clean and not amenable to being rectangular in its current form.

Let’s try to quickly get this data into tibbles using html_table().

  • html_table() makes a few assumptions:
    • No cells span multiple rows.
    • Headers are in the first row.

As before, download and save the website content using read_html().

wiki_hurr_obj <- read_html("https://en.wikipedia.org/wiki/Atlantic_hurricane_season")
xml2::write_html(wiki_hurr_obj, "./data/wiki_hurr_obj.html")
wikixml <- read_html("./data/wiki_hurr_obj.html")

Use html_elements() to get the elements of interest, but this time we’ll extract all of the “table” elements.

  • If you inspect the tables you see the name table and class wikitable we can use either. Remember to add a . in front of wikitable as a CSS class selector.
wiki_hurr_tables <- html_elements(wiki_hurr_obj, "table")
  • Use html_table() to get the tables from the table elements as a list of tibbles:
tablist <- html_table(wiki_hurr_tables )
class(tablist)
[1] "list"
length(tablist)
[1] 19
tablist[[19]] |>
  select(1:4)
# A tibble: 6 × 4
  Year     TC    TS     H
  <chr> <int> <int> <int>
1 2020     31    30    14
2 2021     21    21     7
3 2022     16    14     8
4 2023     21    20     7
5 2024     17    17    10
6 Total   106   102    46

Compare the table names across the data frames.

  • Use map() to get a list where each element is a character vector of the names.
  • Use map()to collapse each vector to a single string with the names separated by “:”.
  • Convert to a data frame for each of viewing.
Show code
(map(tablist, \(df) names(df)) |> 
  map( \(v) tibble(names_vector = str_c(v, collapse = ":"))))[-1] |> list_rbind()
# A tibble: 18 × 1
   names_vector                                                                 
   <chr>                                                                        
 1 Year:TS:H:MH:ACE:Deaths:Strongest storm:Major landfalling storms:Notes       
 2 Year:TS:H:MH:ACE:Deaths:Strongest storm:Major landfalling storms:Notes       
 3 Year:TS:H:MH:ACE:Deaths:Strongest storm:Major landfalling storms:Notes       
 4 Year:TS:H:MH:ACE:Deaths:Strongest storm:Major landfalling storms:Notes       
 5 Year:TS:H:MH:ACE:Deaths:Strongest storm:Major landfalling storms:Notes       
 6 Year:TS:H:MH:ACE:Deaths:Damage:Strongest storm:Major landfalling storms:Notes
 7 Year:TS:H:MH:ACE:Deaths:Damage:Strongest storm:Major landfalling storms:Notes
 8 Year:TS:H:MH:ACE:Deaths:Damage:Strongest storm:Major landfalling storms:Notes
 9 Year:TS:H:MH:ACE:Deaths:Damage:Strongest storm:Major landfalling storms:Notes
10 Year:TS:H:MH:ACE:Deaths:Damage:Strongest storm:Major landfalling storms:Notes
11 Year:TS:H:MH:ACE:Deaths:Damage:Strongest storm:Retired names:Notes           
12 Year:TS:H:MH:ACE:Deaths:Damage:Strongeststorm:Retirednames:Notes             
13 Year:TS:H:MH:ACE:Deaths:Damage:Strongeststorm:Retirednames:Notes             
14 Year:TC:TS:H:MH:ACE:Deaths:Damage:Strongeststorm:Retirednames:Notes          
15 Year:TC:TS:H:MH:ACE:Deaths:Damage:Strongeststorm:Retirednames:Notes          
16 Year:TC:TS:H:MH:ACE:Deaths:Damage:Strongeststorm:Retired names:Notes         
17 Year:TC:TS:H:MH:ACE:Deaths:Damage:Strongeststorm:Retired names:Notes         
18 Year:TC:TS:H:MH:ACE:Deaths:Damage:Strongeststorm:Retirednames:Notes          
  • The tables have much different structures.
  • You can reshape to a common structure and bind to suit your question of interest.

7.5.11 Example: Taylor Swift Songs

Use the data on the Wikipedia page “List of Taylor Swift Songs” at https://en.wikipedia.org/wiki/List_of_songs_by_Taylor_Swift.

Create a line plot of the number of songs by the first year of release with nice title, caption (the source data), and axis labels.

  • Hint: Look at help for separate_wider_delim().
  • Find the CSS selector for the songs - which are in tables.
  • Scrape all of the songs on the page.
  • Extract as tables in a list.
  • Convert only the “Released Songs” into a tibble.
  • Look at the data for cleaning.
  • Clean and Save
    • Song: remove extra "
    • Year: Separate into two columns that are integer
    • Remove Ref..
  • Look at the data to check cleaning.
  • Group by first year
  • Summarize to get the count of the songs with n().
  • Plot with geom_line().
Show code
swift_html_obj <- read_html("https://en.wikipedia.org/wiki/List_of_songs_by_Taylor_Swift")
xml2::write_html(swift_html_obj, "./data/swift_html_obj.html")

swift_elements <- html_elements(swift_html_obj, ".wikitable")

html_table(swift_elements)[[2]] ->
  swift_songs_df

glimpse(swift_songs_df)
Rows: 281
Columns: 6
$ Song        <chr> "\"The 1\"", "\"1 Step Forward, 3 Steps Back\"", "\"22\"",…
$ `Artist(s)` <chr> "Taylor Swift", "Olivia Rodrigo", "Taylor Swift", "Taylor …
$ `Writer(s)` <chr> "Taylor SwiftAaron Dessner", "Olivia RodrigoTaylor SwiftJa…
$ `Album(s)`  <chr> "Folklore", "Sour", "Red and Red (Taylor's Version)", "Lov…
$ Releaseyear <chr> "2020", "2021", "2012 and 2021", "2019", "2024", "2024", "…
$ Ref.        <chr> "[19]", "[20]", "[21]", "[22]", "[23]", "[23]", "[24]", "[…
Show code
swift_songs_df |> 
  mutate(Song = str_replace_all(Song, '\\"', "")) |> 
  separate_wider_delim(Releaseyear, delim = " and ", 
                       names = c("Year_1st", "Year_2nd"),
                       too_few = "align_start") |> 
  mutate(Year_1st = parse_integer(Year_1st),
         Year_2nd = parse_integer(Year_2nd)) |> 
  select(-Ref.) ->
  swift_songs_df
glimpse(swift_songs_df)
Rows: 281
Columns: 6
$ Song        <chr> "The 1", "1 Step Forward, 3 Steps Back", "22", "Afterglow"…
$ `Artist(s)` <chr> "Taylor Swift", "Olivia Rodrigo", "Taylor Swift", "Taylor …
$ `Writer(s)` <chr> "Taylor SwiftAaron Dessner", "Olivia RodrigoTaylor SwiftJa…
$ `Album(s)`  <chr> "Folklore", "Sour", "Red and Red (Taylor's Version)", "Lov…
$ Year_1st    <int> 2020, 2021, 2012, 2019, 2024, 2024, 2023, 2023, 2012, 2021…
$ Year_2nd    <int> NA, NA, 2021, NA, NA, NA, NA, NA, 2021, NA, 2023, NA, NA, …
Show code
swift_songs_df |> 
  group_by(Year_1st) |> 
  summarize(total_songs = n()) |> 
ggplot(aes(Year_1st, total_songs)) +
  geom_line() +
  labs(title = "Taylor Swift Song Releases by First Year of Release",
       caption = "Source is https://en.wikipedia.org/wiki/List_of_songs_by_Taylor_Swift") +
  xlab("Year of First Release") +
  ylab("Number of Songs")

7.6 Example: Tables of the Oldest Mosques

The Wikipedia page on the oldest mosques in the world has tables of data about the oldest mosques in each country organized by different regions in the world.

The first step is to just look at the web page to get an idea of the data that is available and how it is organized.

We see a number of tables with different structures. However, each has data on the name of the mosque, the location country, when it was first built, and in some tables, the denomination. Many of the tables also have a heading for the region.

We want to scrape this data so we can plot the history of when were mosques first built across different regions.

  • That question suggests we need to scrape the tables to get the data into a data frame suitable for plotting.

Let’s build a strategy for going from web page tables to a single data frame with tidy and clean data so we can plot.

  1. Inspect the web page tables.
  2. Use the browser’s Developer Tools to identify the right CSS selectors.
  • Given the differences in the table structures we can scrape the regional information in the headers for each table separately from the tables themselves.
  1. Download the web page and save it as an XML object so we only have to download once. Add code to read in the XML object.

  2. Extract the data of interest, both headers and tables. Check we have the correct (same) number of headers and tables so we have a header for each table.

  3. Tidy the data into a data frame.

  • Creating a data frame requires each data frame in the list to have the same variables which have the same class across the data frames.
  • Clean the table headings (regions) and assign them as names for the table elements in the list of tables.
  • Check the data frames for the different structures and fix any issues.
  • Select the variables of interest.
  • Convert into a tibble.
  1. Clean the data for consistent analysis.
  • Ensure standard column names.
  • Remove parenthetical comments.
  • Convert data to be consistent of of the type desired for analysis.

We should now be able to create and interpret various plots.

7.6.1 Steps 1 and 2: Inspect Tables to Identify the CSS Selectors.

Go to this URL: https://en.wikipedia.org/wiki/List_of_the_oldest_mosques in your browser.

Look at the tables and inspect their structure/elements.

  • Right-click on the table “Mentioned in the Quran” and select Inspect to open the Developer Tools.
  • The Elements tab should be highlighting the element in the table where you right-clicked as in Figure 7.14.
  • Move your mouse up the elements until you are highlighting the line table class="wikitable sortable jquery-tablesorter".
    • You should see the table highlighted in the web page.
    • The class = "wikitable" tells us we want the CSS selector wikitable based on class. We don’t need the additional attributes.
Snapshot of mosque web page on the left and developer tools on the right showing the highlighted element with class `wikitable`.
Figure 7.14: First table selected.
  • Scroll down the elements to get to the next table so it is highlighted. Check this is also class = wikitable.
    • You can check more tables to confirm they are all class="wikitable"

Let’s look at the regions now. These are the table headings for each region within a continent, e.g., “Northeast Africa” is a region within the continent “Africa”.

Right-click on “Northeast Africa” and inspect it.

  • It has the element name of caption so we can use caption as a CSS selector.
    • Confirm with a few other tables that the region is a caption.

Note that the first table, under the heading “Mentioned in the Quran” does not have a region like the other tables.

  • Right-click on “Mentioned in the Quran” and inspect it.
  • It shows id="Mentioned_in_the_Quran". Since we only want this one instance, we can use the id with a # as a CSS selector, i.e., #Mentioned_in_the_Quran.
Snapshot of mosque web page on the left and developer tools on the right showing the highlighted element with ``id="Mentioned_in_the_Quran" and the name `caption` for the table heading.
Figure 7.15: Choosing the selectors for the table headings.

We now have selectors for each table and for the regions we can use for each table.

7.6.2 Step 3: Load (and Save) the Web Page.

Download the HTML page and save as a file.

  • Then set a code chunk option#| eval: false so you don’t re-download each time you render and can just read in the saved file.
library(tidyverse)
library(rvest)
mosque <- read_html("https://en.wikipedia.org/wiki/List_of_the_oldest_mosques")
xml2::write_html(mosque, "./data/mosque.html")

Read in the saved file.

mosque <- read_html("./data/mosque.html")
class(mosque)
[1] "xml_document" "xml_node"    

We now have the entire web page stored in an abject of class xml-document.

We can now extract the elements of interest.

7.6.3 Step 4: Extract the Data of Interest

We want to extract both headers and tables.

mosque_headers <-   html_text2(html_elements(mosque,
  css = "caption , #Mentioned_in_the_Quran"
))

mosque_tables <- html_elements(mosque, css = ".wikitable")
mosque_tables <- html_table(mosque_tables, fill = TRUE)
## fill = TRUE, automatically fills rows with fewer
## than the maximum number of columns with NAs

Check that we have the same number of tables and header values

length(mosque_headers)
[1] 25
length(mosque_tables)
[1] 25
  • There are 25 headers in a character vector and 25 tables in a list.

7.6.4 Step 5: Tidy the data

Use {purrr} and other functions to convert the list of tables to a tidy data frame.

7.6.4.1 Clean the headers and assign to the list.

While we could wait to clean the headers, cleaning them now will make our data easier to look at during the tidy process.

Look at the mosque_headers for issues.

mosque_headers
 [1] "Mentioned in the Quran"                                                                                                 
 [2] "Northeast Africa"                                                                                                       
 [3] "Northwest Africa"                                                                                                       
 [4] "Southeast Africa (including nearby islands of the Indian Ocean, but barring countries that are also in Southern Africa)"
 [5] "West Africa"                                                                                                            
 [6] "Southern Africa"                                                                                                        
 [7] "South America"                                                                                                          
 [8] "North America (including Central America and island-states of the Caribbean Sea)"                                       
 [9] "Arabian Peninsula (including the island-state of Bahrain)"                                                              
[10] "Greater China"                                                                                                          
[11] "East Asia (excluding Greater China)"                                                                                    
[12] "South Asia"                                                                                                             
[13] "Southeast Asia"                                                                                                         
[14] "Levant (for Cyprus and Greater Syria)"                                                                                  
[15] "Southwest Asia (excluding the Arabian peninsula, Caucasus, and Levant)"                                                 
[16] "Central Asia"                                                                                                           
[17] "Transcaucasia"                                                                                                          
[18] "Iberian Peninsula"                                                                                                      
[19] " Russia"                                                                                                                
[20] "Eastern Europe (excluding the Caucasus, European Russia and Nordic countries)"                                          
[21] "British Isles"                                                                                                          
[22] "Western-Central Europe (excluding the British Isles, Nordic countries, and countries that are also in Eastern Europe)"  
[23] "Nordic countries"                                                                                                       
[24] "Australasia"                                                                                                            
[25] "Melanesia"                                                                                                              
  • Several have parenthetical phrases we don’t need.
  • ” Russia” has a leading space for some reason.
  • We want to get rid of the extra text.
  • Use {stringr} and regex.
mosque_headers |> 
str_replace_all(" \\(.+\\)", "") |>  # git rid of parenthetical comments
str_squish() ->
mosque_headers_clean

mosque_headers_clean
 [1] "Mentioned in the Quran" "Northeast Africa"       "Northwest Africa"      
 [4] "Southeast Africa"       "West Africa"            "Southern Africa"       
 [7] "South America"          "North America"          "Arabian Peninsula"     
[10] "Greater China"          "East Asia"              "South Asia"            
[13] "Southeast Asia"         "Levant"                 "Southwest Asia"        
[16] "Central Asia"           "Transcaucasia"          "Iberian Peninsula"     
[19] "Russia"                 "Eastern Europe"         "British Isles"         
[22] "Western-Central Europe" "Nordic countries"       "Australasia"           
[25] "Melanesia"             

The headers look good. We can assign as names to the table elements in list.

names(mosque_tables) # an un-named list
NULL
names(mosque_tables) <- mosque_headers_clean
names(mosque_tables) # now a named-list
 [1] "Mentioned in the Quran" "Northeast Africa"       "Northwest Africa"      
 [4] "Southeast Africa"       "West Africa"            "Southern Africa"       
 [7] "South America"          "North America"          "Arabian Peninsula"     
[10] "Greater China"          "East Asia"              "South Asia"            
[13] "Southeast Asia"         "Levant"                 "Southwest Asia"        
[16] "Central Asia"           "Transcaucasia"          "Iberian Peninsula"     
[19] "Russia"                 "Eastern Europe"         "British Isles"         
[22] "Western-Central Europe" "Nordic countries"       "Australasia"           
[25] "Melanesia"             

7.6.4.2 Convert the Data into a Tidy Data Frame.

We want to turn the list into a data frame with the desired columns.

To do that, we have to ensure each data frame has the same columns of interest and they are of the same class across each data frame..

  • We are looking for Building, Location, Country, and First built.

Let’s get the names for each of the data frames in the mosque_tables list and convert into a data frame for ease of viewing.

  • Add an index column based on the row number.
  • If there were many many rows, we could use code with the list of names to check whether the names of each data frame were %in% a vector of the column names of interest.
map(mosque_tables, names) |> # we could stop here and just inspect for a few rows.
 tibble(names = _)  |>
  mutate(index = row_number(),
         names = map_chr(names, \(x) str_flatten_comma(x))) ->
  table_names_df
table_names_df[12:20,]
# A tibble: 9 × 2
  names                                                                index
  <chr>                                                                <int>
1 Building, Image, Location, Country, First built, Denomination, Notes    12
2 Building, Image, Location, Country, First built, Denomination, Notes    13
3 Building, Image, Location, Country, First built, Denomination, Notes    14
4 Building, Image, Location, Country, First built, Denomination, Notes    15
5 Building, Image, Location, Country, First built, Denomination, Notes    16
6 Building, Image, Location, Country, First built, Denomination, Notes    17
7 Building, Image, Location, Country, First built, Denomination, Notes    18
8 Building, Image, Location, First built, Denomination, Notes             19
9 Building, Image, Location, Country, First built, Denomination, Notes    20

Looking at all of the rows visually we can see that element 19 is missing country.

  • Let’s add a column Country to the 19th table.
names(mosque_tables)[19]
[1] "Russia"
mosque_tables$Russia$Country <- "Russia"

Now that we have the same columns, we want to ensure they have consistent classes across each data frame.

  • When html_table() creates a tibble, it “guesses” at the appropriate class based on the values in the table. So tibbles with different content for a variable may wind up being guessed differently.

The simplest “brute-force” approach is to just use a nested map() to convert all variables in all data frames to character and sort out the types during cleaning.

  • Class character is the least restrictive on content so easiest to use as the default.

As an alternative, use map to get the class of each column in each data frame and compare them visually to see if there are any issues.

Let’s select the columns of interest and visually inspect if they are the same class.

mosque_tables |>
  map(\(x) select(x, c(Building, Location, Country,`First built`))) ->
mosque_tables
(mosque_tables |> 
  map(\(df) map_chr(df, \(col) class(col))))[12:22]
$`South Asia`
   Building    Location     Country First built 
"character" "character" "character" "character" 

$`Southeast Asia`
   Building    Location     Country First built 
"character" "character" "character" "character" 

$Levant
   Building    Location     Country First built 
"character" "character" "character" "character" 

$`Southwest Asia`
   Building    Location     Country First built 
"character" "character" "character" "character" 

$`Central Asia`
   Building    Location     Country First built 
"character" "character" "character"   "integer" 

$Transcaucasia
   Building    Location     Country First built 
"character" "character" "character" "character" 

$`Iberian Peninsula`
   Building    Location     Country First built 
"character" "character" "character" "character" 

$Russia
   Building    Location     Country First built 
"character" "character" "character" "character" 

$`Eastern Europe`
   Building    Location     Country First built 
"character" "character" "character" "character" 

$`British Isles`
   Building    Location     Country First built 
"character" "character" "character" "character" 

$`Western-Central Europe`
   Building    Location     Country First built 
"character" "character" "character"   "integer" 

If there were thousands of columns we could do the brute force conversion, but that might take a while.

As an alternative let’s use code to compare the class for each variable for each row and identify only those rows and variables where there is an issue.

  • The goal is to compare the actual class for each variable in each data frame to your choice of the desired class for the variable.

Consider creating two data frames, one with the correct class for each variable and one with the actual classes for each variable in each data frame. If you want to bind them together, they must have the same structure.

  • What does that look like?

  • What steps would you take in your strategy to convert from the list to the final data frame with rows for tables with incorrect classes in one or more variables?

  • Create a data frame with a column for each variable and one row with the class you want for each column, call it correct_class.
  • Now create the second data frame with the same shape.
    • Use map() with mosque_tables to get a list where each element is a named vector with the classes of the columns for each data frame in mosque_tables.
    • Use list_rbind to convert to a single data frame and keep the table name as a new column.
    • Use pivot_wider to convert each variable to a column.
    • Save it as m_var_classes.
  • Use bind_rows to bind the correct_class and m_var_classes together so correct_class is now the first row.
    • Note the Region will just be NA which is fine.
  • Now you want to compare all the values in each column with the first value in the column.
    • Use mutate() with across() in columns 1:4 with an anonymous function for the comparison.
    • This converts each row to either TRUE if the match is correct, else FALSE
  • Now we want to find which rows have a FALSE in them in any column.
    • Use rowwise() to enable easy operations across the columns
    • Use mutate() to create a new variable row_trues with the value of c_across() with sum() on the first four columns.
  • Now you can filter onrow_trues < 4 to find the specific rows with a variable with the incorrect class.

Try to implement this before unfolding the code solution.

This may not be the slickest way but it does demonstrate the integration of a number of functions from across the tidyverse to get to a solution.

  • This hard codes the number of columns at 4 for clarity but that could be changed to a function call for more general solutions.
  • If any functions are unfamiliar, suggest reviewing the help or the vignettes.
Show code
correct_class <- tibble(Building = "character", Location = "character",
                        Country = "character", `First built` = "character")

mosque_tables |> 
  map(\(df) enframe(map_chr(df, \(col) class(col)))) |> 
 list_rbind(names_to = "Region") |> 
  pivot_wider() ->
m_var_classes

bind_rows(correct_class, m_var_classes) |> 
  mutate(across(1:4,  \(col) col == col[1])) |> 
  rowwise() |> 
  mutate(row_trues = sum(c_across(1:4))) |> 
filter(row_trues < 4)
# A tibble: 2 × 6
# Rowwise: 
  Building Location Country `First built` Region                 row_trues
  <lgl>    <lgl>    <lgl>   <lgl>         <chr>                      <int>
1 TRUE     TRUE     TRUE    FALSE         Central Asia                   3
2 TRUE     TRUE     TRUE    FALSE         Western-Central Europe         3
  • The $Central Asia and $Western-Central Europe data frames have First built as other than character, as an integer.
    • This mix of classes suggests we will need to look at First built when it comes to cleaning.

Let’s convert the data frame’s with First built as integer to character.

mosque_tables$`Central Asia`$`First built` <- as.character(mosque_tables$`Central Asia`$`First built`) 
mosque_tables$`Western-Central Europe`$`First built` <- as.character(mosque_tables$`Western-Central Europe`$`First built`) 

Now that we have the same column names and classes for each data frame, we can convert the list to a tibble.

Let’s use the names_to= argument of list_rbind() to keep the element names as a new column, Region, in the data frame.

mosque_tables |> 
  list_rbind(names_to = "Region") ->
  mosque_df
mosque_df |> glimpse()
Rows: 179
Columns: 5
$ Region        <chr> "Mentioned in the Quran", "Mentioned in the Quran", "Men…
$ Building      <chr> "Al-Haram Mosque", "Haram al-Sharif, also known as the A…
$ Location      <chr> "Mecca", "Jerusalem (old city)", "Muzdalifah", "Medina",…
$ Country       <chr> "Saudi Arabia", "contested", "Saudi Arabia", "Saudi Arab…
$ `First built` <chr> "Unknown, considered the oldest mosque, associated with …

Our data of interest is now in a tidy data frame with 179 rows and five columns.

It is time to look at the data to see what cleaning needs to be done.

7.6.5 Step 6 : Clean the Data.

We have a non-standard column name in First built as it has a space in it.

  • For convenience, let’s rename it.
rename(mosque_df, First_built = `First built`) ->
  mosque_df

You can view() the data frame to look for other cleaning issues.

We already know there are issues with First_built since we had two different classes.

  • First_built values are a mix of text, dates as centuries, and dates as years.

  • There are multiple footnotes demarcated by [] which we do not need. ”

  • Let’s replace century dates with the last year of the century, e.g., “15th” becomes “1500”.

  • We can then extract any set of numbers with at least three digits which means we don’t have to worry about the footnotes.

mosque_df |> 
  mutate(First_built = str_replace_all(First_built, "(?<=\\d)th", "00") ,
         First_built = as.numeric(str_extract(First_built, "\\d\\d\\d+"))
         ) ->
mosque_df

Several of the country names have parenthetical comments. Let’s remove those.

mosque_df |> 
  mutate(Country = str_replace_all(Country, " \\(.*\\)", "")) ->
  mosque_df

We now have a tidy data frame with cleaned data for analysis

Tip

There are no hard and fast rules for the choices you will make in cleaning other than to get the data in a state that is useful for analysis.

In this example, you could try to find missing dates in other sources but for now we will let them stay as missing.

7.6.6 Plotting the Data.

Let’s plot the building of the mosques over time by region.

  • Reorder the regions based on the earliest data a mosque was first built
  • Make the title dynamic based on the number of mosques actually used in the plot.
    • Use the magrittr pipe %>% with the ggplot() call embraced with {} so you can refer to the data frame with .
mosque_df |>
  filter(!is.na(Region), !is.na(First_built)) %>% 
  {ggplot(., aes( x = fct_reorder(.f = Region, .x = First_built, .fun = min,
      na.rm = TRUE ),
    y = First_built )) + 
  geom_point(alpha = .5) +
  coord_flip() +
  labs(
    x = "Region", y = "First Built",
    title = glue::glue("First Built Dates for { nrow(.) } Prominent Mosques by Region"),
    subtitle = "Does not include mosques whose built date is unknown.",
    caption = "Source- Wikipedia"
  )}

Create a bar plot of the number of oldest mosques by Country for those countries with more than one oldest mosque.

  • Group by country
  • Mutate to add the count of mosques to the data frame
  • Ungroup
  • Filter to countries with more than one mosque
  • Plot
    • Reorder the countries by count
    • Rotate the X axis titles by 90 degrees.
    • Add Title, Caption, and X axis label.
Show code
mosque_df |> 
  group_by(Country) |> 
  mutate(count = n()) |> 
  ungroup() |> 
  filter(count > 1) |> 
ggplot(aes(x = fct_reorder(Country, .x = count) )) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  xlab("Country") +
  labs(title = "Countries with Two or More of the Oldest Mosques",
       caption = "source = Wikipedia https://en.wikipedia.org/wiki/List_of_the_oldest_mosques") 

7.7 Scraping Multiple Web Pages by Combining {purrr} and {rvest}

One can combine functions from the {rvest} package and the {purrr} package to scrape multiple pages automatically and get the results into a single tibble.

The goal is to create a tibble with variables page ID, URL, and CSS selectors and then add columns with the scraped elements of interest and the scraped text as a column of tibbles.

This way, all the data is in one object which documents how the data was collected in addition to the data and it can be stored for later analyses.

7.7.1 Example Sites

For this example we will scrape pages from vikaspedia and India Today.

Vikaspedia is a multi-language information portal sponsored by the government of India.

  • We are interested in scraping the data about e-government initiatives in each state listed on https://vikaspedia.in/e-governance/states.
  • This page lists each state with a hyperlink to a page with a list of the e-government initiatives in the state and some short information about the initiatives,
    • Note. There are 31 states but only 30 fit on a page so there are actually two pages of interest.
    • If we manually select page 2, we see the URL is https://vikaspedia.in/e-governance/states?pid=3848&pageno=2&size=30 which has the 31st state, West Bengal.
    • However, this is a URL that loads the first page and then triggers a javascript to load the second page.
    • Thus even though there is a different URL for the second page, we cannot scrape it directly and have to treat it as a dynamic page.
    • {rvest} does not yet handle javascript so that requires using {rselenium} functions to interact with dynamic pages.
    • Thus we are limited to getting on the first 30 states with {rvest}.
  • When you go to a state page, the data of interest is in the middle column of the page, but we don’t want any of the comments and extra material as you can see at https://vikaspedia.in/e-governance/states/andhra-pradesh.

India Today is a media conglomerate whose elements include news, magazines and other media.

Although the page layouts for the two sites have similarities in that we want the data in the middle of the page but not all the comments and extra material, the pages come from different sites so the DOM and CSS is quite different.

7.7.2 Scraping Strategy

Instead of scraping each page manually, we want to develop code to programatically build a list of the pages to scrape with their URLs and the CSS needed for each page.

We will create a tibble with three columns (ID, URL,CSS`) .

  • Each row has the data for an ID for the page to be scraped, the URL for the page, and the CSS selectors for the elements of interest for that URL.

Then we can use map2() to process the URL and CSS columns in parallel and get the results as a list.

Then we can tidy and clean the list as appropriate for our analysis.

This strategy allows us to adjust the tibble URL and CSS contents as appropriate as the analysis needs change or the web pages change over time.

Tip

This use case is an example of a strategy template if you will where you have a set of data you want to use as arguments to a function.

  • In this example, we have a set of many URLs and CSS selectors we want to use as arguments to functions.

For these kinds of use cases, recommend using a spreadsheet file to manage the arguments data, here the URL and CSS data instead of embedding it in your code.

Since it is data, not code, It is easier to collaborate and configuration manage the data in a shared spreadsheet file than in R code (especially with people not on GitHub).

  • Use {readxl} or {googlesheets4} package to read in the data, or,
  • Export a .csv file to a data_raw directory and use readr::read_csv() and a relative path to load the data into a tibble.
Note

A similar use case is where humans generate free text data for fields, such as business names, where there is a “canonical” or standard name in the analysis so all data aligns to the correct name, but the humans are overly creative.

  • Say the canonical name is “First Union Bank”.
    • Humans might use “1st Union Bank” or “First Union Bk”.

Instead of trying to use Regex to manage all the variants, create a spreadsheet with the standard names in one column and all the erroneous variants you have had to fix in the second.

  • Read in the spreadsheet and use it to clean the data using a join.
  • You can also have the code identify any “new” errors which you can then add to the spreadsheet once you figure out what the correct standard is.

While you might think you can impose a standard on humans, if you don’t have a drop-down form with the canonical names so they are entering in free text, it is more robust to manage on the back end as data and outside of your code.

Since we do not have a tibble already with URLs and CSS, we will manually create the tibble.

7.7.3 Phase 1: Getting URLs for the State Pages

Let’s load {tidyverse} and {rvest}.

library(tidyverse)
library(rvest)

Let’s start by building a tibble of the ID of state names, URLs, and CSS that we will need to access the pages for each state.

If we go to the default https://vikaspedia.in/e-governance/states, we see the 30 states.

  • We can use this URL to get these 30.

Let’s go to the first page and look for the CSS selector we need to scrape the state names and URLs.

  • Using SelectorGadget we can see the selector is a class selector ".folderfile_name".
  • If we use the developer tools to inspect a state name, we can see the state names are hyperlinks (a) with class="folderfile_name" so the CSS selector would be the same a SelectorGadget identified, ".folderfile_name".

However, with the developer tools we can also see there the <a> has an href attribute where href has the value of a relative link for the state page.

  • We know it is a relative link as it starts with a / and not the https:// of an absolute link.
  • Thus, we have to convert it to an absolute URL in our tibble so we can use it later.
  • Hovering on the href attribute shows the relative link appends to the base link for the site, https://vikaspedia.in, which provides the initial part of the URL to go with the relative link.

Now we know we can get the state name as the text value and the URL as the attribute of each element.

The next few steps are more than we need but they show how you could handle multiple URLs on a site.

  1. Create a tibble of the URL and CSS selectors for the page that list the 30 states.
tibble(urls = c(
  "https://vikaspedia.in/e-governance/states"),
  css = c(".folderfile_name")) ->
states_page
  1. Create a function to extract individual state names and URLs given a page URL and CSS.
get_states <- function(url, css) {
my_elements <- html_elements(read_html(url), css = css)
  html_text2(my_elements) ->
    state
  html_attrs(my_elements) |>
    map(\(x) x[[2]]) |>
    unlist() -> 
    temp_url
  tibble(state, url = str_c("https://vikaspedia.in", temp_url, sep = ""))
}
  1. Use map2_df() to collect the data and return as a data frame.
map2_df(states_page$urls, states_page$css, get_states, .progress = TRUE ) ->
  states_df

We now have a data frame of all the state names and URLs.

Let’s select the URLs for n states of interest and add the India Today URL to a new tibble.

n <- 5
tibble(
  ID = c(states_df$state[1:n],  "India_Today"),
  URL = c(states_df$url[1:n], 
  "https://www.indiatoday.in/education-today/gk-current-affairs/story/list-of-indian-states-union-territories-with-capitals-1346483-2018-09-22"
)) ->
  pages_df

7.7.4 Phase 2: Getting the CSS for elements on the Pages

The CSS selectors will need to be done individually for the different structure pages.

  • As complicated pages, you may want to explore options described in CSS Selectors Reference to fine tune the selection and reduce the potential time in tidying and cleaning the data.

For the vikaspedia pages, go to a page for a state, e.g., https://vikaspedia.in/e-governance/states/andhra-pradesh.

  • SelectorGadget is challenged by the site structure. You can identify #texttospeak div, h3, but it does not work well as it extracts the transport department twice and skips the available services under the otherh3 headers. You can also try .col but that selects all the comments at the end.
  • Inspecting with the developer tools you can find div under the texttospeak with id="MiddleColumn_internal"which appears to get all the data of interest, i.e., minus the side columns and headers and the comments at the bottom. However it is one big block of text that will require substantial effort in tidying and cleaning.
  • Inspect shows the service areas are h3 and the list of services are ul. Instead of selecting all of them on the page, explore using the parent child relationship selector combinator (white space) to only select those that are children of the middle column.
my_url <- "https://vikaspedia.in/e-governance/states/andhra-pradesh"
html_elements(read_html(my_url), "#MiddleColumn_internal") |>
  html_text2()

html_elements(read_html(my_url), "#MiddleColumn_internal h3, 
              #MiddleColumn_internal p, #MiddleColumn_internal ul") |>
  html_text2()

For the India Today page:

  • A few tries with SelectorGadget gets to selectors .description :nth-child(1). This yields one big blob of text that will take a lot of cleaning.
  • Using Inspect shows h3 for the state names and li for the list items. Using them alone captures extra list items.
  • Add in the parent div with class description to reduce the extra scraped items.
my_url <-  "https://www.indiatoday.in/education-today/gk-current-affairs/story/list-of-indian-states-union-territories-with-capitals-1346483-2018-09-22"
html_elements(read_html(my_url), ".description :nth-child(1)") |> 
  html_text2() 

html_elements(read_html(my_url), "h3, li") |> 
  html_text2() 

html_elements(read_html(my_url), ".description h3, .description li") |> 
  html_text2() 

Let’s add the best selectors we have found to the tibble for each page.

vika_css <- "#MiddleColumn_internal h3, #MiddleColumn_internal p, #MiddleColumn_internal ul"
india_css <- ".description h3, .description li"
pages_df$css <- c(rep(vika_css, nrow(pages_df) - 1), india_css)
Tip

These notes and examples just touch the surface of creating better CSS selectors.

Use the developer tools to inspect the page and then explore various options for selectors to scrape the text you want and only the text you want.

7.7.5 Phase 3: Get the Data

Use the tibble along with {purrr} and {rvest} functions to capture the web pages and then extract the CSS elements of interest as text.

  • These procedures could be modified to select tables as well by adding another column of CSS to the tibble which is used for the URLs with table data of interest.

The code uses the URL for each page and the CSS selectors for each URL to extract the elements of interest.

It then extracts the text from each elements and converts it to a tibble.

All of the data is now in one data frame.

pages_df |>
  mutate(elements = map2(URL, css, \(.x, .y) html_elements(read_html(.x), .y),
         .progress = TRUE) ) |>
  mutate(text = map(elements, \(element) tibble(text = html_text2(element)))) ->
pages_df2
head(pages_df2)
# A tibble: 6 × 5
  ID                URL                                css   elements   text    
  <chr>             <chr>                              <chr> <list>     <list>  
1 Andhra Pradesh    https://vikaspedia.in/e-governanc… #Mid… <xml_ndst> <tibble>
2 Arunachal Pradesh https://vikaspedia.in/e-governanc… #Mid… <xml_ndst> <tibble>
3 Assam             https://vikaspedia.in/e-governanc… #Mid… <xml_ndst> <tibble>
4 Bihar             https://vikaspedia.in/e-governanc… #Mid… <xml_ndst> <tibble>
5 Chandigarh        https://vikaspedia.in/e-governanc… #Mid… <xml_ndst> <tibble>
6 India_Today       https://www.indiatoday.in/educati… .des… <xml_ndst> <tibble>

7.7.6 Phase 4: Review the Data

The text column contains all of the desired output in tibbles of one column.

  • Depending upon the structure webpage and the CSS, it may be in one row or multiple rows.
  • RStudio sets default limits on how much text to display so you may see “truncated” when trying to view. Consider using writeLines() to see the text.
map(pages_df2$text, head)
[[1]]
# A tibble: 6 × 1
  text                                                                          
  <chr>                                                                         
1 "Online Citizen Friendly Services of Transport department (CFST)"             
2 "Available services"                                                          
3 "Issue of learner licenses\nIssue of driving licenses\nRenewal of driving lic…
4 "Visit https://aptransport.org/ for more information about Online Transport S…
5 "MeeSeva"                                                                     
6 "Available Services"                                                          

[[2]]
# A tibble: 6 × 1
  text                                                                          
  <chr>                                                                         
1 Apply for Inner Line Permit                                                   
2 Only Indian Nationals can Apply for eILP through this portal                  
3 Visit https://arunachalilp.com/index.jspfor more information.                 
4 Arunachal eServices                                                           
5 Various services related to district administration and district police are a…
6 Visit https://eservice.arunachal.gov.in/ for more information                 

[[3]]
# A tibble: 6 × 1
  text                                                                          
  <chr>                                                                         
1 "Electoral Rolls"                                                             
2 "Search your name in online available electoral rolls\nElectoral rolls are av…
3 "Visit http://ceoassam.nic.in/ for more information about Electoral Rolls"    
4 "Public Utility Forms"                                                        
5 "Department wise application forms for various citizen services\nApplication …
6 "BPL List"                                                                    

[[4]]
# A tibble: 6 × 1
  text                                                                          
  <chr>                                                                         
1 "Online Grievance Registration"                                               
2 "Online submission of grievance application for redressal\n\r"                
3 "Visit http://www.cspgc.bih.nic.in/ for more information about Online Grievan…
4 "Jankari"                                                                     
5 "Register your RTI complaint and get RTI related information to call on 15531…
6 "Visit http://www.biharonline.gov.in/rti/Index.aspx?AspxAutoDetectCookieSuppo…

[[5]]
# A tibble: 6 × 1
  text                                                                          
  <chr>                                                                         
1 "e-Jan Sampark"                                                               
2 "e-Jan Sampark project will enable residents to access information and avail …
3 "e-Jan Sampark project targets that the benefits of ICT should reach the mass…
4 "The type of services provided includes:"                                     
5 "All Procedures and Forms for all departments, which are frequently used by a…
6 "Visit http://www.chandigarh.gov.in/egov_jsmpk.htm for more information about…

[[6]]
# A tibble: 6 × 1
  text                                                                          
  <chr>                                                                         
1 Let's take a look at the states and their capitals:                           
2 1. Andhra Pradesh - Amravati                                                  
3 Hyderabad was the former capital of Andhra Pradesh but it got replaced by Amr…
4 The Buddhist Stupa at Amravati is the biggest stupa in the country with a dia…
5 Recognising Amravati's ancient Buddhist roots, the developers also plan to a …
6 According to the timeline of this mega project, Andhra's capital will be popu…
## print(my_data2$text[[1]]$data)
## writeLines(my_data2$text[[1]]$data)

Note: the text data is not tidy or clean.

  • As seen using the print(), it can contain embedded HTML new line and carriage return characters - although these can be helpful in cleaning at times.
  • Given the structure of the elements, it may also contain text not of interest.

Since each URL may be different but also share common issues, use custom functions with {tidyr} and {stringr} functions to reshape the data and clean it.

7.8 Scraping PDF with {pdf} and {tabulapdf}

7.8.1 tabulapdf

The {tabulapdf} package is designed for extracting tables from PDF files.

  • Install the package from GitHub instead of CRAN using install.packages("tabulapdf", repos = c("https://ropensci.r-universe.dev", "https://cloud.r-project.org")).
  • May be tricky to install the java tools on Windows machines. Read the README
  • On a Mac need to ensure you have the latest version of Java.

Load the library and provide a relative path and file name of a PDF file.

Extract text and the output value is a character vector.

  • Here is an example of the House of Representatives Foreign Travel Report Feb 06 2023
library("tabulapdf")

## 
f <- list.files("./pde_pdf_docs/", pattern = "*.pdf", full.names = TRUE)
f
[1] "./pde_pdf_docs//congressional_record_travel_expenditures.pdf"
[2] "./pde_pdf_docs//house_commmittes_23_06_01.pdf"               
[3] "./pde_pdf_docs//List of the oldest mosques - Wikipedia.pdf"  
[4] "./pde_pdf_docs//pde_methotrexate_29973177_!.pdf"             
out1 <- extract_text(f[1])
stringr::str_sub(out1, end = 150)
[1] "CONGRESSIONAL RECORD — HOUSE H705 February 6, 2023 \nTomorrow, during the State of the \nUnion, I will be joined by Ms. Pamela \nWalker, a police account"

Extract tables from the pdf and the value is a list.

  • Here the elements of the list are data frames instead of the default matrices.
  • The data frames are broken up by page instead of by table.
  • The area= argument allows a bit more control.
out2 <- extract_tables(f[1], output = "tibble", method = "decide")
Error : 'DSKJLST7X2PROD with CONG-REC-ONLINE' does not exist in current working directory ('/Users/rressler/Library/CloudStorage/OneDrive-american.edu/Courses/DATA-413-613/lectures_book').
out2[[2]] |> tail(15)
# A tibble: 6 × 10
  ...1      ...2  Date  ...4  `Per diem 1` Transportation `Other purposes` ...8 
  <chr>     <chr> <chr> <chr> <chr>        <chr>          <chr>            <chr>
1 <NA>      <NA>  <NA>  <NA>  U.S. dollar  U.S. dollar    U.S. dollar      <NA> 
2 Name of … <NA>  <NA>  Coun… <NA>         <NA>           <NA>             <NA> 
3 <NA>      <NA>  <NA>  <NA>  Foreign equ… Foreign equiv… Foreign equival… Fore…
4 <NA>      Arri… Depa… <NA>  <NA>         <NA>           <NA>             <NA> 
5 <NA>      <NA>  <NA>  <NA>  currency or… currency or U… currency or U.S. curr…
6 <NA>      <NA>  <NA>  <NA>  currency 2   currency 2     currency 2       <NA> 
# ℹ 2 more variables: Total <lgl>, ...10 <chr>
out2[[3]] |> head(5)
# A tibble: 5 × 14
  Hon. Steven Palazzo .......…¹ ...2  `10/1` `10/3` Spain ..............…² ...6 
  <chr>                         <lgl> <chr>  <chr>  <chr>                  <lgl>
1 <NA>                          NA    10/3   10/5   Greece ..............… NA   
2 Hon. Betty McCollum ........… NA    10/10  10/12  South Korea .........… NA   
3 <NA>                          NA    10/12  10/15  Japan ...............… NA   
4 Hon. Debbie Wasserman Schult… NA    10/5   10/7   Palau ...............… NA   
5 <NA>                          NA    10/7   10/8   Australia ...........… NA   
# ℹ abbreviated names:
#   ¹​`Hon. Steven Palazzo ................................................`,
#   ²​`Spain ....................................................`
# ℹ 8 more variables: .......................7 <chr>, `442.98` <dbl>,
#   .......................9 <chr>, `3,282.07` <chr>,
#   .......................11 <chr>, .......................12 <chr>,
#   .......................13 <chr>, `3,725.00` <dbl>
out2[[3]] |> tail(15)
# A tibble: 15 × 14
   Hon. Steven Palazzo ......…¹ ...2  `10/1` `10/3` Spain ..............…² ...6 
   <chr>                        <lgl> <chr>  <chr>  <chr>                  <lgl>
 1 Matthew Bower .............… NA    10/22  10/24  Kuwait ..............… NA   
 2 <NA>                         NA    10/24  10/27  Iraq ................… NA   
 3 Ariana Sarar ..............… NA    10 /22 10/24  Kuwait ..............… NA   
 4 <NA>                         NA    10/24  10 /27 Iraq ................… NA   
 5 <NA>                         NA    10/27  10/29  Bahrain .............… NA   
 6 Nicholas Vance ............… NA    10/22  10/24  Kuwait ..............… NA   
 7 <NA>                         NA    10 /24 10/27  Iraq ................… NA   
 8 <NA>                         NA    10/27  10/29  Bahrain .............… NA   
 9 Shannon Richter ...........… NA    10/23  10 /26 Germany .............… NA   
10 <NA>                         NA    10/26  10 /28 Italy ...............… NA   
11 <NA>                         NA    10 /28 10/31  Tunisia .............… NA   
12 Laurie Mignone ............… NA    10/24  10/30  Ghana ...............… NA   
13 Stephen Steigleder ........… NA    10 /24 10/30  Ghana ...............… NA   
14 Kristin Clarkson ..........… NA    10/24  10/26  South Korea .........… NA   
15 <NA>                         NA    10/26  10/28  Thailand ............… NA   
# ℹ abbreviated names:
#   ¹​`Hon. Steven Palazzo ................................................`,
#   ²​`Spain ....................................................`
# ℹ 8 more variables: .......................7 <chr>, `442.98` <dbl>,
#   .......................9 <chr>, `3,282.07` <chr>,
#   .......................11 <chr>, .......................12 <chr>,
#   .......................13 <chr>, `3,725.00` <dbl>

You can extract text from columns that look like tables but are not captured that way.

out4 <- extract_text(f[2])

out4 |> stringr::str_length()
[1] 42290

Both text and table output can require a lot of cleaning, especially if tables have multiple rows in a cell and/or a table covers more than one page.

However, it may be less than copying and pasting into Word or Excel.

7.8.2 PDE

PDE is an interactive GUI for extracting text and tables from PDF.

Follow the Vignette closely.

There are two components.

  • The Analyzer allows you to choose the files to analyze and adjust the parameters for the analysis. It runs the analysis and produces output.
  • The Reader allows you to review.

7.8.2.1 Setup Package and Software

  • The Reader requires you to install.packages("tcltk2").
  • The package requires the Xpdf command line tools by Glyph & Cog, LLC.
    • Install manually by downloading the tar file and manually moving the executable and man files to the three proper locations as in the install.txt file. This may require creating new directories.
  • On Macs, also have to individually allow them to be opened using CTRL-Click on each file.
  • install.packages("PDE", dependencies = TRUE)

7.8.2.2 Organize Files

Put the PDF files to be analyzed into a single directory.

If you want output to go to a different directory, create that as well.

Review the files to identify attributes for the tables of interest such as key words, names, and structure.

7.8.2.3 Analyzing PDFs

7.8.2.3.1 Open the Analyzer:
library("PDE")
PDE_analyzer_i()
  1. Fill in the directory names for input and output folders.

  2. Go through the tabs to identify specific attributes and parameters for the analysis (use the vignette as a guide).

  3. PDE will analyze all the files in the directory.

  • You can filter out files by for the analysis by requiring Filter words to be present in a certain amount in the text.
  • You can also use keywords to select only specific parts of files to be extracted.
  1. Save the configuration as a TSV file for reuse.

  2. Click Start Analysis.

PDE will go through each file and attempt to extract test and or tables based on the selected parameters across the tabs.

  • It will create different folders for the analysis output as in Figure 7.16.
Figure 7.16: PDE Output in same folder as input.

Here you see both the input and the output in the same directory.

  • There are different directories for text and tables.

You can manually review the data by navigating through the directories.

7.8.2.3.2 Open the Reader

The Reader GUI allows you to review the output and annotate it by moving through the files.

  • It also allows you to open up a PDF folder and TSV file to reanalyze a document.
PDE_reader_i()

The vignette has examples of the types of actions you can take.

Note

Both {tabulapdf} and {PDE} are general packages for working with PDFs so there is substantial cleaning to do.

They both appear to work best for small tables that are clearly labeled as “Table”.

They do offer an alternative if copy and paste into a document or spreadsheet is not working well.

If you can access the an original document on the web, it might be easier to scrape it using custom code, than convert to PDF and scrape the PDF with a general use package.

7.9 Scraping Dynamic Web Pages with the {RSelenium} Package

7.9.1 Introduction

This is a case study on using using the {RSelenium} package for scraping the Indian E-Courts website, an interactive and dynamic web site.

This case study uses multiple techniques for using Cascading Style Sheet selectors to identify different kinds of elements on a web page and scrape their content or their attributes to get data.

Warning

The functions in what follows are not documented nor do they include error checking for general use. They are designed to demonstrate how one can scrape data and tidy/clean it for analysis for this site.

This web site in this case study is constantly evolving to new page designs and uses different approaches and standards for similar designs.

The functions here worked on the test cases as of early June 2023. However, if the test web pages are changed or one is scraping additional district courts for different acts, it could lead to errors if they are using a different design or page structure.

The user should be able to compare the code for these test sites, for old and new designs, and use their browser Developer Tools to Inspect the attributes of the elements on the erroring page, to develop updated solutions.

The E-courts website allows users to search for and gather information on local and district civil court cases for most Indian states.

  • There are many thousands of possible cases so the site is interactive as it requires users to progressively identify or select the specific attributes for the cases of interest to limit how much data is provided so users get the information they want.
  • The site is dynamic in that as a user makes selections, e.g., the state, the court, the applicable law(s), the web page will update to show the next set of selections and eventually the data based on the selected attributes.
  • The website also incorporates a requirement to solve a Captcha puzzle to get to the actual cases.
    • Including a Captcha is designed to force human interactivity and keep bots from being able to scrape everything automatically.

The case study focuses on getting data on court cases based on the states, judicial district, court, and legal act (law) for the cases. The E-courts site allows for search by other attributes such as case status.

The site has a separate web page for each district of which there are over 700.

  • There are two different designs for district sites.
    • The older design redirects to a services site for searching for cases using a dynamic page.
    • The newer design embeds the dynamic search capability from the service site into the district page.
    • The sites have different structures but similar attributes for the data of interest.

This diverse structure of many pages and dynamic pages means the strategy for scraping data must handle navigating to the various pages and accommodating the different designs.

The strategy uses {RSelenium} to provide interactivity and {rvest} to access and scrape data. The strategy has three major parts:

  1. Get Static Data: Use {rvest} and custom functions to access and scrape static pages about states and districts which create the URLs for the state-district dynamic pages.
  2. Get Dynamic Data - Pre-Captcha: Use {RSelenium}, {rvest} and custom functions to interact with the website to access and scrape data up to the requirement to solve the Captcha.
  3. Get Dynamic Data - Post-Captcha: Use {RSelenium}, {rvest} and custom functions to interact with the website once the Captcha is solved to access and scrape data.

7.9.1.1 The {RSelenium} Package

The {RSelenium} Package Harrison (2023a) helps R users work with the Selenium 2.0 Remote WebDriver.

The Selenium WebDriver is part of the Selenium suite of open-source tools for “automating all major web browsers.”

  • These tools are designed to help developers test all of the functionality of their website using code instead of humans.
  • The automating part is about using code to mimic the actions a human could take when using a specific web browser to interact with a webpage.
    • Actions includes scrolling, hovering, clicking, selecting, filling in forms, etc..

Since developers want to test their sites on all the major browsers, there is a three-layer layered structure to provide flexibility for different browsers but consistency in top-level coding for the developer.

  • At the top layer, a developer writes R code to use {RSelenium} functions to represent human interactions with a web page using a browser.
  • In the middle layer, the {RSelenium} functions use the Selenium WebDriver standards to interact with “driver” software for the browser.
  • At the lower layer, the browser “driver” interacts with the specific brand and version of browser for which it was designed.
  • As the webpage (javascript) responds to the requested actions, it communicates updates back up the layers to the R code.
  • RSelenium allows the developer to observe the dynamic interactions on the web site through the browser.

Each brand and version of a browser, e.g., Chrome, Firefox, etc., has its own “driver” software, e.g., Chrome Driver for Chrome and Gecko Driver for Firefox.

Important

The Selenium construct is for testing if the website code operates in accordance with its specifications.

It does not test if humans will actually like or be able to use the web site.

Testing the user experience (UX) is a whole other specialty in web site design. Institute (2023)

7.9.1.2 Setup for Using the {RSelenium} Package

  • Setting up to use {RSelenium} can be a bit complex as you have to have matching versions of software for each of the three layers (R, Driver, Browser) and the software has to be in the right location, which is different for Windows and Mac.
  • Update all the software:
    • Update to the latest versions of RStudio and the {Rselenium} package.
    • Update your Java SDK to the latest version. See Java.
    • Update your browser to the latest version. Recommend Firefox or Chrome.
  • Install the correct driver for the browser you want to use.
    • If using Firefox, install the gecko web driver for your version. Go to https://github.com/mozilla/geckodriver/releases/.
      • For a Mac with homebrew installed, you can use brew install geckodriver.
    • If using Chrome, you need the version number to get the correct web driver. There are a lot of stackoverflow issues with troubleshooting chrome so recommend Firefox
    • Test with the Firefox browser using the code below with netstat::free_port() to find an unused port.
library(RSelenium)
# start the Selenium server and open the browser
rdriver <- rsDriver(port = netstat::free_port(),
                    browser = "firefox")

# create a "client-user" object to connect to the browser
obj <- rdriver$client
  
# navigate to the url in the browser
obj$navigate("https://wikipedia.org/")
   
# close the browser
obj$close()
# remove the test objects from the environment
rm(obj, rdriver)

7.9.2 Scraping Strategy Overview

Accessing the court cases requires knowing the state, the judicial district, the court, the act(s) (laws), and case status of interest.

The scraping strategy was originally designed to collect all of the courts for each state and judicial district and save them for later collection of the acts for each court.

  • This design assumed the data would be stable in terms of the court names and option values. However, experience has shown the assumption is incorrect as they change rather frequently.
    • The functions below have not incorporated error checking yet so if a court option value changes it causes an error and the process steps.
  • Thus the strategy was revised to identify only the courts for a given state and judicial district and then get the acts for those courts. This reduces the opportunities for errors.
  • The process below now saves the data for a state, district, and court along with the acts to a .rds file with the codes for the state and district. There are hundreds of these files.
  • The process also incorporates a random sleep period between 1-2 seconds between each state/district to minimize the load on the web site.
  • Scraping the acts for the almost 10,000 states, districts, and courts can take multiple hours.

Scraping the Data takes place in three steps.

7.9.2.1 Get Static Data

The base e-courts site is interactive but static. Users interact to select states and districts.

The web pages have data needed to create URLs for accessing the state and district and district services pages. Custom functions are used to scrape data and create URLs. The data saved to an .rds file.

7.9.2.2 Get Dynamic Data - Pre-Captcha

Each Judicial District has links to the “Services Website” search capability which is interactive and dynamic.

  • This is where the user has to make selections on the specific courts, acts, and status of cases of interest.
  • Web scraping gathers all of the options for the various kinds of courts and the many laws.
  • Since the website is dynamic, the code uses {RSelenium} to navigate to each state-district URL.
  • The two types of court and specific courts of each type are scraped for each state and district combination.
  • Each court has its own set of acts that apply to its cases.
  • The acts data for each is scraped.
  • Each state-district combination has its own data set of court types, specific courts, and acts. This data set is saved as an individual .rds file (several hundred of them).

This step ends with the saving of the data given the need to solve a Captcha puzzle to make progress.

While there are options for using code to attempt to break Captcha puzzles, these notes do not employ those methods.

7.9.2.3 Get Dynamic Data - Post-Captcha

Getting to the cases requires solving a Captcha puzzle for every combination of state-district-specific court, and acts.

That is a lot of puzzles to be done one at a time.

Thus one should develop a strategy for choosing which combinations are of interest.

  • That can involve filtering by state-district, court-type, court, and/or acts.

For each combination of interest, the code navigates to the appropriate site and enters the data up to the acts.

The user then has to choose the type of cases: active or pending, and then the “Captcha puzzle has to be solved.

Once the Captcha puzzle is solved, the site opens a new page which provides access to the case data.

  • The case data can be scraped and saved or further analysis.

7.9.2.4 Manipulatng the Data

Once the data has been scraped, there is code to help manipulate the data for additional analysis or scraping.

  • One can read in all the files into a single list.
  • Then one can reshape that list into a data frame with a row for each state, district and court and a list column of the acts.
  • Then one can identify a vector of key words for filtering the acts.
  • Then one can filter the acts for each court to reduce the number of acts.
  • Finally one can unnest the acts.

7.9.3 Step 1. Get Static Data

7.9.3.1 Getting Data for States and Judicial Districts

We start with the top level web site page, https://districts.ecourts.gov.in/ which has an interactive map showing each state.

  • Each state name is a hyperlink with a URL to open a page with an interactive map for the state showing the judicial districts in the state.
    • Note: The judicial districts do not align exactly to the state administrative districts. In several cases, multiple administrative districts are combined into a single judicial district.
    • The URLs for the state web pages use abbreviated state names.
  • The district names are also hyperlinks with a URL to open a page with administrative information about the judicial district.
  • The old design for a district page has a panel on the right for Services.
    • The Case Status/Act is a hyperlink with a URL to open a new page the “services” website (base URL https://servces.ecourts.gov.in) for the selected state and district.
    • This is the starting web page for interacting with the site to select courts, acts (or laws) and eventually cases.
  • The new design for a district page has an icon for Case Status which contains a link for Acts which opens up an embedded search capability from the services site.

7.9.3.2 Load Libraries

library(tidyverse)
library(rvest)
library(RSelenium)

7.9.3.3 Get States and Judicial Districts

This step only has to be done once until there are changes in the State and Judicial district alignments or names.

  • The final data set is saved as ./data/states_districts_df.rds.
7.9.3.3.1 Build Helper Functions

This step builds and tests several help functions to get the text or the attributes from an element.

  • Set the root level url.
url_ecourts <- "https://districts.ecourts.gov.in"
  • Define a function for scraping text from a URL.
scrape_text <- function(url, css) {
  read_html(url) |>
  html_elements(css = css) |>
  html_text2() |>
  as_tibble()
}
  • Define a function for getting the HTML attributes of a selected object from a URL.
scrape_attrs <- function(url, css) {
  read_html(url) |>
  html_elements(css = css) |> 
  html_attrs() 
}
7.9.3.3.2 Scrape the official state names and state URL names.

Scrape the official state names.

  • Inspect the web page to get the CSS selector for the div with class="state-district"
  • That div is a parent of children list items, one for each state with the tag a for hyperlink.
  • Scrape the names using the combined selector for parent and child of ".state-district a".
  • Rename the scraped names and save to a data frame.
scrape_text(url_ecourts, ".state-district a") |>
  rename(state_name = value) ->
  states_df

Scrape the state URL names.

  • The state URL names are abbreviated.
  • They appear the values of the href attributes of the children list items for each state.
    • The href values are designed to be the last part of a URL so each comes with a / on the front end.
  • Append the state_url_nameto the end of the site root URL for a complete URL for each state.
  • Write the code so as to start with binding the states_df to the result, by columns, and save.
bind_cols(states_df,
           (scrape_attrs(url_ecourts, ".state-district a") |> 
            unlist() |> 
              as.vector()) |> 
            tibble(state_url_name = _) |> 
            mutate(state_url = str_c(url_ecourts, state_url_name, sep = ""))
) ->
states_df
7.9.3.3.3 Scrape the state pages to get the official district names and URLs.

Scrape the district names and URLs for each state and save as a list-column of data frames.

  • Append the scraped url to the end of the root URL to get a complete URL for the page.

  • Note the map() progress bar in the console pane.

district_state_css <- ".state-district a"
states_df |>
  mutate(state_district_data = map(state_url, \(url){
    bind_cols(
      scrape_text(url, district_state_css) |>
        rename(district_name = value),
      (scrape_attrs(url, ".state-district a") |>
        unlist() |> 
         as.vector()) |>
        tibble(district_url_name = _) |>
        mutate(district_url = str_c(url_ecourts, district_url_name, sep = ""))
    )
  }, .progress = TRUE)
  ) ->
states_districts_ldf
7.9.3.3.4 Scrape district pages to get the URLs for their Services page.
  • Given the two different design for the district web pages, the layout and list structure for each district are **not* the same.
    • The newer sites embed the services site into the page and others open a new window with the services site.
    • However, both designs use of the attribute title="Act" for the link to the services page.
    • The selector for a named attribute uses '[attribut|="value"]' (see The Ultimate CSS Selectors Cheat Sheet)
    • Note the use of single and double quotes.

The two designs also differ in the value of the href for the services search capability.

  • The old design site provides a complete “absolute URL” since it is opening up a new page in a new window.
    • The older design uses numeric codes for state and district in the URL.
  • The newer design site is a redirect from the older-style district URL to a new site.
    • Instead of an absolute URL, it uses an internal “relative URL” to refresh the page with the embedded search capability now showing.
    • This relative URL must be appended to the “canonical URL` defined in the head of the web page to navigate to the page.
  • The results are different style URLs for districts and states.
    • Most states have an active search capability but not all.
  • About 53 districts have an “empty” link to the services site (0s in the code fields) so you cannot search for cases based on the acts for those districts.
7.9.3.3.4.1 Build a helper function to get the canonical URL for a page.
get_canonical_url <- function(url) {
  read_html(url) |> 
  html_elements(css = "link") |>
    html_attrs() |>
    tibble(value = _) |>
    unnest_wider(value) |>
    filter(rel == "canonical") |>
    select(href) |>
    unlist()
}
get_canonical_url("https://districts.ecourts.gov.in/jaisalmer")
get_canonical_url("https://districts.ecourts.gov.in/krishna")
7.9.3.3.4.2 Build a function to get the services URL given a base district URL and the css for the services link.

Since there are still more older design district pages than newer, only get the canonical URL if required for a new design page.

  • Use css = '[title|="Act"]' for both new and older design web pages.
  • The structures above the list items have different classes but using the title attribute skips ignores them in searching the DOM.
get_s_d_url <- function(url, css) {
   (scrape_attrs(url, css = '[title|="Act"]') |>
         unlist() |>
         as.vector())[1] ->
   temp_url

 if (str_starts(temp_url, "/")) {
      get_canonical_url(url) |>
      str_c(str_sub(temp_url, start = 2L), sep = "") ->
      temp_url
}
  temp_url
}

Test the function get_s_d_url() on one or more districts with the css = '[title|="Act"]'.

get_s_d_url("https://districts.ecourts.gov.in/jaisalmer", css = '[title|="Act"]')
get_s_d_url("https://districts.ecourts.gov.in/krishna",   css = '[title|="Act"]')
7.9.3.3.4.3 Scrape the data and unnest to get state and district service URLS and codes where they exist.

Update the list-column of district data with the services URLs

  • Unnest to create a wider and longer data frame.
  • Extract the state and district codes for the older design URLs.
map(states_districts_ldf$state_district_data, \(df) {
    df$district_url |> 
     map_chr( \(url) get_s_d_url(url, '[title|="Act"]')) |> 
          tibble(state_district_services_url = _) ->
        temp
   bind_cols(df, temp)
  },
  .progress = TRUE
  ) -> 
  states_districts_ldf$state_district_data

states_districts_ldf |>
  unnest(state_district_data) |>
  mutate(
    state_code = str_extract(state_district_services_url, "(?<=state_cd=)\\d+"),
    district_code = str_extract(state_district_services_url, "(?<=dist_cd=)\\d+")
  ) ->
  states_districts_df
  • Write the states district data to "./data/states_districts_df.rds".
write_rds(states_districts_df, "./data/states_districts_df.rds")

This completes the scraping of static data.

7.9.4 Step 2. Get Dynamic Data - Pre-Captcha

In this step, the goal is to capture all the possible courts within each State and Judicial District and then capture all the possible acts that apply to each court.

That data can be then used to select which states, districts, courts, acts, and status to be scraped after solving the Captcha puzzle.

7.9.4.1 The Services Web Site is Interactive

The services site allows one to search for information by court and act and status on a web page for a given state and district.

There are two types of court structures: Complexes and Establishments in a district. They have different sets of courts.

  • The services site web pages are dynamic. As the user selects options, the page updates to reveal more options.
    • Note: each option has a value (hidden) and a name (what one sees on the page).
    • The values are needed for updating the page and the names are useful for identifying the values.
  • The services site URLs use numeric codes for states (1:36) and districts. However, the pages do not contain data on the state or district only the URL does. The linkages between states and URLs were captured in step 1.

The step begins with loading the data from step one for a given state and judicial district.

  • The user must select a type of court (via a radio button).

  • Once the type of court is selected, the user has to select a single court option.

    • The data is scraped to collect each possible court of each type for the district.
  • Once the court option is selected, the user can enter key words to limit the possible options for acts.

    • The search option is not used below as the goal was to collect all the acts.
    • The data is scraped for all possible act option values and names.
      • Some courts have over 2000 possible acts.
    • Once an act is selected, the user can also enter key words to limit to certain sections of the act in the Under Section search box. That is not used in this effort.
  • At this point the data with all the acts for all courts in a state and district is saved to a file.

  • Once an act is selected, the user must chose (using radio buttons) whether to search for Pending or Disposed cases.

  • Once that choice is made a Captcha puzzle is presented which must be solved.

The presentation of the Captcha ends step 2.

7.9.4.2 Build/Test Custom Functions

Let’s start with reading in the data from step 1.

  • Read in states-districts data frame from the RDS file.
states_districts_df <- read_rds("./data/states_districts_df.rds")

Let’s build and test functions to get data for each State and District and type of Courts

  • Define smaller functions that are then used in the next function as part of a modular approach and a build-a-little-test-a-little development practice.

Define functions for the following:

  • Access the pages for each state, district, and type of court.
  • Convert each html page into an {rvest} form.
  • Scrape Court Names and Values from each form based on the court type of Court Establishment or Court Complex.
  • Scrape the specific courts of both court establishments and court complexes for each state and district.
7.9.4.2.1 Access Court Pages as a Form for each URL

Define a function using rvest::html_form and rvest::session to access the page for a URL as a session form.

  • The new design has two elements in the form object and we want the second element.
get_court_page <- \(url) {
  if (str_detect(url, "ecourts")) {
    html_form(session(url))[[1]]
  } else {
    html_form(session(url))[[2]]
  } 
}
  • Test the function with a URL of each design.
test_url_old <- "https://services.ecourts.gov.in/ecourtindia_v4_bilingual/cases/s_actwise.php?state=D&state_cd=2&dist_cd=2"
test_url_new <- "https://krishna.dcourts.gov.in/case-status-search-by-act-type/"

# old design
get_court_page(test_url_old) 

# new design
get_court_page(test_url_new)

Define a function to get court pages as “forms” for a state and district.

  • Use the just-created get_court_page() function.
get_court_pages <- function(.df = states_districts_df, .s_name = "Andhra Pradesh", .d_name = "Anantpur") {
  stopifnot(is.data.frame(.df), is.character(.s_name), is.character(.d_name))

  dplyr::filter(.df, state_name == .s_name & district_name == .d_name) |>
  mutate(ec_form = map(state_district_services_url, \(url) get_court_page(url))) ->
  .df
  .df
}

Test the function get_court_pages().

  • Save to a variable test_pages to use later on.
get_court_pages()$ec_form ->
  test_pages_old
test_pages_old

get_court_pages(states_districts_df, "Andhra Pradesh", "Krishna")$ec_form ->
  test_pages_new
test_pages_new

The result for each test should be a list-object of with one element which has class form.

The form elements are the fields o the form to which one can navigate.

7.9.4.2.2 Use the Form to Make Choices and Access the Data

Define a function to extract court names and values from the page’s form based on type of court structure.

  • Handle both types of court structures as they have different form object names.
  • Use the names from the fields of the correct form and scrape the option names and values for the select box.
  • The first entry is the “Select Court …” default so can be removed.
get_court_data <- function(ec_form, court_type) {
  if (ec_form[[1]][1] == "frm") { # old_design

    if (court_type == "establishment") {
      tibble(
        court_names = names(ec_form$fields$court_code$options)[-1],
        court_values = ec_form$fields$court_code$options[-1]
      )
    } else if (court_type == "complex") {
      tibble(
        court_names = names(ec_form$fields$court_complex_code$options)[-1],
        court_values = ec_form$fields$court_complex_code$options[-1]
      )
    }
  } else { # new design
    if (court_type == "establishment") {
      tibble(
        court_names = names(ec_form$fields$est_code$options)[-1],
        court_values = ec_form$fields$est_code$options[-1]
      )
    } else if (court_type == "complex") {
      tibble(
        court_names = names(ec_form$fields$est_code$options)[-1],
        court_values = ec_form$fields$est_code$options[-1]
      )
    }
  }
}

Test the function get_court_data() with test_pages$ec_form.

map(test_pages_old$ec_form, \(form) get_court_data(form, "establishment"))
map(test_pages_old$ec_form, \(form) get_court_data(form, "complex"))

map(test_pages_new$ec_form, \(form) get_court_data(form, "establishment"))
map(test_pages_new$ec_form, \(form) get_court_data(form, "complex"))
  • The results should be an empty list.

Define a function to extract court establishments and court complexes for a state and district.

  • Select the identifying URL and court type and pivot to convert into character vectors of the names and values
  • Use a left join to add to the states_districts_df.
get_courts <- function(.df, .s_name, .d_name) {
  get_court_pages(.df, .s_name, .d_name) |>
  mutate(court_establishments = map(ec_form,
                                    \(form) get_court_data(form, "establishment")),
         court_complexes = map(ec_form,
                               \(form) get_court_data(form, "complex"))) |>
  select(-c(state_district_services_url, ec_form)) |>
  pivot_longer(cols = starts_with("court_"),
               names_to = "court_type",
               names_prefix = "court_",
               values_to = "data") |>
  unnest(data) |>
  ungroup() |>
  unique()  |>  ## get rid of duplicate data in one district court
  left_join(states_districts_df)  |>
  arrange(state_name, district_name, court_type) ->
  tempp
  tempp
}

Test the get_courts() function with examples of old and new designs.

get_courts(states_districts_df, "Andhra Pradesh", "Anantpur") -> test_courts_old
glimpse(test_courts_old)

get_courts(states_districts_df, "Andhra Pradesh", "Krishna") -> test_courts_new
glimpse(test_courts_new)
  • The result should be data frame with 12 columns and 39 and 42 rows.

Now that the functions are built and tested, they can be used in scraping the dynamic website.

7.9.4.3 Use {RSelenium} Access to Acts for each Court

Load {RSelenium} then start the driver and connect a client-user object to it.

We will use the running server as we build and test functions (that use our previous functions) to access the data in sequence instead of as isolated test cases.

library(RSelenium)
  • If you are having to restart due to error or time out of the server you may have to close the server and garbage collect which will close the java server.
  • Ref Stop the Selenium server
# remDr$close()
rm(rD) ## if necessary when rerunning
gc() ## Garbage Collection

Use RSelenium::rsDriver() to create a driver object to open the browser to be manipulated.

  • Use netstat::freeport() to identify a free port for the driver to use.
  • The resulting rD has class "rsClientServer" "environment".
  • Assign a name to the rDr$client object (a remoteDriver) as a convenience for accessing your “user”.
rD <- rsDriver(browser = "firefox", #specify browser you want Selenium to open
               port = netstat::free_port(),
               verbose = FALSE) 
remDr <- rD$client

7.9.4.4 Define Functions for Naviagation and Interactivity

Define functions to navigate the site and enter values based on the data frame

  • As before these functions build on each other and the previous functions.
  • Each of these require an active (running) RSelenium driver with an open browser.
7.9.4.4.2 Define a function to get the acts for an input court type and court value for the input state-district URL.
get_acts <- function(url, court_type, court_value) {

enter_court(url, court_type, court_value)

if (str_detect(url, "ecourts")) {
if (length(remDr$findElements(using = 'id', value = "actcode")[[1]]) > 0) {
  
  option3 <- remDr$findElement(using = 'id', value = "actcode") #get acts

## convert acts to data frame
## Have to be careful to get the option values as well as the names
## as the option values do not show up as html_text()
act_values <- option3$getPageSource()[[1]] |>
  read_html() |>
  html_elements("#actcode")
act_values <- act_values[[1]]
act_values |>
 as.character() |>
  str_extract_all("(?<=value=).+(?=<)") |>
  unlist() |>
  tibble() |>
  separate(col = 1, sep = ">",
           into = c("act_values", "act_names"),
           extra = "drop") |>
  filter(act_values != '"0"') ## get rid of the first row for select act type
} else {
  NA
}
} else {
if (length(remDr$findElements(using = 'id', value = "national_act_code")[[1]]) > 0) {
  
  option3 <- remDr$findElement(using = 'id', value = "national_act_code") #get acts

## convert acts to data frame
## Have to be careful to get the option values as well as the names
## as the option values do not show up as html_text()
act_values <- option3$getPageSource()[[1]] |>
  read_html() |>
  html_elements("#national_act_code")
act_values <- act_values[[1]]
act_values |>
 as.character() |>
  str_extract_all("(?<=value=).+(?=<)") |>
  unlist() |>
  tibble() |>
  separate(col = 1, sep = ">",
           into = c("act_values", "act_names"),
           extra = "drop") |>
  filter(str_detect(act_values, "data", negate = TRUE)) ## get rid of the first row for select act type
} else {
  NA
}
}
}

Test get-acts().

get_acts(test_url_old, "complexes", "1020002@11,12@N") -> test_acts_old_c
# Court of Senior Civil Judge,, Penukonda (Taluka)
# 
get_acts(test_url_old, "establishments", "1") -> test_acts_old_e
# District Courts, Ananthapur

get_acts(test_url_new, "complexes", "APKR1A") -> test_acts_new_c
# Junior Civil Courts, Vuyyuru
# 
get_acts(test_url_new, "establishments", "APKR11") -> test_acts_new_e
# Prl  JCJ Court Gannavaram 
  • The old design pages provide a data frames of 118 and 119 rows tailored to the court while the new provides the same data frame of 2,320 acts for both.
7.9.4.4.3 Define a function to get all the acts for all courts in a state-district.
  • Note the use of pmap to handle each of the arguments in parallel.
  • You have to create the directory to be used by write_rds().
  • The Sys.sleep() function creates a random gap between each time this function is called to send a request to the web site.
get_acts_for_state_district <- function(.df = states_districts_df, 
                                        .s_name = "Andhra Pradesh",
                                        .d_name = "Anantpur") {
  stopifnot(is.data.frame(.df), is.character(.s_name), is.character(.d_name))
get_courts(.df, .s_name, .d_name) |>
  mutate(act_names = pmap(
    list(url = state_district_services_url, court_type = court_type,
         court_value = court_values),
    .f = get_acts,
    .progress = TRUE)) ->  tempp
write_rds(tempp,
          glue::glue("./data/court_act_rds/courts_acts_", .s_name, "_",
                     .d_name, ".rds"))
Sys.sleep(runif(1, 1, 2))
tempp
}
  • Test get_acts_for_state_district().
get_acts_for_state_district(states_districts_df, "Andhra Pradesh", "Anantpur") -> test_acts_old
get_acts_for_state_district(states_districts_df, "Andhra Pradesh", "Krishna") -> test_acts_new

The results should be a tibble and a saved file with the proper names in the data/court_act_rds directory with all of the acts for each court in the state-district.

  • The old design returns just over 100 acts for each court in the state-district.
  • The new design test courts all return the same data frame with 2,320 acts.

7.9.5 Scope the Scraping for Steps 2 and 3.

Scraping the entire data could take a while since there are 722 districts across the states.

One can filter the data frame to states_districts_df to chose a subset of districts based on state_name and/or `district_name.

  • Create a data frame for which states and districts are to be scraped.
  • Adjust filters to expand or limit.
# chose states 
chosen_states <- unique(states_districts_df$state_name)
set.seed(1234)
chosen_states <- sample(chosen_states, size = 2)

# chose districts
chosen_districts <- unique(states_districts_df$district_name)
set.seed(1234)
chosen_districts <- sample(chosen_districts, size = 5)

states_districts_df |>
  filter(state_name %in% chosen_states) ->
  chosen_states_districts
chosen_states_districts |> 
  group_by(state_name) |> 
slice_head(n = 3) |> 
  ungroup() ->
  chosen_states_districts

Scrape all acts for each court based on the states and districts in chosen_states_districts.

  • Open the server if required.
remDr$close()
rm(rD) ## if necessary when rerunning
gc() ## Garbage Collection
rD <- rsDriver(browser = "firefox", #specify browser you want Selenium to open
               port = netstat::free_port(),
               verbose = FALSE) 
remDr <- rD$client
  • Scrape all the acts for the states-courts of interest.
map2(.x = chosen_states_districts$state_name,
     .y = chosen_states_districts$district_name,
    \(.x, .y) get_acts_for_state_district(chosen_states_districts, .x, .y),
    .progress = TRUE) ->
chosen_court_acts
write_rds(chosen_court_acts, "./data/chosen_court_acts.rds")

Step 2 is now complete.

You can close the driver/browser for now.

remDr$close()

All of the acts are saved for all of the courts in the states and districts of interest.

7.9.6 Scope the Scraping for Step 3.

The data from Step 2 is saved in .rds files, each of which is identified by the state and district names.

  • Each .rds file contains a data frame with data on all of the courts and their applicable acts for the state and district.

Even without the requirement to solve a Captcha puzzle, one would probably want to filter the data prior to the Step 3 scraping of cases to reduce the volume of data to what is relevant for a given analysis.

Given the Captcha puzzle, one can only scrape for cases for one state, district and court at a time, for a selected set of acts.

However, one can scope the amount of scraping for Step 3 by selecting specific states and districts to be scraped as well as the acts to be scraped. This will greatly reduce the number of cases to just the relevant cases for the analysis.

  • The selected states and districts can be a subset for a state-district-focused analysis or a random sample for a broader analysis.
  • The selected acts can be based on either specific act names or key words that might appear in an act name.
    • Unfortunately, the act names are not consistent across all courts.
    • Thus, key words might provide better data (with an initial analysis focused on cleaning the data for consistent names for the analysis).

The following sections demonstrate methods for scoping the data for Step 3 in the following sequence.

Filter by state and District

  1. Load all the file names saved from Step 2 into a data frame.
  2. Identify the states and districts of interest.
  3. Filter by state-district.
  4. Load in the files for the file names of interest.

Filter the Courts

  1. Identify the courts of interest for the analysis.
  2. Filter the data frame based on either court_type or court_name.

Filter the Acts

  1. Identify the key words of interest for the acts.

7.9.6.1 Filter by File Name

The data from step two is saved in .rds files each of identified by the state and district codes.

  • Load all the file names saved from Step 2 into a data frame and append to the path.
    • The list.files() function retrieves all of the file names in the directory.
  • Create a character vector of the states and districts of interest by their names.
  • Filter the file names to only those with codes of interest.
  • Load in the files for the file names of interest and reshape as a data frame.
# Create list of file names
tibble(file_names = list.files("./data/court_act_rds")) |>
  mutate(file_names = str_c("./data/court_act_rds/", file_names)) ->
 court_act_rds_files

# Create vector of state_district names of interest
my_states_districts <- c("Andhra Pradesh_Anantpur", "Andhra Pradesh_Krishna")

court_act_rds_files |> 
  mutate(s_d = str_extract(file_names, "([:alpha:]+\\s?[:alpha:]+_[:alpha:]+\\s?[:alpha:]+)(?=\\.rds)")) |> 
  filter(s_d %in% my_states_districts) |>  
  select(-s_d) ->
 court_act_rds_files 
  
 ## read in the files of interest
all_acts <- map(court_act_rds_files$file_names, read_rds)
  • Reshape the list into a data frame with one list-column for act_names.
tibble(data = all_acts) |>
unnest_longer(col = data) |>
unnest(col = data) ->  
  court_acts_all

The court_acts_all data frame now contains all of the acts for the state-district courts of interest.

7.9.6.2 Filter Courts

The court_acts_all data frame has data on all of the courts.

One can filter the data frame based on the variables court_type and/or court_name.

  • Let’s assume we only want to filter to “establishment” courts.
court_acts_all |> 
  filter(court_type == "establishments") |> 
  nrow()
  • That would leave 46 courts in the data frame.
  • We will leave them all in for now.

7.9.6.3 Filter Acts based on Key Words

There are thousands of acts in the data, most of which are probably not of interest for a given analysis.

court_acts_all$act_names |> 
  map(\(df) nrow(df)) |> 
  unlist() |> 
  mean()
  
court_acts_all$act_names |> 
 list_rbind() |> 
  nrow()

court_acts_all$act_names |> 
 list_rbind() |> 
  select(-1) |> 
  distinct() |> 
  nrow()

We can try to scope the data to the acts of interest based on key words.

  • Create a vector of key words for the acts of interest.
    • As an example, let’s assume the question of interest is about land reform.
  • Convert the vector into a regex string.
key_words <- c("Land", "common", "revenue", "ceiling", "reform", "agriculture",
               "tenant", "tenancy", "lease", "conversion", "rent")
kw_regex <- str_c("(?i)", str_c(key_words, collapse =  "|"))
  • Define a function to filter a data frame to the acts of interest based on the key word regex.
filter_acts <- \(df) filter(df, str_detect(act_names, kw_regex))
  • Use map() to filter each data frame to the acts of interest.
  • Save the updated data to a file.
court_acts_all |>
  mutate(act_names = map(act_names, \(x) filter_acts(x))) ->
  court_acts_key
write_rds(court_acts_key, "./data/court_acts_key.rds")
court_acts_key$act_names |> 
  map(\(df) nrow(df)) |> 
  unlist() |> 
  mean()
  
court_acts_key$act_names |> 
 list_rbind() |> 
  nrow()

court_acts_key$act_names |> 
 list_rbind() |>  
  select(-1) |> 
  distinct() |> 
  nrow()
  • Help from a domain expert will be needed to determine if Land Acquisition Act, Land Acquisition (Amendment) Act , Land Acquisition (Amendment and Validation) Act, and Land Acquisition (Mines) Act are truly different acts or different versions of the same act.

We now have a data frame with the states, districts, courts, and acts of interest for which we want to find cases in Step 3.

7.9.7 Step 3. Get Dynamic Data - Post-Captcha

Given the presence of the Captcha puzzle, Step 3 requires working one state, district, court, and act at a time.

  • The cases can have status “Pending” or “Disposed” which must be chosen before the Captcha Puzzle is activated as well.
  • There are 722 districts. Each can have between 10 to 30 courts. There are over 2000 acts and an average of 100 plus for each court.
  • Thus, there is no practical way to to scrape all of the data for even a single act across all courts and districts. That would require solving thousands of Captcha puzzles.

Many courts do not have any cases for many acts.

  • A domain expert might be able to assist in narrowing the parameters for a given question of interest.

The question of interest will shape how to proceed across the various attributes to get to cases of interest

Let’s assume one wants to look for cases under a single Act, “LAND ACQUISITION (A.P. AMENDMENT) ACT, 1953”, that are already Disposed.

Assuming one has identified a small set of states, districts, and courts, the following strategy can help get to and scrape the case data.

7.9.7.1 Strategy

  • Create the data frame of attributes of interest (state, district, court type, court, act, case status)
  • Start the selenium server web driver.
  • Start with the first record of interest.
  • Use the client object to navigate to the state-district web page with the services search capability.
  • Use functions to make entries for the court type and specific court and the act.
  • Use functions to enter the case status.
  • Manually enter the Captcha solution.
  • In many cases the response is “No records found.” If so, record that result, and move to the next entry.
  • If cases are found, a new page will open. Scrape the data and save to a file.
  • Move to the next record of interest.

7.9.7.2 Preparing for the Captcha

7.9.7.2.1 Create the Data Set

Let’s create a test case for state “Andhra Pradesh”, district “Anantpur”, all courts, act “LAND ACQUISITION (A.P. AMENDMENT) ACT, 1953” and disposed cases.

  • Read in the data for the state and district courts and acts.
read_rds("./data/court_act_rds/courts_acts_Andhra Pradesh_Anantpur.rds") ->
  court_acts_key 
  • Unnest and ungroup the acts data frame.
court_acts_key |>
  unnest(act_names) |>
  ungroup() ->
  court_acts_key_unnest
  • Filter to the “LAND ACQUISITION (A.P. AMENDMENT) ACT, 1953”
court_acts_key_unnest |> 
  filter(act_names  == "LAND ACQUISITION (A.P. AMENDMENT) ACT, 1953") ->
  court_acts_key_unnest
  • Use an index variable, indext to select row of interest for each column and create variables for the attributes of interest
indext <- 1
state_district_services_url <- court_acts_key_unnest$state_district_services_url[indext]
court_type <- court_acts_key_unnest$court_type[indext]
court_value <- court_acts_key_unnest$court_values[indext]
act_value <- court_acts_key_unnest$act_values[indext]
case_status <- "Disposed"

Create a variable for the first part of the file name for the cases.

f_name <- str_c(court_acts_key_unnest$state_name[indext],
                court_acts_key_unnest$district_name[indext],
                str_extract(court_acts_key_unnest$act_values[indext], "\\d+"),
                case_status,
                sep = "_")
7.9.7.2.2 Create Functions to Enter Act and Case Status
Warning

The functions in this section have only been developed and tested on old-design web pages. Updating the functions to the new design is an exercise left to the reader.

Define a function to enter acts after a court is entered.

enter_act <- function(act_value) {
  act_value2 <- str_replace_all(act_value, '\\"', "")
if (court_type == "complexes") {
  act_box <- remDr$findElement(using = 'id', value = "actcode")
  # act_box$clickElement()
 
} else {
  radio_button <- remDr$findElement(using = 'id', value = "radCourtEst")
  # radio_button$clickElement()
  # option <- remDr$findElement(using = 'id', value = "court_code")
}
#
## Update act item
 option <- remDr$findElement(using = 'xpath',
                             str_c("//*/option[@value = ", act_value2, "]", sep = "")
                             )
option$clickElement() ## fire change for act )
}

Define a function to enter case status.

enter_status <- function(status) {
  if (status == "Pending") {
    ## Get pending Cases
    radio_button <- remDr$findElement(using = "id", value = "radP")
    radio_button$clickElement()
  } else {
    ## Get Disposed Cases
    radio_button <- remDr$findElement(using = "id", value = "radD")
    radio_button$clickElement()
  }
}

Test enter_act()and enter_status().

  • Open Selenium for testing
remDr$close()
rm(rD)
gc()
rD <- rsDriver(browser = "firefox", #specify browser you want Selenium to open
               port = netstat::free_port(),
               verbose = FALSE) 
remDr <- rD$client
enter_court(state_district_services_url, court_type, court_value)

enter_act(act_value)
enter_status(case_status)
  • The individual lines can be run and the user has to choose whether to run the pending cases or disposed cases.
  • Then manually enter the captcha solution and go to the next code chunk.

7.9.7.3 Take action on the Browser via the Selenium server.

Open Selenium Server.

remDr$close()
rm(rD)
gc()
rD <- rsDriver(browser = "firefox", #specify browser you want Selenium to open
               port = netstat::free_port(),
               verbose = FALSE) 
remDr <- rD$client

Enter the data to be scraped.

enter_court(state_district_services_url, court_type, court_value)
enter_act(act_value)
enter_status(case_status)

## Enter Captcha Solution

7.9.7.4 Scrape Cases

Once the captcha has been solved, the cases can be scraped with the following function

  • Requires an active Selenium server where the court and act have already been entered.

Build a function to get a data frame with identification details for each case.

get_cases <- function() {

case_table <- remDr$findElement(using = 'id', value = "showList1")
case_table$getElementText() |> 
str_extract_all("(?<=\n).+") |>
  tibble(data = _) ->
  tempp
tempp |>
  unnest_longer(data)  %>% # need to use to support the "." pronoun
  mutate(table_col = rep_len(c("c1", "versus", "c3"), length.out = nrow(x = .))) |> 
  mutate(case_no = ((row_number() + 2) %/% 3) ) |> 
  pivot_wider(names_from = table_col, values_from = data) |> 
  mutate(sr_no = as.integer(str_extract(c1, "^\\d+ ")),
         case_type = str_extract(c1, "(?<=^\\d{1,3}\\s).+ ")) |> 
  separate_wider_delim(cols = case_type, delim = " ", 
                       names = c("case_type", "temp"),
                       too_many = "drop") |> 
  mutate(petitioner = str_extract(c1, "(?<=\\d{4}\\s).+"),
         respondent = str_extract(c3, "^.+(?= Viewdetails)"),
         case_number = str_extract(c3,
                                   "(?<=Viewdetails for case number\\s).+")) |> 
  select(sr_no, case_type, petitioner, versus, respondent, case_number) |> 
  mutate(state_district_services_url = state_district_services_url,
         court_type = court_type,
         court_value = court_value,
         act_value = act_value,
         case_status = case_status,
         .before = 1) -> tempp
}

Scrape the case data for all the cases that are returned.

  • Scrape the data frame of cases.
get_cases() ->
  cases
  • Build a function to view the details of each case.
view_case <- function(case_no){
case_no_attr <- str_extract(case_no, "/\\d*/\\d*") |> 
  str_replace_all("/", "") 
case_t <- str_c('[onclick*="',case_no_attr, '"]', sep = "")

case_attr <- remDr$findElement(using = "css", 
                               value = case_t
                               )
case_attr$clickElement()
}
  • Test the view_case() function.
case_no <- "L.A.O.P/95/2011"
view_case(case_no)
  • The result should be the site opens the second page with details for the case that looks like Figure 7.17:
Figure 7.17: Case Details page.

Build a function to get the case details, save, and return (go back) to the first page for the next case.

  • This function only tidy’s the data into a data frame with a list-column of data frames with the data in each table. It does not clean the data in each table. That can be done later when one is sure of the data to be analyzed.

  • It was observed that some pages use a table structure instead of the list structure therefore consider extracting the table instead of the text. That would require a different approach to tidying.

get_case_details <- function() {
case_details <- remDr$findElement(using = 'id', value = "secondpage")
case_details$getElementText() -> tempp
write_rds(tempp, "./data/tempp.rds")

table_type = c("Case Details", "Case Status", "Petitioner and Advocate", "Respondent and Advocate", "Acts", "IA Status", "Case History", "Orders")

tempp |> tibble(text = _) |> 
  unnest_longer(text) |> 
  separate_longer_delim(cols = text, delim = "\n") |> 
  mutate(t_type = (text %in% table_type),
         t_type = cumsum(t_type),
         t_type = c("Court Name", table_type[t_type])
         ) |> 
  filter(text != "") |> 
  nest_by(t_type) ->
  tempp2
cnr <- str_sub(tempp2$data[[2]][[9,1]], start = 3)
write_rds(tempp2, glue::glue("./data/court_act_rds/case_details_", f_name, "_",
                     court_type,"_", court_value, "_", cnr, ".rds"))
go_back <- remDr$findElement(using = "css", 
                               value = '[onclick*="funBack"]'
                               )
go_back$clickElement()
}

Scrape all the detailed data and save it.

map(cases$case_type, \(x) {
    view_case(x)
    get_case_details()
},
.progress = "TRUE"
)

The result will be a list with null elements with a length of the number of cases.

7.9.7.5 Step 3 Summary of Steps

  1. Read in the data from step 2 for the state and district courts and acts of interest.
read_rds("./data/court_act_rds/courts_acts_Andhra Pradesh_Anantpur.rds") ->
  court_acts_key 
  1. Unnest and ungroup the acts data frame.
court_acts_key |>
  unnest(act_names) |>
  ungroup() ->
  court_acts_key_unnest
  1. Filter to the Act of interest, here, “LAND ACQUISITION (A.P. AMENDMENT) ACT, 1953”.
court_acts_key_unnest |> 
  filter(act_names  == "LAND ACQUISITION (A.P. AMENDMENT) ACT, 1953") ->
  court_acts_key_unnest
  • The result is a data frame of district courts for the chosen act.
  1. Use an index variable, indext to select row of interest for each column and create variables for the attributes of interest
  • Change the index, increment by 1, to work through each row of the data frame.
  • Note when a court has no cases of interest.
indext <- 1
state_district_services_url <- court_acts_key_unnest$state_district_services_url[indext]
court_type <- court_acts_key_unnest$court_type[indext]
court_value <- court_acts_key_unnest$court_values[indext]
act_value <- court_acts_key_unnest$act_values[indext]
case_status <- "Disposed"
  1. Create a variable for the first part of the file name for the cases.
f_name <- str_c(court_acts_key_unnest$state_name[indext],
                court_acts_key_unnest$district_name[indext],
                str_extract(court_acts_key_unnest$act_values[indext], "\\d+"),
                case_status,
                sep = "_")
  1. Open the Selenium Server and browser.
remDr$close()
rm(rD)
gc()
rD <- rsDriver(browser = "firefox", #specify browser you want Selenium to open
               port = netstat::free_port(),
               verbose = FALSE) 
remDr <- rD$client
  1. Enter the data to be scraped into the browser for the state-district services page.
enter_court(state_district_services_url, court_type, court_value)
enter_act(act_value)
enter_status(case_status)

## Enter Captcha Solution
  1. Solve the Captcha and note if there are no records.

  2. If there are records, scrape the data frame of cases.

get_cases() ->
  cases
  1. Scrape all the detailed data and save it.
map(cases$case_type, \(x) {
    view_case(x)
    get_case_details()
},
.progress = "TRUE"
)

7.9.8 Summary of E-courts Scraping.

This case study shows multiple techniques useful in scraping interactive and dynamic sites using {rvest} and {rSelenium} packages.

It also shows several challenges:

  1. When a site inserts a Captcha puzzle, that injects a manual step and inhibits automated scraping at scale.
  2. When a site is evolving, there can be major design differences in the pages that require multiple checks in the functions with different code branches to handle the different designs.
  3. Even when a site has a similar look and feel, if the pages have been built by different developers, they can use different approaches which again result in additional changes to code.

Recommendations:

  1. Work with a domain expert who understands how the overall system works to better shape the data to be scraped.
  • As an example, if the interest is land reform acts, it may not make sense to search for cases under family courts.
  1. Work with a system technical expert who understands how the overall systems is evolving to better identify the differences in web page design across the site instead of just finding errors.

7.10 Summary of Web Scraping

When you want to analyze data from a web page and there is no download data or API access, you can try to scrape the data using the {rvest} package tools.

  • Use SelectorGadget and/or Developer Tools to find the best CSS selectors.
  • Download the page once and save.
  • Scrape the elements of interest using the CSS Selectors.
  • Create a strategy to convert to a tibble that is useful for analysis.
  • Execute the strategy, one step at a time.
    • Build-a-little-test-a-little, checking the results at each step.
  • Look at the data for cleaning and clean what you need.

Use your new data to answer your question!