Web scraping: I

  |   Source

This is the first post from a series dedicated to web scraping

There are at least three ways to analyze the html from a page. In order to scrape fairly regularly formatted data from large documents, a regular expression is the right solution and will be faster than a generic parser. But bear in mind that html is not always well constructed and as the name suggests regular expression are designed to deal with regular structures.

In these cases you use a parser (lxml, pyquery, beautifulsoup). The elements are then selected via xpath or alternatively using CSS selectors (a la jquery). Anyway you choose, there is a very useful utility to help with the identification of the selectors: Selector Gadget.

A friend of mine wanted to scrape some public company data. The page had a lot of different tables and the task was to extract the first and the second column from the ones satisfying the following conditions:

  1. The header of the first column should be NUANS Reports & Preliminary Searches
  2. The values of the second column should be Active or Inactive

Fortunately this being an old government website what you see is what you get, meaning that the page is fully generated by the server (it is most probably just being statically served), so no "special tricks" are needed. First we load the data with the wonderful pyquery:

In [2]:
from pyquery import PyQuery as pq

url = 'https://www.nuans.com/RTS2/en/jur_codes-codes_jur_en.cgi#Example_of_report_layouts'
d = pq(url)

Looking at the markup in the Developer Tools we see that there are several possible approaches. We could traverse all td elements and check that the corresponding header matches the condition 1:

In [140]:
l = []
for th in d.items('.borderless td:nth-child(1)'):
    left = th.text()
    right = th.next().text()
    tr = th.parent()
    tbody = tr.parent()
    title = tbody('th:first').text()    # first element
    if title == 'NUANS Reports & Preliminary Searches' and right in ['Active', 'Inactive']:
        l.append([left, right])
1 loops, best of 3: 199 ms per loop

but that feels wrong as the condition is being checked for all th elements in the whole page, even the ones that belong to a table with the wrong header. I also find the pattern of initializing an empty list and appending slightly un-pythonic so I came up with an improved version:

In [143]:
row_gen = ( [td.text(), td.next().text()]                # left, right element
          for table in d('.borderless').items()
          for td in table('td:nth-child(1)').items()   # left column
          if table('th:first').text() == 'NUANS Reports & Preliminary Searches' and 
             td.next().text() in ('Active', 'Inactive') )
10 loops, best of 3: 172 ms per loop

Now we have a nice generator expression that goes only through the tables (and not through all table rows) and as soon the header doesn't match the condition shortcuts and the next table is analyzed. As the timing indicate (timed with a list comprehension) this is not only more memory efficient but also slightly faster (yes I know, another case of premature optimization, ;-)).

Finally we can save the results in a file:

In [6]:
import csv

with open('companies.csv', 'wb') as csvfile:
    csv.writer(csvfile, delimiter=',').writerows(row_gen)
!head companies.csv
# Amlgmtd,Inactive
# Bankrupt,Active
# Cancelled,Inactive
# Cnttn_Out,Inactive
# Deleted,Inactive
# Dissolved,Inactive
# Historic,Inactive
# Lqdtd,Active
# LT_CrtOrd,Active
# Pnd_Rstrn,Active
Comments powered by Disqus