Web scraping: I
This is the first post from a series dedicated to web scraping
There are at least three ways to analyze the html from a page. In order to scrape fairly regularly formatted data from large documents, a regular expression is the right solution and will be faster than a generic parser. But bear in mind that html is not always well constructed and as the name suggests regular expression are designed to deal with regular structures.
In these cases you use a parser (lxml, pyquery, beautifulsoup). The elements are then selected via xpath or alternatively using CSS selectors (a la jquery). Anyway you choose, there is a very useful utility to help with the identification of the selectors: Selector Gadget.
A friend of mine wanted to scrape some public company data. The page had a lot of different tables and the task was to extract the first and the second column from the ones satisfying the following conditions:
- The header of the first column should be NUANS Reports & Preliminary Searches
- The values of the second column should be Active or Inactive
Fortunately this being an old government website what you see is what you get, meaning that the page is fully generated by the server (
it is most probably just being statically served), so no "special tricks" are needed. First we load the data with the wonderful pyquery:
from pyquery import PyQuery as pq url = 'https://www.nuans.com/RTS2/en/jur_codes-codes_jur_en.cgi#Example_of_report_layouts' d = pq(url)
Looking at the markup in the Developer Tools we see that there are several possible approaches. We could traverse all
td elements and check that the corresponding header matches the condition 1:
l =  for th in d.items('.borderless td:nth-child(1)'): left = th.text() right = th.next().text() tr = th.parent() tbody = tr.parent() title = tbody('th:first').text() # first element if title == 'NUANS Reports & Preliminary Searches' and right in ['Active', 'Inactive']: l.append([left, right])
1 loops, best of 3: 199 ms per loop
but that feels wrong as the condition is being checked for all
th elements in the whole page, even the ones that belong to a table with the wrong header. I also find the pattern of initializing an empty list and appending slightly un-pythonic so I came up with an improved version:
row_gen = ( [td.text(), td.next().text()] # left, right element for table in d('.borderless').items() for td in table('td:nth-child(1)').items() # left column if table('th:first').text() == 'NUANS Reports & Preliminary Searches' and td.next().text() in ('Active', 'Inactive') )
10 loops, best of 3: 172 ms per loop
Now we have a nice generator expression that goes only through the tables (and not through all table rows) and as soon the header doesn't match the condition shortcuts and the next table is analyzed. As the timing indicate (timed with a list comprehension) this is not only more memory efficient but also slightly faster (yes I know, another case of premature optimization, ;-)).
Finally we can save the results in a file:
import csv with open('companies.csv', 'wb') as csvfile: csv.writer(csvfile, delimiter=',').writerows(row_gen) !head companies.csv # Amlgmtd,Inactive # Bankrupt,Active # Cancelled,Inactive # Cnttn_Out,Inactive # Deleted,Inactive # Dissolved,Inactive # Historic,Inactive # Lqdtd,Active # LT_CrtOrd,Active # Pnd_Rstrn,Active