Web Scraping: II

  |   Source

This is the second post from a series dedicated to web scraping

In some cases the data we see on the browser is generated after it has being downloaded. For example gmail servers don't send the whole page source code each time a new email comes in. In this case Javascript is responsible for properly updating the visible website (the DOM).

This is exactly what we see when we use "Inspect element" in Google Chrome or Safari. In contrast if we use the "View Source" tool, we get the original html that was downloaded for the server. In this Stackoverflow question the OP wanted to get the live prices from the website of his electricity provider.

A naive approach would go like this:

In [ ]:
import urllib2 from bs4 import BeautifulSoup

url = "https://rrtp.comed.com/live-prices/"
soup = BeautifulSoup(urllib2.urlopen(url).read())
instantPrices = soup.findAll('div', 'instant prices')
print instantPrices

and we get:

In [ ]:
[<div class="instant prices">
</div>]

Its empty!! If we select "View Source" (in Chrome: View->Developer->View Source) we see that in fact, this is exactly what was sent by the server on initial page load:

So what can be done? In many cases we will need to emulate the whole browser behavior (Selenium/PhantomJS). But before calling in the heavy guns we could check how the data is arriving to the client. For example in Chrome Developer Tools we can use the Network Panel and record all requests made by the client. Then if we examine the responses, we can find:

Voilà! Now we can search the Headers sub-tab for the right request URL. In some cases we might need to also pass additional headers, cookies, etc with the request as some websites implement time based protection against web scraping, but in this example it is enough to send the GET request to the URL endpoint.

Now it becomes a simple parsing problem. As we know the structure of our html response, a regular expression is the right solution in order to extract the desired value:

In [1]:
import urllib2
import re

url = "https://rrtp.comed.com/rrtp/ServletFeed?type=instant"
s = urllib2.urlopen(url).read()
#"<p class='price'>3.1<small>&cent;</small><strong> per kWh </strong></p><p>5-minute Trend Price 7:40 PM&nbsp;CT</p>\r\n"

float(re.findall("\d+.\d+", s)[0])  # look for the first positive float.
Out[1]:
3.1
Comments powered by Disqus