Web Scraping: III

  |   Source

This is the third post from a series dedicated to web scraping

Some websites implement measures against scraping. A common reason is to protect the data, for example site scraping is sometimes used to gather product and pricing data to undercut rivals’ prices for goods and services (Ryanair case).

In this post I will show a technique that can be used to bypass cookie based protections. As an example we are going to download data from [Kaggle] (http://www.kaggle.com). This is useful if you want for example to download a big data file directly to Amazon EC2. Kaggle only allows browser based downloads, because you have to accept the competition rules before getting access to the data.

Fortunately both wget and curl can send cookies along with the request, for example:

$ wget -x --load-cookies cookies.txt https://www.kaggle.com/c/digit-recognizer/download/knn_benchmark.csv

Now we only need to get the cookies in our browser. The easiest way I have found to do this is to use the cookie.txt export Chrome extension.

So just copy the data to a cookies.txt file and run wget.

If we wanted to do this from Python we need to maintain a session object which is super easy using requests. This makes sure that the corresponding cookies are kept between calls:

In [1]:
import shutil
import requests

login_url = 'https://www.kaggle.com/account/login'
download_url = 'https://www.kaggle.com/c/digit-recognizer/download/knn_benchmark.csv'
filename = download_url.split('/')[-1]
login_data = {'UserName':'kaggle_username', 
              'Password':'kaggle_password'}

with requests.session() as s, open(filename, 'w') as f:
    s.post(login_url, data=login_data)                  # login
    response = s.get(download_url, stream=True)         # send download request
    shutil.copyfileobj(response.raw, f)                 # save response to file
Comments powered by Disqus