Scraping Javascript Rendered Websites with Python and Selenium

One of the tricky things you’ll likely encounter while scraping web pages is javascript rendered content. Some websites look as simple as can be, but still contain javascript rendered data that you may need to deal with. Take Blockchain.info for example:

Screenshot_1

Suppose that we’re interested in scraping ‘Blocks mined’ from the page. Taking a look at the dev console, we can confirm that the data is available as text on the page.

Screenshot_4

This looks like a piece of cake; all we need to do is get the page content into Python and find id='n_blocks_mined'. Easy, right? Lets try it out.

We’ll start with the bread and butter of Python web scraping: BeautifulSoup and requests.


import requests
from bs4 import BeautifulSoup

# Get raw HTML content:
url = 'https://blockchain.info/stats'
page_content = requests.get(url).content

# Parse the raw HTML into a delicious soup:
soup = BeautifulSoup(page_content, 'html.parser')

# Find the id we're interested in, 'n_blocks_mined':
blocks_mined = soup.find(id='n_blocks_mined')

Inspecting our results with:

print(blocks_mined)

gives us:

<td colspan="2" id="n_blocks_mined"></td>

But wait, where’s the data? In the browser it’s right here, clear as day!

Screenshot_2

It turns out that the data we want on this simple page is Javascript rendered, so requests is only able to get the page content as it was before the data of interest was loaded. No worries though, we can still get the job done!


Introducing Selenium

With Selenium, we can operate a fully functioning web browser from within python, allowing us to actually work with Javascript rendered data.

Once you have selenium installed via pip install selenium or whatever other method you choose, you’ll need to download a web driver in order to actually use selenium’s browser functionality. For a web scraping application like this, I highly recommend Chrome.

Download the web driver for your operating system here: http://chromedriver.chromium.org/downloads


import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# The path to where you have your chrome webdriver stored:
webdriver_path = './chromedriver.exe'

# Add arguments telling Selenium to not actually open a window
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1920x1080')

# Fire up the headless browser
browser = webdriver.Chrome(executable_path = webdriver_path,
chrome_options = chrome_options)

# Load webpage
browser.get(url)

# It can be a good idea to wait for a few seconds before trying to parse the page
# to ensure that the page has loaded completely.
time.sleep(3)

# Parse HTML, close browser
soup = BeautifulSoup(browser.page_source, 'html.parser')
browser.quit()
blocks_mined = soup.find(id='n_blocks_mined')

Inspecting our results again:

print(blocks_mined)

gives us:

<td colspan="2" id="n_blocks_mined">134</td>

Perfect! Selenium sucessfully loaded the Javascript rendered data, and we’re back in business. Finally, to get the data we’re really interested in, simply access the text attribute:

print(blocks_mined.text)

134

We’re in business.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s