Suppose that we’re interested in scraping ‘Blocks mined’ from the page. Taking a look at the dev console, we can confirm that the data is available as text on the page.
This looks like a piece of cake; all we need to do is get the page content into Python and find
id='n_blocks_mined'. Easy, right? Lets try it out.
We’ll start with the bread and butter of Python web scraping: BeautifulSoup and requests.
import requests from bs4 import BeautifulSoup # Get raw HTML content: url = 'https://blockchain.info/stats' page_content = requests.get(url).content # Parse the raw HTML into a delicious soup: soup = BeautifulSoup(page_content, 'html.parser') # Find the id we're interested in, 'n_blocks_mined': blocks_mined = soup.find(id='n_blocks_mined')
Inspecting our results with:
<td colspan="2" id="n_blocks_mined"></td>
But wait, where’s the data? In the browser it’s right here, clear as day!
requests is only able to get the page content as it was before the data of interest was loaded. No worries though, we can still get the job done!
Once you have selenium installed via
pip install selenium or whatever other method you choose, you’ll need to download a web driver in order to actually use selenium’s browser functionality. For a web scraping application like this, I highly recommend Chrome.
Download the web driver for your operating system here: http://chromedriver.chromium.org/downloads
import time from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.chrome.options import Options # The path to where you have your chrome webdriver stored: webdriver_path = './chromedriver.exe' # Add arguments telling Selenium to not actually open a window chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--window-size=1920x1080') # Fire up the headless browser browser = webdriver.Chrome(executable_path = webdriver_path, chrome_options = chrome_options) # Load webpage browser.get(url) # It can be a good idea to wait for a few seconds before trying to parse the page # to ensure that the page has loaded completely. time.sleep(3) # Parse HTML, close browser soup = BeautifulSoup(browser.page_source, 'html.parser') browser.quit() blocks_mined = soup.find(id='n_blocks_mined')
Inspecting our results again:
<td colspan="2" id="n_blocks_mined">134</td>
We’re in business.