Multithreading Web Scraping with Python


In this entry we are going to cover a very popular topic, web scraping, and how to scrape the web using Python to extract useful information using concurrency.

logo

Web Scraping

What is web Scraping?

Web Scraping is a method used by software to extract meaningful information from a website. The same way that a person would extract information by copying from the browser and pasting in some data file like a spreadsheet, but instead in an automatic approach using a programming language or a scraping tool.

The approach using Python is similar as a manual extraction, we have to start in the page url, then following some rules, most commonly CSS rules, we can find patterns in the source code for the page, extract our data and save it for later.

Foundations

In order to understand about how to recognize patterns in the source code we have to cover first some foundations.

HTML

HTML stands for HyperText Markup Language, is not a programming language but instead a markup language, in essence this language is used the structure the information displayed by a web page inside a website, using a tag system and parent child relationships you can build tag nodes, a well coded page has an organized structure, using recognizable tags which holds relationships themselves.

These tags also have attributes, which in combination with tag names we can match to find patterns in this structure.

Classes

Classes are designated names that you use inside the class attribute for an HTML tag. Let's take a look at following example

<h1 class="page-heading">Main header</h1>

Usually, a class has a relation with the style of the page defined in a CSS file, but we are not going to cover this, the relevance about classes is that a well structured page use classes for relevant elements in the page and we need to recognize them.

Identifiers

Identifiers are attributes used to identify unique and very relevant elements in our page, by using the id we can define a unique name for a tag. Let's take a look at the following example.

<h1 class="page-heading" id="main-heading">Main header</h1>

Two elements can not have the same identifier, but two elements can have the same class.

Web developers use these two features to develop the visual style with CSS and the logic with JavaScript. What is relevant for us as web scrapers, is to use them to select elements.

CSS selectors

CSS selectors are a set of rules that we can use to select one or multiple elements or tags in a page, these rules are really wide but we are going to focus on how to select by class and by identifier.

Selecting by class name.

.page-heading

If there is more than one element in the page with this class name, then we will have selected multiple elements, thus a list in Python.

Selecting by identifier.

#main-heading

We will have a unique element selected.

Web Scraping vs Web Crawling

Scraping is the process of extracting information from an HTML (or another markup language) page, but crawling is traversing the page for new links.

Scraping Example

For this entry we going to do some coding and scrape a website that holds information about all of the CPUs available in the market, PassMark Software. In the main page there is a table that classifies the CPUs by performance, value and hardware target, in the performance section we are going to crawl and scrape the High End CPU Chart Page.

list

This table has a list of approximately 850 CPUs, each row with a link to a full description page, let's take AMD Ryzen Threadripper PRO 3995WX for example:

detailed

With all of this infomation and for simplicity of this post we are going to crawl only the high end chart and get for every CPU in the list its socket, Clockspeed, Turbo Speed and Average CPU Mark.

Walking through the code

We are going to write a Python script that can perform this crawling and scraping task from scratch.

Dependencies

For this entry we are going to use the requests library to perform http requests to the Internet and the BeautifulSoup library to extract elements from the HTML code in the web pages. You can install both by executing the following in your terminal.

pip3 install requests
pip3 install beautifulsoup4

We highly recommend to perform this using a virtual environment, check out our entry about venv

Importing tools

Let's start by importing the required libraries

# encoding: utf-8
# web_scraping_cpu.py

'''
    Script for scraping high end cpus on www.cpubenchmark.net
    https://www.cpubenchmark.net/high_end_cpus.html
'''

import time

import requests
from bs4 import BeautifulSoup

The request library and its methods mimics with some regards the behavior of a web browser, it can perform http methods such as GET, POST, PUT, DELETE among others. But instead of showing the response of a server in a user interface we can hold the response in a variable.

The beautifulsoup4 library gathers some classes that allow us to parse HTML code from the server response, and traverse this code using methods to find or match information by using different mechanisms such as CSS selectors.

Finally we import the time module to measure the amount of time taken to finish the crawling and scraping process, in a single thread and compared this with a multithread approach.

Finding patterns and CSS Selectors

In order to extract precise information in a web page we have to start by inspecting its source HTML code, you can do this by using any browser developer tools interface, we recommend using Chrome Developer Tools.

Chrome Developer Tools

Using the web browser the are several ways to open the Developer Tools to inspect the HTML code, one option is by pressing the F12 key from our keyboard, and the second option is by right clicking on any element in the page, and finally clicking the inspect choice from the context menu.

Using this last option allow us to open the Elements tab of the Developer Tools in a specific element already highlighted.

source

copy

We can get the CSS Selector for this element by right clicking on it and choosing Copy selector from the Copy menu in the context menu popped.

We will get the following.

#rk4207 > a > span.prdname

But this selector is too specific, instead of one element we want to select all rows from this CPU table, by inspecting some parent nodes we can find a ul tag, which is an unsorted list, and inside this node all the li nodes, which the are rows we want, we can make this selection by using the following CSS selector.

#mark .chartlist li

We can tune our CSS selector by specifying more parent nodes, this way we can avoid selecting other elements with the same chartlist class name.

Testing selectors in the browser console

The browser developer tools have another interesting tool that we can use to practice our selectors, that is the console tool, this is an interactive JavaScript console that we can use, since this is not an entry about JavaScript I'm only going to explain how to do some JavaScript statements to test your CSS selectors.

In this console execute the following.

$$("#mark .chartlist li")

You will see the following output

(869) [li#rk4207, li#rk3837.alt, li#rk4206, li#rk4300.alt, li#rk3674, li#rk4205.alt, li#rk3719, li#rk4251.alt, li#rk3623, li#rk3547.alt, li#rk3894, li#rk3851.alt, li#rk4391, li#rk4383.alt, li#rk3555, li#rk3604.alt, li#rk3617, li#rk3880.alt, li#rk4346, li#rk3538.alt, li#rk3862, li#rk4388.alt, li#rk3713, li#rk4407.alt, li#rk3600, li#rk3642.alt, li#rk3591, li#rk3753.alt, li#rk3846, li#rk3870.alt, li#rk3598, li#rk3732.alt, li#rk3662, li#rk3420.alt, li#rk3861, li#rk4272.alt, li#rk3858, li#rk3630.alt, li#rk3845, li#rk3111.alt, li#rk4403, li#rk3778.alt, li#rk3493, li#rk3309.alt, li#rk3610, li#rk3532.alt, li#rk3373, li#rk3650.alt, li#rk4070, li#rk3701.alt, li#rk3625, li#rk3563.alt, li#rk3671, li#rk3517.alt, li#rk3405, li#rk3316.alt, li#rk3575, li#rk3118.alt, li#rk4326, li#rk3850.alt, li#rk3770, li#rk3854.alt, li#rk3541, li#rk3472.alt, li#rk3092, li#rk3345.alt, li#rk3482, li#rk3639.alt, li#rk4420, li#rk3869.alt, li#rk3387, li#rk3358.alt, li#rk3631, li#rk3608.alt, li#rk3543, li#rk3182.alt, li#rk3891, li#rk3149.alt, li#rk3096, li#rk3540.alt, li#rk3058, li#rk3742.alt, li#rk3094, li#rk3311.alt, li#rk3085, li#rk3332.alt, li#rk3731, li#rk3632.alt, li#rk3127, li#rk3516.alt, li#rk4188, li#rk3817.alt, li#rk3215, li#rk3534.alt, li#rk3904, li#rk3354.alt, li#rk3728, li#rk4380.alt, li#rk3352, li#rk3389.alt,]

This output is a JavaScript array of 869 elements, and in this interactive console you can expand this array and inspect each JavaScript object. You can confirm with this if your CSS selector is fine or it needs more tune up.

Coding our scraper

For each row we have to access 3 data tags, with hyperlinks tags, let's also mention that every tag has some attributes and properties, the text content property which holds the displayed information, and some attributes under the hood, like the href (if found), used for targeting a hyperlink to another page.

main_url = "https://www.cpubenchmark.net/high_end_cpus.html"
response = requests.get(main_url)
soup = BeautifulSoup(response.content, features='lxml')

cpu_rows = soup.select("#mark .chartlist li")

The first thing to perform in Python is to get the server response for the main url, that's what requests is for, we can use requests th perform a get http method and get a response back from the server.

This response object holds so many properties but the one that is relevant right now is the content property, which is the body of the page, the HTML code.

We can use the class BeautifulSoup to instantiate an object which methods will allow us to perform selections.

At this point we have taken our previous tested CSS selector and use it as the argument for the select method. This will return an iterator of all of the 869 CPUs.

Let's use a for loop to iterate over this and extract our data.

cpus = list()

base_url = "https://www.cpubenchmark.net/"

for row in cpu_rows:

    # row scraping

    name_tag = row.select_one("a span.prdname")
    link_tag = row.select_one("a")
    score_tag = row.select_one("a span.count")
    price_tag = row.select_one("a span.price-neww")

    data = dict()

    data["name"] = name_tag.text
    data["url"] = base_url + link_tag["href"]
    data["score"] = int(score_tag.text.replace(",", ""))
    data["price"] = price_tag.text

    cpus.append(data)

In this loop we are extracting data in each row, by applying CSS selectors on each row we can specify a precise selection, but in this case we used the select_one, this method returns the first element found with that CSS selector.

In some of our selected tags we only want the text but for other we want the href attribute, something to point out is that we have to verify if this link is relative or absolute, in case being a relative link, we have to concatenate a base url to this relative link.

Some string methods are used to clean our data, taking out formats used to display data, in our case we don't want a comma (,) for the score since we are going to convert this information into an integer value.

Crawling each CPU

Now let's create another code section in which we will perform our crawling.

tick = time.time()
fail_cpus = list()

for cpu in cpus:

    try:
        url = cpu["url"]

        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')

        socket_tag = soup.select_one("div.left-desc-cpu p:nth-child(2)")
        clockspeed_tag = soup.select_one("div.left-desc-cpu p:nth-child(3)")
        turbospeed_tag = soup.select_one("div.left-desc-cpu p:nth-child(4)")
        avg_mark_tag = soup.select_one("div.right-desc > span:nth-child(3)")

        socket = socket_tag.text.replace("Socket:", "").strip()
        clockspeed = socket_tag.text.replace("Clockspeed:", "").strip()
        turbospeed = socket_tag.text.replace("Turbospped:", "").strip()
        avg_mark = int(avg_mark_tag.text)

        cpu["socket"] = socket
        cpu["clockspeed"] = clockspeed
        cpu["turbospeed"] = turbospeed
        cpu["avg_mark"] = avg_mark

    except:
        fail_cpus.append(cpu)

tock = time.time()
print("Took {} seconds..".format(tock - tick))

In this section we are iterating over our scraped CPUs from the main url and crawling to each cpu detailed page, in there we are extracting the relevant data using other CSS selectors.

And finally printing out the amount of time it took to perform this web scraping task.

In this case we used exception handling with the try-except statements, this is because in some CPUs there are not found some elements of the information, with the possibility of raising an exception during execution.

In my computer it took 673 seconds, approximately 11 minutes. Some of you could say that it was fast but some would say it was a slow process.

Since this is an intensive network task to perform, we can take advantage of multithreading to improve this time by means of concurrency.

Crawling with Concurrency

There are many ways that we can use the multithreading module and its functions, in this post we are going develop a Runnable class and target one instance of this for every thread worker.

But first, let's talk about how we are going to do this process, how we move from a sequential crawling to a concurrency crawling. Let's take a look at the next chart.

chart

The first part of this process we already have it, extracting the list of CPUs, but now instead of saving this into a list, we are going to save it in a Queue, this a Python class form the queue module, this object can pile information in some order, but the important thing, is that in can be use to share information between threads, we are going to make 4 instances of our Runnable class, and each of these threads is going to pop off one element at a time fro this defined queue, and scrape the detailed information using the same code we already used, we are going to define another queue to save the extracted and detailed information for every cpu, eventually the input queue is going to be empty and at this point all Runnable instances can finish and we will have our data stored in the output queue.

The are some advantages with this approach, the first one is that the 4 threads work independently, is like using four lines in a groceries store instead of one, making the extraction process faster, and if there is some failure in any of the Runnable threads, this thread is going to crash but without affecting the other threads, the remaining threads will continue to work making this process more reliable.

let's take a look at the Runnable class.

# encoding: utf-8
# runnable.py

import queue
import threading

import requests
from bs4 import BeautifulSoup

input_cpus = queue.Queue()
output_cpus = queue.Queue()
fail_cpus = queue.Queue()

class Runnable:

    def __call__(self):

        message = "\nThread {} working hard!"

        def process_cpu(cpu):

            try:
                url = cpu["url"]

                response = requests.get(url)
                soup = BeautifulSoup(response.content, 'lxml')

                socket_tag = soup.select_one("div.left-desc-cpu p:nth-child(2)")
                clockspeed_tag = soup.select_one("div.left-desc-cpu p:nth-child(3)")
                turbospeed_tag = soup.select_one("div.left-desc-cpu p:nth-child(4)")
                avg_mark_tag = soup.select_one("div.right-desc > span:nth-child(3)")

                socket = socket_tag.text.replace("Socket:", "").strip()
                clockspeed = socket_tag.text.replace("Clockspeed:", "").strip()
                turbospeed = socket_tag.text.replace("Turbospped:", "").strip()
                avg_mark = int(avg_mark_tag.text)

                cpu["socket"] = socket
                cpu["clockspeed"] = clockspeed
                cpu["turbospeed"] = turbospeed
                cpu["avg_mark"] = avg_mark

            except:
                fail_cpus.put(cpu)

        while True:

            try:
                cpu = input_cpus.get(timeout=1)
            except Exception as e:
                print(e)
                break

            print(message.format(id(self)))

            process_cpu(cpu)

In order for an instance of this class to be used as a thread target it has to be callable, we can do this by defining the special method call.

In this method we are performing the same instructions that we were performing in the previous case, but now we are using a producer-consumer approach, this code consumes information from an input queue and produces other information into an output queue, this produced information is the detailed information of each cpu that is being scraped in concurrency.

We use a perpetual while loop with a break statement in case there is no more information or our input queue.

This class is defined in its own module and we will import it in our main module.

Defining Threads

We are going to fix our main script by replacing the crawling section with a multiple thread definitions and executions, and our final code will look like this.

# encoding: utf-8
# web_scraping_multithreading_cpu.py

'''
    Script for scraping high end cpus on www.cpubenchmark.net
    https://www.cpubenchmark.net/high_end_cpus.html
'''

import time
import threading

import requests
from bs4 import BeautifulSoup

from runnable import Runnable, input_cpus, output_cpus

main_url = "https://www.cpubenchmark.net/high_end_cpus.html"
response = requests.get(main_url)
soup = BeautifulSoup(response.content, features='lxml')

cpu_rows = soup.select("#mark .chartlist li")

cpus = list()

base_url = "https://www.cpubenchmark.net/"

for row in cpu_rows:

    # row scraping

    name_tag = row.select_one("a span.prdname")
    link_tag = row.select_one("a")
    score_tag = row.select_one("a span.count")
    price_tag = row.select_one("a span.price-neww")

    data = dict()

    data["name"] = name_tag.text
    data["url"] = base_url + link_tag["href"]
    data["score"] = int(score_tag.text.replace(",", ""))
    data["price"] = price_tag.text

    input_cpus.put(data)

tick = time.time()

threads = list()

for i in range(4):

    thread = threading.Thread(target=Runnable())
    thread.start()
    threads.append(thread)

for thread in threads:

    thread.join()

tock = time.time()
print("Took {} seconds..".format(tock - tick))

Pretty much the same code, but at the end we have defined some threads that use as target an instance of a Runnable.

By the end of the execution of this script you will have all the detailed information stored in the output_cpus queue.

Conclusion

This time in my computer, it took 179 seconds, approximately 3 minutes, less than one third of the time taken with no concurrency.

If we decide to increase the number threads from 4 to 8, this time it would take 98 seconds, approximately one and half minutes, but we can not increase as much as we can the number of threads, even if our computer is big enough, we have to use a number that is balanced, this way we can avoid that the server blocks us for performing too many requests in a short period of time from the same ip address, because some servers have security mechanisms to detect network attacks.

We have come to the end of this post, I hope you found this post helpful. Thanks for coming by.

0 Comments

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel