How to build a web crawler?

At the era of big data, web scraping is a life saver.
To save even more time, you can couple ScrapingBot to a web crawling bot.


What is a web crawler?

A crawler, or spider, is an internet bot indexing and visiting every URLs it encounters. Its goal is to visit a website from end to end, know what is on every webpage and be able to find the location of any information. The most known web crawlers are the search engine ones, the GoogleBot for example. When a website is online, those crawlers will visit it and read its content to display it in the relevant search result pages. 

How does a web crawler work?

Starting from the root URL or a set of entries, the crawler will fetch the webpages and find other URLs to visit, called seeds, in this page. All the seeds found on this page will be added on its list of URLs to be visited. This list is called the horizon. The crawler organises the links in two threads: ones to visit, and already visited ones. It will keep visiting the links until the horizon is empty. 

Because the list of seeds can be very long, the crawler has to organise those following several criterias, and prioritise which ones to visit first and revisit. To know which pages are more important to crawl, the bot will consider how many links go to this URL, how often it is visited by regular users.

What is the difference between a web scraper and a web crawler?

Crawling, by definition, always implies the web. A crawler’s purpose is to follow links to reach numerous pages and analyze their meta data and content. 

Scraping is possible out of the web. For example you can retrieve some information from a database. Scraping is pulling data from the web or a database.

Why do you need a web crawler?

With web scraping, you gain a huge amount of time, by automatically retrieving the information you need instead of looking for it and copying it manually. However, you still need to scrape page after page. Web crawling allows you to collect, organize and visit all the pages present on the root page, with the possibility to exclude some links. The root page can be a search result or category. 

For example, you can pick a product category or a search result page from amazon as an entry, and crawl it to scrape all the product details, and limit it to the first 10 pages with the suggested products as well. 

How to build a web crawler?

The first thing you need to do is threads:

  • Visited URLs
  • URLs to be visited (queue)

To avoid crawling the same page over and over, the URL needs to automatically move to the visited URLs thread once you’ve finished crawling it. In each webpage, you will find new URLs. Most of them will be added to the queue, but some of them might not add any value for your purpose. Hence why you also need to set rules for URLs you’re not interested in.

Deduplication is a critical part of web crawling. On some websites, and particularly on e-commerce ones, a single webpage can have multiple URLs. As you want to scrape this page only once, the best way to do so is to look for the canonical tag in the code. All the pages with the same content will have this common canonical URL, and this is the only link you will have to crawl and scrape.

Here’s an example of a canonical tag in HTML:

<link rel="canonical" href="">

Here are the basic steps to build a crawler:

  • Step 1: Add one or several URLs to be visited.
  • Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread.
  • Step 3: Fetch the  page’s content and scrape the data you’re interested in with the ScrapingBot API. 
  • Step 4: Parse all the URLs present on the page, and add them to the URLs to be visited if they match the rules you’ve set and don’t match any of the Visited URLs. 
  • Step 5: Repeat steps 2 to 4 until the URLs to be visited list is empty. 

NB: The Steps 1 and 2 must be synchronised. 

Similarly to the web scraping, there is some rules to respect when crawling a website. The Robots.txt file specify if some areas of the site map should not be visited by a crawler. Also, the crawler should avoid overloading a website by limiting its crawling rate, to maintain a good experience for human users. Otherwise, the website being scraped could decide to block the crawler’s IP or take other measures. 

Comments are closed.