How to build a web crawler ?

ScrapingBot-Web-Crawler

Welcome to our blog post on building a web crawler! In this post, we will take you through the process of creating your own web crawler, step by step. We will cover everything from the basics of what a web crawler is and why it's useful, to the technical details of how to build one.

Whether you're a developer looking to add web crawling functionality to your projects or simply interested in learning more about how the internet works, this post is for you. By the end of this post, you will have a solid understanding of how to build a web crawler and be able to start experimenting with your own projects. So, let's get started!

In order to save time and maximize efficiency, it's a great idea to couple a scraping tool like ScrapingBot with a web crawling bot.

You want to use a scraping tool with your web crawling bot ?

This allows the web crawler to first gather a list of URLs to scrape, and then the scraping tool can quickly and easily extract the desired data from those pages. By using both a web crawler and a scraping tool together, you can automate the process of collecting data from multiple websites, saving you a significant amount of time and effort.

ScrapingBot-Web-Crawler

What is a web crawler?

A crawler, or spider, is an internet bot indexing and visiting every URLs it encounters. Its goal is to visit a website from end to end, know what is on every webpage and be able to find the location of any information. The most known web crawlers are the search engine ones, the GoogleBot for example. When a website is online, those crawlers will visit it and read its content to display it in the relevant search result pages. 

How does a web crawler work?

Starting from the root URL or a set of entries, the crawler will fetch the webpages and find other URLs to visit, called seeds, in this page. All the seeds found on this page will be added on its list of URLs to be visited. This list is called the horizon. The crawler organizes the links in two threads: ones to visit, and already visited ones. It will keep visiting the links until the horizon is empty. 

A web crawler, also known as a spider or bot, is a program that scans the internet and collects information from websites. It starts by visiting a root URL or a set of entry points, and then fetches the webpages, searching for other URLs to visit, called seeds. These seeds are added to the crawler's list of URLs to visit, known as the horizon. The crawler organizes the links it finds into two categories: those that have yet to be visited and those that have already been visited. It will continue to visit the links until the horizon is empty.

To efficiently navigate the vast number of links, the crawler uses several criteria to prioritize which URLs to visit first. It takes into account factors such as the number of links pointing to a particular URL and the frequency at which regular users visit the site. By doing this, the crawler can determine which pages are more important to crawl and focus its efforts on those.

Because the list of seeds can be very long, the crawler has to organize those following several criterias, and prioritize which ones to visit first and revisit. To know which pages are more important to crawl, the bot will consider how many links go to this URL, how often it is visited by regular users.

Test ScrapingBot with your web crawler now for FREE!

What is the difference between a web scraper and a web crawler?

Crawling, by definition, always implies the web. A crawler’s purpose is to follow links to reach numerous pages and analyze their meta data and content. 

Scraping is possible out of the web. For example, you can retrieve some information from a database. Scraping is pulling data from the web or a database.

Why do you need a web crawler?

Web scraping is a powerful tool that can save you a significant amount of time by automatically collecting the information you need from websites, without the need for manual data entry. However, scraping can be time-consuming as it requires you to visit each page individually.

Web crawling offers a solution to this problem by allowing you to collect, organize and visit all of the pages linked from a specific starting point, known as the root page. This can be a search result page or a category page on a website. With web crawling, you also have the option to exclude certain links that you don't need to scrape, making the process more efficient.

For example, you can use a product category or a search result page from Amazon as the root page, and then crawl through all the linked pages to scrape product details. You can even limit the number of pages to crawl, such as the first 10 pages of suggested products. This way you can easily extract the data you need and save a lot of time.

How to build a web crawler?

The first thing you need to do is threads:

  • Visited URLs
  • URLs to be visited (queue)


To avoid crawling the same page over and over, the URL needs to automatically move to the visited URLs thread once you’ve finished crawling it. In each webpage, you will find new URLs. Most of them will be added to the queue, but some of them might not add any value for your purpose. Hence why you also need to set rules for URLs you’re not interested in.

Deduplication is a critical part of web crawling. On some websites, and particularly on e-commerce ones, a single webpage can have multiple URLs. As you want to scrape this page only once, the best way to do so is to look for the canonical tag in the code. All the pages with the same content will have this common canonical URL, and this is the only link you will have to crawl and scrape.

Here’s an example of a canonical tag in HTML:

<link rel="canonical" href="https://scraping-bot.io/how-to-build-a-crawler">

Here are the basic steps to build a crawler

  • Step 1: Add one or several URLs to be visited.
  • Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread.
  • Step 3: Fetch the page’s content and scrape the data you’re interested in with the ScrapingBot API. 
  • Step 4: Parse all the URLs present on the page, and add them to the URLs to be visited if they match the rules you’ve set and don’t match any of the Visited URLs. 
  • Step 5: Repeat steps 2 to 4 until the URLs to be visited list is empty. 


NB: The Steps 1 and 2 must be synchronized. 

Similarly to the web scraping, there is some rules to respect when crawling a website. The Robots.txt file specify if some areas of the site map should not be visited by a crawler. Also, the crawler should avoid overloading a website by limiting its crawling rate, to maintain a good experience for human users. Otherwise, the website being scraped could decide to block the crawler’s IP or take other measures. 

Find here a crawler example using scraping bot API with only two dependencies : request and cheerio
You need to use at least nodeJs 8 because of usage of await/async

const request = require("request");
const util = require("util");
const rp = util.promisify(request);
const sleep = util.promisify(setTimeout);
const cheerio = require('cheerio');
const { URL } = require('url');

let seenLinks = {};

let rootNode = {};
let currentNode = {};

let linksQueue = [];
let printList = [];

let previousDepth = 0;
let maxCrawlingDepth = 5;

let options = null;
let mainDomain = null;
let mainParsedUrl = null;

class CreateLink {
  constructor(linkURL, depth, parent) {
    this.url = linkURL;
    this.depth = depth;
    this.parent = parent;
    this.children = [];
  }
}
//your scraping bot credentials
let username = "yourUsername",
    apiKey = "yourApiKey",
    apiEndPoint = "http://api.scraping-bot.io/scrape/raw-html",
    auth = "Basic " + Buffer.from(username + ":" + apiKey).toString("base64");

let requestOptions = {
  method: 'POST',
  url: apiEndPoint,
  json: {
    url: "this will be replaced in the findLinks function",
    //scraing-bot options
      options: {
          useChrome:false, //if you want to use headless chrome WARNING two api calls wiil be consumed for this option
          premiumProxy:false, //if you want to use premium proxies Unblock Amazon,linkedIn (consuming 10 calls)
      }
  },
  headers: {
      Accept: 'application/json',
      Authorization : auth
  }
}

//Start Application put here the adress where you want to start your crawling with
//second parameter is depth with 1 it will scrape all the links found on the first page but not the ones found on other pages
//if you put 2 it will scrape all links on first page and all links found on second level pages be careful with this on a huge website it will represent tons of pages to scrape
// it is recommanded to limit to 5 levels
crawlBFS("https://www.scraping-bot.io/", 1);

async function crawlBFS(startURL, maxDepth = 5) {
  try {
    mainParsedUrl = new URL(startURL);
  } catch (e) {
    console.log("URL is not valid", e);
    return;
  }

  mainDomain = mainParsedUrl.hostname;

  maxCrawlingDepth = maxDepth;
  startLinkObj = new CreateLink(startURL, 0, null);
  rootNode = currentNode = startLinkObj;
  addToLinkQueue(currentNode);
  await findLinks(currentNode);
}

//
async function crawl(linkObj) {
  //Add logs here if needed!
  //console.log(`Checking URL: ${options.url}`);
  await findLinks(linkObj);
}

//The goal is to get the HTML and look for the links inside the page.
async function findLinks(linkObj) {
  //lets set the url we wnt to scrape
  requestOptions.json.url = linkObj.url
  console.log("Scraping URL : " + linkObj.url);
  let response
  try {
    response = await rp(requestOptions);
    if (response.statusCode !== 200) {
      if (response.statusCode === 401 || response.statusCode === 405) {
        console.log("autentication failed check your credentials");
      } else {
        console.log("an error occurred check the URL" + response.statusCode, response.body);
      }
      return 
    }
    //response.body is the whole content of the page if you want to store some kind of data from the web page you should do it here
    let $ = cheerio.load(response.body);
    let links = $('body').find('a').filter(function (i, el) {
      return $(this).attr('href') != null;
    }).map(function (i, x) {
      return $(this).attr('href');
    });
    if (links.length > 0) {
      links.map(function (i, x) {
        let reqLink = checkDomain(x);
        if (reqLink) {
          if (reqLink != linkObj.url) {
            newLinkObj = new CreateLink(reqLink, linkObj.depth + 1, linkObj);
            addToLinkQueue(newLinkObj);
          }
        }
      });
    } else {
      console.log("No more links found for " + requestOptions.url);
    }
    let nextLinkObj = getNextInQueue();
    if (nextLinkObj && nextLinkObj.depth <= maxCrawlingDepth) {
      //random sleep
      //It is very important to make this long enough to avoid spamming the website you want to scrape
      //if you choose a short time you will potentially be blocked or kill the website you want to crawl
      //time is in milliseconds here
      let minimumWaitTime = 500; //half a second these values are very low on a real worl example you should use at least 30000 (30 seconds between each call) 
      let maximumWaitTime = 5000 //max five seconds
      let waitTime = Math.round(minimumWaitTime + (Math.random() * (maximumWaitTime-minimumWaitTime)));
      console.log("wait for " + waitTime + " milliseconds");
      await sleep(waitTime);
      //next url scraping
      await crawl(nextLinkObj);
    } else {
      setRootNode();
      printTree();
    }
  } catch (err) {
    console.log("Something Went Wrong...", err);
  }
}

//Go all the way up and set RootNode to the parent node
function setRootNode() {
  while (currentNode.parent != null) {
    currentNode = currentNode.parent;
  }
  rootNode = currentNode;
}

function printTree() {
  addToPrintDFS(rootNode);
  console.log(printList.join("\n|"));
}

function addToPrintDFS(node) {
  let spaces = Array(node.depth * 3).join("-");
  printList.push(spaces + node.url);
  if (node.children) {
    node.children.map(function (i, x) {
      {
        addToPrintDFS(i);
      }
    });
  }
}

//Check if the domain belongs to the site being checked
function checkDomain(linkURL) {
  let parsedUrl;
  let fullUrl = true;
  try {
    parsedUrl = new URL(linkURL);
  } catch (error) {
    fullUrl = false;
  }
  if (fullUrl === false) {
    if (linkURL.indexOf("/") === 0) {
      //relative to domain url
      return mainParsedUrl.protocol + "//" + mainParsedUrl.hostname + linkURL.split("#")[0];
    } else if (linkURL.indexOf("#") === 0) {
      //anchor avoid link
      return
    } else {
      //relative url
      let path = currentNode.url.match('.*\/')[0]
      return path + linkURL;
    }
  }

  let mainHostDomain = parsedUrl.hostname;

  if (mainDomain == mainHostDomain) {
    //console.log("returning Full Link: " + linkURL);
    parsedUrl.hash = "";
    return parsedUrl.href;
  } else {
    return;
  }
}

function addToLinkQueue(linkobj) {
  if (!linkInSeenListExists(linkobj)) {
    if (linkobj.parent != null) {
      linkobj.parent.children.push(linkobj);
    }
    linksQueue.push(linkobj);
    addToSeen(linkobj);
  }
}

function getNextInQueue() {
  let nextLink = linksQueue.shift();
  if (nextLink && nextLink.depth > previousDepth) {
    previousDepth = nextLink.depth;
    console.log(`------- CRAWLING ON DEPTH LEVEL ${previousDepth} --------`);
  }
  return nextLink;
}

function peekInQueue() {
  return linksQueue[0];
}

//Adds links we've visited to the seenList
function addToSeen(linkObj) {
  seenLinks[linkObj.url] = linkObj;
}

//Returns whether the link has been seen.
function linkInSeenListExists(linkObj) {
  return seenLinks[linkObj.url] == null ? false : true;
}