How to build a web crawler?

At the era of big data, web scraping is a life saver.
To save even more time, you can couple ScrapingBot to a web crawling bot.

ScrapingBot-Web-Crawler

What is a web crawler?

A crawler, or spider, is an internet bot indexing and visiting every URLs it encounters. Its goal is to visit a website from end to end, know what is on every webpage and be able to find the location of any information. The most known web crawlers are the search engine ones, the GoogleBot for example. When a website is online, those crawlers will visit it and read its content to display it in the relevant search result pages. 



How does a web crawler work?

Starting from the root URL or a set of entries, the crawler will fetch the webpages and find other URLs to visit, called seeds, in this page. All the seeds found on this page will be added on its list of URLs to be visited. This list is called the horizon. The crawler organises the links in two threads: ones to visit, and already visited ones. It will keep visiting the links until the horizon is empty. 

Because the list of seeds can be very long, the crawler has to organise those following several criterias, and prioritise which ones to visit first and revisit. To know which pages are more important to crawl, the bot will consider how many links go to this URL, how often it is visited by regular users.



What is the difference between a web scraper and a web crawler?

Crawling, by definition, always implies the web. A crawler’s purpose is to follow links to reach numerous pages and analyze their meta data and content. 

Scraping is possible out of the web. For example you can retrieve some information from a database. Scraping is pulling data from the web or a database.



Why do you need a web crawler?

With web scraping, you gain a huge amount of time, by automatically retrieving the information you need instead of looking for it and copying it manually. However, you still need to scrape page after page. Web crawling allows you to collect, organize and visit all the pages present on the root page, with the possibility to exclude some links. The root page can be a search result or category. 

For example, you can pick a product category or a search result page from amazon as an entry, and crawl it to scrape all the product details, and limit it to the first 10 pages with the suggested products as well. 



How to build a web crawler?

The first thing you need to do is threads:

  • Visited URLs
  • URLs to be visited (queue)


To avoid crawling the same page over and over, the URL needs to automatically move to the visited URLs thread once you’ve finished crawling it. In each webpage, you will find new URLs. Most of them will be added to the queue, but some of them might not add any value for your purpose. Hence why you also need to set rules for URLs you’re not interested in.

Deduplication is a critical part of web crawling. On some websites, and particularly on e-commerce ones, a single webpage can have multiple URLs. As you want to scrape this page only once, the best way to do so is to look for the canonical tag in the code. All the pages with the same content will have this common canonical URL, and this is the only link you will have to crawl and scrape.

Here’s an example of a canonical tag in HTML:

<link rel="canonical" href="https://scraping-bot.io/how-to-build-a-crawler">


Here are the basic steps to build a crawler:

  • Step 1: Add one or several URLs to be visited.
  • Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread.
  • Step 3: Fetch the  page’s content and scrape the data you’re interested in with the ScrapingBot API. 
  • Step 4: Parse all the URLs present on the page, and add them to the URLs to be visited if they match the rules you’ve set and don’t match any of the Visited URLs. 
  • Step 5: Repeat steps 2 to 4 until the URLs to be visited list is empty. 


NB: The Steps 1 and 2 must be synchronised. 


Similarly to the web scraping, there is some rules to respect when crawling a website. The Robots.txt file specify if some areas of the site map should not be visited by a crawler. Also, the crawler should avoid overloading a website by limiting its crawling rate, to maintain a good experience for human users. Otherwise, the website being scraped could decide to block the crawler’s IP or take other measures. 

Find here a crawler example using scraping bot API with only two dependencies : request and cheerio
You need to use at least nodeJs 8 because of usage of await/async

const request = require("request");
const util = require("util");
const rp = util.promisify(request);
const sleep = util.promisify(setTimeout);
const cheerio = require('cheerio');
const { URL } = require('url');

let seenLinks = {};

let rootNode = {};
let currentNode = {};

let linksQueue = [];
let printList = [];

let previousDepth = 0;
let maxCrawlingDepth = 5;

let options = null;
let mainDomain = null;
let mainParsedUrl = null;

class CreateLink {
  constructor(linkURL, depth, parent) {
    this.url = linkURL;
    this.depth = depth;
    this.parent = parent;
    this.children = [];
  }
}
//your scraping bot credentials
let username = "yourUsername",
    apiKey = "yourApiKey",
    apiEndPoint = "http://api.scraping-bot.io/scrape/raw-html",
    auth = "Basic " + Buffer.from(username + ":" + apiKey).toString("base64");

let requestOptions = {
  method: 'POST',
  url: apiEndPoint,
  json: {
    url: "this will be replaced in the findLinks function",
    //scraing-bot options
      options: {
          useChrome:false, //if you want to use headless chrome WARNING two api calls wiil be consumed for this option
          premiumProxy:false, //if you want to use premium proxies Unblock Amazon,linkedIn (consuming 10 calls)
      }
  },
  headers: {
      Accept: 'application/json',
      Authorization : auth
  }
}

//Start Application put here the adress where you want to start your crawling with
//second parameter is depth with 1 it will scrape all the links found on the first page but not the ones found on other pages
//if you put 2 it will scrape all links on first page and all links found on second level pages be careful with this on a huge website it will represent tons of pages to scrape
// it is recommanded to limit to 5 levels
crawlBFS("https://www.scraping-bot.io/", 1);

async function crawlBFS(startURL, maxDepth = 5) {
  try {
    mainParsedUrl = new URL(startURL);
  } catch (e) {
    console.log("URL is not valid", e);
    return;
  }

  mainDomain = mainParsedUrl.hostname;

  maxCrawlingDepth = maxDepth;
  startLinkObj = new CreateLink(startURL, 0, null);
  rootNode = currentNode = startLinkObj;
  addToLinkQueue(currentNode);
  await findLinks(currentNode);
}

//
async function crawl(linkObj) {
  //Add logs here if needed!
  //console.log(`Checking URL: ${options.url}`);
  await findLinks(linkObj);
}

//The goal is to get the HTML and look for the links inside the page.
async function findLinks(linkObj) {
  //lets set the url we wnt to scrape
  requestOptions.json.url = linkObj.url
  console.log("Scraping URL : " + linkObj.url);
  let response
  try {
    response = await rp(requestOptions);
    if (response.statusCode !== 200) {
      if (response.statusCode === 401 || response.statusCode === 405) {
        console.log("autentication failed check your credentials");
      } else {
        console.log("an error occurred check the URL" + response.statusCode, response.body);
      }
      return 
    }
    //response.body is the whole content of the page if you want to store some kind of data from the web page you should do it here
    let $ = cheerio.load(response.body);
    let links = $('body').find('a').filter(function (i, el) {
      return $(this).attr('href') != null;
    }).map(function (i, x) {
      return $(this).attr('href');
    });
    if (links.length > 0) {
      links.map(function (i, x) {
        let reqLink = checkDomain(x);
        if (reqLink) {
          if (reqLink != linkObj.url) {
            newLinkObj = new CreateLink(reqLink, linkObj.depth + 1, linkObj);
            addToLinkQueue(newLinkObj);
          }
        }
      });
    } else {
      console.log("No more links found for " + requestOptions.url);
    }
    let nextLinkObj = getNextInQueue();
    if (nextLinkObj && nextLinkObj.depth <= maxCrawlingDepth) {
      //random sleep
      //It is very important to make this long enough to avoid spamming the website you want to scrape
      //if you choose a short time you will potentially be blocked or kill the website you want to crawl
      //time is in milliseconds here
      let minimumWaitTime = 500; //half a second these values are very low on a real worl example you should use at least 30000 (30 seconds between each call) 
      let maximumWaitTime = 5000 //max five seconds
      let waitTime = Math.round(minimumWaitTime + (Math.random() * (maximumWaitTime-minimumWaitTime)));
      console.log("wait for " + waitTime + " milliseconds");
      await sleep(waitTime);
      //next url scraping
      await crawl(nextLinkObj);
    } else {
      setRootNode();
      printTree();
    }
  } catch (err) {
    console.log("Something Went Wrong...", err);
  }
}

//Go all the way up and set RootNode to the parent node
function setRootNode() {
  while (currentNode.parent != null) {
    currentNode = currentNode.parent;
  }
  rootNode = currentNode;
}

function printTree() {
  addToPrintDFS(rootNode);
  console.log(printList.join("\n|"));
}

function addToPrintDFS(node) {
  let spaces = Array(node.depth * 3).join("-");
  printList.push(spaces + node.url);
  if (node.children) {
    node.children.map(function (i, x) {
      {
        addToPrintDFS(i);
      }
    });
  }
}

//Check if the domain belongs to the site being checked
function checkDomain(linkURL) {
  let parsedUrl;
  let fullUrl = true;
  try {
    parsedUrl = new URL(linkURL);
  } catch (error) {
    fullUrl = false;
  }
  if (fullUrl === false) {
    if (linkURL.indexOf("/") === 0) {
      //relative to domain url
      return mainParsedUrl.protocol + "//" + mainParsedUrl.hostname + linkURL.split("#")[0];
    } else if (linkURL.indexOf("#") === 0) {
      //anchor avoid link
      return
    } else {
      //relative url
      let path = currentNode.url.match('.*\/')[0]
      return path + linkURL;
    }
  }

  let mainHostDomain = parsedUrl.hostname;

  if (mainDomain == mainHostDomain) {
    //console.log("returning Full Link: " + linkURL);
    parsedUrl.hash = "";
    return parsedUrl.href;
  } else {
    return;
  }
}

function addToLinkQueue(linkobj) {
  if (!linkInSeenListExists(linkobj)) {
    if (linkobj.parent != null) {
      linkobj.parent.children.push(linkobj);
    }
    linksQueue.push(linkobj);
    addToSeen(linkobj);
  }
}

function getNextInQueue() {
  let nextLink = linksQueue.shift();
  if (nextLink && nextLink.depth > previousDepth) {
    previousDepth = nextLink.depth;
    console.log(`------- CRAWLING ON DEPTH LEVEL ${previousDepth} --------`);
  }
  return nextLink;
}

function peekInQueue() {
  return linksQueue[0];
}

//Adds links we've visited to the seenList
function addToSeen(linkObj) {
  seenLinks[linkObj.url] = linkObj;
}

//Returns whether the link has been seen.
function linkInSeenListExists(linkObj) {
  return seenLinks[linkObj.url] == null ? false : true;
}
Comments are closed.