Have a question?

Contact Us

How to build a web crawler — ScrapingBot

How to build a web crawler ?

5 min read
Web scraping 10 min read  ·  Published: 06/05/2026

How to Build a Web Crawler: A Step-by-Step Guide

Building a web crawler is one of the most practical skills you can develop as a developer working with data. Rather than manually visiting pages one by one, a web crawler automates the entire process — following links, discovering URLs, and feeding them to a scraper. In this guide, you will learn exactly how a web crawler works, how to build one from scratch in Node.js, and how to combine it with ScrapingBot to extract structured data at scale.

1. What is a web crawler?

A web crawler — also called a spider or bot — is a program that systematically browses the internet by following hyperlinks from page to page. Starting from one or more entry URLs, it fetches each page, extracts all the links it finds, and adds them to a queue of pages to visit next. This process repeats until the queue is empty or a stopping condition is met.

The most well-known web crawlers are search engine bots such as Google's Googlebot or Bing's Bingbot. When you publish a new website, these crawlers will eventually find it, read its content, and index it so it appears in search results. Beyond search engines, however, developers use web crawlers daily for data collection, competitive intelligence, price monitoring, and more.

💡 Key concept: A web crawler discovers URLs. A web scraper extracts data from those URLs. The two tools work best together.

2. Web crawler vs web scraper — what's the difference?

These two terms are often confused, but they serve different purposes:

Web CrawlerWeb Scraper
Follows links to discover pagesExtracts data from specific pages
Always works on the webCan work on the web or any data source
Builds a list of URLsParses page content into structured data
Output: a list of URLsOutput: JSON, CSV, database records

In practice, a crawler and a scraper are typically used together: the crawler discovers all the product pages on an e-commerce site, and the scraper then extracts the price, title, and description from each one.

3. How does a web crawler work?

Understanding the internal mechanics of a crawler will help you build a reliable one. At its core, a crawler manages two lists:

  • The queue (also called the horizon) — URLs waiting to be visited
  • The visited set — URLs that have already been crawled

The crawling loop

Here is the basic flow, step by step:

StepAction
1Add the root URL(s) to the queue
2Pop the first URL from the queue
3Add it to the visited set
4Fetch the page content
5Extract all links from the page
6For each link: if not already visited and matches your rules → add to queue
7Repeat from step 2 until the queue is empty

URL prioritization

To prioritize which URLs to visit first, more advanced crawlers take into account signals such as the number of inbound links pointing to a URL or the frequency at which regular users visit the page. Consequently, the most important pages are crawled first, even when the queue contains thousands of URLs.

4. Why do you need a web crawler?

Web scraping alone requires you to know every URL you want to scrape in advance. For small, well-defined datasets, this works fine. However, when dealing with large websites — e-commerce catalogues, news archives, job boards — manually listing every page is impossible.

A web crawler solves this by automating URL discovery. For instance, you can point your crawler at a product category page on Amazon, and it will automatically find and queue every product page linked from there.

Additionally, you can set rules to exclude irrelevant pages — login pages, cart pages, pagination — so only the pages you care about end up in your scraping queue. As a result, you save hours of manual work and collect far more complete datasets.

5. How to build a web crawler

Data structures you need

Before writing any code, set up two core data structures:

  • A queue — use an array or a proper queue structure to store URLs to visit. A FIFO (first in, first out) queue gives you breadth-first crawling, which is usually what you want.
  • A visited set — use a Set or hash map so URL lookups are O(1). This is critical for performance at scale.

Handling duplicate URLs with canonical tags

On many websites — especially e-commerce ones — a single page can be accessible via multiple URLs. For example:

https://example.com/product?id=123&ref=homepage
https://example.com/product?id=123&ref=search
https://example.com/product/blue-sneakers

All three might display the exact same content. To avoid scraping the same page multiple times, look for the canonical tag in the HTML head of each page:

<link rel="canonical" href="https://example.com/product/blue-sneakers" />

By using the canonical URL as the key in your visited set, you ensure that each unique page is crawled only once — regardless of how many different URLs point to it.

Setting URL filtering rules

Not every link on a page is worth crawling. Therefore, define filtering rules before you start. Common rules include:

  • Only follow links within the same domain (avoid leaving the target site)
  • Exclude URLs matching patterns like /login, /cart, /account
  • Exclude file extensions like .pdf, .jpg, .zip
  • Only include URLs matching a specific path prefix, e.g. /products/

Complete Node.js crawler example

Here is a working web crawler in Node.js using only two dependencies: axios for HTTP requests and cheerio for HTML parsing. This requires Node.js 8 or above for async/await support.

const axios = require('axios');
const cheerio = require('cheerio');

const ROOT_URL = 'https://example.com/products';
const DOMAIN   = 'https://example.com';

const queue   = [ROOT_URL];
const visited = new Set();

async function crawl(url) {
  if (visited.has(url)) return;
  visited.add(url);

  console.log(`Crawling: ${url}`);

  try {
    const { data } = await axios.get(url, { timeout: 10000 });
    const $ = cheerio.load(data);

    // Extract canonical URL to avoid duplicates
    const canonical = $('link[rel="canonical"]').attr('href');
    const pageUrl = canonical || url;

    // TODO: pass pageUrl to your ScrapingBot scraper here

    // Find and queue all links on the page
    $('a[href]').each((_, el) => {
      const href = $(el).attr('href');
      const absolute = toAbsolute(href, DOMAIN);

      if (
        absolute &&
        absolute.startsWith(DOMAIN) &&
        !visited.has(absolute) &&
        !isExcluded(absolute)
      ) {
        queue.push(absolute);
      }
    });

  } catch (err) {
    console.error(`Failed to crawl ${url}: ${err.message}`);
  }
}

function toAbsolute(href, base) {
  if (!href) return null;
  if (href.startsWith('http')) return href;
  if (href.startsWith('/')) return base + href;
  return null;
}

function isExcluded(url) {
  const excluded = ['/login', '/cart', '/account', '/checkout'];
  return excluded.some(pattern => url.includes(pattern));
}

// Main loop — process queue sequentially
async function run() {
  while (queue.length > 0) {
    const url = queue.shift(); // FIFO
    await crawl(url);
    await sleep(500); // Polite delay between requests
  }
  console.log(`Done. Visited ${visited.size} pages.`);
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

run();
💡 Note: The sleep(500) call adds a 500ms delay between requests. This is important — without it, your crawler may overload the target server and get your IP banned. See the best practices section below.

6. Best practices and rules to follow

Before deploying any crawler, it is essential to follow a set of rules — both technical and ethical:

RuleWhy it matters
Check robots.txtSpecifies which paths crawlers are not allowed to visit. Always respect it.
Set a crawl delayAvoid overloading the server. A 500ms–1s delay between requests is a good baseline.
Set a User-Agent headerIdentify your crawler honestly in the request headers.
Handle errors gracefullyUse try/catch and retry logic for failed requests — don't let one bad URL crash your crawler.
Deduplicate aggressivelyUse canonical tags and a visited Set to avoid crawling the same content twice.
Limit crawl depthSet a maximum depth to prevent your crawler from going too deep into a site.

You can find the robots.txt file at the root of any website, e.g. https://example.com/robots.txt. Furthermore, some websites include Crawl-delay directives directly in their robots.txt — check for these and respect them.

7. Combining your crawler with ScrapingBot

Building a crawler to discover URLs is only half the work. Once you have a queue of pages to scrape, you still need to extract structured data from each one — and that's where anti-bot protections, JavaScript rendering, and IP bans become a problem.

ScrapingBot handles all of this for you. Rather than fetching pages directly in your crawler, pass each URL to the ScrapingBot API instead. As a result, you gain automatic IP rotation, JavaScript rendering, and CAPTCHA handling — without changing your crawler logic.

const axios = require('axios');

const USERNAME = 'your_username';
const API_KEY  = 'your_api_key';

async function scrapeWithBot(url) {
  const response = await axios.post(
    'https://api.scraping-bot.io/scrape/raw-html',
    { url },
    { auth: { username: USERNAME, password: API_KEY } }
  );
  return response.data; // Returns the rendered HTML
}

// In your crawler loop, replace direct axios.get() with:
const html = await scrapeWithBot(url);
const $ = cheerio.load(html);
// ... parse the content as usual

This approach gives you the best of both worlds: your crawler handles URL discovery and queue management, while ScrapingBot handles the hard part of actually fetching the pages reliably.

Ready to combine your web crawler with ScrapingBot? Get 1,000 free API calls when you sign up — no credit card required.

Try ScrapingBot for free →

Looking for something more specific?

Start using ScrapingBot

Ready to Unlock Web Data?
Data is only useful once it’s accessible. Let us do the heavy lifting so you can focus on insights.