How to Build a Web Crawler: A Step-by-Step Guide
Building a web crawler is one of the most practical skills you can develop as a developer working with data. Rather than manually visiting pages one by one, a web crawler automates the entire process — following links, discovering URLs, and feeding them to a scraper. In this guide, you will learn exactly how a web crawler works, how to build one from scratch in Node.js, and how to combine it with ScrapingBot to extract structured data at scale.
Table of contents
1. What is a web crawler?
A web crawler — also called a spider or bot — is a program that systematically browses the internet by following hyperlinks from page to page. Starting from one or more entry URLs, it fetches each page, extracts all the links it finds, and adds them to a queue of pages to visit next. This process repeats until the queue is empty or a stopping condition is met.
The most well-known web crawlers are search engine bots such as Google's Googlebot or Bing's Bingbot. When you publish a new website, these crawlers will eventually find it, read its content, and index it so it appears in search results. Beyond search engines, however, developers use web crawlers daily for data collection, competitive intelligence, price monitoring, and more.
2. Web crawler vs web scraper — what's the difference?
These two terms are often confused, but they serve different purposes:
| Web Crawler | Web Scraper |
|---|---|
| Follows links to discover pages | Extracts data from specific pages |
| Always works on the web | Can work on the web or any data source |
| Builds a list of URLs | Parses page content into structured data |
| Output: a list of URLs | Output: JSON, CSV, database records |
In practice, a crawler and a scraper are typically used together: the crawler discovers all the product pages on an e-commerce site, and the scraper then extracts the price, title, and description from each one.
3. How does a web crawler work?
Understanding the internal mechanics of a crawler will help you build a reliable one. At its core, a crawler manages two lists:
- The queue (also called the horizon) — URLs waiting to be visited
- The visited set — URLs that have already been crawled
The crawling loop
Here is the basic flow, step by step:
| Step | Action |
|---|---|
| 1 | Add the root URL(s) to the queue |
| 2 | Pop the first URL from the queue |
| 3 | Add it to the visited set |
| 4 | Fetch the page content |
| 5 | Extract all links from the page |
| 6 | For each link: if not already visited and matches your rules → add to queue |
| 7 | Repeat from step 2 until the queue is empty |
URL prioritization
To prioritize which URLs to visit first, more advanced crawlers take into account signals such as the number of inbound links pointing to a URL or the frequency at which regular users visit the page. Consequently, the most important pages are crawled first, even when the queue contains thousands of URLs.
4. Why do you need a web crawler?
Web scraping alone requires you to know every URL you want to scrape in advance. For small, well-defined datasets, this works fine. However, when dealing with large websites — e-commerce catalogues, news archives, job boards — manually listing every page is impossible.
A web crawler solves this by automating URL discovery. For instance, you can point your crawler at a product category page on Amazon, and it will automatically find and queue every product page linked from there.
Additionally, you can set rules to exclude irrelevant pages — login pages, cart pages, pagination — so only the pages you care about end up in your scraping queue. As a result, you save hours of manual work and collect far more complete datasets.
5. How to build a web crawler
Data structures you need
Before writing any code, set up two core data structures:
- A queue — use an array or a proper queue structure to store URLs to visit. A FIFO (first in, first out) queue gives you breadth-first crawling, which is usually what you want.
- A visited set — use a Set or hash map so URL lookups are O(1). This is critical for performance at scale.
Handling duplicate URLs with canonical tags
On many websites — especially e-commerce ones — a single page can be accessible via multiple URLs. For example:
https://example.com/product?id=123&ref=homepage
https://example.com/product?id=123&ref=search
https://example.com/product/blue-sneakersAll three might display the exact same content. To avoid scraping the same page multiple times, look for the canonical tag in the HTML head of each page:
<link rel="canonical" href="https://example.com/product/blue-sneakers" />By using the canonical URL as the key in your visited set, you ensure that each unique page is crawled only once — regardless of how many different URLs point to it.
Setting URL filtering rules
Not every link on a page is worth crawling. Therefore, define filtering rules before you start. Common rules include:
- Only follow links within the same domain (avoid leaving the target site)
- Exclude URLs matching patterns like
/login,/cart,/account - Exclude file extensions like
.pdf,.jpg,.zip - Only include URLs matching a specific path prefix, e.g.
/products/
Complete Node.js crawler example
Here is a working web crawler in Node.js using only two dependencies: axios for HTTP requests and cheerio for HTML parsing. This requires Node.js 8 or above for async/await support.
const axios = require('axios');
const cheerio = require('cheerio');
const ROOT_URL = 'https://example.com/products';
const DOMAIN = 'https://example.com';
const queue = [ROOT_URL];
const visited = new Set();
async function crawl(url) {
if (visited.has(url)) return;
visited.add(url);
console.log(`Crawling: ${url}`);
try {
const { data } = await axios.get(url, { timeout: 10000 });
const $ = cheerio.load(data);
// Extract canonical URL to avoid duplicates
const canonical = $('link[rel="canonical"]').attr('href');
const pageUrl = canonical || url;
// TODO: pass pageUrl to your ScrapingBot scraper here
// Find and queue all links on the page
$('a[href]').each((_, el) => {
const href = $(el).attr('href');
const absolute = toAbsolute(href, DOMAIN);
if (
absolute &&
absolute.startsWith(DOMAIN) &&
!visited.has(absolute) &&
!isExcluded(absolute)
) {
queue.push(absolute);
}
});
} catch (err) {
console.error(`Failed to crawl ${url}: ${err.message}`);
}
}
function toAbsolute(href, base) {
if (!href) return null;
if (href.startsWith('http')) return href;
if (href.startsWith('/')) return base + href;
return null;
}
function isExcluded(url) {
const excluded = ['/login', '/cart', '/account', '/checkout'];
return excluded.some(pattern => url.includes(pattern));
}
// Main loop — process queue sequentially
async function run() {
while (queue.length > 0) {
const url = queue.shift(); // FIFO
await crawl(url);
await sleep(500); // Polite delay between requests
}
console.log(`Done. Visited ${visited.size} pages.`);
}
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
run();sleep(500) call adds a 500ms delay between requests. This is important — without it, your crawler may overload the target server and get your IP banned. See the best practices section below.6. Best practices and rules to follow
Before deploying any crawler, it is essential to follow a set of rules — both technical and ethical:
| Rule | Why it matters |
|---|---|
Check robots.txt | Specifies which paths crawlers are not allowed to visit. Always respect it. |
| Set a crawl delay | Avoid overloading the server. A 500ms–1s delay between requests is a good baseline. |
| Set a User-Agent header | Identify your crawler honestly in the request headers. |
| Handle errors gracefully | Use try/catch and retry logic for failed requests — don't let one bad URL crash your crawler. |
| Deduplicate aggressively | Use canonical tags and a visited Set to avoid crawling the same content twice. |
| Limit crawl depth | Set a maximum depth to prevent your crawler from going too deep into a site. |
You can find the robots.txt file at the root of any website, e.g. https://example.com/robots.txt. Furthermore, some websites include Crawl-delay directives directly in their robots.txt — check for these and respect them.
7. Combining your crawler with ScrapingBot
Building a crawler to discover URLs is only half the work. Once you have a queue of pages to scrape, you still need to extract structured data from each one — and that's where anti-bot protections, JavaScript rendering, and IP bans become a problem.
ScrapingBot handles all of this for you. Rather than fetching pages directly in your crawler, pass each URL to the ScrapingBot API instead. As a result, you gain automatic IP rotation, JavaScript rendering, and CAPTCHA handling — without changing your crawler logic.
const axios = require('axios');
const USERNAME = 'your_username';
const API_KEY = 'your_api_key';
async function scrapeWithBot(url) {
const response = await axios.post(
'https://api.scraping-bot.io/scrape/raw-html',
{ url },
{ auth: { username: USERNAME, password: API_KEY } }
);
return response.data; // Returns the rendered HTML
}
// In your crawler loop, replace direct axios.get() with:
const html = await scrapeWithBot(url);
const $ = cheerio.load(html);
// ... parse the content as usualThis approach gives you the best of both worlds: your crawler handles URL discovery and queue management, while ScrapingBot handles the hard part of actually fetching the pages reliably.
Ready to combine your web crawler with ScrapingBot? Get 1,000 free API calls when you sign up — no credit card required.
Try ScrapingBot for free →



