How to scrape a website without getting blocked ?
When you want to collect and analyze data, let it be for price comparison, statistics or to see a general evolution, scraping is a great and essential time saver. However, many websites do not appreciate to be heavily scraped, some of them don’t allow it at all. There are some generic rules and tricks to respect/follow if you do not want to be blocked from scraping a website, temporarily or permanently.
Rotating IP is key when scraping websites. Most ecommerce websites do not appreciate being scraped.
When you’re scraping a website, you want the data to be collected fast. However, when websites receive multiple requests simultaneously from a single IP address, they detect that it is a scraper and block it. To avoid being blacklisted, the best way is to use proxies. They will use a pool of different IP addresses to route your requests.
The whole point of scraping is to collect data quicker than if it was done manually. As a result, scrapers are browsing websites very fast. The websites can see how long you spend on each page, and if it is not human-like, they will block you. That’s why even if it means being less effective, it is worth limiting the speed. Find the optimal speed, and add some delays between the pages and requests.
If not specified otherwise, the crawler will always use the most effective route. This seems great, except that it shows a huge difference with human users navigating much slower. As a result, going fast makes the scraper very easy to spot and block. To avoid being blacklisted, you must mimic a standard user: set some delays between clicks, avoid repetitive browsing behavior, add some mouse movements and random clicks. Basically, you need to program your robot to look less like a robot and more like a person.
Honeypot Traps are links that are hidden in the HTML code. They are not visible by regular users visiting the website. That’s why when those links are visited, the website knows that there is a scraper on the page and they will block the IP address. The scraper needs to be able to detect if a link is made to be invisible. For example, a link can be set in the same colour as the background, so it is not visible to human users.
Switch User Agents
The user agent is a chain of characters informing the website on how you are visiting it: what browser, version and operating system you are using. As for the IP address, a single user agent, when used by a human user, will not send as many requests per minute compared to a crawler. Therefore, it is important to create a list of different user agents and regularly switch between them, to avoid getting detected and blocked.
Respect Robots.txt and the website in general
The robots.txt file is based at the root of the website. It set the rules of crawling: which parts of the website should not be scraped, how frequently it can be scraped. Some websites are not allowing anyone to scrape them.
If you scrape a website too frequently and send too many requests at a time, you might overload the website servers and impact badly its performance. The owners want their site to run smoothly for everyone, so they might block you to rebalance the performance.