Even though web scraping is commonly used across most industries, most websites do not appreciate it and new anti-scraping methods are being developed regularly. The main reason is that aggressive web scraping can slow down the website for regular users, and in the worst-case result in a denial of service. To prevent you from scraping their websites, companies are using various strategies.
Limiting the scraping
IP rate limiting, also called requests throttling, is a commonly used anti-scraping method. A good practice of web scraping is to respect the website and scrape it slowly. This way, you will avoid monopolizing the bandwidth of the website. The goal is for the regular users to still have a smooth experience of the website in parallel of your scraping. IP rate limitation means that there is a maximum number of actions doable in a certain time on the website. Any request over this limit will simply not receive an answer.
Blocking the web scraping
While some websites are okay with a simple regulation of the web scraping, others are trying to prevent it all together. They are using many technics to detect and block scrapers: user agent, CAPTCHAs, behavioral analysis technology, blocking individual or entire IP ranges, AWS shield, … You can read more about how to scrape a website without being blocked in this article.
Some websites modify their HTML markups every month to protect their data. A scraping bot will look for an information in the places it found it last time. By changing the pattern of their HTML, the websites are trying to confuse the scraping tool, and making it harder to find the desired data.
In addition, the programmers can obfuscate the code. HTML obfuscation consist in making the code much harder to read, while keeping it perfectly functional. The information is still there but written in an extremely complex way.
Providing fake information
In our article about scraping without getting blocked, we’ve talked about Honeypots, those links that only bots will find and visit. Some other techniques are also intended to be seen only by bots and not by regular users. This is the case of cloaking. This is a hiding technique that returns an altered page when visited by a bot. Normal user only will be able to see the real pages. The bot will still collect information, without knowing that it is fake or incorrect. This method is really frowned upon by Google and other search engines. Websites using it are taking the risk to be removed from their index.
⬇️ If you want to get tips on scraping check out this article ⬇️