Some pages are harder to scrape than other. In the case of infinite scrolling pages, scraping them will result in incomplete data collection.
Infinite scrolling can be a way to prevent web scraping. You can read about other anti scraping methods in this article.
What are Infinite Scroll Pages and what makes them harder to scrape?
If you scrape the webpage URL as it is, you will only collect the data of the first visible segment. It will also problematic for a crawler, as it won’t find the next pages URLs.
When you visit this kind of webpage, the next content will automatically load when you reach the end of the page. One way to scrape the HTML of this kind of page can be by simulation a human behaviour with specific tools, such as Splash or Selenium. But there is more simple ways to do it.
In reality, there is a pagination in an infinite scrolling page, but it is hidden in the HTML code. On a classic page, the user clicks on the next page URL whereas here, the next page is called dynamically when you visit the end area of the webpage.
Infinite Scroll Page Example:
We’re going to use this page as an example:
How do I find the URLs to scrape in the HTML code?
First, let’s open the Developer Tools of your browser. For that, you just need to right click and select Inspect Element.
Go in the Network tab. Most of the time, the requests we’re interested in will be in the XHR filter. On the webpage, scroll to the bottom to trigger the next page call. You will see in the network tab some requests being made by your browser. Click on one of them to see more details.
Here are the requests we can see for this page:
If you don’t see any requests in the XHR tab, go back to All, and search for Results.
As you can see, the browser is calling for the next pages.
To scrape those pages, you will need to call them the same way your browser. So at the end of the URL, add /results?page=1
So you can either scrape manually the URLs, or you can adapt your web crawler to increment the page number. For that, your crawler needs to generate requests to the URL and make it end by ?page=X, until the request receives a 404 error, meaning the previous page was the last one.