"5 principles of bypassing blocks revealed by a web scraping expert"

5 principles to avoid web crawling blocks. Essential strategies include setting User-Agent and changing IP addresses. Introduction to reasons for blocking and solutions during web crawling.

11
"5 principles of bypassing blocks revealed by a web scraping expert"

0. What is the cause of blocking during web crawling?

Developers who have experienced web crawling have inevitably experienced blocking.

Your crawler was perfect, but were you frustrated because you didn't know where the problem occurred?

In this post, we will discuss the typical causes of blocking with a focus on solutions.

Crawling is similar to going into a store and fetching the desired products. When we enter a store, there are implicit rules we must follow. For example, removing dirt from our shoes before entering, properly folding and placing umbrellas in the umbrella stand, and dressing in a way that does not inconvenience others.

In some places, there are unique rules set by the owner. In such cases, you must follow those rules to use the store. For example, like cafes that require inquiries to be made via DM, restaurants where trays must be returned directly, or situations where you leave items on the seat.

Similarly, there are rules that must be followed in crawling. One typical reason for being blocked is when requests lack User-Agent or other parameters, or when they are oddly configured, causing the website to identify them as coming from a bot and block them.

Therefore, the most basic step is to set the User-Agent to not appear as a bot.

1. Let's set the User-Agent

Setting the User-Agent involves putting the User-Agent value in the HTTP request header. The User-Agent value is a string that indicates the type and version of the web browser or HTTP client.

For example, Chrome browser uses the following User-Agent value.

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3

By setting the User-Agent value in the crawler, the website will recognize the request as coming from a browser. Therefore, if the User-Agent value is not set, there is a high possibility that the website will block the request, assuming it is from a crawler.

The method of setting the User-Agent value may vary depending on the HTTP request library, but generally, it is possible to include the User-Agent value in the HTTP request header. In Python using the requests library, you can set the User-Agent value as follows.

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

response = requests.get("http://example.com", headers=headers)

In this way, in the requests library, you can set the User-Agent value in the HTTP request header using the headers parameter.

However, nowadays, websites are increasingly attempting to block crawling using various methods beyond just User-Agent, so it is advisable to use other methods in addition to User-Agent setting. For example, frequently changing IP addresses or adjusting crawling speed.

2. Change IP addresses as frequently as possible

Continuously requesting connections with the same IP address may appear very suspicious to the website.

One simple way to frequently change IP addresses is to use a VPN. VPN stands for Virtual Private Network, a service that hides the user's IP address and changes it to an IP address from a different location over the internet.

By using a VPN, you are perceived as connecting to the internet through the VPN, reducing the likelihood of being blocked. Additionally, using a VPN means you do not have to use previously blocked IP addresses, allowing for more stable crawling.

3. Irregularly adjust crawling speed

Requesting at precise timings and intervals like a machine increases the likelihood of being identified as a bot. Therefore, it is necessary to adjust the crawling speed irregularly.

There are two main methods to adjust crawling speed.

The first method is to adjust the intervals between crawling requests. Sending crawling requests at short intervals may increase the likelihood of being recognized and blocked by the server. Therefore, it is advisable to send crawling requests at regular intervals.

The second method is to use various IP addresses. If you continuously send crawling requests from the same IP address, the server may recognize this and increase the likelihood of being blocked. Therefore, it is recommended to send crawling requests using multiple IP addresses. You can use proxy servers for this purpose.

4. Set parameters accurately

Earlier, we set the header when setting the User-Agent.

When crawling, problems may arise if there are parameters that should be set in the HTTP request header or if there are parameters that should not be included.

For example, if the User-Agent value is not set, some websites may block the request, assuming it is not sent from a browser. Also, if you set and send cookie values that should not be included in the HTTP request header, the website may recognize this and block the request.

Therefore, when crawling, it is essential to carefully check and adjust what parameters should be set in the HTTP request header and whether there are parameters that should not be included.

5. Proper exception handling

If errors occur during crawling and you continue to send the same request persistently, the website may identify this as a bot and increase the likelihood of being blocked. Therefore, proper exception handling is necessary.

Exception handling refers to handling unexpected situations that occur during program execution.

For example, if the server does not respond when sending an HTTP request, the request fails. In this case, the program should recognize this, wait for a certain period, and resend the request through exception handling.

Moreover, if you send crawling requests that violate the website's rules, the website will block them. In this case, the program should recognize this through exception handling, stop the request, and adjust accordingly. To do this, when sending an HTTP request, you should check the status code returned by the website and handle exceptions accordingly.

Exception handling is a crucial part of crawling. With proper exception handling, you can perform crawling more stably.

Conclusion

To perform web crawling stably, it is essential to avoid website blocking by using various methods such as setting User-Agent, changing IP addresses, irregularly adjusting crawling speed, setting parameters accurately, and implementing proper exception handling.

If you experience blocking, I hope you can check whether the above five principles are well reflected and achieve good results.

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.