What is web crawling?

Web crawling, or scraping, is the process of extracting data from web pages using a software called a crawler.

Why can only 10% of desired data be collected during web scraping?

Hasty development and lack of proper planning can lead to incomplete data collection, resulting in only about 10% of the desired data being gathered.

Is Python the best language for web scraping?

While Python is popular for web scraping due to its use in data analysis, it's not the only option; other languages like Ruby can also be effective.

What are common issues faced during web scraping?

Common issues include IP blocking, requiring logins, encountering CAPTCHAs, and being redirected to incorrect pages.

How can IP blocking be avoided when scraping?

To avoid IP blocking, use multiple IPs or servers, such as AWS EC2 instances, and implement AutoScaling technology to manage server load.

Precautions for web scraping and how to utilize cloud servers

1. Three things to be careful when everyone tries crawling

Recently, crawling is essential in software education courses at academies or online education sites.

Crawling, which is a technology that must be included in big data analysis courses, may not collect even 10% of the desired data if developed hastily. Even after investing a lot of time in development, there is a possibility of facing losses due to discovering problems later.

Let's first learn about crawling and why it is claimed that only 10% of the data can be collected, and then explore how to solve that problem.

notion image

Crawling education advertisement

What is Crawling?

Crawling or scraping is the act of bringing a web page as it is and extracting data from it. The software that crawls is called a crawler.

Now let's look at what to be careful about when developing a crawler.

1) Is Python the best choice?

Python is mainly used in data analysis, so most textbooks and educational programs use Python to create crawlers. However, it is not necessary to use Python. Our company, Hashscraper, is developing using Ruby.

Choosing Python, a widely used language, is a good choice if the goal is achieved. In an era where "know where" is more important than know-how, choosing a language that makes it easy to solve problems through search is a smart choice. (But still, I chose Ruby for its simplicity and convenience)

2) IP Blocking

When you diligently type and understand while creating a crawler after reading a book, it works well at first. However, when collecting data from large-scale sites, you may encounter situations like the following:

Access is blocked
Login is required
CAPTCHA appears
Redirected to the wrong page

Since the web server knows your IP, if you request web pages at short intervals, your IP may be blocked for a certain period.

How can you solve IP blocking? You need more IPs. It's a simple but realistically difficult method.

Therefore, Hashscraper has been collecting data using multiple AWS EC2 instances for about 3 years. Additionally, depending on the amount of data to be collected, AutoScaling technology is applied to automatically increase or decrease the number of servers.

Servers that repeat continuous failures shut down on their own, create new instances, and use newly assigned IPs.

3) IP Distribution

There are quite a few places that use EC2 like Hashscraper to crawl, and some specific servers have blocked the entire EC2 IP range. Therefore, they secure "clean" IPs through domestic hosting companies and utilize Proxy IP servers when necessary.

2. Conclusion

For tasks or data collection for reports, creating a basic crawler is sufficient.

However, if you want to utilize it for work (marketing, trend analysis, basic platform data, influencer search, etc.), we recommend establishing a proper crawling system.