27 reasons why web scraping stops

27 reasons why web crawling stops. Introducing types of crawler malfunctions such as IP blocking, CAPTCHA, structural changes, and their solutions.

100
27 reasons why web scraping stops

Blocking IP, CAPTCHA, structural changes... It is 10 times harder to preserve crawlers than to create them

Reading time: 12 minutes | As of January 2026


Key Summary

When you create a crawler, it runs smoothly for about a week. The problem arises after that.

Websites constantly change, security gets stronger every month, and infrastructure shakes without warning. Hashscraper categorized 27 types of failure experienced while crawling over 5,000 sites for 8 years. It includes occurrence frequency, response difficulty, and actual cost for self-resolution.

Category Number of Failure Types Response Difficulty
Access Blocking 8
Site Changes 6
Infrastructure/Network 5
Authentication/Session 4
Data Quality 4

Category 1: Access Blocking (8 types)

This is the most common obstacle crawlers face. Once the target site detects "you are a bot," data collection stops.

1. IP Blocking (Rate Limiting)

Symptom: Suddenly 403 Forbidden or 429 Too Many Requests
Cause: Mass requests from the same IP in a short period
Frequency: (Very common)
Response Difficulty:

This is the most basic blocking. It can be resolved by reducing request speed or using a proxy pool. However, managing proxies becomes a separate task. You need to pay attention to IP quality management, changing blocked IPs, and availability monitoring.

Self-resolution cost: Proxy service monthly 500,000~2,000,000 KRW + management personnel

2. Akamai Bot Manager

Symptom: Only Akamai logo and waiting screen displayed when accessing a page
Cause: Bot detection specialized security solutions analyze browser fingerprints
Frequency: (Common in large e-commerce sites)
Response Difficulty:

In Korea, Coupang is a typical example. Even accessing with Selenium or Playwright, it analyzes browser fingerprints, JavaScript execution patterns, mouse trajectories, and scroll speeds. It is almost impossible to bypass with conventional crawling tools.

In a practical test in January 2026, both Firecrawl (including Stealth Proxy) and Jina Reader were blocked by Coupang Akamai. Hashscraper bypasses this using its own browser emulation technology.

Self-resolution cost: Specialized personnel + continuous bypass technology development (annual cost in the millions)

3. CAPTCHA

Symptom: "Not a robot" verification screen
Cause: Confirmation of human presence when suspicious traffic patterns are detected
Frequency:
Response Difficulty:

reCAPTCHA, hCaptcha can be automatically solved using external solving services (2Captcha, Anti-Captcha). However, CAPTCHAs developed in-house like Naver Shopping's receipt CAPTCHA cannot be processed by external services. It requires training a separate machine learning model, and if the site changes the CAPTCHA image, the model needs to be retrained.

Self-resolution cost: General CAPTCHA solving costs 2~5 KRW per item + in-house CAPTCHA requires separate ML development

4. JavaScript-based Bot Detection

Symptom: Blank screen or infinite redirects after page load
Cause: Client-side JavaScript verifies the browser environment
Frequency:
Response Difficulty:

Simple HTTP requests (requests, urllib) are immediately detected. Even using Headless browsers, automation environments are identified through objects like navigator.webdriver, window.chrome, etc. While there are tools like Puppeteer Stealth, undetected-chromedriver, individual responses are required due to different detection logics for each site.

5. User-Agent/Header Verification

Symptom: 403 Forbidden or abnormal responses
Cause: Request headers do not match actual browser patterns
Frequency:
Response Difficulty:

This is the simplest and easiest blocking. You just need to match User-Agent, Accept, Referer headers. This is a problem that beginners in crawling encounter, but this alone is not enough to bypass advanced blocking.

6. Geo-blocking

Symptom: Blocked or different content returned when accessed from overseas IPs
Cause: Access allowed only from specific country IPs
Frequency:
Response Difficulty:

It is common when crawling Korean sites from overseas servers like AWS US-East. You need to use Korean IP proxies or run from domestic servers.

7. Robots Exclusion Standard (robots.txt)

Symptom: Crawling is possible but legal risks exist
Cause: Site prohibits crawling specific paths in robots.txt
Frequency: (Exists on most sites)
Response Difficulty: (Technical) / (Legal)

Technically, it can be ignored, but legally it's a different story. When crawling large corporate sites for commercial purposes, verification is essential.

8. WAF (Web Application Firewall)

Symptom: Sudden blocking, inconsistent responses
Cause: Cloudflare, AWS WAF, etc., analyze traffic patterns comprehensively
Frequency:
Response Difficulty:

WAF comprehensively analyzes IP, request frequency, browser fingerprints, TLS handshake patterns. To bypass Cloudflare's "5-second challenge," a JavaScript execution environment is essential. Since 2025, sites replacing reCAPTCHA with Cloudflare Turnstile have been increasing rapidly.


Category 2: Site Changes (6 types)

The crawler that was perfect when created suddenly returns empty data one day. No one informs you.

9. HTML Structure Changes

Symptom: Empty or incorrect data returned
Cause: Frontend updates on the target site
Frequency: (Most common cause of failure)
Response Difficulty:

Naver Shopping updates its frontend dozens of times a year. The same goes for Coupang, 11th Street, Gmarket. Class names change from product-price to prd_price_v2, div structures change, and new components are added.

Actual Data: Each crawler requires 6~12 structural change responses annually on average. With 10 crawlers, that's 60~120 responses per year — something breaks every 3 days.

Self-resolution cost: 3~5 hours per item × 8 times a year = 24~40 hours/year/crawler

10. SPA/Dynamic Rendering Transition

Symptom: Pages that used to be fetched well return only empty HTML
Cause: Transition to SPA with React/Vue/Angular, etc.
Frequency:
Response Difficulty:

When transitioning from SSR to SPA, existing HTTP-based crawlers become completely useless. It requires a complete rewrite using Headless browser-based methods, with resource consumption increasing by over 10 times.

11. API Endpoint Changes

Symptom: 404 or response format change when calling the API
Cause: Internal API URL/schema changes
Frequency:
Response Difficulty:

Directly calling the internal REST/GraphQL API of SPA sites is more efficient than HTML parsing, but if the API version changes from v2 to v3, the entire parsing logic needs to be rewritten.

12. URL Pattern Changes

Symptom: Existing URLs return 404
Cause: URL structure overhaul
Frequency:
Response Difficulty:

E.g., /product/12345/shop/items/12345. The crawler's URL generation logic needs to be modified.

13. Pagination Method Changes

Symptom: Failure to load the next page, collecting only the first page repeatedly
Cause: Page number → infinite scroll, or offset → cursor-based transition
Frequency:
Response Difficulty:

14. Content Loading Method Changes

Symptom: Only some data is collected, the rest is missing
Cause: Introduction of Lazy loading, Intersection Observer-based scroll triggers
Frequency:
Response Difficulty:


Category 3: Infrastructure/Network (5 types)

The crawler code is fine, but problems arise in the execution environment.

15. Insufficient Server Resources

Symptom: Slow speed, OOM (Out of Memory) crashes
Cause: Insufficient memory, CPU, disk capacity
Frequency:
Response Difficulty:

Headless browsers (Chromium) consume 200~500MB of memory per tab. If you have 10 concurrent crawlers, you need 2~5GB. Considering memory leaks, periodic process restarts are essential.

16. Proxy Failure

Symptom: Connection timeouts, intermittent failures
Cause: Proxy server downtime, IP expiration, provider outages
Frequency:
Response Difficulty:

17. DNS Resolution Failure

Symptom: "Host not found" error
Cause: DNS server failure, domain changes
Frequency:
Response Difficulty:

18. SSL/TLS Certificate Issues

Symptom: SSL handshake failure
Cause: Target site certificate expiration/delayed renewal
Frequency:
Response Difficulty:

19. Target Server Downtime

Symptom: 503 Service Unavailable, 504 Gateway Timeout
Cause: Site maintenance or outage
Frequency:
Response Difficulty: (Retry + notification implementation)


Category 4: Authentication/Session (4 types)

Crawling sites that require login can be particularly troublesome.

20. Login Session Expiration

Symptom: Sudden redirect to the login page
Cause: Session cookie expiration, token TTL exceeded
Frequency:
Response Difficulty:

21. 2FA/MFA Authentication Requirement

Symptom: Requires SMS/email verification
Cause: Security verification triggered when accessing from a new device/IP
Frequency:
Response Difficulty:

Automating 2FA is technically very challenging and mostly prohibited by service terms. It is almost impossible to resolve without manual intervention.

22. OAuth Token Refresh Failure

Symptom: 401 Unauthorized when calling the API
Cause: Refresh token expiration, OAuth app permission changes
Frequency:
Response Difficulty:

23. Cookie Policy Changes

Symptom: Existing authentication flow suddenly breaks
Cause: Strengthened SameSite policy, cookie name/domain/path changes
Frequency:
Response Difficulty:


Category 5: Data Quality (4 types)

Crawlers may run smoothly, but the collected data may be unreliable. The longer the discovery is delayed, the greater the damage.

24. Honeypot Data

Symptom: Fake information mixed in the collected data
Cause: Sites intentionally provide incorrect data to bots
Frequency:
Response Difficulty:

This is the most cunning defense mechanism. It shows different prices, non-existent products only to bots. It's difficult to detect data contamination until manually cross-checked.

25. Personalized Content

Symptom: Different data collected each time from the same URL
Cause: Personalization algorithms, A/B testing, regional price differentials
Frequency:
Response Difficulty:

26. Encoding Issues

Symptom: Korean character corruption, special character errors
Cause: Mixing UTF-8 and EUC-KR, character set mismatch
Frequency: (Especially common on Korean sites)
Response Difficulty:

This often occurs on older Korean shopping malls or public institution sites. There are still cases where the page header declares UTF-8 while the actual content is in EUC-KR.

27. Dynamic Price/Stock Mismatch

Symptom: Collected price differs from displayed price
Cause: Real-time price changes, regional/member-level price differentials
Frequency: (Essential consideration for e-commerce)
Response Difficulty:


Actual Costs Incurred in Failure Response

How much does it cost to handle all 27 failures?

Personnel

Role Required Level Salary (as of 2025)
Crawling Senior Developer 5+ years experience, practical experience in bypassing blocks 80 million~120 million KRW
Infrastructure Engineer Server/proxy/monitoring operations 60 million~90 million KRW

If you have more than 5 crawlers, at least one of them must be dedicated to crawling. Combining roles will delay failure response, resulting in data gaps.

Infrastructure

Item Monthly Cost
Server (for crawler execution) 500,000~2,000,000 KRW
Proxy Service 500,000~3,000,000 KRW
CAPTCHA solving service 100,000~500,000 KRW
Monitoring/Notification 100,000~300,000 KRW
Total 1,200,000~5,800,000 KRW/month

Time

Failure Category Average Response Time Monthly Occurrence Annual Time Investment
Access Blocking 4~16 hours 2~4 times 96~768 hours
Site Changes 2~8 hours 1~3 times 24~288 hours
Infrastructure Failure 1~4 hours 1~2 times 12~96 hours
Authentication Issues 2~6 hours 0.3~0.7 times 7~50 hours
Total 139~1,202 hours/year

With 5 crawlers, 200~500 hours are spent annually on failure response alone. This is 10~25% of a senior developer's working hours.


3 Ways to Solve This Problem

Method 1: Build It Yourself

As analyzed above, you need 150 million~300 million KRW for personnel + infrastructure. Suitable for companies where crawling is a core business or already have specialized personnel.

Method 2: Subscription — All-in-One Outsourcing

Entrust everything from crawler development to operation, maintenance, and failure response.

Failure Category Direct Response Hashscraper Subscription
Access Blocking (8 types) Develop/manage proxies+circumvent directly Self-solve with proprietary technology
Site Changes (6 types) Detection+manual modification Automatic response within 24 hours
Infrastructure (5 types) Direct server/proxy operation Dedicated infrastructure included
Authentication/Session (4 types) Implement session management directly Includes automation
Data Quality (4 types) Develop verification logic directly Multi-stage quality verification

Suitable for companies that need data but not crawlers. Starts from 300,000 KRW/month.

Method 3: MCP/API — Direct Integration by Development Team

For companies with their own development team but want to outsource block circumvention and infrastructure management. Calling a crawling API from an AI agent is also part of this method.

Global services like Firecrawl, Jina Reader are completely blocked on major Korean sites (Coupang, Naver, Instagram). Hashscraper's MCP server solves this with 8 years of accumulated block circumvention technology.

You can start from 30,000 KRW per month with credit-based pricing.


Conclusion: Which Choice Is Right?

Situation Recommendation
Crawling is a core business + specialized personnel available Build it yourself
Only need data, want to outsource development/operation Subscription
Have a development team, need block circumvention+infrastructure management MCP/API
Small-scale/irregular collection Credit

Are you ready to handle the 27 failures on your own? If not, trust the 8 years of experience.

Start with Credits →

Learn about MCP Server →

Consultation for Subscription →


Hashscraper — We handle crawling failures for you. You focus on the data.

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.