Blocking IP, CAPTCHA, structural changes... It is 10 times harder to preserve crawlers than to create them
Reading time: 12 minutes | As of January 2026
Key Summary
When you create a crawler, it runs smoothly for about a week. The problem arises after that.
Websites constantly change, security gets stronger every month, and infrastructure shakes without warning. Hashscraper categorized 27 types of failure experienced while crawling over 5,000 sites for 8 years. It includes occurrence frequency, response difficulty, and actual cost for self-resolution.
| Category | Number of Failure Types | Response Difficulty |
|---|---|---|
| Access Blocking | 8 | |
| Site Changes | 6 | |
| Infrastructure/Network | 5 | |
| Authentication/Session | 4 | |
| Data Quality | 4 |
Category 1: Access Blocking (8 types)
This is the most common obstacle crawlers face. Once the target site detects "you are a bot," data collection stops.
1. IP Blocking (Rate Limiting)
Symptom: Suddenly 403 Forbidden or 429 Too Many Requests
Cause: Mass requests from the same IP in a short period
Frequency: (Very common)
Response Difficulty:
This is the most basic blocking. It can be resolved by reducing request speed or using a proxy pool. However, managing proxies becomes a separate task. You need to pay attention to IP quality management, changing blocked IPs, and availability monitoring.
Self-resolution cost: Proxy service monthly 500,000~2,000,000 KRW + management personnel
2. Akamai Bot Manager
Symptom: Only Akamai logo and waiting screen displayed when accessing a page
Cause: Bot detection specialized security solutions analyze browser fingerprints
Frequency: (Common in large e-commerce sites)
Response Difficulty:
In Korea, Coupang is a typical example. Even accessing with Selenium or Playwright, it analyzes browser fingerprints, JavaScript execution patterns, mouse trajectories, and scroll speeds. It is almost impossible to bypass with conventional crawling tools.
In a practical test in January 2026, both Firecrawl (including Stealth Proxy) and Jina Reader were blocked by Coupang Akamai. Hashscraper bypasses this using its own browser emulation technology.
Self-resolution cost: Specialized personnel + continuous bypass technology development (annual cost in the millions)
3. CAPTCHA
Symptom: "Not a robot" verification screen
Cause: Confirmation of human presence when suspicious traffic patterns are detected
Frequency:
Response Difficulty:
reCAPTCHA, hCaptcha can be automatically solved using external solving services (2Captcha, Anti-Captcha). However, CAPTCHAs developed in-house like Naver Shopping's receipt CAPTCHA cannot be processed by external services. It requires training a separate machine learning model, and if the site changes the CAPTCHA image, the model needs to be retrained.
Self-resolution cost: General CAPTCHA solving costs 2~5 KRW per item + in-house CAPTCHA requires separate ML development
4. JavaScript-based Bot Detection
Symptom: Blank screen or infinite redirects after page load
Cause: Client-side JavaScript verifies the browser environment
Frequency:
Response Difficulty:
Simple HTTP requests (requests, urllib) are immediately detected. Even using Headless browsers, automation environments are identified through objects like navigator.webdriver, window.chrome, etc. While there are tools like Puppeteer Stealth, undetected-chromedriver, individual responses are required due to different detection logics for each site.
5. User-Agent/Header Verification
Symptom: 403 Forbidden or abnormal responses
Cause: Request headers do not match actual browser patterns
Frequency:
Response Difficulty:
This is the simplest and easiest blocking. You just need to match User-Agent, Accept, Referer headers. This is a problem that beginners in crawling encounter, but this alone is not enough to bypass advanced blocking.
6. Geo-blocking
Symptom: Blocked or different content returned when accessed from overseas IPs
Cause: Access allowed only from specific country IPs
Frequency:
Response Difficulty:
It is common when crawling Korean sites from overseas servers like AWS US-East. You need to use Korean IP proxies or run from domestic servers.
7. Robots Exclusion Standard (robots.txt)
Symptom: Crawling is possible but legal risks exist
Cause: Site prohibits crawling specific paths in robots.txt
Frequency: (Exists on most sites)
Response Difficulty: (Technical) / (Legal)
Technically, it can be ignored, but legally it's a different story. When crawling large corporate sites for commercial purposes, verification is essential.
8. WAF (Web Application Firewall)
Symptom: Sudden blocking, inconsistent responses
Cause: Cloudflare, AWS WAF, etc., analyze traffic patterns comprehensively
Frequency:
Response Difficulty:
WAF comprehensively analyzes IP, request frequency, browser fingerprints, TLS handshake patterns. To bypass Cloudflare's "5-second challenge," a JavaScript execution environment is essential. Since 2025, sites replacing reCAPTCHA with Cloudflare Turnstile have been increasing rapidly.
Category 2: Site Changes (6 types)
The crawler that was perfect when created suddenly returns empty data one day. No one informs you.
9. HTML Structure Changes
Symptom: Empty or incorrect data returned
Cause: Frontend updates on the target site
Frequency: (Most common cause of failure)
Response Difficulty:
Naver Shopping updates its frontend dozens of times a year. The same goes for Coupang, 11th Street, Gmarket. Class names change from product-price to prd_price_v2, div structures change, and new components are added.
Actual Data: Each crawler requires 6~12 structural change responses annually on average. With 10 crawlers, that's 60~120 responses per year — something breaks every 3 days.
Self-resolution cost: 3~5 hours per item × 8 times a year = 24~40 hours/year/crawler
10. SPA/Dynamic Rendering Transition
Symptom: Pages that used to be fetched well return only empty HTML
Cause: Transition to SPA with React/Vue/Angular, etc.
Frequency:
Response Difficulty:
When transitioning from SSR to SPA, existing HTTP-based crawlers become completely useless. It requires a complete rewrite using Headless browser-based methods, with resource consumption increasing by over 10 times.
11. API Endpoint Changes
Symptom: 404 or response format change when calling the API
Cause: Internal API URL/schema changes
Frequency:
Response Difficulty:
Directly calling the internal REST/GraphQL API of SPA sites is more efficient than HTML parsing, but if the API version changes from v2 to v3, the entire parsing logic needs to be rewritten.
12. URL Pattern Changes
Symptom: Existing URLs return 404
Cause: URL structure overhaul
Frequency:
Response Difficulty:
E.g., /product/12345 → /shop/items/12345. The crawler's URL generation logic needs to be modified.
13. Pagination Method Changes
Symptom: Failure to load the next page, collecting only the first page repeatedly
Cause: Page number → infinite scroll, or offset → cursor-based transition
Frequency:
Response Difficulty:
14. Content Loading Method Changes
Symptom: Only some data is collected, the rest is missing
Cause: Introduction of Lazy loading, Intersection Observer-based scroll triggers
Frequency:
Response Difficulty:
Category 3: Infrastructure/Network (5 types)
The crawler code is fine, but problems arise in the execution environment.
15. Insufficient Server Resources
Symptom: Slow speed, OOM (Out of Memory) crashes
Cause: Insufficient memory, CPU, disk capacity
Frequency:
Response Difficulty:
Headless browsers (Chromium) consume 200~500MB of memory per tab. If you have 10 concurrent crawlers, you need 2~5GB. Considering memory leaks, periodic process restarts are essential.
16. Proxy Failure
Symptom: Connection timeouts, intermittent failures
Cause: Proxy server downtime, IP expiration, provider outages
Frequency:
Response Difficulty:
17. DNS Resolution Failure
Symptom: "Host not found" error
Cause: DNS server failure, domain changes
Frequency:
Response Difficulty:
18. SSL/TLS Certificate Issues
Symptom: SSL handshake failure
Cause: Target site certificate expiration/delayed renewal
Frequency:
Response Difficulty:
19. Target Server Downtime
Symptom: 503 Service Unavailable, 504 Gateway Timeout
Cause: Site maintenance or outage
Frequency:
Response Difficulty: (Retry + notification implementation)
Category 4: Authentication/Session (4 types)
Crawling sites that require login can be particularly troublesome.
20. Login Session Expiration
Symptom: Sudden redirect to the login page
Cause: Session cookie expiration, token TTL exceeded
Frequency:
Response Difficulty:
21. 2FA/MFA Authentication Requirement
Symptom: Requires SMS/email verification
Cause: Security verification triggered when accessing from a new device/IP
Frequency:
Response Difficulty:
Automating 2FA is technically very challenging and mostly prohibited by service terms. It is almost impossible to resolve without manual intervention.
22. OAuth Token Refresh Failure
Symptom: 401 Unauthorized when calling the API
Cause: Refresh token expiration, OAuth app permission changes
Frequency:
Response Difficulty:
23. Cookie Policy Changes
Symptom: Existing authentication flow suddenly breaks
Cause: Strengthened SameSite policy, cookie name/domain/path changes
Frequency:
Response Difficulty:
Category 5: Data Quality (4 types)
Crawlers may run smoothly, but the collected data may be unreliable. The longer the discovery is delayed, the greater the damage.
24. Honeypot Data
Symptom: Fake information mixed in the collected data
Cause: Sites intentionally provide incorrect data to bots
Frequency:
Response Difficulty:
This is the most cunning defense mechanism. It shows different prices, non-existent products only to bots. It's difficult to detect data contamination until manually cross-checked.
25. Personalized Content
Symptom: Different data collected each time from the same URL
Cause: Personalization algorithms, A/B testing, regional price differentials
Frequency:
Response Difficulty:
26. Encoding Issues
Symptom: Korean character corruption, special character errors
Cause: Mixing UTF-8 and EUC-KR, character set mismatch
Frequency: (Especially common on Korean sites)
Response Difficulty:
This often occurs on older Korean shopping malls or public institution sites. There are still cases where the page header declares UTF-8 while the actual content is in EUC-KR.
27. Dynamic Price/Stock Mismatch
Symptom: Collected price differs from displayed price
Cause: Real-time price changes, regional/member-level price differentials
Frequency: (Essential consideration for e-commerce)
Response Difficulty:
Actual Costs Incurred in Failure Response
How much does it cost to handle all 27 failures?
Personnel
| Role | Required Level | Salary (as of 2025) |
|---|---|---|
| Crawling Senior Developer | 5+ years experience, practical experience in bypassing blocks | 80 million~120 million KRW |
| Infrastructure Engineer | Server/proxy/monitoring operations | 60 million~90 million KRW |
If you have more than 5 crawlers, at least one of them must be dedicated to crawling. Combining roles will delay failure response, resulting in data gaps.
Infrastructure
| Item | Monthly Cost |
|---|---|
| Server (for crawler execution) | 500,000~2,000,000 KRW |
| Proxy Service | 500,000~3,000,000 KRW |
| CAPTCHA solving service | 100,000~500,000 KRW |
| Monitoring/Notification | 100,000~300,000 KRW |
| Total | 1,200,000~5,800,000 KRW/month |
Time
| Failure Category | Average Response Time | Monthly Occurrence | Annual Time Investment |
|---|---|---|---|
| Access Blocking | 4~16 hours | 2~4 times | 96~768 hours |
| Site Changes | 2~8 hours | 1~3 times | 24~288 hours |
| Infrastructure Failure | 1~4 hours | 1~2 times | 12~96 hours |
| Authentication Issues | 2~6 hours | 0.3~0.7 times | 7~50 hours |
| Total | 139~1,202 hours/year |
With 5 crawlers, 200~500 hours are spent annually on failure response alone. This is 10~25% of a senior developer's working hours.
3 Ways to Solve This Problem
Method 1: Build It Yourself
As analyzed above, you need 150 million~300 million KRW for personnel + infrastructure. Suitable for companies where crawling is a core business or already have specialized personnel.
Method 2: Subscription — All-in-One Outsourcing
Entrust everything from crawler development to operation, maintenance, and failure response.
| Failure Category | Direct Response | Hashscraper Subscription |
|---|---|---|
| Access Blocking (8 types) | Develop/manage proxies+circumvent directly | Self-solve with proprietary technology |
| Site Changes (6 types) | Detection+manual modification | Automatic response within 24 hours |
| Infrastructure (5 types) | Direct server/proxy operation | Dedicated infrastructure included |
| Authentication/Session (4 types) | Implement session management directly | Includes automation |
| Data Quality (4 types) | Develop verification logic directly | Multi-stage quality verification |
Suitable for companies that need data but not crawlers. Starts from 300,000 KRW/month.
Method 3: MCP/API — Direct Integration by Development Team
For companies with their own development team but want to outsource block circumvention and infrastructure management. Calling a crawling API from an AI agent is also part of this method.
Global services like Firecrawl, Jina Reader are completely blocked on major Korean sites (Coupang, Naver, Instagram). Hashscraper's MCP server solves this with 8 years of accumulated block circumvention technology.
You can start from 30,000 KRW per month with credit-based pricing.
Conclusion: Which Choice Is Right?
| Situation | Recommendation |
|---|---|
| Crawling is a core business + specialized personnel available | Build it yourself |
| Only need data, want to outsource development/operation | Subscription |
| Have a development team, need block circumvention+infrastructure management | MCP/API |
| Small-scale/irregular collection | Credit |
Are you ready to handle the 27 failures on your own? If not, trust the 8 years of experience.
Start with Credits →
Learn about MCP Server →
Consultation for Subscription →
Hashscraper — We handle crawling failures for you. You focus on the data.




