What are common issues that can cause crawlers to fail?

Common issues include HTML structure changes, strengthened bot blocking, timeout and network errors, data schema changes, and pagination or infinite scroll changes.

How can I monitor the performance of my web crawler?

Monitor key metrics such as success rate, data validity, response times, and error rates to ensure optimal performance.

What should I do if my crawler is returning empty data?

Investigate potential causes like changes in HTML structure or bot blocking, and set up automated alerts to detect such issues quickly.

What is the importance of automated alerts in crawling monitoring?

Automated alerts help detect anomalies in data collection promptly, allowing for timely recovery and maintaining data quality.

How can I recover from a failed crawler operation?

Implement automatic recovery strategies such as retries, fallback mechanisms, and monitoring for changes in the target website.

Crawling Monitoring Automation - How to Keep Data Quality 24 Hours a Day

Creating a crawler is 20% of the project. The remaining 80% is operation.

"One day, a well-functioning crawler suddenly started spitting out empty data, and no one knew why." - If you have operated a crawling system, you may have experienced this at least once. This article summarizes patterns in which crawlers quietly break and how to automatically detect and recover from them.

5 Patterns in Which Crawlers Quietly Break
4 Key Metrics to Monitor
Setting Up Automated Alerts
Automatic Recovery Strategies
Self-Operation vs Managed Services - Cost Comparison
Conclusion

5 Patterns in Which Crawlers Quietly Break

The most dangerous failure of a crawler is returning incorrect data without errors. Even though HTTP 200 is returned, the actual data may be empty or contain incorrect values.

1. HTML Structure Changes

When the target site is renewed or undergoing A/B testing, data extraction fails due to incorrect CSS selectors. There are no errors, but the result is None or an empty string.

2. Bot Blocking Strengthened

IP blocking, CAPTCHA, Cloudflare protection, etc., are suddenly applied. The response code is 200, but the returned content is a "access denied" block page.

3. Timeout & Network Errors

The response from the target server slows down or intermittently fails at specific times. Without retry logic, data may be missed.

4. Data Schema Changes

The price field changes from price to salePrice, or the date format changes. The crawler operates normally, but problems occur in subsequent pipelines (DB loading, analysis, etc.).

5. Pagination/Infinite Scroll Changes

The "next page" button disappears, or the infinite scroll API endpoint changes. The situation arises where only the first page is collected.

The commonality among these five patterns? They appear normal just by looking at the logs.

4 Key Metrics to Monitor

To catch when a crawler breaks pretending to be "normal," monitoring at the data level is necessary, not just simple error logs.

1. Success Rate

Track the percentage of actually valid data returned, not just simple HTTP 200.

# 성공률 모니터링 예시
from datetime import datetime

def monitor_crawl_success(results):
    total = len(results)
    valid = sum(1 for r in results if r.get("title") and r.get("price"))
    success_rate = valid / total * 100 if total > 0 else 0

    # 성공률이 임계값 이하면 알림
    if success_rate < 90:
        send_alert(
            level="warning" if success_rate >= 70 else "critical",
            message=f"크롤링 성공률 저하: {success_rate:.1f}% ({valid}/{total})",
            timestamp=datetime.now().isoformat()
        )

    return {"success_rate": success_rate, "total": total, "valid": valid}

2. Response Time

If the average response time suddenly increases 2-3 times, it signals a block or issues on the target server.

3. Data Completeness

Check if all required fields are filled. Track the ratio of results with "price" fields, the ratio with "image URLs," etc.

def check_data_completeness(results, required_fields):
    """필수 필드 완전성 체크"""
    if not results:
        return {field: 0.0 for field in required_fields}

    completeness = {}
    for field in required_fields:
        filled = sum(1 for r in results if r.get(field))
        completeness[field] = filled / len(results) * 100

    # 특정 필드 완전성이 급격히 떨어지면 스키마 변경 의심
    for field, rate in completeness.items():
        if rate < 80:
            send_alert(
                level="warning",
                message=f"필드 '{field}' 완전성 {rate:.1f}%로 하락 — 스키마 변경 확인 필요"
            )

    return completeness

4. Schema Change Detection

Periodically compare the structure of collected data. Send alerts if new fields appear or if the value format of existing fields changes.

Setting Up Automated Alerts

Even if you are monitoring metrics, it is impossible for a person to watch the dashboard 24/7. Automated alerts are essential.

import requests
import smtplib
from email.mime.text import MIMEText

def send_slack_alert(webhook_url, message, level="warning"):
    """Slack 웹훅으로 알림 전송"""
    emoji = "" if level == "warning" else ""
    payload = {
        "text": f"{emoji} *크롤링 모니터링 알림*\n{message}",
        "username": "Crawl Monitor",
    }
    requests.post(webhook_url, json=payload)

def send_email_alert(to_email, subject, body):
    """이메일 알림 전송"""
    msg = MIMEText(body)
    msg["Subject"] = f"[크롤링 알림] {subject}"
    msg["From"] = "monitor@your-domain.com"
    msg["To"] = to_email

    with smtplib.SMTP("smtp.gmail.com", 587) as server:
        server.starttls()
        server.login("your-email", "app-password")
        server.send_message(msg)

Alert Setup Tips:
- Stepwise Alerts: Success rate below 90% → Slack alert, below 70% → Email + PagerDuty
- Flapping Prevention: Alert only after 3 consecutive failures (ignore temporary errors)
- Alert Fatigue Management: Only one alert for the same issue per hour
- Recovery Alerts: Send a "recovered successfully" alert when the issue is resolved for peace of mind

Automatic Recovery Strategies

Alerts alone are not enough. Common failure patterns can be automatically recovered.

1. Exponential Backoff Retries

import time
import random

def crawl_with_retry(url, max_retries=3):
    """지수 백오프 재시도 — 일시적 오류 자동 복구"""
    for attempt in range(max_retries):
        try:
            result = crawl_page(url)
            if result and result.get("data"):
                return result
        except Exception:
            pass

        # 재시도 간격: 1초 → 2초 → 4초 (+ 랜덤 지터)
        wait = (2 ** attempt) + random.uniform(0, 1)
        time.sleep(wait)

    return None  # 재시도 모두 실패 → 알림으로 넘김

2. Proxy Rotation

Automatically switch to a different proxy when IP blocking is detected.

def crawl_with_proxy_rotation(url, proxies):
    """프록시 로테이션 — IP 차단 시 자동 전환"""
    for proxy in proxies:
        try:
            response = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=10)
            if response.status_code == 200 and not is_block_page(response.text):
                return response
        except requests.RequestException:
            continue

    send_alert(level="critical", message=f"모든 프록시에서 {url} 차단됨")
    return None

3. Fallback Strategy

If the main crawling method fails, switch to an alternative path.
- CSS selector fails → Try XPath
- API endpoint changes → Switch to the mobile version page
- Specific IP range blocked → Use proxies from a different region

The reality? Building and maintaining all of this requires significant engineering resources.

Self-Operation vs Managed Services - Cost Comparison

Let's honestly compare the costs of building a crawling monitoring/operation system directly.

Item	Self-Operation	Managed Service (HashScraper)
Initial Setup	2-4 weeks development time	Start immediately after setup
Proxy Costs	Monthly $100-500+	Included
Monitoring	Requires manual setup	Built-in
Failure Response	Handled by developers	Automatic recovery + dedicated response
Site Change Response	Manual updates	Automatic detection + correction
Labor Costs	Engineer hours (largest cost)	Included in service cost

The biggest cost of self-operation is the invisible costs. The time spent responding to a broken crawler in the middle of the night, understanding site structure changes and modifying selectors, managing proxies. When these times accumulate, resources that should be poured into product development are drained away.

HashScraper takes on this operational burden. With built-in monitoring, failure response, and site change tracking, the development team can focus solely on utilizing crawling data.

Conclusion

The real battle with a crawling system is not at the moment of creation but in operating it every day.

A crawler without monitoring is a time bomb. Just because it is working well now does not guarantee it will work well tomorrow. The target site changes daily, and bot blocking becomes more sophisticated.

Whether you build a monitoring system yourself or use a managed service, the key is one: Create a structure to immediately know when the crawler breaks.

If you are tired of operating crawling systems, contact HashScraper. As a managed crawling service with built-in monitoring and maintenance, we will ensure the data quality 24/7.

Automating Crawling Monitoring — How to Ensure Data Quality 24/7

Crawling Monitoring Automation - How to Keep Data Quality 24 Hours a Day

Table of Contents

5 Patterns in Which Crawlers Quietly Break

1. HTML Structure Changes

2. Bot Blocking Strengthened

3. Timeout & Network Errors

4. Data Schema Changes

5. Pagination/Infinite Scroll Changes

4 Key Metrics to Monitor

1. Success Rate

2. Response Time

3. Data Completeness

4. Schema Change Detection

Setting Up Automated Alerts

Automatic Recovery Strategies

1. Exponential Backoff Retries

2. Proxy Rotation

3. Fallback Strategy

Self-Operation vs Managed Services - Cost Comparison

Conclusion

Comments

Add Comment

Continue Reading

In the era of GPT, why is 'web crawling' still important?

Creating a Campuspick contest and extracurricular activity crawler with Python - Contest & extracurricular activity automatic crawling project: Part 2

Automating the collection of company-related information from daily news publications. Introducing a method for automating news collection using the Hashscraper web crawling solution and showcasing a success story from Company A.

Automating web crawling using Python: schedule, Task Scheduler, crontab

Get notified of new posts