Automating Crawling Monitoring — How to Ensure Data Quality 24/7

Guide on automating crawling monitoring and maintaining data quality 24/7. We will share patterns of crawlers breaking quietly and strategies for automatic recovery.

22
Automating Crawling Monitoring — How to Ensure Data Quality 24/7

Crawling Monitoring Automation - How to Keep Data Quality 24 Hours a Day

Creating a crawler is 20% of the project. The remaining 80% is operation.

"One day, a well-functioning crawler suddenly started spitting out empty data, and no one knew why." - If you have operated a crawling system, you may have experienced this at least once. This article summarizes patterns in which crawlers quietly break and how to automatically detect and recover from them.

Table of Contents

5 Patterns in Which Crawlers Quietly Break

The most dangerous failure of a crawler is returning incorrect data without errors. Even though HTTP 200 is returned, the actual data may be empty or contain incorrect values.

1. HTML Structure Changes

When the target site is renewed or undergoing A/B testing, data extraction fails due to incorrect CSS selectors. There are no errors, but the result is None or an empty string.

2. Bot Blocking Strengthened

IP blocking, CAPTCHA, Cloudflare protection, etc., are suddenly applied. The response code is 200, but the returned content is a "access denied" block page.

3. Timeout & Network Errors

The response from the target server slows down or intermittently fails at specific times. Without retry logic, data may be missed.

4. Data Schema Changes

The price field changes from price to salePrice, or the date format changes. The crawler operates normally, but problems occur in subsequent pipelines (DB loading, analysis, etc.).

5. Pagination/Infinite Scroll Changes

The "next page" button disappears, or the infinite scroll API endpoint changes. The situation arises where only the first page is collected.

The commonality among these five patterns? They appear normal just by looking at the logs.

4 Key Metrics to Monitor

To catch when a crawler breaks pretending to be "normal," monitoring at the data level is necessary, not just simple error logs.

1. Success Rate

Track the percentage of actually valid data returned, not just simple HTTP 200.

# 성공률 모니터링 예시
from datetime import datetime

def monitor_crawl_success(results):
    total = len(results)
    valid = sum(1 for r in results if r.get("title") and r.get("price"))
    success_rate = valid / total * 100 if total > 0 else 0

    # 성공률이 임계값 이하면 알림
    if success_rate < 90:
        send_alert(
            level="warning" if success_rate >= 70 else "critical",
            message=f"크롤링 성공률 저하: {success_rate:.1f}% ({valid}/{total})",
            timestamp=datetime.now().isoformat()
        )

    return {"success_rate": success_rate, "total": total, "valid": valid}

2. Response Time

If the average response time suddenly increases 2-3 times, it signals a block or issues on the target server.

3. Data Completeness

Check if all required fields are filled. Track the ratio of results with "price" fields, the ratio with "image URLs," etc.

def check_data_completeness(results, required_fields):
    """필수 필드 완전성 체크"""
    if not results:
        return {field: 0.0 for field in required_fields}

    completeness = {}
    for field in required_fields:
        filled = sum(1 for r in results if r.get(field))
        completeness[field] = filled / len(results) * 100

    # 특정 필드 완전성이 급격히 떨어지면 스키마 변경 의심
    for field, rate in completeness.items():
        if rate < 80:
            send_alert(
                level="warning",
                message=f"필드 '{field}' 완전성 {rate:.1f}%로 하락 — 스키마 변경 확인 필요"
            )

    return completeness

4. Schema Change Detection

Periodically compare the structure of collected data. Send alerts if new fields appear or if the value format of existing fields changes.

Setting Up Automated Alerts

Even if you are monitoring metrics, it is impossible for a person to watch the dashboard 24/7. Automated alerts are essential.

import requests
import smtplib
from email.mime.text import MIMEText

def send_slack_alert(webhook_url, message, level="warning"):
    """Slack 웹훅으로 알림 전송"""
    emoji = "" if level == "warning" else ""
    payload = {
        "text": f"{emoji} *크롤링 모니터링 알림*\n{message}",
        "username": "Crawl Monitor",
    }
    requests.post(webhook_url, json=payload)

def send_email_alert(to_email, subject, body):
    """이메일 알림 전송"""
    msg = MIMEText(body)
    msg["Subject"] = f"[크롤링 알림] {subject}"
    msg["From"] = "monitor@your-domain.com"
    msg["To"] = to_email

    with smtplib.SMTP("smtp.gmail.com", 587) as server:
        server.starttls()
        server.login("your-email", "app-password")
        server.send_message(msg)

Alert Setup Tips:
- Stepwise Alerts: Success rate below 90% → Slack alert, below 70% → Email + PagerDuty
- Flapping Prevention: Alert only after 3 consecutive failures (ignore temporary errors)
- Alert Fatigue Management: Only one alert for the same issue per hour
- Recovery Alerts: Send a "recovered successfully" alert when the issue is resolved for peace of mind

Automatic Recovery Strategies

Alerts alone are not enough. Common failure patterns can be automatically recovered.

1. Exponential Backoff Retries

import time
import random

def crawl_with_retry(url, max_retries=3):
    """지수 백오프 재시도 — 일시적 오류 자동 복구"""
    for attempt in range(max_retries):
        try:
            result = crawl_page(url)
            if result and result.get("data"):
                return result
        except Exception:
            pass

        # 재시도 간격: 1초 → 2초 → 4초 (+ 랜덤 지터)
        wait = (2 ** attempt) + random.uniform(0, 1)
        time.sleep(wait)

    return None  # 재시도 모두 실패 → 알림으로 넘김

2. Proxy Rotation

Automatically switch to a different proxy when IP blocking is detected.

def crawl_with_proxy_rotation(url, proxies):
    """프록시 로테이션 — IP 차단 시 자동 전환"""
    for proxy in proxies:
        try:
            response = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=10)
            if response.status_code == 200 and not is_block_page(response.text):
                return response
        except requests.RequestException:
            continue

    send_alert(level="critical", message=f"모든 프록시에서 {url} 차단됨")
    return None

3. Fallback Strategy

If the main crawling method fails, switch to an alternative path.
- CSS selector fails → Try XPath
- API endpoint changes → Switch to the mobile version page
- Specific IP range blocked → Use proxies from a different region

The reality? Building and maintaining all of this requires significant engineering resources.

Self-Operation vs Managed Services - Cost Comparison

Let's honestly compare the costs of building a crawling monitoring/operation system directly.

Item Self-Operation Managed Service (HashScraper)
Initial Setup 2-4 weeks development time Start immediately after setup
Proxy Costs Monthly $100-500+ Included
Monitoring Requires manual setup Built-in
Failure Response Handled by developers Automatic recovery + dedicated response
Site Change Response Manual updates Automatic detection + correction
Labor Costs Engineer hours (largest cost) Included in service cost

The biggest cost of self-operation is the invisible costs. The time spent responding to a broken crawler in the middle of the night, understanding site structure changes and modifying selectors, managing proxies. When these times accumulate, resources that should be poured into product development are drained away.

HashScraper takes on this operational burden. With built-in monitoring, failure response, and site change tracking, the development team can focus solely on utilizing crawling data.

Conclusion

The real battle with a crawling system is not at the moment of creation but in operating it every day.

A crawler without monitoring is a time bomb. Just because it is working well now does not guarantee it will work well tomorrow. The target site changes daily, and bot blocking becomes more sophisticated.

Whether you build a monitoring system yourself or use a managed service, the key is one: Create a structure to immediately know when the crawler breaks.

If you are tired of operating crawling systems, contact HashScraper. As a managed crawling service with built-in monitoring and maintenance, we will ensure the data quality 24/7.

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.