Python Web Scraping Introduction 2026 Complete Guide

A comprehensive guide to Python web crawling. Covers concepts of web crawling and scraping, legal issues, and the reality of large-scale crawling in detail.

28
Python Web Scraping Introduction 2026 Complete Guide

"Manually check the prices of 3,000 products every day." - No one is fine even after hearing this. In reality, many domestic e-commerce companies automate this task with Python crawling, and the start is surprisingly simple. In this article, we have summarized from the basics of web scraping to real problems encountered in practice, along with actual working code.


Table of Contents

  1. What is Web Crawling?
  2. Python Crawling Basics — requests + BeautifulSoup
  3. Dynamic Page Crawling — Selenium and Playwright
  4. Common Problems Encountered in Crawling
  5. Crawling and Legal Issues — Things to Know
  6. Reality of Large-scale Crawling — Maintenance Hell
  7. Build vs Professional Services — When to Choose What?

What is Web Crawling?

Web Crawling is a technology where a program automatically visits web pages to extract desired data. It is also called Web Scraping, and in practice, these two terms are used almost interchangeably.

[Basic Flow of Web Crawling: HTTP Request → Receive HTML → Parsing → Data Extraction → Storage](images/seo1-crawling-flow.png)

Where can it be used? The scope is broader than you might think.

  • Price Monitoring — Automatically collect prices of 3,000 products from competitor shopping malls every day for tracking the lowest price.
  • Market Research — Collect product information, reviews, and ratings data from Naver Shopping, Coupang, etc., all at once.
  • Lead Generation — Collect a large amount of potential customer contacts and company information for sales use.
  • Content Monitoring — Real-time tracking of specific keywords from news, SNS, communities.
  • Data Analysis — Building large datasets such as real estate prices, job postings, academic data, etc.

The reason Python is most commonly used for crawling is clear. Libraries like BeautifulSoup, Selenium, Playwright are rich, and the syntax is intuitive, allowing for quick prototyping.


Python Crawling Basics — requests + BeautifulSoup

This is the most basic combination. Use requests to fetch HTML and BeautifulSoup to parse the desired data. This is sufficient for static HTML pages.

Installation

pip install requests beautifulsoup4

Basic Example: Collecting News Headlines

import requests
from bs4 import BeautifulSoup

# 1. 웹 페이지 HTML 가져오기
#    requests.get()으로 HTTP GET 요청을 보냅니다
url = "https://news.ycombinator.com/"

# 2. User-Agent 헤더 설정
#    헤더 없이 요청하면 봇으로 판단해 차단하는 사이트가 많습니다
#    실제 브라우저처럼 보이도록 User-Agent를 설정합니다
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers)
# response.raise_for_status()  # HTTP 에러 시 예외 발생 (선택)

# 3. BeautifulSoup으로 HTML 파싱
#    html.parser는 파이썬 내장 파서 — 별도 설치 불필요
soup = BeautifulSoup(response.text, "html.parser")

# 4. CSS 선택자로 원하는 요소 추출
#    .titleline > a → class="titleline" 하위의 <a> 태그 (뉴스 제목 + 링크)
titles = soup.select(".titleline > a")
for i, title in enumerate(titles, 1):
    print(f"{i}. {title.text}")
    print(f"   링크: {title['href']}")

With this 20-line code, you can collect all headlines and links from Hacker News. The reason why the barrier to entry for Python crawling is low is precisely this.

Crawling Multiple Pages (Pagination)

In practice, it's rare to collect data from just one page. Knowing how to navigate multiple pages allows you to apply it to most list-type sites.

import requests
from bs4 import BeautifulSoup
import time

# 페이지 번호를 URL에 넣는 패턴 — 대부분의 리스트 페이지가 이 구조
base_url = "https://example-blog.com/posts?page={}"
all_posts = []

for page in range(1, 11):  # 1~10페이지 순회
    response = requests.get(base_url.format(page), headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    # 각 게시글 카드에서 제목, 날짜, URL 추출
    posts = soup.select(".post-item")
    for post in posts:
        all_posts.append({
            "title": post.select_one(".post-title").text.strip(),
            "date": post.select_one(".post-date").text.strip(),
            "url": post.select_one("a")["href"]
        })

    # 서버 부하 방지를 위한 요청 간격 설정
    # 1초도 짧을 수 있음 — 대상 사이트 robots.txt의 Crawl-delay 확인 권장
    time.sleep(1)

print(f"{len(all_posts)}개 포스트 수집 완료")

Tip: Leaving intervals between requests using time.sleep() is basic etiquette in web scraping. Too fast requests can overload the server and directly lead to IP blocking. Generally, a 1-3 second interval is safe.


Dynamic Page Crawling — Selenium and Playwright

These days, many websites render content with JavaScript. In SPA (Single Page Application) based on React, Vue, Next.js, even if you fetch HTML with requests, you'll only get an empty <div id="root"></div> — because the data loads after JavaScript execution.

In such cases, you need a browser automation tool. It involves launching a real browser, executing JavaScript, and extracting data from the rendered result.

[Comparison of Static Crawling vs Dynamic Crawling: requests only receives server HTML, Selenium/Playwright access DOM after JS rendering](images/seo1-static-vs-dynamic.png)

Selenium Crawling Example

Selenium is the original tool for browser automation, and it has the most abundant resources related to crawling.

pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 1. 크롬 브라우저를 헤드리스(화면 없이) 모드로 실행
#    서버 환경에서는 화면이 없으므로 헤드리스 모드 필수
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)

try:
    # 2. 대상 페이지 접속
    driver.get("https://example-spa.com/products")

    # 3. JavaScript 렌더링이 끝날 때까지 최대 10초 대기
    #    WebDriverWait는 요소가 나타나면 즉시 다음 단계로 진행
    #    10초 안에 안 나타나면 TimeoutException 발생
    WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-card"))
    )

    # 4. 렌더링된 DOM에서 데이터 추출
    products = driver.find_elements(By.CSS_SELECTOR, ".product-card")
    for product in products:
        name = product.find_element(By.CSS_SELECTOR, ".product-name").text
        price = product.find_element(By.CSS_SELECTOR, ".product-price").text
        print(f"{name}: {price}")
finally:
    # 5. 브라우저 종료 — 안 닫으면 메모리 누수 발생
    driver.quit()

Playwright Crawling Example (Recommended for 2026)

Playwright is a next-generation browser automation library created by Microsoft. It is faster and more stable than Selenium, and it includes auto-wait feature, eliminating the need for boilerplate code like WebDriverWait. As of 2026, if starting a new project, Playwright is recommended.

pip install playwright
playwright install chromium  # 브라우저 바이너리 자동 다운로드
from playwright.sync_api import sync_playwright

# Playwright는 컨텍스트 매니저로 리소스를 자동 관리
with sync_playwright() as p:
    # 1. Chromium 브라우저를 헤드리스 모드로 실행
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    # 2. 대상 페이지 접속
    page.goto("https://example-spa.com/products")

    # 3. Playwright는 자동으로 요소가 나타날 때까지 대기
    #    별도의 WebDriverWait 코드가 필요 없음 — 훨씬 깔끔합니다
    cards = page.locator(".product-card").all()

    for card in cards:
        name = card.locator(".product-name").text_content()
        price = card.locator(".product-price").text_content()
        print(f"{name}: {price}")

    # 4. 브라우저 종료
    browser.close()

Selenium vs Playwright — Which to Use?

[Selenium vs Playwright Comparison Infographic](images/seo1-selenium-vs-playwright.png)

Comparison Item Selenium Playwright
Speed Average Generally considered faster than Selenium
Auto-wait Manual setup required Built-in
Browser Support Chrome, Firefox, Edge, Safari Chromium, Firefox, WebKit
Network Interception Limited Default provided
Learning Resources Very abundant (10+ years of history) Growing rapidly
Recommended Situation Legacy projects, simple automation New projects, large-scale crawling

Common Problems Encountered in Crawling

By now, you might think, "Crawling doesn't seem that difficult?" However, when you try Python crawling in practice, you'll encounter much more challenging problems than writing the code. We have summarized five common problems that most developers face when creating their first crawler.

1. IP Blocking

Most sites block repeated requests from the same IP. The response to this is proxy rotation and adjusting request intervals, but the problem is the cost. High-quality residential proxies cost $5-15 per GB, with monthly basic fees ranging from $50-500. When you add proxy quality monitoring, replacing blocked IPs, etc., it's not a light expense.

2. CAPTCHA and Bot Detection

Cloudflare, PerimeterX, Akamai Bot Manager, DataDome — bot prevention solutions are becoming more sophisticated every year. reCAPTCHA v3 analyzes mouse movements, scroll patterns, typing speeds. Using CAPTCHA solving services (such as 2Captcha) costs 1-5 cents per solve (varies by type), but if you have 100,000 solves a day, it can exceed $6,000 per month. Managing speed and success rates is not easy either.

3. Dynamic Rendering and Infinite Scroll

The number of sites requiring JavaScript rendering is increasing. Since you need to launch a browser instance, it consumes 300-500MB of memory per page, and the speed is 5-10 times slower compared to static crawling. To run 50 browsers simultaneously, you need a minimum of a 16GB RAM server, and the infrastructure cost on the cloud is around $150-300 per month.

4. Site Structure Changes

If the crawling target site changes its design or HTML structure, the crawler breaks immediately. Active sites change their structure every 2-3 months. You need to maintain a cycle of change detection → code modification → testing → deployment, and if you have 20 target sites, you'll need 80-120 maintenance cycles per year.

5. Data Cleaning

Expecting the collected data to be clean is a mistake. Each site has different encodings (UTF-8, EUC-KR), date formats (2026.01.30 vs 2026-01-30 vs Jan 30, 2026), and represents the same information in completely different ways. From experience, data cleaning consumes 30-50% of the collection time.


Crawling and Legal Issues — Things to Know

Although crawling is technically easy, it is a gray area legally. "Can scrape" and "Should scrape" are completely different issues. If you are doing crawling for business purposes, make sure to check the following three things.

robots.txt — Basic Protocol for Crawling

robots.txt is a file where site owners specify "please do not crawl this path." There is debate about whether it has legal binding, but it is practically an industry standard. Crawling ignored by this can work against you in a lawsuit.

# 예시: https://example.com/robots.txt
User-agent: *
Disallow: /private/
Disallow: /api/
Crawl-delay: 2          # 요청 간격 2초 이상 유지 요청

The method to check is simple. Just append /robots.txt to the target URL (e.g., https://naver.com/robots.txt). Paths specified with Disallow are excluded from collection, and if there is a Crawl-delay, you must adhere to that interval.

Personal Information Protection Act — Most Critical Area

Korea's Personal Information Protection Act is based on the principle of obtaining consent when collecting personal information. If the data collected through crawling includes information that can identify individuals such as names, phone numbers, emails, storing or using that data itself can be illegal.

Points to note in practice:

  • Even if a web page is public, collecting or using personal information without consent is not allowed
  • Be cautious, especially with contact information, SNS profiles, user reviews (including nicknames)
  • If targeting EU services, GDPR applies — violation can result in fines of up to 4% of global revenue
  • In Korea, the sanctions from the Personal Information Protection Commission are increasingly being strengthened

Practical Advice: Collecting only non-personal information such as prices, product specifications, market data is the safest. If you need data that includes personal information, be sure to get legal review first.

Copyright Law — Rights of the Data Itself

Text, images, videos posted on web pages are mostly protected by copyright. While collecting data through crawling is a technical act, reproducing or commercially using the collected content as is can constitute copyright infringement.

Precedents and Standards:

  • Factual information (prices, numerical data) is generally not subject to copyright protection
  • Creative works (articles, reviews, images) are protected by copyright
  • Replicating an entire database can violate the Unfair Competition Prevention Act
  • In the hiQ Labs v. LinkedIn case in the U.S. (2022), it was ruled that scraping public profiles was not a violation of the CFAA, but LinkedIn later pursued a separate lawsuit based on violation of its terms of service and eventually settled through an agreement. Even for public data, there may be restrictions based on terms of service and contract law, so it is essential to check the laws of each country and the target site's ToS.

[Crawling Legal Checklist: Check robots.txt → Check for personal information → Possibility of copyright infringement → Review Terms of Service](images/seo1-legal-checklist.png)

In summary: Before starting crawling, make sure to check these four things: ① robots.txt, ② presence of personal information, ③ scope of use of collected data, ④ terms of service of the target site. If you are unsure, getting legal advice is the safest investment.


Reality of Large-scale Crawling — Maintenance Hell

For small projects, a single Python script is sufficient. However, if you need to continuously operate crawling for business, the situation changes completely.

To operate a large-scale web scraping system, you need the following:

  • Infrastructure — server, scheduler (Airflow, Celery, etc.), queue system, database
  • Proxy Management — purchase, rotation, quality monitoring of hundreds to thousands of proxies (monthly $50-500)
  • Error Handling — retry logic, failure alerts, automatic recovery
  • Data Pipeline — automation of collection → cleaning → loading → validation
  • Monitoring — crawler status, data quality, detection of target site changes
  • Legal Compliance — continuous review of robots.txt, Personal Information Protection Act, Copyright Law

[Large-scale Crawling Architecture: Scheduler → Crawler Pool → Proxy Rotation → Data Pipeline → Storage](images/seo1-large-scale-architecture.png)

Considering the realistic costs — it takes about 2 weeks (costing $3-4k) for one developer to create a crawler, and then about 40-60 hours per month (costing $2-3k) for maintenance. If the crawling targets increase to 10, 20, managing the crawling infrastructure alone becomes a full-time job. When you add proxy, server, CAPTCHA solving costs, it's not uncommon for the cost to exceed $5,000 per month just for crawling.

"Creating a crawler is easy. Keeping it alive is the real job."
— Anyone who has operated a crawling system can relate to this statement


Build vs Professional Services — When to Choose What?

Situation Recommended Approach
Learning purposes, one-time collection requests + BeautifulSoup
Dynamic sites, small projects Playwright or Selenium
Business operations, multiple sites, stability required Professional crawling services

Learning Python crawling is definitely a valuable investment as a developer. For simple data collection or prototyping, the tools covered in this article are sufficient.

However, when crawling becomes the core data pipeline of the business, it's essential to coldly compare the costs of building and maintaining it yourself (over $5,000 per month) with the cost of professional services. Development resources should focus on utilizing the data rather than collecting it, which is much more efficient.

If you are considering professional services, you can hand over everything from crawler development to maintenance and receive only refined data. In Korea, B2B crawling services like HashScraper operate in this structure, allowing you to start data collection immediately without infrastructure setup.

Start now:
- Copy and run the code examples above — you can see the first crawling results in 5 minutes
- If you need large-scale collection, compare the costs of building it yourself vs. using an external service

If you have any questions about crawling, feel free to leave a comment. I will provide answers based on my experience.


Data Collection, Automate Now
HashScraper does all the complex crawling for you. Zero development cost, including maintenance.
Start for Free →

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.