What is Playwright and why should I use it for crawling?

Playwright is an open-source browser automation framework by Microsoft that can control multiple browsers with a single API. It's ideal for crawling because it offers automatic waiting for elements, network interception, and the ability to run multiple sessions in parallel.

How do I install Playwright?

You can install Playwright in Python with just two commands: `pip install playwright` followed by `playwright install`.

What are the limitations of using playwright-stealth for anti-bot bypass?

While playwright-stealth helps with anti-bot evasion, it has limitations against advanced systems like Cloudflare AI Labyrinth, necessitating additional strategies for effective crawling.

What is the cost comparison between self-built production crawling and managed services?

Self-built production crawling can cost between 590 to 1,030 million won monthly, while managed services are available for around 3 million won per month.

Playwright Crawling Complete Guide 2026 - From Installation to Anti-bot Bypass

Q: What are the advantages of using Playwright over Selenium?

Playwright is faster than Selenium and supports automatic waiting, network interception, and parallel processing, making it more efficient for crawling dynamic websites.

Playwright is the most widely used open-source browser automation tool for crawling dynamic websites as of 2026. Developed by Microsoft, it controls Chromium, Firefox, and WebKit with a single API and supports both Python and Node.js. It runs actual browsers to extract data in environments where handling JavaScript-rendered pages, services requiring login, or infinite scroll feeds cannot be processed with Requests or BeautifulSoup.

TL;DR
- Playwright is faster than Selenium and natively supports automatic waiting, network interception, and parallel processing
- Installation is completed with just 2 lines in Python (pip install playwright + playwright install)
- With the advancement of anti-bot systems like Cloudflare AI Labyrinth introduced in 2025, playwright-stealth alone has limitations
- The era of AI agents directly controlling browsers has begun with MCP (Model Context Protocol) integration
- Compare self-built production crawling (monthly cost of 590~1,030 million won) vs managed services (monthly cost of 3 million won~)

This article summarizes Playwright installation, basic and intermediate crawling codes, a comparison table of Selenium and Puppeteer, anti-bot evasion strategies, MCP ecosystem, and cost analysis in one place.

1. What is Playwright — Why Use it for Crawling?

Playwright is an open-source browser automation framework released by Microsoft in 2020. It can control Chromium, Firefox, and WebKit (Safari engine) browsers with a single API and supports Python, Node.js, Java, and C#.

From a crawling perspective, Playwright is notable for three reasons.

Automatic Waiting: When selecting an element with page.locator(), it automatically waits until the element appears on the screen. There is no need for the traditional timing adjustment using time.sleep().

Network Interception: It can intercept and modify all HTTP requests that occur during page loading. By directly calling internal APIs without a web UI, data collection speed can be significantly increased.

Multi-Context: It allows running multiple independent sessions simultaneously in a single browser instance, making parallel crawling efficient.

2. Choosing Between Playwright, Selenium, and Puppeteer

When starting a new project, the question of which tool to use often arises. The core differences between the three tools are summarized in a table.

Feature	Playwright	Selenium	Puppeteer
Release Year	2020	2004	2018
Browser Support	Chromium, Firefox, WebKit	All browsers	Chromium only
Language Support	Python, JS/TS, Java, C#	7 languages	Node.js only
Automatic Waiting	Built-in	Manual setup required	Manual setup required
Execution Speed	Fastest	Slow	Fast
Parallel Processing	BrowserContext built-in	Separate implementation required	Separate implementation required
Stealth Support	playwright-stealth	selenium-wire	puppeteer-extra
Maintenance	Actively improved by Microsoft	Community-driven	Improved by Google
Community	Rapidly growing	Mature, with many legacy users	Intermediate

Recommended Choice as of 2026:

Playwright: For new projects, dynamic websites, and cases requiring multi-browser compatibility
Selenium: For compatibility with legacy systems, or when older browsers like IE are needed
Puppeteer: When Chrome-specific features are sufficient, and familiarity with the Node.js ecosystem

3. How to Install Playwright?

Python Environment

# Playwright 설치
pip install playwright

# 브라우저 바이너리 설치 (Chromium, Firefox, WebKit 세 가지 모두)
playwright install

# Chromium만 설치하려면
playwright install chromium

Node.js Environment

# 새 프로젝트 초기화 (대화형 설정)
npm init playwright@latest

# 또는 직접 설치
npm install playwright
npx playwright install

Docker Environment

When running on a server environment rather than local development, using the official Docker image provided by Microsoft is the most convenient option.

FROM mcr.microsoft.com/playwright/python:v1.58.0-noble

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "crawler.py"]

4. Basic Crawling — Extracting Data in 5 Minutes

Python Example: Collecting News Headlines

from playwright.sync_api import sync_playwright

def scrape_headlines():
    with sync_playwright() as p:
        # headless=True 로 설정하면 브라우저 창 없이 실행
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        page.goto("https://news.ycombinator.com")

        # 모든 뉴스 제목 선택
        items = page.locator(".titleline > a").all()
        headlines = [item.inner_text() for item in items]

        for i, title in enumerate(headlines[:10], 1):
            print(f"{i}. {title}")

        browser.close()
        return headlines

if __name__ == "__main__":
    scrape_headlines()

Node.js Example: Same Task

const { chromium } = require('playwright');

async function scrapeHeadlines() {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto('https://news.ycombinator.com');

  const headlines = await page.locator('.titleline > a').allInnerTexts();

  headlines.slice(0, 10).forEach((title, i) => {
    console.log(`${i + 1}. ${title}`);
  });

  await browser.close();
  return headlines;
}

scrapeHeadlines();

Taking screenshots and outputting PDFs can also be done in a single line of code.

# 스크린샷 저장
page.screenshot(path="screenshot.png", full_page=True)

# PDF 저장 (Chromium만 지원)
page.pdf(path="page.pdf")

5. Advanced Crawling Techniques

Automatic Pagination Handling

from playwright.sync_api import sync_playwright

def scrape_paginated(base_url, max_pages=5):
    results = []
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        for page_num in range(1, max_pages + 1):
            page.goto(f"{base_url}?page={page_num}")
            # 요소가 로드될 때까지 대기
            page.wait_for_selector(".product-item")

            items = page.locator(".product-item").all()
            for item in items:
                results.append(item.inner_text())

            # 다음 페이지 버튼이 없으면 종료
            if not page.locator(".next-page").count():
                break

        browser.close()
    return results

Infinite Scroll Handling

def scrape_infinite_scroll(url, scroll_count=10):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)

        prev_count = 0
        for _ in range(scroll_count):
            # 페이지 맨 아래까지 스크롤
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            # 새 콘텐츠 로딩 대기
            page.wait_for_timeout(1500)

            current_count = page.locator(".item").count()
            if current_count == prev_count:
                break  # 더 이상 로딩되지 않으면 종료
            prev_count = current_count

        items = page.locator(".item").all_inner_texts()
        browser.close()
        return items

Network Interception — Direct API Extraction

In JavaScript-rendered pages, capturing internal API responses directly allows obtaining clean JSON data without HTML parsing.

import json

def intercept_api(url):
    captured_data = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # API 응답을 가로채는 핸들러 등록
        def handle_response(response):
            if "api/products" in response.url and response.status == 200:
                try:
                    data = response.json()
                    captured_data.extend(data.get("items", []))
                except Exception:
                    pass

        page.on("response", handle_response)
        page.goto(url)
        page.wait_for_load_state("networkidle")

        browser.close()
    return captured_data

Multi-Page Parallel Processing

from playwright.sync_api import sync_playwright

def scrape_parallel(urls):
    results = {}
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        # BrowserContext로 독립 세션 생성
        context = browser.new_context()

        pages = [context.new_page() for _ in urls]
        for page, url in zip(pages, urls):
            page.goto(url, wait_until="domcontentloaded")

        for page, url in zip(pages, urls):
            results[url] = page.title()

        browser.close()
    return results

6. How to Bypass Anti-Bot Systems? (Latest Techniques in 2026)

Major platforms like Coupang, Naver, and Instagram operate sophisticated anti-bot systems such as Cloudflare Turnstile, DataDome, and Akamai Bot Manager. Here are commonly used evasion methods as of 2026.

Applying playwright-stealth

playwright-stealth patches typical signals that detect headless browsers. It automatically handles tasks like removing the navigator.webdriver property, modifying User-Agent, and disguising WebGL renderers.

pip install playwright-stealth

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    stealth_sync(page)  # 스텔스 적용
    page.goto("https://target-site.com")

User-Agent Rotation

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
]

context = browser.new_context(
    user_agent=random.choice(USER_AGENTS)
)

Proxy Configuration

context = browser.new_context(
    proxy={
        "server": "http://proxy-server:8080",
        "username": "user",
        "password": "password"
    }
)

Human Behavior Simulation

import random

# 마우스 움직임 추가
page.mouse.move(
    random.randint(100, 800),
    random.randint(100, 600)
)

# 자연스러운 타이핑 딜레이
page.locator("#search").type("검색어", delay=random.randint(80, 150))

# 스크롤 후 짧은 대기
page.evaluate("window.scrollBy(0, 300)")
page.wait_for_timeout(random.randint(800, 2000))

Major Anti-Bot Trends in 2026

Cloudflare AI Labyrinth: Introduced by Cloudflare in March 2025, this new defense mechanism guides bots through a maze of fake pages instead of blocking them. If a client navigates more than 4 levels, it is automatically identified as a bot and its fingerprint is registered. The purpose is identification and tracking rather than simple blocking.

Canvas/WebGL Fingerprinting: Anti-bot systems generate a unique browser fingerprint by combining Canvas rendering results, WebGL renderer information, and AudioContext processing results. It is challenging to completely bypass this with playwright-stealth alone, and additional fingerprint masking libraries like browserforge need to be used together.

Limitations: As stated in the official playwright-stealth documentation, this library only bypasses simple detection. Against enterprise-level anti-bot systems like DataDome and Akamai, the effectiveness of individual evasion techniques is limited.

Related Guides: Complete Guide to Crawling Legality

7. Playwright Ecosystem in 2026 — MCP and AI Automation

Playwright MCP Server

The most notable change in the Playwright ecosystem in 2026 is the integration with MCP (Model Context Protocol). By using the officially distributed playwright-mcp package from Microsoft, AI agents like Claude, GPT, and Cursor can directly control the browser with natural language commands.

# Claude Desktop에 Playwright MCP 연동
npx @playwright/mcp@latest

When an AI agent instructs "fetch the product list on this page," the MCP server converts it into Playwright commands, executes them, and returns the results. This enables browser automation without writing code directly.

However, caution is needed in terms of token consumption. According to Microsoft benchmarks, Playwright MCP consumes approximately 114,000 tokens for the same task, while Playwright CLI consumes around 27,000 tokens. For large-scale repetitive tasks, the CLI method is cost-effective.

Stagehand / Browser Use

Stagehand (Browserbase) and Browser Use are frameworks that layer AI on top of Playwright. By instructing in natural language like "click the next page button," AI analyzes the DOM to find the corresponding element and performs the action. The key advantage is that it works without recoding even if the site structure changes.

Hashscraper MCP Server

If you are looking for a crawling-specific MCP server, you can refer to @scrapi.ai/mcp-server provided by Hashscraper. With this, you can receive crawling results with just one MCP command without setting up the Playwright infrastructure.

Related Guides: Complete Guide to AI Agent Crawling

8. Is Playwright Sufficient? Limits and Realistic Alternatives

Limits of Playwright Itself

While Playwright is a powerful tool, it faces several practical limitations when continuously operating in a production environment.

IP Blocking: Repeated requests from the same IP address can lead to blocking. Proper proxy pool management is essential, and obtaining domestic IPs can be challenging.
Anti-Bot Updates: Services like Cloudflare, DataDome continuously update their detection logic. Evasion code that works today may be blocked tomorrow.
Infrastructure Operation Burden: Setting up crawling schedulers, error recovery, data pipelines, monitoring systems need to be built separately.

Cost Comparison of Self-Build vs Managed Services for Crawling

When comparing costs based on collecting data of 300,000 items per month, the following breakdown is as follows.

Item	Self-built Playwright Infrastructure	Hashscraper Subscription
Developer Salary (Senior Level)	5-7 million won per month	Included
Proxy/IP Costs	500,000-2 million won per month	Included
Server Infrastructure (AWS, etc.)	200,000-800,000 won per month	Included
Anti-Bot Update Response	200,000-500,000 won per month (man-hours)	Included
Total Monthly Cost	590-1,030 million won	300 million won per month~
Initial Setup Time	1-3 months	1-2 weeks
In Case of IP Blocking	Manual response	Automatic handling

"Writing code" and "operating stably in a production environment" are different issues. If it's a side project or internal tool with a clear development stack, direct self-building is the best option. However, for long-term stable data collection for business purposes, it is rational to consider both infrastructure operation costs and risks.

Schedule a Free Consultation with Hashscraper to share your current scale and requirements for optimal solutions before setting up Playwright infrastructure.

9. FAQ

Q1. Is Playwright Crawling Legal?

Collecting publicly available data is legal in most countries. However, actions like disregarding a site's robots.txt guidelines, bypassing logins, unauthorized replication of copyrighted content, or causing server overload can lead to legal issues. In Korea, the legality of crawling was debated in a 2024 ruling related to crawling on Baemin for violating service terms and disrupting business. Depending on the purpose and method of collection, legal considerations are advised for commercial-scale crawling. For more details, refer to the Crawling Legality Guide.

Q2. How to Choose Between Playwright and Selenium?

For new projects as of 2026, Playwright is recommended. It excels in automatic waiting, multi-browser support, parallel processing, and fast execution speed. Selenium is effective for compatibility with legacy systems or when special browsers are required.

Q3. What is the Difference Between Headless and Headed Mode?

headless=True runs the browser in the background without displaying the browser window. This is the default mode for server environments. headless=False (headed mode) displays the actual browser window, allowing visual confirmation of actions for debugging. Some anti-bot systems find it easier to detect headless browsers, so trying headed mode can be a solution when evasion is needed.

Q4. Can Playwright Crawl Protected Cloudflare Sites?

Sites protected by Cloudflare Turnstile or Bot Management are difficult to bypass with playwright-stealth alone. Even with a combination of residential IP proxies, fingerprint masking, and human behavior simulation, success rates can be unstable. The AI Labyrinth introduced in 2025 wastes bot resources by guiding them through a maze of fake pages instead of blocking them, making detection even more challenging. For commercial-scale Cloudflare evasion, using specialized services is more practical.

Q5. How to Increase Playwright Crawling Speed?

The key methods to improve crawling speed are threefold. First, block unnecessary resources like images, fonts, CSS using network interception. Second, use BrowserContext to run multiple pages in parallel. Third, if possible, capture internal API responses directly instead of HTML parsing. Applying these three methods often results in a 3-5x speed improvement.

Q6. Is Playwright Suitable for Large-Scale Crawling?

Running Playwright on a single server is suitable for collecting from tens of thousands to hundreds of thousands of items per month. Beyond that, you will need to set up distributed queues (Celery, RQ, etc.), multiple servers, proxy pool management, error recovery systems separately. If you are collecting more than 300,000 items per month, it is a good time to consider managed services over self-building.

Q7. Which is Better, Python or Node.js?

If you need to connect to data science, analysis, or machine learning pipelines, Python is advantageous. If you need to integrate with existing Node.js backends or require TypeScript type safety, choose Node.js. There is almost no difference in Playwright's own functionality. Choosing the language your team is more familiar with is the most practical approach.

Conclusion

As of 2026, Playwright is the most powerful open-source tool for crawling dynamic websites. It excels in automatic waiting, network interception, parallel processing, dual support for Python/Node.js, and integration with AI agents through MCP — leading in terms of features.

However, "writing code that works" and "operating stably in a production environment" are different matters. IP blocking response, anti-bot update tracking, infrastructure operation are separate challenges from Playwright code writing. For internal tools or small projects, direct self-building is the best option. However, for long-term stable data collection for business purposes, it is essential to consider infrastructure operation costs and risks.

Hashscraper is a managed service that has crawled over 5,000 sites in 8 years. Before setting up Playwright infrastructure, schedule a free consultation to share your current requirements.