XPath Crawling Complete Guide 2026: The Core of Precise Web Data Extraction

I have completely summarized the process of web scraping with XPath from the beginning to practical applications. This guide includes Python lxml/requests examples, a comparison with CSS Selector, and a reference to key expressions as of 2026.

21
XPath Crawling Complete Guide 2026: The Core of Precise Web Data Extraction

When you first start web crawling, you usually learn CSS Selector first. However, when you encounter slightly complex pages in practice, it becomes difficult to extract the desired elements using only CSS Selector. This is when you need XPath.

XPath is a standard language for navigating specific nodes in XML/HTML documents. It is more powerful in expression than CSS Selector, and in some cases, it can make selections that are impossible with CSS. As of 2026, major crawling tools like Python lxml, Scrapy, Playwright, and Selenium all support XPath.

This article summarizes everything from the basic concepts of XPath to practical crawling code and debugging tips at once.


1. What is XPath?

XPath (XML Path Language) is a W3C standard query language for navigating the tree structure of XML and HTML documents. Since its initial release in 1999, it has become a core technology of web standards.

The HTML of a web page is a tree structure. Starting from the html root, elements like body, various div, span, and a tags are connected in parent-child relationships. XPath represents this tree structure like a file system path.

For example, just like specifying a path in a file system like /home/user/documents/report.txt, in HTML, you specify nodes in the form of /html/body/div/p.

CSS Selector vs XPath: When to Use Each?

Feature CSS Selector XPath
Basic Element Selection Concise Possible
Select Elements with Specific Text Impossible contains(text(), 'keyword')
Reverse Parent Node Selection Impossible .. or parent::
Accessing Sibling Nodes Limited following-sibling::, preceding-sibling::
Conditional Filtering Basic Complex conditions possible
Learning Difficulty Easy Intermediate
Browser Performance Fast Slightly slower

In general web crawling, it is effective to primarily use CSS Selector and use XPath as a supplement when CSS is not enough.


2. XPath Core Syntax Reference

Basic Path Expression

/html/body/div          # 절대 경로 (루트부터 전체 경로)
//div                   # 문서 전체에서 모든 div 요소
//div[@class='title']   # class 'title' div
//a[@href]              # href 속성이 있는 모든 a 태그

The double slash // is crucial. In practice, relative paths using // are used much more than absolute paths (/html/body/...) because they are relatively stable even if the page structure changes.

Attribute Selector

//div[@id='main']                    # id 'main' div
//input[@type='submit']              # type 'submit' input
//a[contains(@class, 'btn')]         # class 'btn' 포함된 a
//img[@src and @alt]                 # src alt 속성 모두 있는 img
//a[not(@href)]                      # href 없는 a 태그

contains() is similar to CSS [class*="btn"], but it can also be applied to text content, making it much more flexible.

Text Selector

//h2[text()='이벤트 공지']                    # 정확히 '이벤트 공지' 텍스트인 h2
//p[contains(text(), '할인')]                  # '할인' 텍스트를 포함하는 p
//button[normalize-space(text())='구매하기']   # 공백 제거  '구매하기' 버튼

This is the core strength of XPath. While CSS Selector cannot select elements based on text content, XPath allows free text-based selection by combining text() and contains().

Axis Traversal

//span[@class='price']/parent::div          # 특정 span 부모 div
//td[contains(@class, 'total')]/../td[1]    # 형제 td  번째
//h3/following-sibling::p                   # h3 다음에 오는 형제 p 태그
//li[last()]                                # 마지막 li 항목
//li[position() <= 5]                       #  5 li 항목

Reverse parent traversal is a feature unique to XPath. It is useful when you want to retrieve the entire card container that includes a button with a specific class, for example.


3. Crawling with XPath in Python (lxml)

Basic Installation and Structure

import requests
from lxml import html

url = "https://example.com/products"
response = requests.get(url, headers={
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"
})

tree = html.fromstring(response.content)

# XPath로 데이터 추출
titles = tree.xpath('//h2[@class="product-title"]/text()')
prices = tree.xpath('//span[@class="price"]/text()')
links = tree.xpath('//a[@class="product-link"]/@href')

for title, price, link in zip(titles, prices, links):
    print(f"{title} | {price} | {link}")

Parse HTML with lxml.html.fromstring() and pass the XPath expression to .xpath() method. The return value is a list.

Text vs @Attribute Selection

# 요소 텍스트 추출: /text()
titles = tree.xpath('//h1/text()')

# 속성값 추출: /@속성명
hrefs = tree.xpath('//a/@href')
images = tree.xpath('//img/@src')

# 요소 객체 자체를 가져올 때 (하위 탐색 필요 시)
product_cards = tree.xpath('//div[@class="product-card"]')
for card in product_cards:
    title = card.xpath('.//h2/text()')  # 점(.)은 현재 요소 기준
    price = card.xpath('.//span[@class="price"]/text()')
    print(title[0] if title else "N/A", price[0] if price else "N/A")

Important to note: When iterating through element objects, you must append a dot (.) to the internal .xpath() to search relative to the current element. Without the dot, using //h2/text() will search the entire document again.

Flexible Selection with contains()

# class에 특정 단어가 포함된 경우
items = tree.xpath('//*[contains(@class, "item")]')

# 텍스트에 특정 단어가 포함된 링크
discount_links = tree.xpath('//a[contains(text(), "할인")]/@href')

# 여러 조건 조합 (and/or)
main_buttons = tree.xpath('//button[contains(@class, "btn") and @type="submit"]')

4. Crawling Dynamic Pages: Playwright + XPath

Static HTML is sufficient with requests + lxml, but for dynamically rendered pages using JavaScript, you need Playwright or Selenium. Both tools support XPath.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://target-site.com")

    # XPath로 요소 선택
    # locator 방식 (권장)
    price = page.locator('xpath=//span[@class="final-price"]').first.text_content()

    # evaluate로 XPath 직접 실행
    result = page.evaluate("""
        () => {
            const el = document.evaluate(
                '//h1[@class="product-name"]/text()',
                document,
                null,
                XPathResult.FIRST_ORDERED_NODE_TYPE,
                null
            ).singleNodeValue;
            return el ? el.nodeValue : null;
        }
    """)

    print(price, result)
    browser.close()

Add the xpath= prefix to the XPath expression for Playwright's locator('xpath=...').


5. XPath Debugging: Testing Directly in the Browser

Before writing crawling code, you can first check if the XPath is correct in the browser.

Using Chrome DevTools:
1. Press F12 or right-click → Inspect
2. Select the Console tab
3. Enter the code below:

// 단일 요소 찾기
$x('//h1[@class="title"]')

// 모든 매칭 요소 텍스트 출력
$x('//span[@class="price"]').map(e => e.textContent)

// 첫 번째 매칭 텍스트
$x('//div[@class="product-name"]')[0]?.textContent

$x() is a built-in XPath test function available in both Chrome and Firefox. Always verify if the desired elements are correctly selected before writing the code.


6. Practical Example: Extracting E-commerce Product Data

import requests
from lxml import html

def scrape_product_list(url):
    headers = {"User-Agent": "Mozilla/5.0 (compatible; DataCollector/1.0)"}
    resp = requests.get(url, headers=headers, timeout=10)
    tree = html.fromstring(resp.content)

    # 상품 카드 전체 목록 가져오기
    cards = tree.xpath('//li[contains(@class, "product-item")]')

    results = []
    for card in cards:
        name = card.xpath('.//strong[@class="product-name"]/text()')
        price = card.xpath('.//span[contains(@class, "price")]/text()')
        rating = card.xpath('.//span[@class="rating"]/text()')
        reviews = card.xpath('.//span[@class="review-count"]/text()')
        link = card.xpath('.//a[@class="product-link"]/@href')

        results.append({
            "name": name[0].strip() if name else None,
            "price": price[0].strip() if price else None,
            "rating": rating[0].strip() if rating else None,
            "reviews": reviews[0].strip() if reviews else None,
            "url": link[0] if link else None,
        })

    return results

products = scrape_product_list("https://example-shop.com/category/shoes")
print(f"수집된 상품 수: {len(products)}")

7. Common Errors and Solutions

IndexError: list index out of range

# 잘못된 방법
title = tree.xpath('//h1/text()')[0]  # 요소 없으면 에러 발생

# 올바른 방법
titles = tree.xpath('//h1/text()')
title = titles[0] if titles else None

Element Visible but Not Captured by XPath

The main reasons are namespaces or dynamic rendering issues.

# 네임스페이스가 있는 경우
namespaces = {'ns': 'http://www.w3.org/1999/xhtml'}
result = tree.xpath('//ns:div[@id="content"]', namespaces=namespaces)

# 실제로 화면에는 있지만 XPath로 안 잡힐 때 → 동적 렌더링 확인
# requests로 받은 HTML에는 없고, 브라우저에서만 보이는 경우
# → Playwright/Selenium으로 전환 필요

Handling Spaces in Text Extraction

# normalize-space()로 앞뒤 공백 및 연속 공백 제거
clean_text = tree.xpath('normalize-space(//h1/text())')

# Python에서 후처리
texts = [t.strip() for t in tree.xpath('//p/text()') if t.strip()]

8. Limits of XPath and Alternatives

While XPath is powerful, it also has limitations.

Vulnerability to Structural Changes: Using absolute paths (/html/body/div[2]/div[3]/ul/li) makes the entire XPath useless with a single site redesign. Whenever possible, use XPath based on @id, @class attributes.

JavaScript Rendering: XPath is an HTML parsing tool, not a rendering tool. Content loaded dynamically cannot be collected with requests + lxml alone.

Large-scale Crawling: When collecting thousands of pages, using Scrapy is much more efficient than using lxml alone. Scrapy supports both XPath and CSS Selector, asynchronous processing, middleware, and pipelines.


Web Scraping without XPath using HashScraper

Learning XPath is beneficial, but the cost of maintaining a production crawler yourself is higher than you think. You have to update XPath every time the site structure changes, handle JavaScript rendering, bypass IP blocking, etc.

HashScraper handles this process with just one API call. Simply pass the URL without separate XPath or CSS Selector, and it returns the JavaScript-rendered HTML in Markdown, JSON, or HTML format.

import requests

response = requests.post(
    "https://api.hashscraper.com/v1/scrape",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "url": "https://target-site.com/products",
        "renderJs": True,
        "outputFormat": "markdown"
    }
)

data = response.json()
print(data["content"])  # 정제된 마크다운으로 반환

If you find it difficult to build your own crawler or want to spend less time managing crawling infrastructure, start with the free trial of HashScraper.


Summary

  • Use XPath for complex selections that cannot be solved with CSS Selector
  • Key strengths include text-based selection, reverse parent traversal, conditional filtering
  • In Python, the combination of lxml and .xpath() is most common
  • For dynamic pages, use Playwright with locator('xpath=...')
  • Debugging: Quickly test with Chrome console $x()
  • Consider using HashScraper API for large-scale crawling instead of managing XPath directly

If this article was helpful, also check out the Playwright Crawling Complete Guide and Scrapy Crawling Complete Guide.

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.