What is the syntax for accessing the title element in an HTML document using XPath?

The XPath for accessing the title element is '/html/head/title'.

What is the difference between absolute path and relative path in XPath?

An absolute path starts from the top root node, while a relative path uses '//' to skip intermediate nodes and directly access specified nodes.

How do you select elements using the 'contains' function in XPath?

You can select elements using the 'contains' function with syntax like '//div[contains(@class, "value")]'.

What does the '@' symbol represent in XPath expressions?

The '@' symbol represents attributes of elements in XPath expressions.

What is XPath? Let's explore its basic syntax and HTML code.

Q: What is XPath used for?

XPath is used for specifying paths to access specific elements or attributes in an XML document, commonly in web crawling tasks.

0. What is XPath?

XPath is an abbreviation for 'XML Path Language', a language for specifying paths to access specific elements or attributes in an XML document.

XPath is commonly used in web crawling tasks, so let's first look at the basic syntax of XPath.

1. Basic Syntax of XPath

&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
    &lt;meta charset="utf-8"&gt;
    &lt;title&gt;hashscraper&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;div id="container"&gt;
        &lt;div class="title"&gt;
            &lt;p class="content1"&gt;let's start crawling!&lt;/p&gt;
            &lt;p class="content2"&gt;crawling is fun!&lt;/p&gt;
        &lt;/div&gt;
    &lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;

2. HTML Code

Here is a simple HTML code.

HTML code consists of elements and attributes, and each element and attribute are hierarchically related.

XPath represents an XML document in a tree structure and provides paths to extract all nodes, attributes, and data from the top node to the bottom node.

(*Here, a node refers to each part of an XML document such as elements, attributes, and text content.)

Let's find the path to access the title element in the code above.

In terms of the tree structure, the title element consists of the html element → head element → title element in order.

Therefore, the XPath for the title element is as follows.

/html/head/title

Additionally, in XPath, attributes that bind elements like class are represented with "@" symbol. Using "@" to represent the XPath of the first p element in the code above looks like this.

/html/body/div/div/p[@class='content1']

3. Two Expression Methods of XPath

XPath can be expressed in two ways, absolute path and relative path.

3.1. XPath: Absolute Path

An absolute path selects elements starting from the top root node, similar to the method used above.

html/body/div/div/p[@class='content1']

3.2. XPath: Relative Path

A relative path uses '//' to skip intermediate node paths and proceeds with sequential exploration from the specified node. Representing the above absolute path as a relative path looks like this.

//p[@class='content1']

4. Other Expression Syntax

In addition to the above, XPath uses various syntax to represent paths.

4.1. contains

: Retrieves cases containing the specified value.

#'aa'를 포함하는 class명을 가진 div 요소를 선택
//div[contains(@class, "aa")]

4.2. last

: Retrieves the last node among nodes that match the path.

//div[@class="aa")/span[last()]

4.3. and

: Retrieves nodes that satisfy both conditions.

#class명에 'aa'와 'bb'를 포함하는 img 요소를 선택
//img[contains(@class, "aa") and contains(@class, "bb")]

4.4. or

: Retrieves nodes that satisfy one or both conditions.

#class명에 'aa' 또는 'bb'를 포함하는 img 요소를 선택
//img[contains(@class, "aa") or contains(@class, "bb")]

4.5. not

: Retrieves nodes that do not satisfy the specified condition.

#class명에 'aa'를 포함하고 'bb'를 포함하지 않는 img 요소를 선택
//img[contains(@class, "aa") and not(contains(@class, "bb")

5. Example Practice

So far, we have looked at the basic syntax of XPath. Now, let's try to get the XPath of the desired part on a website.

5.1. Open Developer Tools

notion image

First, open the desired website, then click the left mouse button and press 'Inspect' to open the Chrome Developer Tools.

5.2. Identify the Desired Tag

notion image

In the Developer Tools, click the mouse icon in the top left corner. Then, hover the mouse over the part of the website you want to retrieve, and it will show as below.

notion image

Click on the area where you placed the mouse, and the Developer Tools will show the tag of the desired part in the actual HTML code.

notion image

5.3. Copy XPath

notion image

Copy the desired part, then Copy > Copy XPath, and paste the copied content to see the XPath of the desired part!

You can confirm that the XPath of the desired part has been correctly retrieved as shown below.

//*[@id="home"]/div/div/div[1]/h1

Then, how can this retrieved XPath be used in web crawling? Here is a snippet of crawling code.

get_browser.goto "&lt;https://www.hashscraper.com&gt;"
x = get_browser.element(xpath: "//*[@id='home']/div/div/div[1]/h1").text

5.4. Verification

By using XPath to retrieve the desired element and storing it in x, printing x will show the text you wanted as follows.

6. Conclusion: Study XPath for Proper Web Crawling

We have learned the basics of web crawling and XPath. To collect the data you want through crawling, you need to know how that data can be represented through paths, which can be easily expressed using XPath. We recommend starting with XPath study before diving into web crawling!

Also, read this article together:

Data Collection, Automate Now

Start crawling 5,000+ websites in 5 minutes without coding

Get started for free →

Exploring XPath: Basics of Web Crawling

0. What is XPath?

1. Basic Syntax of XPath

2. HTML Code

3. Two Expression Methods of XPath

3.1. XPath: Absolute Path

3.2. XPath: Relative Path

4. Other Expression Syntax

4.1. contains

4.2. last

4.3. and

4.4. or

4.5. not

5. Example Practice

5.1. Open Developer Tools

5.2. Identify the Desired Tag

5.3. Copy XPath

5.4. Verification

6. Conclusion: Study XPath for Proper Web Crawling

Also, read this article together:

Data Collection, Automate Now

Comments

Add Comment

Continue Reading

In the era of GPT, why is 'web crawling' still important?

Build a new customer database through web crawling.

How marketing professionals can utilize web scraping (crawling)

Automating web crawling using Python: schedule, Task Scheduler, crontab

Get notified of new posts