Shadow DOM Crawling Guide: Dissecting the Hidden Elements of the Web

Shadow DOM Crawling Guide: Dissecting the Hidden Elements of the Web. Comparison of Shadow DOM and iframe in terms of style, script, security, and introduction to methods for crawling Shadow DOM.

3
Shadow DOM Crawling Guide: Dissecting the Hidden Elements of the Web

1. What is Shadow DOM?

Shadow DOM is a technology used in web development, mainly to encapsulate the style and structure of web elements. Here, 'encapsulation' means protecting and maintaining web elements inside the Shadow DOM independently from external influences.

2. Difference between Shadow DOM and iframe

Shadow DOM and iframe both create isolated areas within a web page, but they serve different purposes and functions.

Below are three key differences:

2.1. Encapsulation vs Embedding

The core of Shadow DOM is 'encapsulation.' By using Shadow DOM, you can isolate the internal structure and style of web elements from the external environment. This prevents conflicts with styles or scripts of other elements within the same web page.

On the other hand, iframe focuses on the concept of 'embedding.' It allows inserting a separate window of another web page within the current page. For example, when embedding a YouTube video or Google Maps on a web page, iframes are used.

2.2. Style and Script

External styles and scripts do not affect the content inside the Shadow DOM, and conversely, styles or scripts within the Shadow DOM do not affect external elements.

Similarly, with iframes, styles or scripts from the parent page do not affect the content inside the iframe, and the content inside the iframe does not affect the parent page.

2.3. Security

Since Shadow DOM is isolated from the rest of the web page, accessing data or functionality inside the Shadow DOM from external sources becomes more complex. While this does not necessarily enhance security, it helps prevent unintended access to internal structures or data.

In the case of iframes, if external sources are untrusted pages, there is a risk of exposing user data or information, and loading iframe content via HTTP could weaken the overall page security.

For more information on security issues related to iframes and web scraping, you can refer to previous posts.

3. Shadow DOM Scraping Method

Usually, we use tools like CSS selectors or XPath to access HTML elements on a web page. However, in the case of Shadow DOM, these tools cannot access the internal structure. Therefore, we need to directly access the 'root' of the Shadow DOM using JavaScript.

Let's assume there is a web page structure as follows:

  • HTML Example:
<custom-settings>
  #shadow-root
    <settings-panel>
      #shadow-root
        <settings-content>
          #shadow-root
            <input id="customInput">
        </settings-content>
    </settings-panel>
</custom-settings>

As shown in the example above, Shadow DOM often has multiple layers of hierarchy. Therefore, to access

<input id="customInput">

you need to navigate through the hierarchy step by step to find the Shadow Root.

Below is a Python code using Selenium to access the Shadow Root.

def get_shadow_root(element, driver):
    return driver.execute_script('return arguments[0].shadowRoot', element)

driver = ... # Selenium 웹 드라이버 초기화

# 첫 번째 Shadow DOM 접근
root1 = driver.find_element_by_tag_name('custom-settings')
shadow_root1 = get_shadow_root(root1, driver)

# 두 번째 Shadow DOM 접근
root2 = shadow_root1.find_element_by_tag_name('settings-panel')
shadow_root2 = get_shadow_root(root2, driver)

# 세 번째 Shadow DOM 접근
root3 = shadow_root2.find_element_by_tag_name('settings-content')
shadow_root3 = get_shadow_root(root3, driver)

# 원하는 요소에 접근
input_element = shadow_root3.find_element_by_id("customInput")

With this code, you can sequentially explore the hierarchy of the Shadow DOM and access the desired elements.

The key point is that each layer within the Shadow DOM has an independent DOM, so you need to obtain the Shadow Root of that layer to access the elements of the next layer.

4. Conclusion: Utilize Shadow DOM Scraping to Obtain Hidden Data Effectively

As web structures become more sophisticated, there will be more code with 'hidden elements.'

Today, we compared the characteristics and differences between iframe and Shadow DOM, and also conducted a simple example of web scraping. To adapt to the advancement of development technologies, it is essential to continuously research and master methods to access desired data.

Also, check out:

Data Collection, Automate Now

Start in 5 minutes without coding · Experience web scraping of 5,000+ websites

Get started for free →

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.