Claude is an AI chatbot service from Anthropic that supports Korean and allows the attachment of files up to 10MB, with a prompt limit of 100,000 tokens.

How does Claude analyze HTML?

Claude can analyze HTML by receiving it as a text file and provides results including product information and HTML tag details.

Can Claude handle large HTML inputs better than GPT-4?

Yes, Claude can handle larger HTML inputs due to its higher token limit compared to GPT-4.

What data can I extract using Claude?

You can extract product image URLs, names, model numbers, sizes, prices, and price changes using Claude.

How do I prompt Claude for web data extraction?

You can prompt Claude by specifying your requirements, such as using Selenium in Python to extract specific data from HTML.

Use Claude on ChatGPT to input large amounts of HTML.

I have prepared an example of crawling real-time charts on the limited edition trading platform Claude, Kream (https://kream.co.kr).

1. What is Claude?

notion image

Claude is an AI chatbot service from Anthropic, an artificial intelligence startup founded by OpenAI employees and strongly supported by Google.

Claude 2 was announced on July 11, 2023, and was launched in Korea on October 17, about 3 months later.

Claude provides the ability to chat in Korean and attach up to 5 files of up to 10MB in size, including PDF, TXT, and CSV files. It supports up to 100,000 tokens in the prompt, allowing input of many more tokens than GPT-4 (up to 32,768).

2. Analyzing HTML with Claude

Now let's try crawling price change information from the real-time chart on Kream.

notion image

Let's copy the HTML of the <ul> tag containing the product list for analysis.

In a previous post on building a Coupang crawler bot with ChatGPT, the issue was the token count for GPT-4. Can GPT-4 analyze this HTML?

notion image

It seems unable to handle it due to token limits.

What about Claude?

notion image

First, Claude receives HTML as a txt file for analysis.

The analysis results include not only product information but also HTML tag information and the libraries used. This allows us to determine what information can be collected based on the analysis results.

3. Extracting Web Data with Claude

Now let's ask Claude for the code to extract data.

3.1. Writing the prompt

I entered the prompt as follows.

셀레니움으로 웹 페이지 데이터를 추출하고자 해. 언어는 Python을 사용해줘.
위 HTML에서 상품 이미지 URL, 상품명, 모델 번호, 사이즈, 가격, 가격 변동, 가격 변동 비율을 가져와줘.
위 항목은 모두 필수 데이터이므로 해당 데이터가 없다면 나에게 어떤 데이터를 찾을 수 없는지 오류 메시지를 전달해야 해.
모든 HTML element는 xpath로 찾아줘.
위 조건을 가지고 JSON 형식으로 데이터를 출력해줘.

3.2. First Response

notion image

As seen in the image, code is provided to fetch data using Selenium and xpath.

However, since the latest version of Selenium does not support find_element_by_xpath, I requested the following modification.

내 셀레니움 버전에서는 `find_element_by_xpath`를 사용할 수 없어.
`By`를 import해서 `find_element`로 요소를 찾도록 수정해줘.

3.3. Second Response

notion image

It has been modified to use find_element and By.

Here is the complete code written by Claude.

from selenium import webdriver
from selenium.webdriver.common.by import By
import json

driver = webdriver.Chrome()

driver.get("<https://www.example.com>")

products = []

items = driver.find_elements(By.XPATH, "//ul[@class='trading-items-container']/li")
for item in items:

  try:
    img_url = item.find_element(By.XPATH, ".//picture/img").get_attribute("src")
  except:
    img_url = "Image URL not found"

  try:
    name = item.find_element(By.XPATH, ".//div[contains(@class,'name')]").text
  except:
    name = "Name not found"

  try:
    code = item.find_element(By.XPATH, ".//div[contains(@class,'style-code')]").text
  except:
    code = "Code not found"

  try:
    size = item.find_element(By.XPATH, ".//div[contains(@class,'option')]").text
  except:
    size = "Size not found"

  try:
    price = item.find_element(By.XPATH, ".//div[contains(@class,'current-price')]").text
  except:
    price = "Price not found"

  try:
    change = item.find_element(By.XPATH, ".//div[contains(@class,'change-price')]/span").text
  except:
    change = "Price change not found"

  try:
    change_pct = item.find_element(By.XPATH, ".//div[contains(@class,'change-percent')]").text.split()[0]
  except:
    change_pct = "Change percentage not found"

  product = {
    "image": img_url,
    "name": name,
    "code": code,
    "size": size,
    "price": price,
    "price_change": change,
    "price_change_pct": change_pct
  }

  products.append(product)

print(json.dumps(products, indent=2))

driver.quit()

Looking at the code, it satisfies all the requirements of the prompt, from exception handling to output in json format.

3.4. First Modification

It seems we only need to modify the Chrome driver settings here.

I have made the following changes from package imports to launching the Chrome browser.

import json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

url = "<https://kream.co.kr/trading-chart>"
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
driver.get(url)

3.5. First Execution

Now that everything seems ready, shall we give it a try?

[
  {
    "image": "<https://kream-phinf.pstatic.net/MjAyMzA0MDNfMTE1/MDAxNjgwNDkxOTQ4NjQ0.W-mLA596MM6US4gSI6jNsJ9NhCbIeUR5mg0qxQ6p_AUg.Ec1REyClrXQ38dSM40zZhVidrYf0gxBsnc7C1SqK71kg.JPEG/a_e78ed04e3a1948a09a7b26bff5b27634.jpg?type=m>",
    "name": "Nike x Peaceminusone LS T-Shirt Black (DR0098-010)",
    "code": "DR0098-010",
    "size": "M",
    "price": "152,000",
    "price_change": "\u25b2",
    "price_change_pct": "+1.3%"
  },
  {
    "image": "<https://kream-phinf.pstatic.net/MjAyMTAxMjBfNTAg/MDAxNjExMTIyMzAzMTY2.0QpenidH075iEa1_gVzZTKOxJPiB8W6KyX4YmiELcfIg.ydUN0Aq9T8nWA5frqrsbwnbFfsNMNHDx6ezU5WxGAfQg.PNG/p_28c110281b3f4e109e30a6247d503e2a.png?type=m>",
    "name": "Nike Dunk Low Pro Premium Medium Curry",
    "code": "DD1390-100",
    "size": "255",
    "price": "107,000",
    "price_change": "\u25b2",
    "price_change_pct": "+0.9%"
  },
  {
    "image": "<https://kream-phinf.pstatic.net/MjAyMTAzMDVfMjQz/MDAxNjE0OTE0NzIzMTQ4.tsuFUJtHGm4g4KE5EDikVMScORptOQqIB7afi1Nz2Qwg.sftz3YQOuw48xpNSJa1tV4uEsz5iU4mjIvpllgHWEn8g.PNG/p_8f7b72adbc924b5bbf7c670d55865e6b.png?type=m>",
    "name": "Adidas Samba OG Cloud White",
    "code": "B75806",
    "size": "215",
    "price": "130,000",
    "price_change": "\u25b2",
    "price_change_pct": "+0.8%"
  }
]

<div data-v-4f90d828="" class="change-price up" style="color: rgb(241, 87, 70);">
  <span data-v-4f90d828="" class="arrow">▲</span>
  <span data-v-4f90d828="">27,000</span>
</div>

We have successfully retrieved almost all the information, but due to the HTML structure as shown, only the \\u25b2 character corresponding to ▲ is retrieved, and the change amount is not retrieved.

3.6. Second Modification

Let's modify it to also retrieve the change amount:

try:
    change = item.find_element(By.XPATH, ".//div[contains(@class,'change-price')]").text.replace("\u25bc", "-").replace("\u25b2", "+")
except:
    change = "Price change not found"

3.7. Second Execution

Here are the results after running it again:

[
  {
    "image": "<https://kream-phinf.pstatic.net/MjAyMTA3MzBfNTYg/MDAxNjI3NjI3OTg5ODM5.p-9pRcc3cQ_eaTl7UZdCmPJz8Astbjg2N6_HkEhLrHsg.ekjxXxJ-kcnoaZC6gCQH_-KBCUJORPTRC791UZ_C3ukg.PNG/p_e1ef5e002eda49adb7f5d0c8a41f798d.png?type=m>",
    "name": "Jordan 1 x Travis Scott x Fragment Retro Low OG SP Military Blue",
    "code": "DM7866-140",
    "size": "260",
    "price": "1,875,000",
    "price_change": "+27,000",
    "price_change_pct": "+1.5%"
  },
  {
    "image": "<https://kream-phinf.pstatic.net/MjAyMzA0MDNfMTE1/MDAxNjgwNDkxOTQ4NjQ0.W-mLA596MM6US4gSI6jNsJ9NhCbIeUR5mg0qxQ6p_AUg.Ec1REyClrXQ38dSM40zZhVidrYf0gxBsnc7C1SqK71kg.JPEG/a_e78ed04e3a1948a09a7b26bff5b27634.jpg?type=m>",
    "name": "Nike x Peaceminusone LS T-Shirt Black (DR0098-010)",
    "code": "DR0098-010",
    "size": "M",
    "price": "152,000",
    "price_change": "+2,000",
    "price_change_pct": "+1.3%"
  },
  {
    "image": "<https://kream-phinf.pstatic.net/MjAyMjEwMTJfMjQ3/MDAxNjY1NTYyODI3ODMw.YeSueVotfq5mgVXlQIVeV-uL_Q2otMk6nkvlbG2rHncg.vMJPWgosd9OjOn-s75jjEnjk44sB1gNqftQSN8CEkTAg.PNG/a_b2325461062d47e2887459a3e6928d88.png?type=m>",
    "name": "Adidas Puffylette Core Black",
    "code": "GY4559",
    "size": "265",
    "price": "78,000",
    "price_change": "+1,000",
    "price_change_pct": "+1.3%"
  }
]

Now it seems to properly indicate both the change amount and its direction.

4. Conclusion

In this way, we have created a Kream real-time chart crawler using Claude.

With Claude, we were able to analyze long HTMLs as it allows attaching files up to 10MB.

While reducing the HTML size would be necessary to request ChatGPT to build a crawler, Claude handled long HTMLs without any issues.

Although Claude's response speed is slower compared to ChatGPT, there were no shortcomings in the responses.

In conclusion, for creating web crawlers, Claude seems more useful than ChatGPT.

Also, check out this article:

Automate Data Collection Now

Start in 5 minutes without coding · Experience crawling 5,000+ websites

Get started for free →

Use Claude on ChatGPT to input large amounts of HTML.

1. What is Claude?

2. Analyzing HTML with Claude

3. Extracting Web Data with Claude

3.1. Writing the prompt

3.2. First Response

3.3. Second Response

3.4. First Modification

3.5. First Execution

3.6. Second Modification

3.7. Second Execution

4. Conclusion

Also, check out this article:

Automate Data Collection Now

Comments

Add Comment

Continue Reading

In the era of GPT, why is 'web crawling' still important?

Want to track sales trends? Monitor prices? Crawling SSG.com data is the answer!

Automating Musinsa Crawling: How to easily collect product data by category

Automate Twitter data collection without coding and without getting blocked by IP restrictions!

Get notified of new posts