Practical guide to connecting web crawling data to RAG

Learn in detail about the web crawling data processing method and practical code for RAG (Retrieval-Augmented Generation). It also covers the impact of data quality on RAG performance, as well as HashScrapper MCP, LangChain, and more.

18
Practical guide to connecting web crawling data to RAG

Practical Guide to Connecting Web Crawling Data to RAG

"We want to create an AI chatbot that answers using our company's data." - We hear this request a lot these days. While ChatGPT is smart, to make it respond based on our company's latest data, we need RAG. And the performance of RAG ultimately depends on data quality.

In this article, we will summarize the entire process of connecting data collected through web crawling to the RAG pipeline with practical code examples.

Table of Contents

What is RAG?

RAG (Retrieval-Augmented Generation) is simply "search + generation".

Before LLM generates an answer, it first searches for relevant documents in an external database to provide context. This way, GPT can utilize the latest information, internal documents, and domain-specific data it hasn't learned, significantly reducing hallucination.

The key point is this: LLM is smart, but it doesn't know everything. RAG is a structure that supplements what LLM "doesn't know" with search.

Overall Pipeline Flow

The overall flow of the RAG pipeline is divided into 5 stages:

웹 크롤링 → 텍스트 청킹 → 임베딩 변환 → 벡터DB 저장 → LLM 쿼리

Let's examine each stage with code.

Step 1: Crawling - Collecting Raw Data

The starting point of RAG is data. Product information, news, technical documents, community posts collected from the web can all be sources for RAG.

Here, clean text extraction is crucial. If you input data mixed with HTML tags, ads, and navigation menus, the performance of RAG will significantly decrease.

import requests
from bs4 import BeautifulSoup

def crawl_page(url):
    """단순 크롤링 예시 — 실전에서는 봇 차단, JS 렌더링 등 고려 필요"""
    response = requests.get(url, headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    })
    soup = BeautifulSoup(response.text, "html.parser")

    # 본문 텍스트만 추출 (네비게이션, 광고 등 제거)
    for tag in soup(["nav", "header", "footer", "script", "style", "aside"]):
        tag.decompose()

    text = soup.get_text(separator="\n", strip=True)
    return text

However, reality is not as simple. There are limitations when dealing with Single Page Applications rendered with JavaScript, pages requiring login, or sites with bot-blocking measures using just requests.

Step 2: Chunking - Dividing Text into Appropriate Sizes

Embedding the entire document at once will reduce search accuracy. It is crucial to divide the text into chunks of appropriate sizes.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # 청크당 최대 500자
    chunk_overlap=50,      # 청크 간 50자 겹침 (문맥 유지)
    separators=["\n\n", "\n", ". ", " "]  # 분할 우선순위
)

documents = splitter.split_text(crawled_text)
print(f"{len(documents)}개 청크 생성")

Chunking Tips:
- If the chunks are too small, context is lost; if too large, search accuracy decreases.
- For Korean text, chunks of 500-800 characters generally show good performance.
- Set chunk_overlap to prevent sentences from being cut off.

Step 3: Embedding + Storing in VectorDB

Convert the chunks into vectors (numeric arrays) and store them in a VectorDB. Here, we use a combination of OpenAI embedding and ChromaDB.

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# 임베딩 모델 초기화
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# 벡터DB에 저장
vectorstore = Chroma.from_texts(
    texts=documents,
    embedding=embeddings,
    collection_name="crawled_data",
    persist_directory="./chroma_db"
)

print(f"{len(documents)}개 청크를 벡터DB에 저장 완료")

Other VectorDBs (Pinecone, Weaviate, Qdrant, etc.) can also be used with the same interface in LangChain.

Step 4: RAG Query - Asking LLM Questions

Now, when a user's question comes in, search for relevant documents in the VectorDB and pass them to LLM along with context.

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

# LLM 설정
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# RAG 체인 구성
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)

# 질문하기
answer = qa_chain.invoke("이 제품의 최신 가격은 얼마인가요?")
print(answer["result"])

Note: RetrievalQA is the legacy Chain interface of LangChain. While the latest versions recommend a RAG configuration based on LangGraph, this explanation uses the most intuitive method for conceptual understanding.

This is the overall flow of RAG. Crawling -> Chunking -> Embedding -> VectorDB -> LLM Query. The structure itself is simple.

Data Quality Determines RAG Performance

The most underestimated aspect in RAG is the quality of input data.

No matter how good the embedding model and LLM are, if the input data is messy, the results will be messy too. In reality, most RAG performance issues arise from data problems rather than the model.

Common data quality issues:
- Navigation, sidebar text mixed with main content -> noise in search results
- Incomplete crawling (JS rendering failure) -> crucial information missing
- Duplicate data -> biased search results
- Encoding issues, special character errors -> degradation in embedding quality

Garbage In, Garbage Out. Securing clean data during the crawling stage determines the success of the entire RAG pipeline.

Hashscraper MCP + LangChain Integration

Creating a crawler yourself can take away time dealing with ancillary issues like bot blocking, JS rendering, and proxy management. By utilizing Hashscraper's MCP (Model Context Protocol) server, you can focus on the RAG pipeline without worrying about crawling infrastructure.

# 해시스크래퍼 MCP를 LangChain Document Loader로 활용하는 예시
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# 해시스크래퍼 API로 데이터 수집 — 봇 차단 사이트도 OK
import requests

response = requests.post(
    "https://api.hashscraper.com/api/scrape",
    json={
        "url": "https://target-site.com/products",
        "format": "markdown"            # markdown 또는 text
    },
    headers={
        "X-API-Key": "YOUR_API_KEY",
        "Content-Type": "application/json; version=2"
    }
).json()

# LangChain Document로 변환
docs = [
    Document(
        page_content=response["data"]["content"],
        metadata={
            "url": response["data"]["url"],
            "title": response["data"]["title"]
        }
    )
]

# 이후는 동일: 청킹 → 임베딩 → 벡터DB → RAG
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
vectorstore = Chroma.from_documents(chunks, embeddings)

Reasons for using Hashscraper as an RAG data source:
- Clean text extraction even from bot-blocked sites
- JavaScript rendering, login automation
- Built-in proxy management, IP rotation - no need to worry about infrastructure
- Standardized data output -> improved chunking/embedding quality

Conclusion

The RAG pipeline is structurally simple. However, what ultimately determines performance in practice is the quality of the first stage - data collection.

Instead of investing time in creating and maintaining a crawler, focusing on the RAG logic on a validated crawling infrastructure might be a more efficient choice.

If you need to build an RAG pipeline based on crawling data, consult Hashscraper. We provide assistance from data collection to AI integration based on practical experience.

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.