The difference between web crawling and web scraping and basic examples implemented in Ruby

Explanation of the difference between web crawling and web scraping, basic examples implemented in Ruby, and an introduction to web data collection using Ruby.

8
The difference between web crawling and web scraping and basic examples implemented in Ruby

What is the difference between scraping and crawling?

Scraping and crawling are two methods for collecting web data. Although these two methods are often used interchangeably, they have differences in their functions and purposes.

  1. Crawling:
  • Crawling refers to the process of exploring multiple pages of a website.

  • Typically, a web crawler or spider automatically traverses a website by following links on pages or analyzing the site structure to collect specific data.

  • It is commonly used in the process of search engines indexing the entire web.

  1. Scraping:
  • Scraping is the process of extracting desired data from a specific web page.

  • It is used when collecting information such as the price of a specific product or news articles from a web page.

  • Scraping focuses on extracting desired data from the HTML of pages collected by the crawler or API results.

In summary, crawling is the process of exploring and collecting web pages, while scraping is the process of extracting specific information from the collected pages.

A Taste of Crawling in Ruby Language

To perform web crawling and scraping in Ruby, I will show you a simple example code using the **nokogiri** and **open-uri** libraries.

First, you need to install the necessary library. Run the following command in the terminal to install **nokogiri**. There is no need to install **open-uri** separately.

gem install nokogiri

Here is a simple Ruby code that crawls and scrapes the title from a specific web page.


require 'nokogiri'
require 'open-uri'

# 웹 페이지 URL
url = 'http://example.com'

# 해당 URL의 HTML을 열고 Nokogiri 객체로 파싱합니다.
doc = Nokogiri::HTML(URI.open(url))

# 웹 페이지의 제목을 찾아 출력합니다.
title = doc.css('title').text
puts "Page Title: #{title}"

# 웹 페이지 내의 모든 링크를 크롤링하여 출력합니다.
doc.css('a').each do |link|
  puts "Link: #{link['href']} Text: #{link.text}"
end

Code Explanation:

  1. Import the necessary libraries using **require 'nokogiri'** and **require 'open-uri'**.

  2. Use **Nokogiri::HTML(URI.open(url))** to open the HTML document of the given URL and convert it into a Nokogiri object.

  3. Extract the text of the **<title>** tag using **doc.css('title').text** to get the page title.

  4. Find all **<a>** tags (links) using **doc.css('a')** and print the **href** attribute and text of each link.

This code scrapes the title of the given web page and crawls all links within the page to collect information.

A Taste of Scraping in Ruby Language

Next, I will show you an example of extracting (scraping) specific data from a sample web page using **nokogiri** and **open-uri**, similar to crawling.

The following code is an example of scraping the latest news headlines from a sample web page.

ruby코드 복사
require 'nokogiri'
require 'open-uri'

# 가상 웹 페이지의 URL
url = 'https://example.com/news'

# HTML 콘텐츠를 가져와 Nokogiri로 파싱합니다.
doc = Nokogiri::HTML(URI.open(url))

# 최신 뉴스 헤드라인을 담고 있는 요소를 스크래핑합니다.
# 가정: 최신 뉴스는 클래스가 'headline'인 div에 포함되어 있다.
headlines = doc.css('div.headline')

# 각 헤드라인 요소에서 텍스트를 추출하여 출력합니다.
headlines.each do |headline|
  puts headline.text.strip
end

Code Explanation:

  1. Import the necessary libraries using **require 'nokogiri'** and **require 'open-uri'**.

  2. Open and parse the HTML document of the given URL using **Nokogiri::HTML(URI.open(url))**.

  3. Select all **<div>** elements with the class **headline** using **doc.css('div.headline')**.

  4. Print the text of each headline element.

This code demonstrates extracting desired data based on the HTML structure of a web page using specific CSS selectors. By appropriately adjusting the selectors based on the webpage you want to scrape, you can extract various information.

Also, check out this article:

Automate Data Collection Now

Start in 5 minutes without coding · Experience crawling over 5,000 websites

Get started for free →

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.