What is the difference between web crawling and web scraping?

Web crawling is the process of exploring multiple pages of a website to collect data, while web scraping is the extraction of specific information from a particular web page.

How does web crawling work?

Web crawling involves a web crawler or spider that automatically traverses a website by following links to collect data.

What is web scraping used for?

Web scraping is used to extract specific information such as product prices or news articles from web pages.

How can I perform web crawling in Ruby?

You can perform web crawling in Ruby using the 'nokogiri' and 'open-uri' libraries to open and parse HTML documents.

What libraries are needed for web scraping in Ruby?

You need the 'nokogiri' library for parsing HTML and 'open-uri' for opening URLs. 'open-uri' is included with Ruby and does not require separate installation.

The difference between web crawling and web scraping and basic examples implemented in Ruby

What is the difference between scraping and crawling?

Scraping and crawling are two methods for collecting web data. Although these two methods are often used interchangeably, they have differences in their functions and purposes.

Crawling:

Crawling refers to the process of exploring multiple pages of a website.
Typically, a web crawler or spider automatically traverses a website by following links on pages or analyzing the site structure to collect specific data.
It is commonly used in the process of search engines indexing the entire web.

Scraping:

Scraping is the process of extracting desired data from a specific web page.
It is used when collecting information such as the price of a specific product or news articles from a web page.
Scraping focuses on extracting desired data from the HTML of pages collected by the crawler or API results.

In summary, crawling is the process of exploring and collecting web pages, while scraping is the process of extracting specific information from the collected pages.

A Taste of Crawling in Ruby Language

To perform web crawling and scraping in Ruby, I will show you a simple example code using the **nokogiri** and **open-uri** libraries.

First, you need to install the necessary library. Run the following command in the terminal to install **nokogiri**. There is no need to install **open-uri** separately.

gem install nokogiri

Here is a simple Ruby code that crawls and scrapes the title from a specific web page.


require 'nokogiri'
require 'open-uri'

# 웹 페이지 URL
url = 'http://example.com'

# 해당 URL의 HTML을 열고 Nokogiri 객체로 파싱합니다.
doc = Nokogiri::HTML(URI.open(url))

# 웹 페이지의 제목을 찾아 출력합니다.
title = doc.css('title').text
puts "Page Title: #{title}"

# 웹 페이지 내의 모든 링크를 크롤링하여 출력합니다.
doc.css('a').each do |link|
  puts "Link: #{link['href']} Text: #{link.text}"
end

Code Explanation:

Import the necessary libraries using **require 'nokogiri'** and **require 'open-uri'**.
Use **Nokogiri::HTML(URI.open(url))** to open the HTML document of the given URL and convert it into a Nokogiri object.
Extract the text of the **<title>** tag using **doc.css('title').text** to get the page title.
Find all **<a>** tags (links) using **doc.css('a')** and print the **href** attribute and text of each link.

This code scrapes the title of the given web page and crawls all links within the page to collect information.

A Taste of Scraping in Ruby Language

Next, I will show you an example of extracting (scraping) specific data from a sample web page using **nokogiri** and **open-uri**, similar to crawling.

The following code is an example of scraping the latest news headlines from a sample web page.

ruby코드 복사
require 'nokogiri'
require 'open-uri'

# 가상 웹 페이지의 URL
url = 'https://example.com/news'

# HTML 콘텐츠를 가져와 Nokogiri로 파싱합니다.
doc = Nokogiri::HTML(URI.open(url))

# 최신 뉴스 헤드라인을 담고 있는 요소를 스크래핑합니다.
# 가정: 최신 뉴스는 클래스가 'headline'인 div에 포함되어 있다.
headlines = doc.css('div.headline')

# 각 헤드라인 요소에서 텍스트를 추출하여 출력합니다.
headlines.each do |headline|
  puts headline.text.strip
end

Code Explanation:

Import the necessary libraries using **require 'nokogiri'** and **require 'open-uri'**.
Open and parse the HTML document of the given URL using **Nokogiri::HTML(URI.open(url))**.
Select all **<div>** elements with the class **headline** using **doc.css('div.headline')**.
Print the text of each headline element.

This code demonstrates extracting desired data based on the HTML structure of a web page using specific CSS selectors. By appropriately adjusting the selectors based on the webpage you want to scrape, you can extract various information.