Puppeteer is a Node.js library that enables web scraping and interaction with websites by controlling Google Chrome or Chromium.

Why should I use Puppeteer for web scraping?

Puppeteer offers optimized performance for Chrome, strong handling of dynamic web content, and the ability to bypass blocks during web scraping.

What is Puppeteer-Extra?

Puppeteer-Extra is an extension of Puppeteer that allows users to integrate additional features through a flexible plugin system.

What does the Stealth Plugin do?

The Stealth Plugin makes automated crawlers appear like regular users, helping to bypass common automated detection mechanisms.

How do I install Puppeteer and its plugins?

You can install Puppeteer and its plugins by running the command 'npm install puppeteer-extra puppeteer-extra-plugin-stealth'.

Applying specialized web scraping for Chrome using Node.js

Hello, today I will introduce how to create a web crawler using Node.js, not Ruby, Python, or Java. In particular, we will take a detailed look at Puppeteer, a powerful Node.js library that can control Google Chrome or Chromium, and its extension, Puppeteer-Extra.

What is Puppeteer?

Puppeteer is a Node.js library that enables web scraping and interaction with websites in the same way as users. Puppeteer primarily operates in Headless mode, performing tasks in the background without a GUI, but can also be set to Headful mode with the full browser UI if needed.

Why use Puppeteer?

1. Optimized performance for Chrome

Puppeteer communicates directly with the Chrome browser using the Chrome DevTools Protocol. This allows for faster and more precise control than using the WebDriver API provided by other automation tools like Selenium.

2. Strong handling of dynamic web content

Puppeteer excels in handling not only static pages but also content generated dynamically with JavaScript. This means it can effectively handle the complex interactions and dynamic elements of modern websites.

3. Bypassing blocks

In actual web scraping tasks, Puppeteer effectively bypasses blocks to collect data. Especially when using Puppeteer-Extra and the Stealth Plugin, you can collect data more reliably by bypassing automated detection mechanisms.

Puppeteer-Extra and Stealth Plugin

Puppeteer-Extra

Puppeteer-Extra is an extension built on top of the base Puppeteer library, allowing users to easily integrate additional features through a flexible plugin system. This enhances functionality and improves the usability of the web crawler.

Stealth Plugin

One of the plugins in Puppeteer-Extra, the Stealth Plugin, effectively bypasses common automated detection mechanisms during web scraping. While many websites block crawlers, this plugin can make the automated crawler appear like a regular user's browser.

Real-world Usage Example

Now let's explore how to perform a simple web scraping task using Puppeteer-Extra and the Stealth Plugin.

1. Install required modules

Install the puppeteer-extra and puppeteer-extra-plugin-stealth modules by entering the following command.

npm install puppeteer-extra puppeteer-extra-plugin-stealth

2. Import required modules

import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";
import fs from "fs";

3. Handle URL input in commands

In preparation for later using a command like node node_js.mjs https://www.example.com to input a URL directly, we have included code to throw an exception if the URL is missing.

const args = process.argv.slice(2); // 첫 번째 요소는 Node.js 실행 경로이므로 제외
if (args.length < 1) {
  console.error("URL 파라미터가 필요합니다.");
  process.exit(1); // 오류로 종료
}

4. Apply the stealth plugin

puppeteer.use(StealthPlugin());

5. Implement browser interaction

puppeteer.launch({ headless: false }).then(async browser => {
  // 해당 headless는 하지 않고 눈에 보이게 브라우저를 띄워줍니다.

  const url = args[0]; // 첫 번째 파라미터로부터 URL 가져오기

  const page = await browser.newPage(); // 새 브라우저 열기
  await page.goto(url); // url로 페이지 이동
  await page.waitForTimeout(5000); // 렌더링 완료를 위해 5초 대기

  const currentUrl = await page.url(); // url을 변수에 저장
  console.log("현재 페이지의 URL:", currentUrl); // url이 제대로 담겼는지 출력

  // 페이지의 HTML을 변수에 저장
  const html = await page.content();

  // HTML 내용을 파일로 저장합니다
  fs.writeFile("pageContent.html", html, (err) => {
    if (err) {
      console.error("파일 저장 중 오류 발생:", err);
    } else {
      console.log("HTML이 성공적으로 저장됨: pageContent.html");
    }
  });
});

6. Execute the command

Created as an mjs file to receive input.

node [파일명].mjs https://www.naver.com

The URL will open directly, and output will be displayed in the terminal.

7. Check the html file

After following the steps above, you can confirm that pageContent.html has been saved.

Conclusion

This script navigates to the URL specified by the user, retrieves the HTML of the page, and saves it as a pageContent.html file. By using Puppeteer-Extra and the Stealth Plugin, you can effectively bypass the website's automated detection logic.

I hope this article has helped you understand how to create a crawler using Node.js. By leveraging Puppeteer and its extensions, you can implement a more powerful and effective web crawling solution.

Applying specialized web scraping for Chrome using Node.js

What is Puppeteer?

Why use Puppeteer?

1. Optimized performance for Chrome

2. Strong handling of dynamic web content

3. Bypassing blocks

Puppeteer-Extra and Stealth Plugin

Puppeteer-Extra

Stealth Plugin

Real-world Usage Example

1. Install required modules

2. Import required modules

3. Handle URL input in commands

4. Apply the stealth plugin

5. Implement browser interaction

6. Execute the command

7. Check the html file

Conclusion

Comments

Add Comment

Continue Reading

In the era of GPT, why is 'web crawling' still important?

Want to track sales trends? Monitor prices? Crawling SSG.com data is the answer!

Automating Musinsa Crawling: How to easily collect product data by category

Automate Twitter data collection without coding and without getting blocked by IP restrictions!

Get notified of new posts