Applying specialized web scraping for Chrome using Node.js

This is an introduction on how to apply web scraping specialized for Chrome using Node.js. You can handle dynamic content and bypass blocks by using Puppeteer and Puppeteer-Extra.

7
Applying specialized web scraping for Chrome using Node.js

Hello, today I will introduce how to create a web crawler using Node.js, not Ruby, Python, or Java. In particular, we will take a detailed look at Puppeteer, a powerful Node.js library that can control Google Chrome or Chromium, and its extension, Puppeteer-Extra.

What is Puppeteer?

Puppeteer is a Node.js library that enables web scraping and interaction with websites in the same way as users. Puppeteer primarily operates in Headless mode, performing tasks in the background without a GUI, but can also be set to Headful mode with the full browser UI if needed.

Why use Puppeteer?

1. Optimized performance for Chrome

Puppeteer communicates directly with the Chrome browser using the Chrome DevTools Protocol. This allows for faster and more precise control than using the WebDriver API provided by other automation tools like Selenium.

2. Strong handling of dynamic web content

Puppeteer excels in handling not only static pages but also content generated dynamically with JavaScript. This means it can effectively handle the complex interactions and dynamic elements of modern websites.

3. Bypassing blocks

In actual web scraping tasks, Puppeteer effectively bypasses blocks to collect data. Especially when using Puppeteer-Extra and the Stealth Plugin, you can collect data more reliably by bypassing automated detection mechanisms.

Puppeteer-Extra and Stealth Plugin

Puppeteer-Extra

Puppeteer-Extra is an extension built on top of the base Puppeteer library, allowing users to easily integrate additional features through a flexible plugin system. This enhances functionality and improves the usability of the web crawler.

Stealth Plugin

One of the plugins in Puppeteer-Extra, the Stealth Plugin, effectively bypasses common automated detection mechanisms during web scraping. While many websites block crawlers, this plugin can make the automated crawler appear like a regular user's browser.

Real-world Usage Example

Now let's explore how to perform a simple web scraping task using Puppeteer-Extra and the Stealth Plugin.

1. Install required modules

Install the puppeteer-extra and puppeteer-extra-plugin-stealth modules by entering the following command.

npm install puppeteer-extra puppeteer-extra-plugin-stealth

2. Import required modules

import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";
import fs from "fs";

3. Handle URL input in commands

In preparation for later using a command like node node_js.mjs https://www.example.com to input a URL directly, we have included code to throw an exception if the URL is missing.

const args = process.argv.slice(2); // 첫 번째 요소는 Node.js 실행 경로이므로 제외
if (args.length < 1) {
  console.error("URL 파라미터가 필요합니다.");
  process.exit(1); // 오류로 종료
}

4. Apply the stealth plugin

puppeteer.use(StealthPlugin());

5. Implement browser interaction

puppeteer.launch({ headless: false }).then(async browser => {
  // 해당 headless는 하지 않고 눈에 보이게 브라우저를 띄워줍니다.

  const url = args[0]; // 첫 번째 파라미터로부터 URL 가져오기

  const page = await browser.newPage(); // 새 브라우저 열기
  await page.goto(url); // url로 페이지 이동
  await page.waitForTimeout(5000); // 렌더링 완료를 위해 5초 대기

  const currentUrl = await page.url(); // url을 변수에 저장
  console.log("현재 페이지의 URL:", currentUrl); // url이 제대로 담겼는지 출력

  // 페이지의 HTML을 변수에 저장
  const html = await page.content();

  // HTML 내용을 파일로 저장합니다
  fs.writeFile("pageContent.html", html, (err) => {
    if (err) {
      console.error("파일 저장 중 오류 발생:", err);
    } else {
      console.log("HTML이 성공적으로 저장됨: pageContent.html");
    }
  });
});

6. Execute the command

Created as an mjs file to receive input.

node [파일명].mjs https://www.naver.com

The URL will open directly, and output will be displayed in the terminal.

7. Check the html file

After following the steps above, you can confirm that pageContent.html has been saved.

Conclusion

This script navigates to the URL specified by the user, retrieves the HTML of the page, and saves it as a pageContent.html file. By using Puppeteer-Extra and the Stealth Plugin, you can effectively bypass the website's automated detection logic.

I hope this article has helped you understand how to create a crawler using Node.js. By leveraging Puppeteer and its extensions, you can implement a more powerful and effective web crawling solution.

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.