Hello, today I will introduce how to create a web crawler using Node.js, not Ruby, Python, or Java. In particular, we will take a detailed look at Puppeteer, a powerful Node.js library that can control Google Chrome or Chromium, and its extension, Puppeteer-Extra.
What is Puppeteer?
Puppeteer is a Node.js library that enables web scraping and interaction with websites in the same way as users. Puppeteer primarily operates in Headless mode, performing tasks in the background without a GUI, but can also be set to Headful mode with the full browser UI if needed.
Why use Puppeteer?
1. Optimized performance for Chrome
Puppeteer communicates directly with the Chrome browser using the Chrome DevTools Protocol. This allows for faster and more precise control than using the WebDriver API provided by other automation tools like Selenium.
2. Strong handling of dynamic web content
Puppeteer excels in handling not only static pages but also content generated dynamically with JavaScript. This means it can effectively handle the complex interactions and dynamic elements of modern websites.
3. Bypassing blocks
In actual web scraping tasks, Puppeteer effectively bypasses blocks to collect data. Especially when using Puppeteer-Extra and the Stealth Plugin, you can collect data more reliably by bypassing automated detection mechanisms.
Puppeteer-Extra and Stealth Plugin
Puppeteer-Extra
Puppeteer-Extra is an extension built on top of the base Puppeteer library, allowing users to easily integrate additional features through a flexible plugin system. This enhances functionality and improves the usability of the web crawler.
Stealth Plugin
One of the plugins in Puppeteer-Extra, the Stealth Plugin, effectively bypasses common automated detection mechanisms during web scraping. While many websites block crawlers, this plugin can make the automated crawler appear like a regular user's browser.
Real-world Usage Example
Now let's explore how to perform a simple web scraping task using Puppeteer-Extra and the Stealth Plugin.
1. Install required modules
Install the puppeteer-extra and puppeteer-extra-plugin-stealth modules by entering the following command.
npm install puppeteer-extra puppeteer-extra-plugin-stealth
2. Import required modules
import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";
import fs from "fs";
3. Handle URL input in commands
In preparation for later using a command like node node_js.mjs https://www.example.com to input a URL directly, we have included code to throw an exception if the URL is missing.
const args = process.argv.slice(2); // 첫 번째 요소는 Node.js 실행 경로이므로 제외
if (args.length < 1) {
console.error("URL 파라미터가 필요합니다.");
process.exit(1); // 오류로 종료
}
4. Apply the stealth plugin
puppeteer.use(StealthPlugin());
5. Implement browser interaction
puppeteer.launch({ headless: false }).then(async browser => {
// 해당 headless는 하지 않고 눈에 보이게 브라우저를 띄워줍니다.
const url = args[0]; // 첫 번째 파라미터로부터 URL 가져오기
const page = await browser.newPage(); // 새 브라우저 열기
await page.goto(url); // url로 페이지 이동
await page.waitForTimeout(5000); // 렌더링 완료를 위해 5초 대기
const currentUrl = await page.url(); // url을 변수에 저장
console.log("현재 페이지의 URL:", currentUrl); // url이 제대로 담겼는지 출력
// 페이지의 HTML을 변수에 저장
const html = await page.content();
// HTML 내용을 파일로 저장합니다
fs.writeFile("pageContent.html", html, (err) => {
if (err) {
console.error("파일 저장 중 오류 발생:", err);
} else {
console.log("HTML이 성공적으로 저장됨: pageContent.html");
}
});
});
6. Execute the command
Created as an mjs file to receive input.
node [파일명].mjs https://www.naver.com
The URL will open directly, and output will be displayed in the terminal.
7. Check the html file
After following the steps above, you can confirm that pageContent.html has been saved.
Conclusion
This script navigates to the URL specified by the user, retrieves the HTML of the page, and saves it as a pageContent.html file. By using Puppeteer-Extra and the Stealth Plugin, you can effectively bypass the website's automated detection logic.
I hope this article has helped you understand how to create a crawler using Node.js. By leveraging Puppeteer and its extensions, you can implement a more powerful and effective web crawling solution.




