Creating a Campuspick contest and extracurricular activity crawler with Python - Contest & Extracurricular Activity Automatic Crawling Project: Part 1

The first post of a project to create a crawler for Campuspick competitions and external activities using Python. Detailed explanation on how to crawl data using Selenium and Openpyxl libraries.

5
Creating a Campuspick contest and extracurricular activity crawler with Python - Contest & Extracurricular Activity Automatic Crawling Project: Part 1

0. Overview

We expect that many job seekers and college students who read the Hashscraper development notes are interested in competitions, external activities, etc., so we will create a crawler for competition/external activity site CampusPick using Python to deliver more practical crawling techniques. The project will be divided into three series:

  1. Creating a CampusPick crawler

  2. Setting up and running the crawler using Crontab

  3. Sending the crawled data via email using Python and Gmail

This post is about the first part, Creating a CampusPick crawler.

  • This post does not provide guidance on basic Python installation.

1. Installing Openpyxl

1.1. Installing Libraries

First, you need to install Selenium, a library used for Python crawling, and Openpyxl for handling Excel. Please install the libraries with the following code in the terminal.

(If pip is not applicable, you can use the code with pip3)

# selenium 설치
pip install selenium
# or
pip3 install selenium

# openpyxl 설치
pip install openpyxl
# or
pip3 install openpyxl

1.2. Importing Libraries

Then, you need to import Selenium and time libraries.

# 크롤러 라이브러리
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
# 엑셀 라이브러리
import openpyxl
# 시간 라이브러리
import time

2. Developing the Crawler

2.1. Building Lists

Create a list to gather the crawled data and make a list of URLs of the pages you want to crawl.

# 크롤링한 데이터를 모아두는 리스트
earned_content = []

# 캠퍼스픽 공모전 페이지와 대외활동 페이지 URL
contest_url = "<https://www.campuspick.com/contest>"
activity_url = "<https://www.campuspick.com/activity>"
url_list = [contest_url, activity_url]

2.2. Setting up to Automatically Open an Empty Browser Window

You can open an empty browser window with the following code.

# 크롬창을 띄우는 코드
driver = webdriver.Chrome()
for url in url_list:
    # url 접속
    driver.get(url)
    # 로딩이 다 되기 위해서 3초 기다려주기
    time.sleep(3)

    # 웹페이지 자체가무한스크롤이므로 리스트에서 첫번째 것만 가져와서,
    # following-sibling 을 이용해서 다음 리스트 가져옴
    lis = driver.find_elements(By.XPATH, '//*[@class="list"]/*[@class="item"]')
    li = lis[0] if lis else None

    while li:
        if li.find_elements(By.XPATH, './/*[@class="dday highlight"]'):
            # 마감기한을 가져와서 마감된 게시물인지 확인
            due_date = li.find_element(By.XPATH, './/*[@class="dday highlight"]').text
            if "마감" in due_date:
                print("마감된 게시물은 가져오지 않습니다")
        else:
            print("마감기한이 없습니다")
            break  # 시간순서로 가져오기 때문에 마감이면 이후에 게시물도 마감이므로 반복문 종료

        # 다음 리스트 아이템으로 스크롤하는 코드
        driver.execute_script("arguments[0].scrollIntoView({behavior: 'smooth', block: 'center'});", li)

        # 키워드
        keywords = li.find_elements(By.XPATH, './/*[@class="badges"]/span')
        keyword_list = [keyword.text for keyword in keywords]
        keyword_str = ",".join(keyword_list)

        # 활동명
        title = li.find_element(By.XPATH, './/h2').text
        # 주최
        company = li.find_element(By.XPATH, './/*[@class="company"]').text
        # 조회수
        view_count = li.find_element(By.XPATH, './/*[@class="viewcount"]').text
        # 썸네일 이미지 URL
        thumbnail_url = li.find_element(By.XPATH, './/figure').get_attribute("data-image")

        results = [keyword_str, title, company, due_date, view_count, thumbnail_url]

        print(results)

        earned_content.append(results)

        # 다음 리스트 있는지 확인해서 리스트 가져오기
        if li.find_elements(By.XPATH, './following-sibling::*[@class="item"]'):
            print("다음 리스트가 존재합니다")
            next_li = li.find_element(By.XPATH, './following-sibling::*[@class="item"]')
            if next_li.find_element(By.XPATH, './/*[@class="dday highlight"]'):
                due_date = next_li.find_element(By.XPATH, './/*[@class="dday highlight"]').text
                print(f"마감기한: {due_date}")
                if "마감" in due_date:
                    print("마감된 게시물은 가져오지 않습니다")
                print("다음게시물을 가져옵니다")
                li = next_li
            else:
                print("마감기한이 없습니다")
                break  # 시간순서로 가져오기 때문에 마감이면 이후에 게시물도 마감이므로 반복문 종료
        else:
            print("다음 리스트가 존재하지 않습니다. 종료합니다.")
            break
# 크롬 드라이버 종료
driver.quit()

When you run the above code, the crawled data will be saved in earned_content.

2.3. Converting and Saving Data to Excel

To create an Excel file with the saved data, write the following code.

# 엑셀 워크북 생성
wb = openpyxl.Workbook()

# 엑셀 시트 생성
ws = wb.active
ws.title = "캠퍼스픽 대외활동_공모전 정보"

# 컬럼 제목 추가
column_titles = ["키워드", "활동명", "주최", "마감일", "조회수", "썸네일이미지 URL"]
ws.append(column_titles)

# 데이터를 한줄씩 시트에 쓰기
for row_data in earned_content:
    ws.append(row_data)

# 엑셀 파일 저장
wb.save("campuspick_info_data.xlsx")

The above code is for converting the data fetched through libraries into an Excel file.

3. Conclusion

Today, we have fetched competitions/external activities being recruited on the CampusPick site and saved them in Excel.

Choosing CampusPick as a channel, setting up various channels such as news platforms or government-supported project pages in the same way will be very helpful for professionals and entrepreneurs.

In the next post, we will explore how to automate the collection process daily using Crontab based on the crawler we created today.

Check out this article as well:

Data Collection, Automate Now

Start in 5 minutes without coding · Experience crawling 5,000+ websites

Get started for free →

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.