Perfect summary of legal issues in web scraping - the boundary between legality and illegality

Summary of legal and illegal issues in distinguishing web scraping. Explore the boundaries of web scraping based on Korean law, US law, and EU regulations.

4
Perfect summary of legal issues in web scraping - the boundary between legality and illegality

"Can I be caught if I crawl?"

This is a recurring question in the developer community. Some say, "It's public data, so it can be freely collected," while others say, "You can even face criminal penalties if you do it recklessly." The reason for the confusion is that both statements are correct. Depending on the situation, the same act can be legal or illegal.

In 2024-2025, a series of large lawsuits surrounding AI training data collection have made the legal boundaries of crawling a hotter issue than ever before. This article summarizes the legal issues of crawling based on Korean law, US law, and EU regulations. While it does not replace legal advice, it will help establish practical criteria for determining "what is safe and what is risky."


Table of Contents

  1. Is Crawling Legal in Itself?
  2. Laws Applicable in Korea
  3. Key Crawling Related Precedents in Korea
  4. Key Precedents and Laws in the US
  5. EU — GDPR and Database Directive
  6. AI Training Data and Crawling — New Frontiers in 2025
  7. Legal Effectiveness of robots.txt
  8. Is Violating Terms of Service (ToS) Illegal?
  9. Practical Checklist — How to Crawl Safely
  10. Reasons Corporations Use Crawling Services
  11. Frequently Asked Questions (FAQ)

Is Crawling Legal in Itself?

Short Answer: The technology of crawling itself is legal. What becomes illegal is based on what, how, and why you collect data.

Accessing information displayed on a website using a web browser is not a problem. Crawling is just a program performing this process. However, legal issues arise in the following situations:

Situation Risk Level Related Laws
Collecting publicly available product prices Low
Collecting non-public data after logging in High Information and Communication Network Act, CFAA
Collecting personal information (name, contact information, etc.) Very High Personal Information Protection Act, GDPR
Replicating entire copyrighted works High Copyright Act
Mass collection causing server overload Medium to High Information and Communication Network Act, Obstruction of Business
Ignoring robots.txt Medium Varies by precedent
Large-scale collection for AI model training Under Debate Copyright Act, New AI-related Legislation

The key principle is this: "Collecting public data in a reasonable manner" is generally legal, while "circumventing access restrictions or collecting personal information or copyrighted works without permission" is risky.


Laws Applicable in Korea

In Korea, there are four main laws related to crawling. Since each law protects different subjects, multiple laws can apply to a single crawling activity.

1. Information and Communication Network Act (ICNA)

Key Provision: Article 48 (Prohibition of Information and Communication Network Intrusion Acts)

No one shall intrude into an information and communication network without legitimate access authority or beyond the authorized access authority.

This provision is most often problematic in crawling. The key issue is the scope of "legitimate access authority."

  • Accessing Public Web Pages: Generally legal. Accessing a page open to everyone using a program is considered "legitimate access."
  • Circumventing Login/Authentications: High likelihood of illegality. Circumventing CAPTCHA or accessing with others' account information may be deemed as exceeding access authority.
  • Bypassing IP Blocking: Gray area. If a site has blocked specific IPs and you bypass it with a proxy, it could be interpreted as an act exceeding "authorized access authority."

Additionally, Article 48(2) prohibits the transmission or distribution of malicious programs that can interfere with the stable operation of an information and communication network, and Article 48(3) prohibits acts that cause disturbances to an information and communication network. Crawling causing excessive server load may fall under these provisions.

Penalty: Imprisonment of up to 5 years or a fine of up to 50 million won.

2. Personal Information Protection Act

With the 2020 Data 3 Acts amendment and the 2023 comprehensive amendment, regulations on personal data processing have been significantly strengthened.

Cases Where Crawling is Problematic:

  • Collecting Personal Information like names, phone numbers, emails: Illegal without the consent of the data subject. Even if publicly available, collecting and using it for purposes other than the public purpose can be problematic.
  • Exceptions for Publicly Available Personal Information: The 2023 amendment has specified standards for processing 'publicly available personal information.' Even if the information is directly disclosed by the data subject, it is only allowed if the collection purpose is substantially related to the public purpose and does not unduly infringe on the data subject's interests.
  • Pseudonymization and Exceptions: If it is for statistical analysis, scientific research, etc., using data after pseudonymization without consent is allowed, but this is only possible under strict conditions.

Penalty: Depending on the violation type, imprisonment of up to 5 years, a fine of up to 50 million won, or a fine of up to 3% of the total revenue.

3. Copyright Act

If web content qualifies as a work of authorship, copyright applies to reproducing or transmitting it.

Key Points in Crawling:

  • Factual Information vs. Creative Works: Factual information like product prices, addresses, business hours is not protected by copyright. However, content with creative expressions like news articles, blog posts, product reviews is considered copyrighted.
  • Database Protection: Korean copyright law separately protects the rights of database creators (Article 93). Even if individual data is not a work of authorship, a systematically collected and organized database itself is protected. Replicating or distributing a substantial part or the entirety of a database is illegal.
  • Temporary Reproduction: Storing data temporarily in memory during crawling may technically qualify as reproduction, but Article 35(2) of the Copyright Act recognizes temporary reproduction for smooth use.
  • Text and Data Mining (TDM) Exception: Some countries recognize a copyright exception for non-commercial text and data mining for research purposes. While discussions are ongoing in Korea, there is no explicit exception provision yet.

4. Act on the Prevention of Unfair Competition and Protection of Trade Secrets

General Clause of Article 2(1)(k) can apply to crawling. This provision, established in 2013 and revised several times, now reads as follows:

Acts that infringe on another's economic interests by using their achievements made through significant investment or effort for one's own business in a manner contrary to fair trade practices or competition order.

In simple terms, if a company crawls a competitor's database that has been built with significant investment to use it in their own service, it may fall under this provision. This general clause is intended to capture acts of "data free-riding" that are difficult to be protected by other laws.


Key Crawling Related Precedents in Korea

It is difficult to determine where the line is drawn based on legal texts alone. Actual precedents must be examined.

Job Korea vs. Saramin (2017)

Case Overview: Saramin, a job platform, crawled job postings data from its competitor Job Korea and displayed it on its own service.

Court Ruling: The court deemed Saramin's act of crawling Job Korea's job posting database, which was built with significant investment and effort, and using it on their competitive service as an act of unfair competition.

Implications: Crawling a competitor's core data to use in the same business can be sanctioned under the Unfair Competition Prevention Act. Simply using "public data" is not a free pass to use it freely.

Controversies over Crawling Restaurant Reviews

In Korea, there have been repeated issues with crawling large amounts of restaurant reviews from portal sites, blog content, etc. In such cases, the court examines whether individual reviews qualify as copyrighted works and whether a substantial part of the review database was replicated.

Implications: Even user-generated content (UGC) can be considered a work of authorship if it exhibits creativity, and replicating it in bulk can violate copyright law and database protection regulations.

Circumvention of Technical Protection Measures and the Information and Communication Network Act

The Korean Supreme Court has consistently maintained the position that accessing data by bypassing technical protection measures constitutes "intrusion" under the Information and Communication Network Act. Especially, acts of circumventing access explicitly blocked — such as bypassing IP blocking with a proxy or evading bot detection systems — are likely to be deemed illegal.


Key Precedents and Laws in the US

Legal discussions on crawling in the US have a global impact on practical work.

CFAA (Computer Fraud and Abuse Act)

The key issue in the US Computer Fraud and Abuse Act is what it means to "access a computer without authorization or exceed authorized access."

Van Buren v. United States (2021, Supreme Court)

In this case, a police officer accessed a database he could access in the course of his duties for personal purposes. The Supreme Court ruled that "exceeding authorized access" means accessing information for which one does not have access, not using authorized information for inappropriate purposes.

Impact on Crawling: Accessing information on public websites is not a violation of the CFAA, providing a significant basis for the legality of crawling public data.

hiQ Labs v. LinkedIn (2022, 9th Circuit Court of Appeals)

This is the most important US precedent regarding the legality of crawling.

Case Overview: Data analytics company hiQ Labs crawled public profile data from LinkedIn to provide employee attrition prediction services. When LinkedIn sent a cease-and-desist letter and technically blocked crawling, hiQ filed a lawsuit.

Key Ruling:
- Collecting publicly accessible data is not a violation of the CFAA.
- "Unauthorized access" applies only to systems with authentication barriers like passwords. It does not apply to publicly accessible web pages.
- Cease-and-desist letters from LinkedIn do not result in "access authorization revocation."

Implications: However, subsequent developments are crucial. In November 2022, a federal district court ruled that hiQ violated LinkedIn's User Agreement, and the parties reached a settlement. Therefore, while it was not a CFAA violation, contractual liability for violating the User Agreement was recognized. This means that while public data crawling is safe from a criminal standpoint, civil risks are separate, as shown clearly in this case.

Meta Platforms v. Bright Data (2024)

Case Overview: Meta sued data collection company Bright Data for unauthorized collection of data from Facebook and Instagram.

Key Ruling: The court found that Bright Data collecting publicly accessible data without logging in was not a violation of the CFAA. However, contractual liability due to a breach of the User Agreement remained a separate issue.

This ruling continued the trend from the hiQ case, clarifying the boundaries between crawling public data and data requiring login.

Key Summaries of US Precedents

Principle Explanation
Public Data Principle Accessing publicly available data is not a violation of the CFAA
Technical Barrier Standard Circumventing passwords, authentication, etc., may lead to illegality
ToS is Separate Violating the User Agreement is a contractual issue separate from the CFAA
Purpose Irrelevant Accessing information is not judged based on the purpose if access is authorized

EU — GDPR and Database Directive

GDPR (General Data Protection Regulation)

The EU's general data protection regulation is the strictest data protection law globally. When it comes to crawling, the following principles are crucial:

  • Need for Legal Grounds for Processing: To process personal data, one of the six legal grounds is required. While "legitimate interest" is often applied in crawling, it is not allowed if the data subject's rights and interests take precedence.
  • Purpose Limitation: Data collected should only be used for the purpose it was collected for.
  • Data Minimization: Only the minimum necessary data should be collected.
  • Extraterritorial Application: If processing data of EU residents, GDPR applies even if the company is outside the EU. If a Korean company crawls data of EU users, GDPR applies.

Penalty: Up to 4% of the worldwide annual turnover or €20 million, whichever is higher.

EU Database Directive (96/9/EC)

The EU grants database creators "sui generis" rights. This is a right separate from copyright and prohibits repetitive extraction of a substantial part of the contents of a database that has been made with substantial investment.

Ryanair v. PR Aviation (2015, CJEU): In the case of Ryanair's flight data crawling, the CJEU ruled that website terms of use could have legal binding force separate from the Database Directive. This means that crawling can be restricted through the terms of use even if the database is not protected.


AI Training Data and Crawling — New Frontiers in 2025

In 2024-2025, the most significant change in crawling legal discussions is the disputes surrounding AI training data collection. While previous discussions mainly focused on "using data from competitors," now the core issue is web-wide crawling for large-scale language model (LLM) training.

Major Lawsuits

  • The New York Times v. OpenAI & Microsoft (Filed in 2023.12): The New York Times sued OpenAI and Microsoft for using its articles in training the GPT model without permission, alleging copyright infringement. The key issue is whether AI training falls under 'fair use.'
  • Multiple Copyright Holders Group Lawsuits: Authors, photographers, programmers, etc., have filed group lawsuits against OpenAI, Meta, Stability AI, etc.
  • Reddit, X(Twitter) Data Monetization: In response to AI companies' large-scale crawling, Reddit has monetized its API, and X has enforced data access restrictions.

Responses in Various Countries

  • EU AI Act (2024): The EU AI law obligates general AI model providers to disclose their copyright policies regarding training data. Additionally, the EU Copyright Directive (DSM Directive) provides for a text and data mining (TDM) exception, but if the copyright holder explicitly opts out, TDM is prohibited.
  • Japan: The 2018 amended Copyright Act relatively broadly allows the use of works for information analysis purposes, including AI training. However, it excludes cases that "unfairly harm the interests of the copyright holder."
  • Korea: There is no explicit legal provision for AI training data yet, and it is judged based on general principles of copyright law. Discussions on relevant legislation are ongoing.

Impact on Practical Crawling

The legal disputes over AI training data have indirect effects on general business crawling:

  1. Blocking AI Crawlers in robots.txt: Many websites have started blocking AI crawlers like GPTBot, CCBot, Google-Extended in robots.txt. This has sparked active discussions on the legal meaning of robots.txt.
  2. Enhanced Data Access Restrictions: Overall, websites are strengthening their crawling defenses in response to AI crawling issues.
  3. Importance of Legitimate Purpose: The purpose of collected data is becoming increasingly crucial in legal judgments.

Legal Effectiveness of robots.txt

robots.txt is a standard protocol through which website owners inform crawlers not to collect specific data. However, it is not a legal document.

robots.txt is Not Law

  • robots.txt is advisory. Disregarding it does not immediately constitute a crime.
  • However, the act of ignoring robots.txt can be used as unfavorable evidence in legal disputes.
  • Some courts have adopted robots.txt as evidence of the "access authority of the site owner."

Practical Significance

While robots.txt is not legally binding, compliance is wise.

  • When Compliant: It serves as a favorable basis when claiming the legality of crawling. It demonstrates that you respected the site owner's intention.
  • When Ignored: It serves as evidence of "accessing against the site owner's intention." Especially, crawling paths explicitly blocked in robots.txt can be interpreted as exceeding "authorized access authority."
  • Industry Standard: Major search engines like Google, Bing respect robots.txt. Disregarding it can be seen as a violation of industry best practices.

Is Violating Terms of Service (ToS) Illegal?

While violating the User Agreement is not a criminal offense, it can be the basis for a civil lawsuit.

In Korea

  • The User Agreement is a civil contract. Breach of the agreement can lead to claims for damages, but it may not be a basis for criminal penalties alone.
  • However, when combined with violations of the Information and Communication Network Act or the Copyright Act, the likelihood of criminal penalties increases.
  • Particularly, under the Act on Regulation of Terms of Contracts, overly unilateral terms in the User Agreement may be deemed void.

In the US

  • After the Van Buren ruling, violating the User Agreement alone does not constitute a violation of the CFAA.
  • However, violating the User Agreement can be the basis for a breach of contract lawsuit.
  • In the Meta v. Bright Data case, while CFAA violation was not recognized, contractual liability due to a breach of the User Agreement remained a separate issue.

Key Summary

Violating the User Agreement ≠ Criminal Offense
However, Violating the User Agreement = Civil Lawsuit Risk

While violating the User Agreement does not lead to criminal penalties, it can pose a risk of civil litigation. It is essential to be aware of potential civil disputes when violating the User Agreement.


Practical Checklist — How to Crawl Safely

Before starting crawling, check the following checklist.

Seven Principles for Safe Crawling

1. Collect Only Public Data
- Target only pages accessible without logging in.
- Do not circumvent CAPTCHA or authentication barriers.

2. Do Not Collect Personal Information
- Avoid data containing personal information like names, contact details, emails.
- If personal information is unavoidably included, immediately

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.