Introducing a method to efficiently label text data using ChatGPT. It includes how to use the API, costs, and pros and cons.

Introducing methods and pros and cons of efficiently labeling text data using ChatGPT. Discussing API usage, costs, and more.

4
Introducing a method to efficiently label text data using ChatGPT. It includes how to use the API, costs, and pros and cons.

0. Building a Dataset to Save Time and Cost: Labeling Data with ChatGPT

To train an AI model, you need a dataset consisting of 'questions' and 'answers' pairs. (Supervised learning basis)

Labeling the data with answers, in other words, is called 'labeling.'

1. The Importance of Data, What is High-Quality Data?

Those who develop AI models often feel like "the data has done it all."

Of course, various elements such as the latest algorithms and high-performance computing are important for a good AI model.

However, if you are a junior in the AI and data-related field, I think it is wise to pay attention to the value of data first. Just like high-quality data.

What is 'high-quality' data? Why is it important?

Most of the latest AI technologies are based on artificial neural network deep learning models.

These artificial neural networks mimic human learning methods.

Until a newborn baby utters the word 'mom'

Referring to the mother as 'mom' -> High-quality data (accurate labeling)

You must have witnessed this countless times. -> A lot of data

If the baby refers to the mother as dad -> Low-quality data (inaccurate labeling)

The baby would be very confused.

As time passes, the baby learns that he should call the person who gave birth to him, takes care of him, and carries him, 'mom.' -> Pattern learning

Learning meaningful patterns from data, this is the essence of AI model training.

What makes this easier is high-quality data.

High-quality data is a good teacher for AI models.

2. Reasons for Entrusting Labeling to ChatGPT

As mentioned earlier, the quantity and quality of data are important.

However, providing good labels to a large amount of data is a repetitive task that requires continuous attention, making it monotonous and tedious.

Therefore, many government agencies and companies often outsource data labeling tasks to external companies or short-term part-timers.

In this process, as Hashscraper has experienced, you inevitably face various problems.

**First, it cannot guarantee high-quality data.

**AI model developers have a direction for the AI model they envision,

and there are standards for data labeling that align with that direction.

Even if these standards are communicated in detail to the labelers, there may still be variations among labelers,

and even if one labeler works alone, their concentration may waver during repetitive tasks.

It would be difficult for humans to maintain perfect consistency. (*Human error occurs)

**Second, it incurs significant costs.

**The more data that needs to be processed, and the faster a dataset needs to be built, the more skilled and more labor will be required.

Labeling data with ChatGPT is not perfect and may face difficulties in prompt creation, especially with challenging problems.

However, in terms of cost savings, time reduction, and consistency, it is widely recognized as effective.

Moreover, for professionals in related fields, becoming familiar with large language models and learning how to interact with them is essential.

I dare to predict that prompt engineering will become a basic skill for work efficiency in the future.

3. Labeling Data with ChatGPT

Let's assume we are creating a sentiment analysis dataset.

3.1 Obtaining OpenAI's API Key

In this post, we will not label data online but utilize ChatGPT's API.

First, log in to OpenAI at the link below and register a payment method for API call charges.

https://platform.openai.com/

Next, you need to download the API key.

After generating the API key, be sure to store it separately.

Since it is only exposed once, if you forget it, you will have to recreate it, so be careful.

3.2 Writing Prompts

text = '맛있는 거 먹어서 기분이 너무 좋다.' # 감정 분석 대상 텍스트

prompt = f'''You're an assistant, labeling data on a consistent basis.
Label the given text with one of the following sentiments: Positive, Negative, or Neutral.
If you can't determine a sentiment, label the text as Neutral.
Do not enclose your output in double or single quotes, just the label.
Follow the example to analyze the sentiment.

given text: 나 너무 우울해
sentiment: Negative

given text: 오늘 아침 일찍 출근했다.
sentiment: Neutral

given text: {text}
sentiment: '''

Here are some tips for writing prompts:

  • Write in English

As of August 2023, language models have token limits.

You cannot ask unlimitedly long questions, and the longer the tokens, the more costly it becomes.

ChatGPT is most efficient with tokens in English.

In simple terms, asking the same question in English incurs lower costs.

Also, when comparing results, ChatGPT has been proven to perform best in English.

  • Assign a Role to ChatGPT

Assign a role before writing instructions for ChatGPT to follow.

  • Provide Examples

Provide guidance on the output format and method.

This is also known as Few-shot Learning.

  • Be Detailed yet Clear

This is for token efficiency and to induce accurate responses.

Provide detailed explanations but write as simply and clearly as possible.

Refer to the above points to write your prompt.

3.3 Requesting API Responses (Setting Parameters)

response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    temperature=0,
    messages=[
        {"role": "user", "content": prompt}
    ]
)

sentiment = response.choices[0].message['content']

print(f'문장: {text}\n감정: {sentiment}')
문장: 맛있는 거 먹어서 기분이 너무 좋다.
감정: Positive

If you run the code as it is, you should see the same results.

It is good to refer to the following explanations for some parameters.

  • temperature

The most important parameter.

This parameter controls the diversity of the model's responses.

Higher temperature results in more diverse and creative responses, while lower temperature results in consistent responses.

Since we are 'labeling data,' let's set this to 0 (the lowest value).

When the language model generates a response, it calculates probabilities for various responses.

A 'temperature=0' means selecting the most probable response among the generated responses. You can understand this by repeating more complex questions. If 'temperature=0,' it will keep repeating the same response.

Also, the reason we did not label data online is as follows.

1) Coding is required for repetitive labeling of numerous data.

2) The temperature setting for ChatGPT we use online is fixed and cannot be adjusted.

  • messages' role: user, assistant, system

In the code, only the message from 'user' is written, but there are actually 'user,' 'assistant,' and 'system' messages.

For 'user' messages, write the questions or instructions the user wants to ask ChatGPT.

The 'assistant' message is for ChatGPT to refer to previous conversation content.

When conversing with ChatGPT online, you can see that ChatGPT remembers previous conversation content while conversing. This is possible because when the user asks a question, the previous conversation content is also transmitted to ChatGPT. Therefore, this is not necessary for labeling tasks.

The 'system' message specifies ChatGPT's behavior.

Although we assigned ChatGPT's role in the 'user' message, you can also write this content in the 'system' message. However, there is no need to write a separate system message for such a simple labeling task.

3.4 Labeling a Large Amount of Data?

The process up to this point has shown labeling for one data entry. If you have a certain level of development skills, labeling a large amount of data repeatedly should not be too difficult.

However, what you need to be careful about is handling errors that occur during API calls.

Errors often occur due to too many calls in a short period or poor network connection, so handling exceptions will ensure stable labeling.

def sentiment_analyze(text):
    prompt = f'''You're an assistant, labeling data on a consistent basis.
    Label the given text with one of the following sentiments: Positive, Negative, or Neutral.
    If you can't determine a sentiment, label the text as Neutral.
    Do not enclose your output in double or single quotes, just the label.
    Follow the example to analyze the sentiment.

    given text: 나 너무 우울해
    sentiment: Negative

    given text: 오늘 아침 일찍 출근했다.
    sentiment: Neutral

    given text: {text}
    sentiment: '''

    try:
        response = openai.ChatCompletion.create(
            model='gpt-3.5-turbo',
            temperature=0,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )

        sentiment = response.choices[0].message['content']

        return sentiment
    except:
        sentiment = 'error'
        return sentiment

df['label'] = df['text'].apply(sentiment_analyze)

4. How Much Does It Cost?

As of August 2023, the commonly used models are 'gpt-3.5-turbo' and 'gpt-4.'

Creating a simple classification dataset like sentiment analysis, as in the example, is sufficient with the 3.5 model.

Based on 'gpt-3.5-turbo,' for user questions, it costs $0.0015 per 1,000 tokens,

and for GPT responses, it costs $0.002 per 1,000 tokens.

The prompt used in the example is only about 200 tokens.

With a simple calculation, $3 can label approximately 10,000 sentiment analysis datasets.

However, when dealing with complex problems, the command prompt becomes more specific and complex, and using the 3.5 model may have limitations.

In such cases, there are many things that can be solved by using the gpt4 model, but considering that the cost is 20-30 times higher, careful decision-making is necessary.

5. Conclusion

For simple repetitive labeling tasks, there is no need to hesitate about using ChatGPT considering cost and performance. However, if you need to handle tasks at a level that requires using the GPT4 model, thorough consideration of costs is necessary.

There are drawbacks such as being limited to text datasets, depending on OpenAI's pricing policies, and the difficulty of creating AI models that outperform GPT models with labeled data.

However, the open-source community is rapidly advancing, and these drawbacks are expected to be resolved soon.

When building text datasets, I believe that the method prioritized over human labelers.

ChatGPT is an assistant that can help with my work tirelessly and at a low cost.

I hope you make good use of it to achieve your goals.

Also, read this article together:

Comments

Add Comment

Your email won't be published and will only be used for reply notifications.

Continue Reading

Get notified of new posts

We'll email you when 해시스크래퍼 기술 블로그 publishes new content.

Your email will only be used for new post notifications.