How did the AI model reduce the defect rate?

The AI model analyzed 128 variables to identify key factors causing defects and adjusted those variables to lower the defect rate.

What is the minimum data requirement for training the model?

A minimum of 10,000 data points was requested, and approximately 175,000 data points were ultimately utilized.

Why is data cleansing important in machine learning?

Data cleansing is crucial because poorly cleansed data can lead to ineffective model training, resulting in inaccurate predictions.

What encoding formats were handled during data loading?

The model handled CSV files with 'cp949' and 'utf-8' encoding formats.

What was the project's goal in terms of defect rate?

The goal was to reduce the defect rate by predicting defects and adjusting the identified key variables.

There is a company that dramatically reduced the defect rate in the process using an AI model? (feat. SHAP library)

0. Overview

One of the Hashscraper customer cases used an AI model to reduce the defect rate in the process. I have written this article to explain the predictions of the machine learning model more deeply and understandably.

1. Problem Definition

1.1. Goal Setting

To briefly explain the customer case, there were differences in defect rates for each process based on 128 variables, and the goal was to reduce the defect rate by analyzing which variables caused defects through a machine learning model and adjusting those variables after predicting defects with the machine learning model.

1.2. Hypothesis Establishment

We hypothesized that extracting and adjusting key variables through a machine learning model in the process would reduce the defect rate.

2. Data Collection

2.1. Determination of Data Sources

Our customer provided data from each process directly from the factory.

Since this data is internal company data, direct disclosure is difficult, so I will just capture the folders to show.

2.2. Data Collection

We requested a minimum of 10,000 data points, and asked for as much data as possible.

The raw data we received is as follows:
- Machine 1: 3,931 data points
- Machine 2: 16,473 data points
- Machine 3: 2,072 data points
- Machine 4: 16,129 data points
- Machine 5: 57,970 data points
- Machine 6: 78,781 data points

In total, we utilized approximately 175,000 data points to train the model.

3. Data Preprocessing

3.1. Data Cleansing

Data cleansing is crucial for training the model. I would say that more than 80% of machine learning is about data learning.

If poorly cleansed data is used for training, the model will not learn effectively. (Simply put, if you put garbage in, you get garbage out.)

3.2. Workflow

3.2.1. File Encoding Handling

When loading files, we read CSV files with different encoding formats such as 'cp949' and 'utf-8'.

for file_path in file_paths:
    try:
        df = pd.read_csv(file_path, encoding='cp949', header=None)
    except UnicodeDecodeError:
        df = pd.read_csv(file_path, encoding='utf-8', header=None)

3.2.2. Labeling Task

To label, we combined the date and time columns and linked them with the y-axis data at that time:

for i in range(len(result_df_new) - 1):
    start_time, end_time = result_df_new['Datetime'].iloc[i], result_df_new['Datetime'].iloc[i + 1]
    selected_rows = df_yaxis[(df_yaxis['Datetime'] >= start_time) & (df_yaxis['Datetime'] < end_time)]
    results.append(1 if all(selected_rows['결과'].str.contains('OK')) else 0)
results.append(0)

3.2.3. Data Integration

Merging preprocessed data for each machine:

data = pd.concat([df1,df2,df3,df4,df6])
data.reset_index(drop=True,inplace=True)

3.2.4. Removing Duplicate Data

Duplicate data can introduce bias to the dataset and cause the model to have difficulty learning the diversity of data. Also, it can lead to overfitting issues, so all duplicate data was removed:

data = data.drop_duplicates().reset_index(drop=True)

3.2.5. Handling Missing Values

We visualized missing values using the missingno library. Columns with many missing values were completely removed. The reason for removing columns with many missing values is similar to the reasons mentioned above.

3.3. Feature Engineering

Feature engineering involves creating new features or transforming existing features to improve the model's performance. Since we considered each feature value to be important and did not have a precise understanding of each feature value, we did not perform feature engineering separately.

3.4. Exploratory Data Analysis (EDA)

Checking Data Distribution

We use graphs such as histograms and box plots to check the distribution of data.

Correlation Analysis

Analyzing the correlation between features to identify important features or resolve multicollinearity issues.

4. Sampling Types

4.1. Handling Data Imbalance

There was a significant data imbalance, so we tried combining various undersampling techniques with multiple models. Various undersampling techniques were attempted:

Random Under-sampling (RUS): Randomly removing data from the majority class
NearMiss: Keeping only k nearest majority class data points to minority class data
Tomek Links: Finding pairs of closest data points and removing majority class data
Edited Nearest Neighbors (ENN): Using k-NN algorithm to remove majority class data
Neighbourhood Cleaning Rule (NCL): Extended version of ENN

Ultimately, we determined that ENN undersampling was the most suitable for the model.

# NearMiss 인스턴스 생성
nm = NearMiss()

# 언더샘플링 수행
X_resampled, y_resampled = nm.fit_resample(data.drop('결과', axis=1), data['결과'])

# 언더샘플링 결과를 DataFrame으로 변환
data_sample = pd.concat([X_resampled, y_resampled], axis=1)

5. Modeling

5.1. Model Selection

Depending on the problem type (classification, regression, clustering, etc.), we select an appropriate machine learning model. While we did try running multiple models directly, we chose the model based on the best model selected by the PyCaret library.

PyCaret is an open-source data analysis and machine learning automation library for Python. PyCaret allows users to quickly build and experiment with the entire data analysis and machine learning pipeline with minimal code.

5.2. Model Training

We train the model using the training data. Ultimately, the CatBoost model yielded the highest AUC and F1-score values.

Explanation of Evaluation Metrics:

AUC (Area Under the Curve):
- AUC refers to the area under the ROC (Receiver Operating Characteristic) curve.
- The AUC value ranges between 0 and 1, with values closer to 1 indicating better performance of the classifier.
- AUC is particularly useful in imbalanced class distributions.

F1-Score:
- F1-Score is the harmonic mean of precision and recall.
- Precision is the ratio of true positives to all positives predicted, while recall is the ratio of true positives to all actual positives.
- The F1-Score value ranges between 0 and 1, with higher values indicating better model performance.

6. Conclusion: Variable Extraction and Enhanced Intuitiveness Function Addition

Ultimately, we received raw data generated by machines in the factory in real-time through this model, and using the SHAP library, we extracted the variables related to defects.

Additionally, for the general workers in the factory, we used PyInstaller to create an Excel file at the end of this series of processes, making it easy to view, and created an executable file so that they can easily see variables and determine whether the product is defective or not based on the raw data.

What is SHAP?

SHAP stands for SHapley Additive exPlanations, a tool used to explain how much each feature of a machine learning model influences predictions. This increases the model's 'transparency' and enhances confidence in how predictions are made.

Conclusion

In cases like the one mentioned above, Hashscraper is progressing with the project based on the AI model described above. By providing a complete solution from data preprocessing to modeling, and an intuitive interface for end users, we were able to effectively reduce the defect rate in actual processes.