How did the customer improve their defect rate?

The customer improved their defect rate by using an AI model to analyze 128 variables and adjust those that caused defects.

What was the goal of the machine learning model in this case study?

The goal was to reduce the defect rate by predicting defects based on key variables.

How much data was used to train the machine learning model?

Approximately 175,000 pieces of data were used to train the model.

Why is data cleansing important in machine learning?

Data cleansing is crucial because poorly cleansed data can lead to ineffective model training, resulting in poor predictions.

What challenges were faced in data collection for the study?

The main challenge was the internal nature of the company data, which made direct disclosure difficult.

A case study of a customer who dramatically improved their defect rate (feat. SHAP library)

0. Overview

One of the Hashscraper customer cases used an AI model to reduce the defect rate in the process. I wrote this article to explain the predictions of the machine learning model in a more detailed and understandable way.

1. Problem Definition

1.1. Goal Setting

To briefly explain the customer case, there was a difference in defect rates for each machine based on the 128 variables. The goal was to reduce the defect rate by analyzing which variables caused defects through machine learning models and adjusting those variables after predicting defects using machine learning models.

1.2. Hypothesis Establishment

We hypothesized that extracting and adjusting key variables in the process through a machine learning model would reduce the defect rate.

2. Data Collection

2.1. Determining Data Sources

The data sources were provided directly by our customer from the factory for each machine. Since this data is internal company data, direct disclosure is difficult, so I captured only the folders to show.

2.2. Data Collection

We requested a minimum of 10,000 pieces of data and asked for as much data as possible. Therefore, we received raw data as follows: 1st machine 3931 pieces, 2nd machine 16473 pieces, 3rd machine 2072 pieces, 4th machine 16129 pieces, 5th machine 57970 pieces, and 6th machine 78781 pieces. In total, we used approximately 175,000 pieces of data to train the model.

3. Data Preprocessing

3.3. Data Cleansing

Data cleansing is crucial for training the model. I would say that more than 80% of machine learning is about data cleansing. If poorly cleansed data is used for training, the model will not learn well. (In simple terms, if you put garbage in, you get garbage out.)

3.4. Workflow

3.4.1.

First, when loading the file, I read the CSV files with different encoding formats such as 'cp949' and 'utf-8'.

for file_path in file_paths:
        try:
            df = pd.read_csv(file_path, encoding='cp949', header=None)
        except UnicodeDecodeError:
            df = pd.read_csv(file_path, encoding='utf-8', header=None)

3.4.2.

To label, I combined the date and time columns and connected them with the y-axis data that came out at that time.

for i in range(len(result_df_new) - 1):
    start_time, end_time = result_df_new['Datetime'].iloc[i], result_df_new['Datetime'].iloc[i + 1]
    selected_rows = df_yaxis[(df_yaxis['Datetime'] &gt;= start_time) & (df_yaxis['Datetime'] < end_time)]
    results.append(1 if all(selected_rows['결과'].str.contains('OK')) else 0)
results.append(0)

3.4.3.

Merge preprocessed data for each machine.

data = pd.concat([df1,df2,df3,df4,df6])
data.reset_index(drop=True,inplace=True)

3.4.4

Duplicate data can cause bias in the dataset and may pose a problem for the model to learn the diversity of data. It can also lead to overfitting issues. Therefore, all duplicate data is removed.

data = data.drop_duplicates().reset_index(drop=True)

After this level of preprocessing, I conducted some exploratory data analysis (EDA).

3.4.5.

I visualized missing values using the missingno library. Columns with many missing values were completely removed. Removing columns with many missing values is for similar reasons as mentioned above. It can cause problems in learning the diversity of data and lead to overfitting. Of course, missing values can also be important depending on the data being analyzed. This can vary depending on the data being analyzed.

3.5. Feature Engineering

Feature engineering involves creating new features or transforming existing features to improve the performance of the model. Since we did not know each feature value exactly and considered all feature values important, we did not perform feature engineering separately.

3.6. EDA (The image below shows only a part for security reasons)

Checking data distribution

Check the distribution of data using graphs such as histograms and box plots.

4. Sampling Types

4.1. Data Imbalance

Due to significant data imbalance, we tried combining various models with undersampling. Undersampling is done when there is a large data imbalance. The main reason for undersampling is to prevent overfitting. Our main goal was for the model to learn without being biased towards any specific data. Random Under-sampling (RUS): Resolves class imbalance by randomly removing data from the majority class. It is simple and fast to implement, but there is a risk of losing important information.

4.2. NearMiss

This method retains only k nearest majority class data points to the minority class data points. NearMiss has several versions, and each version calculates the distance between minority class data points differently.

4.3. Tomek Links

Finds the closest pairs of data points between the minority and majority classes and removes the majority class data points. This clarifies the boundaries between classes.

4.4. Edited Nearest Neighbors (ENN)

Uses the k-NN algorithm for all majority class data points, and if the majority of nearest neighbors belong to the minority class, that data point is removed.

4.5. Neighbourhood Cleaning Rule (NCL)

An extended version of ENN that more effectively removes majority class data points to tidy up the area around the minority class.

We tried undersampling as mentioned above and also tried combining undersampling and oversampling. However, we found that ENN undersampling was the most suitable for the model, so we applied ENN.

##### NearMiss 인스턴스 생성
nm = NearMiss()

##### 언더샘플링 수행
X_resampled, y_resampled = nm.fit_resample(data.drop('결과', axis=1), data['결과'])

##### 언더샘플링 결과를 DataFrame으로 변환
data_sample = pd.concat([X_resampled, y_resampled], axis=1)

5. Modeling

5.1. Model Selection

Select an appropriate machine learning model based on the type of problem (classification, regression, clustering, etc.).

In model selection, we directly tried running various models, but we also referred to the best model selected by the pycaret library. PyCaret is an open-source data analysis and machine learning automation library for Python. PyCaret allows users to quickly build and experiment with the entire data analysis and machine learning pipeline with minimal code.

5.2. Model Training

Train the model using the training data.

Ultimately, the CatBoost model yielded the highest AUC value and f1-score value.

AUC (Area Under the Curve):

AUC stands for the area under the ROC (Receiver Operating Characteristic) curve.

The ROC curve plots sensitivity (True Positive Rate) on the y-axis and 1-specificity (False Positive Rate) on the x-axis.

The AUC value ranges between 0 and 1, with higher values indicating better classifier performance. A value of 0.5 is equivalent to random classification.

AUC is particularly useful for imbalanced class distributions.

F1-Score:

The F1-Score is the harmonic mean of precision and recall.

Precision is the ratio of true positives among positive predictions, and recall is the ratio of true positives among actual positives.

The F1-Score indicates a balance between the two metrics and is used to overcome the limitations of models that optimize only one metric.

The F1-Score ranges between 0 and 1, with higher values indicating better model performance.

6. Conclusion: Variable Extraction and Enhanced Intuitiveness

Ultimately, we received real-time raw data generated by machines in the factory through this model to make predictions. Using the SHAP library, we extracted variables related to defective products.

Furthermore, for ordinary workers in the factory, we made it easy to view the Excel file at the end of this series of processes by using Pyinstaller. We created an exe file so that with just one click, they can easily view variables and determine whether the product is defective or not based on the raw data.

What is SHAP?

SHAP stands for SHapley Additive exPlanations, a tool used to explain how much each feature of a machine learning model influences predictions. This tool increases the 'transparency' of the model, thereby enhancing confidence in how predictions are made.

In cases like the one mentioned above, Hashscraper is conducting projects based on the AI model mentioned above.

Read this article as well:

Data Collection, Automate Now

Start crawling websites in 5 minutes without coding · 5,000+ web scraping experiences

Get started for free →