Wine Quality Prediction Project (Documentation) Jupyter Notebook
Wine Quality Prediction Project (Red Wine Variant)
Code of project: For code click here
1. Dataset Source:
Dataset: Wine Quality Dataset
(Red Wine Variant)
Source: UCI Machine Learning
Repository
Link: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
Direct CSV Link: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
2. Dataset Selection:
Why This
Dataset?
This dataset is ideal for regression-based
machine learning tasks because:
1.
Real-World
Relevance: Predicts wine quality using measurable chemical
properties, mimicking industry needs.
2.
Structured
& Clean: No missing values, minimal preprocessing (e.g.,
scaling, duplicate removal).
3.
Educational
Value: Small size and clear features make it perfect for practicing
workflows (preprocessing → modeling → evaluation).
4.
Benchmarking:
Widely used in research, enabling comparison with existing models (e.g., SVM
vs. Linear Regression).
Dataset
Properties:
a) Rows (Samples): 1,599 rows.
b)
Columns (Features: 12 columns.
c)
Feature Names:
Fixed Acidity
, Volatile Acidity
, Citric Acid
, Residual Sugar
, Chlorides
, Free Sulfur Dioxide
, Total Sulfur Dioxide
, Density
, Ph
, Sulphates
, Alcohol
, Quality
.
3. Jupyter Notebook Implementation
Step 1: Import Libraries
We
first import all necessary Python libraries that help us work with data, create
plots, and build machine learning models:
ü Pandas: Load, preprocess and analyze the
data like csv files.
ü Numpy: it is used for numerical
operations.
ü Matplotlib.pyplot: create basic plots like line
charts
ü Seaborn: it is used to generate advance
visualization like heatmaps.
ü Sklearn.model_selection.train_test_split: it is used to split data for
training and Testing.
ü Sklearn.preprocessing.standardscaler: standardize features (mean= 0,
variance=1).
ü sklearn.svm.svr: implement support vector
regression for predictions.
ü Sklearn.linear_model.LinearRegression: Fit a linear regression model.
ü Sklearn.metrics: Calculate performance scores (MSE,
R2).
Uses of
This Libraries in Project
·
Preprocessing:
Clean and scale data (pandas
, numpy
, StandardScaler
).
·
Modeling:
Train regression models (SVR
, LinearRegression
).
·
Evaluation:
Measure accuracy (mean_squared_error
, r2_score).
·
Visualization:
Explore patterns and results (matplotlib
, seaborn
).
Step 2: Load Dataset
Downloaded the raw dataset csv in my pc and moved into the local
user and then import that csv file in jupyter notebook to build the model.
By default, CSV files use commas (,
)
to separate values. But in my file, the values are separated by semicolons (;
).
So I use delimiter=’;’ to separate the values.
·
raw_data contain the path of file the csv file.
·
pd.read_csv is used to load the csv file.
·
Delimiter is used to tell pandas how the files in each row are separated.
·
print
(date.head()) this code will print first 5
column in the csv file of every row.
Step 3: Data Preprocessing
Before building the model, we clean the data:
StandardScaler
to ensure all
features have the same scale, which helps machine learning models perform
better and more efficiently.
Step 4: Train-Test Split
This code splits the dataset into training and testing sets, where
80% of the data is used for training and 20% for testing (test_size=0.2
). X_scaled
contains the input
features, and y
is the target variable. The random_state=42
ensures the
split is the same every time you run the code. It then prints the shape of both
sets to confirm the split.
Step 5: Model Implementation
We use two models: Support Vector
Regression and Linear Regression because Linear
Regression is faster and easier to understand, but SVR can capture complex
patterns better especially if the data is not linearly distributed.
Support Vector Regression (SVR):
Linear Regression:
Step 6: Visualization
In visualization I used two
charts.
Ø Actual vs
Predicted Wine Quality (SVR)
This chart shows how well the SVR model
predicted wine quality compared to the actual values.
Ø Feature
Correlation Heatmap
this chart displays the correlation between
different features in the dataset, including their relationship with wine
quality.
1.
Actual vs Predicted Quality (SVR)
·
Purpose: Evaluates how well the SVR model predicts
wine quality.
·
Interpretation: Each dot represents one
prediction. If the dots are close to the red dashed line (y = x), it means the
predictions are accurate.
·
2.
Feature Correlation Heatmap:
·
Purpose: Shows how strongly each feature (column) is
correlated with every other feature, including the target quality.
·
Interpretation: The values range from -1 to 1. A
high positive or negative value means a strong relationship.
·
Use Case: Helps identify which features are most
relevant for predicting quality and if there’s multicollinearity
(features highly correlated with each other).
Step 7: Model Evaluation
To measure
how well our models perform, we used three common evaluation metrics:
- Mean
Squared Error (MSE)
MSE calculates the average squared difference between the actual and predicted values. A lower MSE means the model is making fewer errors. It helps us understand how far off our predictions are, but it can be sensitive to large errors. - Root
Mean Squared Error (RMSE)
RMSE is simply the square root of MSE. It brings the error back to the same unit as the target variable (wine quality in our case), making it easier to interpret. Like MSE, lower RMSE values indicate better performance. - R-squared
Score (R²)
R² measures how well the predicted values match the actual values. It ranges from 0 to 1:
- 1 means perfect predictions.
- 0 means the model does no better than
simply guessing the average.
We applied these metrics to both Support
Vector Regression (SVR) and Linear Regression models using the test
dataset. The results help us compare which model makes more accurate
predictions for wine quality.
Step 8: Save Processed Data:
This code creates a new, clean DataFrame called processed_data
that includes all
the scaled
feature values (from X_scaled
) and the original target values
(quality
).
It then saves this final DataFrame into a CSV file named wine_quality.csv
, without
including row index numbers (index=False
). This allows you to
store the cleaned and scaled data for future use, like sharing, analysis, or
re-training the model.
Comments
Post a Comment