Wine Quality Prediction Project (Documentation) Jupyter Notebook

Wine Quality Prediction Project (Red Wine Variant)

Code of project: For code click here

1. Dataset Source:

Dataset: Wine Quality Dataset (Red Wine Variant)

Source: UCI Machine Learning Repository

Link: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

Direct CSV Link: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

2. Dataset Selection:

Why This Dataset?

This dataset is ideal for regression-based machine learning tasks because:

1.      Real-World Relevance: Predicts wine quality using measurable chemical properties, mimicking industry needs.

2.      Structured & Clean: No missing values, minimal preprocessing (e.g., scaling, duplicate removal).

3.      Educational Value: Small size and clear features make it perfect for practicing workflows (preprocessing → modeling → evaluation).

4.      Benchmarking: Widely used in research, enabling comparison with existing models (e.g., SVM vs. Linear Regression).

Dataset Properties:

a)      Rows (Samples): 1,599 rows.

b)      Columns (Features: 12 columns.

c)      Feature Names:
Fixed AcidityVolatile AcidityCitric AcidResidual SugarChloridesFree Sulfur DioxideTotal Sulfur DioxideDensityPhSulphatesAlcoholQuality.

3. Jupyter Notebook Implementation

Step 1: Import Libraries

We first import all necessary Python libraries that help us work with data, create plots, and build machine learning models:

ü  Pandas: Load, preprocess and analyze the data like csv files.

ü  Numpy: it is used for numerical operations.

ü  Matplotlib.pyplot: create basic plots like line charts

ü  Seaborn: it is used to generate advance visualization like heatmaps.

ü  Sklearn.model_selection.train_test_split: it is used to split data for training and Testing.

ü  Sklearn.preprocessing.standardscaler: standardize features (mean= 0, variance=1).

ü  sklearn.svm.svr: implement support vector regression for predictions.

ü  Sklearn.linear_model.LinearRegression: Fit a linear regression model.

ü  Sklearn.metrics: Calculate performance scores (MSE, R2).

Uses of This Libraries in Project

·         Preprocessing: Clean and scale data (pandasnumpyStandardScaler).

·         Modeling: Train regression models (SVRLinearRegression).

·         Evaluation: Measure accuracy (mean_squared_errorr2_score).

·         Visualization: Explore patterns and results (matplotlibseaborn).

Step 2: Load Dataset

Downloaded the raw dataset csv in my pc and moved into the local user and then import that csv file in jupyter notebook to build the model.

By default, CSV files use commas (,) to separate values. But in my file, the values are separated by semicolons (;).

So I use delimiter=’;’ to separate the values.

·         raw_data contain the path of file the csv file.

·         pd.read_csv is used to load the csv file.

·         Delimiter is used to tell pandas how the files in each row are separated.

·         print (date.head()) this code will print first 5 column in the csv file of every row.

Step 3: Data Preprocessing

Before building the model, we clean the data:

This code first checks for any missing values and removes duplicate rows to clean the dataset. Then, it separates the features from the target variable. Finally, it applies feature scaling using StandardScaler to ensure all features have the same scale, which helps machine learning models perform better and more efficiently.

Step 4: Train-Test Split

Split the dataset for training and testing:

This code splits the dataset into training and testing sets, where 80% of the data is used for training and 20% for testing (test_size=0.2). X_scaled contains the input features, and y is the target variable. The random_state=42 ensures the split is the same every time you run the code. It then prints the shape of both sets to confirm the split.

Step 5: Model Implementation

We use two models: Support Vector Regression and Linear Regression because Linear Regression is faster and easier to understand, but SVR can capture complex patterns better especially if the data is not linearly distributed.

Support Vector Regression (SVR):

Uses the RBF kernel to capture non-linear relationships. Can model more complex patterns in the data. Often performs better with scaled data and smaller datasets.

Linear Regression:

Assumes a linear relationship between features and the target. Simpler and easier to interpret. May not perform well if the relationship in data is non-linear.

Step 6: Visualization

In visualization I used two charts.

Ø  Actual vs Predicted Wine Quality (SVR)
This chart shows how well the SVR model predicted wine quality compared to the actual values.

Ø  Feature Correlation Heatmap
this chart displays the correlation between different features in the dataset, including their relationship with wine quality.

1.     Actual vs Predicted Quality (SVR)

·         Purpose: Evaluates how well the SVR model predicts wine quality.

·         Interpretation: Each dot represents one prediction. If the dots are close to the red dashed line (y = x), it means the predictions are accurate.

·         Use Case: Helps visually assess prediction accuracy and detect overfitting or under fitting.

2.     Feature Correlation Heatmap:

·         Purpose: Shows how strongly each feature (column) is correlated with every other feature, including the target quality.

·         Interpretation: The values range from -1 to 1. A high positive or negative value means a strong relationship.

·         Use Case: Helps identify which features are most relevant for predicting quality and if there’s multicollinearity (features highly correlated with each other).

 

Step 7: Model Evaluation

To measure how well our models perform, we used three common evaluation metrics:

  1. Mean Squared Error (MSE)
    MSE calculates the average squared difference between the actual and predicted values. A lower MSE means the model is making fewer errors. It helps us understand how far off our predictions are, but it can be sensitive to large errors.
  2. Root Mean Squared Error (RMSE)
    RMSE is simply the square root of MSE. It brings the error back to the same unit as the target variable (wine quality in our case), making it easier to interpret. Like MSE, lower RMSE values indicate better performance.
  3. R-squared Score (R²)
    R² measures how well the predicted values match the actual values. It ranges from 0 to 1:
    • 1 means perfect predictions.
    • 0 means the model does no better than simply guessing the average.

We applied these metrics to both Support Vector Regression (SVR) and Linear Regression models using the test dataset. The results help us compare which model makes more accurate predictions for wine quality.

Step 8: Save Processed Data:

This code creates a new, clean DataFrame called processed_data that includes all the scaled feature values (from X_scaled) and the original target values (quality). It then saves this final DataFrame into a CSV file named wine_quality.csv, without including row index numbers (index=False). This allows you to store the cleaned and scaled data for future use, like sharing, analysis, or re-training the model.

Comments

Popular posts from this blog

Wine Quality prediction project code