WINE
Data Mining

White Wine
Quality
Classification

NATHALY INGOL ORANGE 3 TOOLKIT
UCI ML REPOSITORY
4,898
Wine instances
11
Chemical features
7
Quality classes (3โ€“9)
6
Models compared
Orange 3 Workflow - click any model to inspect scores
Click a model node to activate
1. Source
๐Ÿ“„
File (XLSX)
white-wine-quality
โ†“
2. Prepare
๐Ÿ“Š
Data Table
Inspect 4,898 rows
โœ๏ธ
Edit Domain
quality โ†’ Categorical
โ†“
3. Models - tap to inspect
โ†“
4. Evaluate
๐Ÿงช
Test & Score
5-fold stratified CV
๐Ÿ“ˆ
ROC Analysis
AUC curves
๐Ÿ”ข
Confusion Matrix
Per-class accuracy
Model Comparison - 5-fold stratified cross-validation
Model AUC Accuracy F1 Precision Recall MCC
Feature Importance (Decision Tree, depth 5)
Quality Score Distribution
Classes 5 & 6 = 74% of all 4,898 instances
Python Analysis - Jupyter Notebook Deep Dive
๐Ÿ““
What is this section?
The Orange 3 analysis above gives us a visual workflow. Below, the same dataset was analyzed using Python (pandas, scikit-learn, matplotlib) in a Jupyter Notebook - replicating every result and going deeper. Each chart below includes an explanation so you can follow along even without a data science background.
View notebook on GitHub โ†’
Step 1 - Class Distribution & Imbalance
Quality Distribution & Class Imbalance
What is this chart?
The bar chart (left) shows how many wines exist at each quality score from 3 to 9. The pie chart (right) groups them into Low (3โ€“4), Medium (5โ€“6), and High (7โ€“9) buckets.
Why does this matter?
Quality scores 5 and 6 make up 74% of all 4,898 wines. This is called class imbalance - the model sees far more medium wines than low or high ones. As a result, accuracy alone is misleading: a model that just guesses '6' for everything would still be ~45% accurate. This is why we also look at AUC and F1.
Key Finding
Only 20 wines scored 3 and only 5 scored 9 - these rare classes are nearly impossible to classify correctly.
Python
vc = df['quality'].value_counts().sort_index()
1 / 7