Data Science · Case study · 2024
Who subscribes to a term deposit?
A binary classification project on a Portuguese bank's telemarketing data. The real challenge isn't fitting a model — it's a severe class imbalance that makes accuracy lie. This write-up walks through the EDA, the preprocessing, the imbalance problem, and a head-to-head comparison of seven classifiers.
01
The business problem
A Portuguese retail bank runs telemarketing campaigns to sell term deposits. Calls cost money and annoy uninterested clients, so the bank wants to know in advance who is likely to subscribe. That is the target variable y (yes / no) — a classic binary classification problem on the well-known UCI Bank Marketing dataset.
The full dataset (bank-full.csv) holds 45,211 calls and 16 features describing the client (age, job, marital status, education, balance…), the contact (channel, month, duration), and the history of previous campaigns (campaign, pdays, previous, poutcome).
02
Exploration: a heavily imbalanced target
The data is clean to start with — no missing values, no duplicate rows — so the analysis went straight to distributions. The single most important finding sits in the target itself:

Breaking each categorical feature down by the target reveals where the signal lives. Job is a good example:

The numeric features split into seven continuous variables. Boxplots against the target and outlier inspection guided the cleaning step that followed.
03
Preprocessing: cleaning, scaling, encoding
- Outlier removal — numeric columns were standardized to a z-score and rows with any
|z| > 3were dropped, trimming extreme values that can distort sensitive models. - Normalization —
StandardScaleron the numeric features so that large-magnitude variables (likebalance) don't dominate distance-based models. - Encoding — label encoding for binary columns (yes/no → 1/0), one-hot encoding for categorical variables with more than two values.
- A second, binned dataset — for Naive Bayes and the Decision Trees, continuous variables were discretized into categories (age groups, balance quantiles, call-duration bands…), which suits those models better.
04
The core challenge: imbalance and SMOTE
With 88% of clients saying “no”, plain accuracy is a trap. A KNN baseline reached 90% accuracy — but only 31% recall on the “yes” class: it missed two-thirds of the actual subscribers, which are exactly the clients the bank cares about. The right lens is the minority class's precision and recall, not the headline accuracy.
The remedy applied was SMOTE (Synthetic Minority Over-sampling Technique): instead of duplicating minority examples, it synthesizes new ones between existing neighbors. Crucially, SMOTE was fit on the training set only, never the test set, to avoid leaking synthetic information into evaluation.
SMOTE consistently shifted the trade-off the same way — more recall on subscribers, at the cost of some precision. On KNN, minority recall jumped from 0.31 to 0.58 while precision fell from 0.57 to 0.42. On Random Forest it balanced out neatly to 0.55 precision / 0.55 recall on the “yes” class. The lesson: SMOTE doesn't conjure free performance — it rebalances which errors the model makes.
05
The models
Seven classifiers were trained and compared, each evaluated with and without SMOTE: KNN, Random Forest, Logistic Regression, MLP (neural network), Naive Bayes, Decision Tree (Gini and Entropy), and SVM.
A practical note on SVM: with an RBF kernel it took several minutes to train on the full 45k rows, so it was evaluated on the smaller bank.csv sample — a real-world reminder that some models don't scale cheaply.

06
Results
Ranking the models on accuracy, macro precision, and ROC AUC (with SMOTE applied) gives a clear winner:
| Model | Accuracy | Precision | ROC AUC |
|---|---|---|---|
| Random Forest(best) | 0.908 | 0.751 | 0.923 |
| Logistic Regression | 0.899 | 0.727 | 0.894 |
| KNN | 0.875 | 0.684 | 0.838 |
| Naive Bayes | 0.796 | 0.620 | 0.828 |
| Decision Tree (Gini) | 0.870 | 0.654 | 0.689 |
| Decision Tree (Entropy) | 0.870 | 0.652 | 0.682 |
Random Forest leads on every metric (ROC AUC 0.923), with Logistic Regression a close and far simpler runner-up (0.894). KNN and Naive Bayes trail; the single Decision Trees have decent accuracy but weak AUC, a sign they discriminate the two classes less reliably across thresholds. SVM (on the reduced sample) landed around 0.90 accuracy but a low minority precision of 0.50.
07
Takeaways
| Step | Skill demonstrated |
|---|---|
| EDA on 45k rows, 16 features | Reading distributions and target relationships |
| Outlier removal, scaling, encoding | Preprocessing pipeline |
| Diagnosing class imbalance | Knowing accuracy lies — reading precision/recall |
| SMOTE on train only | Resampling without data leakage |
| Seven-model comparison | Breadth across classifier families |
| ROC AUC as the yardstick | Choosing the right metric for imbalance |
What I take away: on imbalanced data, the model choice matters less than understanding the metric. Every classifier here predicted “no” well and struggled with “yes”; the work was in surfacing that honestly, fixing it with SMOTE where it helped, and picking ROC AUC over accuracy to rank them.