Data Science · Case study · 2024

Who subscribes to a term deposit?

A binary classification project on a Portuguese bank's telemarketing data. The real challenge isn't fitting a model — it's a severe class imbalance that makes accuracy lie. This write-up walks through the EDA, the preprocessing, the imbalance problem, and a head-to-head comparison of seven classifiers.

Pythonpandasscikit-learnimbalanced-learn (SMOTE)seabornMatplotlib

View code

The business problem

A Portuguese retail bank runs telemarketing campaigns to sell term deposits. Calls cost money and annoy uninterested clients, so the bank wants to know in advance who is likely to subscribe. That is the target variable y (yes / no) — a classic binary classification problem on the well-known UCI Bank Marketing dataset.

The full dataset (bank-full.csv) holds 45,211 calls and 16 features describing the client (age, job, marital status, education, balance…), the contact (channel, month, duration), and the history of previous campaigns (campaign, pdays, previous, poutcome).

Exploration: a heavily imbalanced target

The data is clean to start with — no missing values, no duplicate rows — so the analysis went straight to distributions. The single most important finding sits in the target itself:

Distribution of the target variable y — The target is heavily imbalanced: roughly 88% “no” vs 12% “yes”. Most clients decline — which means a model can score 88% accuracy by always predicting “no”. This single fact dictates the rest of the project.

Breaking each categorical feature down by the target reveals where the signal lives. Job is a good example:

Subscription counts by job category — Subscription by job. Blue-collar clients skew strongly toward “no”; students and retirees show a relatively higher share of “yes”. Other strong signals: previous-campaign success (poutcome = success) is the best predictor of a new subscription, and contact month matters (March, October, December convert better than the busy summer months).

The numeric features split into seven continuous variables. Boxplots against the target and outlier inspection guided the cleaning step that followed.

Preprocessing: cleaning, scaling, encoding

Outlier removal — numeric columns were standardized to a z-score and rows with any |z| > 3 were dropped, trimming extreme values that can distort sensitive models.
Normalization — StandardScaler on the numeric features so that large-magnitude variables (like balance) don't dominate distance-based models.
Encoding — label encoding for binary columns (yes/no → 1/0), one-hot encoding for categorical variables with more than two values.
A second, binned dataset — for Naive Bayes and the Decision Trees, continuous variables were discretized into categories (age groups, balance quantiles, call-duration bands…), which suits those models better.

The core challenge: imbalance and SMOTE

With 88% of clients saying “no”, plain accuracy is a trap. A KNN baseline reached 90% accuracy — but only 31% recall on the “yes” class: it missed two-thirds of the actual subscribers, which are exactly the clients the bank cares about. The right lens is the minority class's precision and recall, not the headline accuracy.

The remedy applied was SMOTE (Synthetic Minority Over-sampling Technique): instead of duplicating minority examples, it synthesizes new ones between existing neighbors. Crucially, SMOTE was fit on the training set only, never the test set, to avoid leaking synthetic information into evaluation.

SMOTE consistently shifted the trade-off the same way — more recall on subscribers, at the cost of some precision. On KNN, minority recall jumped from 0.31 to 0.58 while precision fell from 0.57 to 0.42. On Random Forest it balanced out neatly to 0.55 precision / 0.55 recall on the “yes” class. The lesson: SMOTE doesn't conjure free performance — it rebalances which errors the model makes.

The models

Seven classifiers were trained and compared, each evaluated with and without SMOTE: KNN, Random Forest, Logistic Regression, MLP (neural network), Naive Bayes, Decision Tree (Gini and Entropy), and SVM.

A practical note on SVM: with an RBF kernel it took several minutes to train on the full 45k rows, so it was evaluated on the smaller bank.csv sample — a real-world reminder that some models don't scale cheaply.

ROC curve for the KNN model — ROC curve for KNN (AUC ≈ 0.82). ROC AUC is a far more honest summary than accuracy on imbalanced data, since it weighs performance across all decision thresholds.

Results

Ranking the models on accuracy, macro precision, and ROC AUC (with SMOTE applied) gives a clear winner:

Model	Accuracy	Precision	ROC AUC
Random Forest(best)	0.908	0.751	0.923
Logistic Regression	0.899	0.727	0.894
KNN	0.875	0.684	0.838
Naive Bayes	0.796	0.620	0.828
Decision Tree (Gini)	0.870	0.654	0.689
Decision Tree (Entropy)	0.870	0.652	0.682

Random Forest leads on every metric (ROC AUC 0.923), with Logistic Regression a close and far simpler runner-up (0.894). KNN and Naive Bayes trail; the single Decision Trees have decent accuracy but weak AUC, a sign they discriminate the two classes less reliably across thresholds. SVM (on the reduced sample) landed around 0.90 accuracy but a low minority precision of 0.50.

Takeaways

Step	Skill demonstrated
EDA on 45k rows, 16 features	Reading distributions and target relationships
Outlier removal, scaling, encoding	Preprocessing pipeline
Diagnosing class imbalance	Knowing accuracy lies — reading precision/recall
SMOTE on train only	Resampling without data leakage
Seven-model comparison	Breadth across classifier families
ROC AUC as the yardstick	Choosing the right metric for imbalance

What I take away: on imbalanced data, the model choice matters less than understanding the metric. Every classifier here predicted “no” well and struggled with “yes”; the work was in surfacing that honestly, fixing it with SMOTE where it helped, and picking ROC AUC over accuracy to rank them.