Data Science · Case study · 2026

Predicting mortgage default

An end-to-end case study. From raw Freddie Mac data to a deployed application, through modeling, explainability, and industrialization. This write-up tells the story of the project — the decisions, the pitfalls, and the results.

Pythonpandasscikit-learnXGBoostSHAPKedroFAISSAnthropic ClaudeStreamlit

View code

The business problem

When a bank grants a mortgage, it makes a bet: will the borrower repay? Credit risk is the probability that they default. Getting it wrong is costly both ways — too cautious, and you turn away good clients and lose revenue; too lax, and you pile up arrears and capital losses.

The goal of this project: build a model that estimates the probability of default from only the information available at origination — credit score, debt-to-income ratio, down payment, property type, and so on. No future information, because in the real world you decide before you know how the loan will behave.

A structuring constraint in banking: the model must be explainable. EU regulation (GDPR, the right to explanation) and EBA guidelines require being able to justify a credit refusal. A high-performing but unexplainable “black box” is unusable in a regulated production setting. This requirement shaped several decisions in the project.

The data: Freddie Mac Single-Family Loan-Level Dataset

Freddie Mac (a U.S. government-sponsored mortgage refinancing enterprise) publishes the performance history of millions of mortgages since 1999 as open data. It is an academic and industry reference for credit risk.

The dataset is structured as two files per vintage:

Origination: one row per loan, the characteristics known at origination (FICO, amount, rate, state, etc.).
Performance: one row per loan per month, the repayment history (delinquencies, balance, status).

For this project I used the 2017 and 2018 samples (50,000 loans each, 100,000 total), with performance tracked through September 2025. The vintage choice is deliberate: these loans have enough hindsight for defaults to be observed (7-8 years), and they span the 2020-2021 COVID shock, which enriches the signal.

Building the target

The variable to predict, default, does not exist as such. I built it from the performance file: default = 1 if the loan reached 90+ days delinquency (delinquency ≥ 3) OR ended in foreclosure / repossession (zero-balance codes 03 or 09). Otherwise default = 0.

Observed default rate: 5.57%. This imbalance (the “default” class is rare) is typical of credit and has consequences for modeling and evaluation.

First pitfall: the samples didn't match

A concrete first obstacle. The Freddie Mac site offers generic “sample files” — but those example origination and performance files share no common loan identifier. They can't be joined. I had to realize I needed to download the real annual files of the Standard Dataset (sample_orig_2017.txt, sample_svcg_2017.txt), where the loans do correspond.

Second pitfall: a silent column shift

Subtler, and instructive. Reading the files (separator |, no header), I first handed pandas a list of 27 column names — while the current format has 32. Pandas' behavior here is silent and misleading: it aligns the names to the last columns, shifting everything else. As a result, the column I called dti actually held something else, and 100% of the “numeric” values became missing.

The symptom (100% missing values on columns meant to be full, and only 29 unique identifiers out of 100,000 rows) put me on the trail. The lesson: always validate the number of columns read, never trust silent parsing. I added assert statements on the column count so any future inconsistency fails loudly instead of quietly corrupting the data.

Exploration: what the data says

Exploratory data analysis (EDA) confirmed the business intuitions and revealed the structure of the signal. The key numeric variables have clear relationships with default:

Default rate by numeric variable. FICO: clear monotone decline (13.5% → 1.7%). DTI rising. LTV flat up to 95%, then a sharp jump.

FICO score: clean, monotonically decreasing relationship. FICO 426-681 → 13.5% default; above 801 → 1.7%. By far the best single predictor.
DTI (debt-to-income): rising. Below 22% → 2.3%; 44-47% → 8.3%.
LTV (loan-to-value): flat up to 95%, then a sharp jump to ~10%. That threshold is where mortgage insurance becomes mandatory.

The categorical variables carry signal too:

Default rate by categorical variable — First-time homebuyers: 7.5% vs 5%. Second homes: 3.2%. Counterintuitively, two borrowers are less risky than one — a co-borrower acts as a financial safety net.

Geography matters a lot:

Top 15 states by default rate — New York, Hawaii, Louisiana, Florida, Connecticut: 8 to 10% default, i.e. +50 to +80% above the national average (5.57%). Overpriced coastal real estate and disaster-exposed areas.

One redundancy to fix: oltv and ocltv are correlated at 0.99 — they say almost the same thing. I kept ocltv (more complete) and dropped oltv, so as not to muddy the model's interpretation.

Pearson correlation matrix — Correlation matrix (Pearson). oltv and ocltv correlated at 0.99 — redundancy removed.

Feature engineering: injecting domain knowledge

Models learn better when the signal is presented in a usable form. I created 11 derived variables encoding credit-domain knowledge:

Family	Variables	Idea
Risk flags	`is_subprime`, `is_high_ltv`, `is_high_dti`	Recognized regulatory thresholds (CFPB Qualified Mortgage)
Cumulative score	`risk_count` (0 to 3)	How many risk criteria stack up
Interactions	`fico_dti_interaction`, `fico_ltv_interaction`, `dti_ltv_interaction`	Capture profiles combining several weaknesses
Financial effort	`monthly_payment`, `payment_to_upb_ratio`	Real monthly payment via the amortization formula
Pricing	`rate_spread`	Gap to the vintage's average rate — a proxy for the lender's internal scoring

The risk_count variable illustrates the value of this approach well: the default rate goes from 4% (no criterion) to 20% (all three combined). A powerful and perfectly readable signal.

The crucial methodological point here: the order of operations. The train/test split is done before encoding the geographic variables. Why? Because those variables are encoded with target encoding — each state is replaced by its mean default rate. If you computed that rate over the full data, test-set information would “leak” into training (data leakage). The rate is therefore computed on the training set only, with smoothing that pulls under-represented areas toward the global mean.

Modeling: simple sometimes beats sophisticated

I trained and compared two models:

Logistic regression — the historical credit-scoring standard: linear, interpretable, validated by regulators.
XGBoost — the state of the art on tabular data: gradient boosting, able to capture complex interactions.

Model	ROC AUC	Average Precision
Logistic regression (chosen for production)	0.740	0.147
XGBoost (challenger + SHAP)	0.735	0.142

ROC and Precision-Recall curves for both models — ROC and Precision-Recall curves. The two models nearly overlap — logistic regression matches XGBoost.

Logistic regression matches (even slightly beats) XGBoost. Counterintuitive when you expect the “more powerful” model to always win. XGBoost's learning curve showed it was overfitting: excellent on train, no gain on validation. Three reasons explain this ceiling:

1.The relationships are near-linear. FICO ↘ default, DTI ↗ default, LTV ↗ default. Logistic regression models that perfectly.
2.Feature engineering already digested the signal. The interactions XGBoost might have found, I gave it explicitly.
3.The residual signal is macroeconomic noise. Job loss, COVID, rate hikes — unknown at origination, therefore unpredictable.

An AUC of 0.74 sits squarely in the professional credit-scoring range (0.70-0.85). The verdict: keep logistic regression in production — equivalent performance, but interpretable and compliant. XGBoost stays useful for fine-grained analysis via SHAP.

Portfolio lesson: knowing how to recognize that a simple model is enough, and justifying it, beats piling on gratuitous complexity.

The most instructive moment: importance ≠ usefulness

This is the episode I'm most proud of methodologically. In an earlier version, the postal_code variable came out as driver #1 according to the SHAP analysis — well ahead of FICO. Tempting to keep, then.

But on reflection: postal code has enormous cardinality (thousands of zones, many with fewer than 10 loans). Target encoding at that granularity memorizes the training set without generalizing — a disguised geographic overfit.

The decisive test: I removed postal_code and the AUC increased (from 0.72 to 0.735 on XGBoost). The variable wasn't merely useless: it was actively degrading generalization.

Lesson: a variable's importance measures what the model exploits, not its real predictive value. Only validation on unseen data settles it. I kept msa and property_state, coarser and therefore more stable.

Explainability: opening the black box with SHAP

For a credit model, explaining why matters as much as predicting how much. SHAP (SHapley Additive exPlanations) decomposes each prediction additively: prediction = base value + each variable's contribution.

SHAP beeswarm plot of global drivers — Global risk drivers. 5 of the top 8 are engineered features — the model reasons over business concepts, not opaque codes.

The main drivers: fico_dti_interaction, msa, number_of_borrowers, fico_ltv_interaction, credit_score. At the individual level, the “waterfall” plot decomposes a specific loan:

SHAP waterfall decomposition of a loan at 62% risk — Decomposition of a loan rated at 62% risk: metro area (+1.29), FICO×LTV (+0.54), FICO×DTI (+0.49), above-market rate (+0.33), single borrower (+0.25). You can literally explain the refusal to the client — exactly what GDPR and the EBA require.

A documentation assistant (RAG)

To make the project usable by a non-specialist, I added a RAG (Retrieval-Augmented Generation) chatbot that answers questions about the model, the variables, and the methodology. The architecture, built by hand to fully master the mechanics:

Indexing — the documentation (data dictionary, model card, methodology) is split into sections, each encoded as a vector by sentence-transformers.
Retrieval — the vectors are stored in a FAISS index; for each question, the closest passages are retrieved (cosine similarity).
Generation — those passages serve as context for Claude (Anthropic), which writes an answer grounded only in the documentation.

Two notable technical points: grounding (the model is instructed to answer only from the context and to say “I don't know” otherwise — an anti-hallucination guardrail), and prompt caching on the system instructions, which cuts the cost and latency of repeated calls.

Industrialization: from notebook to Kedro pipeline

Notebooks tell the exploration story, but they are neither reproducible nor deployable as-is. I industrialized the data → model path with Kedro (created by QuantumBlack / McKinsey). Two pipelines orchestrated as a dependency graph:

data_processing: loading → target construction → join → feature engineering → split + encoding (5 nodes).
data_science: train logistic regression + XGBoost → evaluation (3 nodes).

Configuration (paths, target thresholds, hyperparameters) is externalized in YAML. A single command, kedro run, replays the whole pipeline. The most satisfying validation: the pipeline reproduces the notebook metrics exactly (LogReg AUC 0.7402 vs 0.7400; XGBoost 0.7352 vs 0.7351). The code shared between notebooks and application was factored into a credit_risk package — a single source of truth.

Deployment: an interactive application

The final deliverable is a two-tab Streamlit application:

Scoring — you enter a borrower profile, you get the default probability and the SHAP explanation in real time. A risky profile (FICO 620, DTI 50, LTV 97, NY, first-time buyer) comes out at ~32%; a safe profile (FICO 800, DTI 20, LTV 60, CA, two borrowers) at ~0.6%.
Assistant — the RAG chatbot, accessible in natural language.

The technical challenge: reconstructing the 39-variable vector the model expects from a raw profile. This requires replaying the preprocessing pipeline exactly at inference time — hence saving a “preprocessor” (encoding mappings, imputation medians, column order). A good MLOps practice: persist the entire transformation chain, not just the model.

Takeaways

This project covers the full chain of a data science use case in banking:

Step	Skill demonstrated
Loading, parsing, joining real data	Data engineering, robustness to format
EDA, detecting redundancy and leakage	Analytical rigor
Domain feature engineering	Credit-domain knowledge
Model comparison, overfit diagnosis	Critical thinking, no gratuitous complexity
The postal_code decision	Deep understanding of validation
SHAP	Explainability, regulatory compliance
Kedro pipeline	MLOps, reproducibility
RAG + Claude	Applied LLM, modern architecture
Streamlit app	Deployment, presentation

What I take away: a good data science project isn't the one that piles on the most complex models, but the one that frames the right problem, treats the data with rigor, justifies its choices through validation, and stays explainable and reproducible end to end.