Data Science · Case study · 2026
Predicting mortgage default
An end-to-end case study. From raw Freddie Mac data to a deployed application, through modeling, explainability, and industrialization. This write-up tells the story of the project — the decisions, the pitfalls, and the results.
01
The business problem
When a bank grants a mortgage, it makes a bet: will the borrower repay? Credit risk is the probability that they default. Getting it wrong is costly both ways — too cautious, and you turn away good clients and lose revenue; too lax, and you pile up arrears and capital losses.
The goal of this project: build a model that estimates the probability of default from only the information available at origination — credit score, debt-to-income ratio, down payment, property type, and so on. No future information, because in the real world you decide before you know how the loan will behave.
A structuring constraint in banking: the model must be explainable. EU regulation (GDPR, the right to explanation) and EBA guidelines require being able to justify a credit refusal. A high-performing but unexplainable “black box” is unusable in a regulated production setting. This requirement shaped several decisions in the project.
02
The data: Freddie Mac Single-Family Loan-Level Dataset
Freddie Mac (a U.S. government-sponsored mortgage refinancing enterprise) publishes the performance history of millions of mortgages since 1999 as open data. It is an academic and industry reference for credit risk.
The dataset is structured as two files per vintage:
- Origination: one row per loan, the characteristics known at origination (FICO, amount, rate, state, etc.).
- Performance: one row per loan per month, the repayment history (delinquencies, balance, status).
For this project I used the 2017 and 2018 samples (50,000 loans each, 100,000 total), with performance tracked through September 2025. The vintage choice is deliberate: these loans have enough hindsight for defaults to be observed (7-8 years), and they span the 2020-2021 COVID shock, which enriches the signal.
Building the target
The variable to predict, default, does not exist as such. I built it from the performance file: default = 1 if the loan reached 90+ days delinquency (delinquency ≥ 3) OR ended in foreclosure / repossession (zero-balance codes 03 or 09). Otherwise default = 0.
Observed default rate: 5.57%. This imbalance (the “default” class is rare) is typical of credit and has consequences for modeling and evaluation.
First pitfall: the samples didn't match
A concrete first obstacle. The Freddie Mac site offers generic “sample files” — but those example origination and performance files share no common loan identifier. They can't be joined. I had to realize I needed to download the real annual files of the Standard Dataset (sample_orig_2017.txt, sample_svcg_2017.txt), where the loans do correspond.
Second pitfall: a silent column shift
Subtler, and instructive. Reading the files (separator |, no header), I first handed pandas a list of 27 column names — while the current format has 32. Pandas' behavior here is silent and misleading: it aligns the names to the last columns, shifting everything else. As a result, the column I called dti actually held something else, and 100% of the “numeric” values became missing.
The symptom (100% missing values on columns meant to be full, and only 29 unique identifiers out of 100,000 rows) put me on the trail. The lesson: always validate the number of columns read, never trust silent parsing. I added assert statements on the column count so any future inconsistency fails loudly instead of quietly corrupting the data.
03
Exploration: what the data says
Exploratory data analysis (EDA) confirmed the business intuitions and revealed the structure of the signal. The key numeric variables have clear relationships with default:

- FICO score: clean, monotonically decreasing relationship. FICO 426-681 → 13.5% default; above 801 → 1.7%. By far the best single predictor.
- DTI (debt-to-income): rising. Below 22% → 2.3%; 44-47% → 8.3%.
- LTV (loan-to-value): flat up to 95%, then a sharp jump to ~10%. That threshold is where mortgage insurance becomes mandatory.
The categorical variables carry signal too:

Geography matters a lot:

One redundancy to fix: oltv and ocltv are correlated at 0.99 — they say almost the same thing. I kept ocltv (more complete) and dropped oltv, so as not to muddy the model's interpretation.

04
Feature engineering: injecting domain knowledge
Models learn better when the signal is presented in a usable form. I created 11 derived variables encoding credit-domain knowledge:
| Family | Variables | Idea |
|---|---|---|
| Risk flags | is_subprime, is_high_ltv, is_high_dti | Recognized regulatory thresholds (CFPB Qualified Mortgage) |
| Cumulative score | risk_count (0 to 3) | How many risk criteria stack up |
| Interactions | fico_dti_interaction, fico_ltv_interaction, dti_ltv_interaction | Capture profiles combining several weaknesses |
| Financial effort | monthly_payment, payment_to_upb_ratio | Real monthly payment via the amortization formula |
| Pricing | rate_spread | Gap to the vintage's average rate — a proxy for the lender's internal scoring |
The risk_count variable illustrates the value of this approach well: the default rate goes from 4% (no criterion) to 20% (all three combined). A powerful and perfectly readable signal.
The crucial methodological point here: the order of operations. The train/test split is done before encoding the geographic variables. Why? Because those variables are encoded with target encoding — each state is replaced by its mean default rate. If you computed that rate over the full data, test-set information would “leak” into training (data leakage). The rate is therefore computed on the training set only, with smoothing that pulls under-represented areas toward the global mean.
05
Modeling: simple sometimes beats sophisticated
I trained and compared two models:
- Logistic regression — the historical credit-scoring standard: linear, interpretable, validated by regulators.
- XGBoost — the state of the art on tabular data: gradient boosting, able to capture complex interactions.
| Model | ROC AUC | Average Precision |
|---|---|---|
| Logistic regression (chosen for production) | 0.740 | 0.147 |
| XGBoost (challenger + SHAP) | 0.735 | 0.142 |

Logistic regression matches (even slightly beats) XGBoost. Counterintuitive when you expect the “more powerful” model to always win. XGBoost's learning curve showed it was overfitting: excellent on train, no gain on validation. Three reasons explain this ceiling:
- 1.The relationships are near-linear. FICO ↘ default, DTI ↗ default, LTV ↗ default. Logistic regression models that perfectly.
- 2.Feature engineering already digested the signal. The interactions XGBoost might have found, I gave it explicitly.
- 3.The residual signal is macroeconomic noise. Job loss, COVID, rate hikes — unknown at origination, therefore unpredictable.
An AUC of 0.74 sits squarely in the professional credit-scoring range (0.70-0.85). The verdict: keep logistic regression in production — equivalent performance, but interpretable and compliant. XGBoost stays useful for fine-grained analysis via SHAP.
Portfolio lesson: knowing how to recognize that a simple model is enough, and justifying it, beats piling on gratuitous complexity.
06
The most instructive moment: importance ≠ usefulness
This is the episode I'm most proud of methodologically. In an earlier version, the postal_code variable came out as driver #1 according to the SHAP analysis — well ahead of FICO. Tempting to keep, then.
But on reflection: postal code has enormous cardinality (thousands of zones, many with fewer than 10 loans). Target encoding at that granularity memorizes the training set without generalizing — a disguised geographic overfit.
The decisive test: I removed postal_code and the AUC increased (from 0.72 to 0.735 on XGBoost). The variable wasn't merely useless: it was actively degrading generalization.
Lesson: a variable's importance measures what the model exploits, not its real predictive value. Only validation on unseen data settles it. I kept msa and property_state, coarser and therefore more stable.
07
Explainability: opening the black box with SHAP
For a credit model, explaining why matters as much as predicting how much. SHAP (SHapley Additive exPlanations) decomposes each prediction additively: prediction = base value + each variable's contribution.

The main drivers: fico_dti_interaction, msa, number_of_borrowers, fico_ltv_interaction, credit_score. At the individual level, the “waterfall” plot decomposes a specific loan:

08
A documentation assistant (RAG)
To make the project usable by a non-specialist, I added a RAG (Retrieval-Augmented Generation) chatbot that answers questions about the model, the variables, and the methodology. The architecture, built by hand to fully master the mechanics:
- Indexing — the documentation (data dictionary, model card, methodology) is split into sections, each encoded as a vector by sentence-transformers.
- Retrieval — the vectors are stored in a FAISS index; for each question, the closest passages are retrieved (cosine similarity).
- Generation — those passages serve as context for Claude (Anthropic), which writes an answer grounded only in the documentation.
Two notable technical points: grounding (the model is instructed to answer only from the context and to say “I don't know” otherwise — an anti-hallucination guardrail), and prompt caching on the system instructions, which cuts the cost and latency of repeated calls.
09
Industrialization: from notebook to Kedro pipeline
Notebooks tell the exploration story, but they are neither reproducible nor deployable as-is. I industrialized the data → model path with Kedro (created by QuantumBlack / McKinsey). Two pipelines orchestrated as a dependency graph:
data_processing: loading → target construction → join → feature engineering → split + encoding (5 nodes).data_science: train logistic regression + XGBoost → evaluation (3 nodes).
Configuration (paths, target thresholds, hyperparameters) is externalized in YAML. A single command, kedro run, replays the whole pipeline. The most satisfying validation: the pipeline reproduces the notebook metrics exactly (LogReg AUC 0.7402 vs 0.7400; XGBoost 0.7352 vs 0.7351). The code shared between notebooks and application was factored into a credit_risk package — a single source of truth.
10
Deployment: an interactive application
The final deliverable is a two-tab Streamlit application:
- Scoring — you enter a borrower profile, you get the default probability and the SHAP explanation in real time. A risky profile (FICO 620, DTI 50, LTV 97, NY, first-time buyer) comes out at ~32%; a safe profile (FICO 800, DTI 20, LTV 60, CA, two borrowers) at ~0.6%.
- Assistant — the RAG chatbot, accessible in natural language.
The technical challenge: reconstructing the 39-variable vector the model expects from a raw profile. This requires replaying the preprocessing pipeline exactly at inference time — hence saving a “preprocessor” (encoding mappings, imputation medians, column order). A good MLOps practice: persist the entire transformation chain, not just the model.
11
Takeaways
This project covers the full chain of a data science use case in banking:
| Step | Skill demonstrated |
|---|---|
| Loading, parsing, joining real data | Data engineering, robustness to format |
| EDA, detecting redundancy and leakage | Analytical rigor |
| Domain feature engineering | Credit-domain knowledge |
| Model comparison, overfit diagnosis | Critical thinking, no gratuitous complexity |
| The postal_code decision | Deep understanding of validation |
| SHAP | Explainability, regulatory compliance |
| Kedro pipeline | MLOps, reproducibility |
| RAG + Claude | Applied LLM, modern architecture |
| Streamlit app | Deployment, presentation |
What I take away: a good data science project isn't the one that piles on the most complex models, but the one that frames the right problem, treats the data with rigor, justifies its choices through validation, and stays explainable and reproducible end to end.