Claude Code for Credit Scoring Models (2026)
Why Claude Code for Credit Scoring
Credit scoring models must satisfy both statistical rigor and regulatory requirements. Basel II/III IRB models need documented variable selection, Weight of Evidence (WoE) transformation, scorecard scaling, and validation using GINI coefficient, KS statistic, and population stability index (PSI). Regulators (OCC, PRA, ECB) audit the entire model lifecycle: development data, feature engineering decisions, model performance, and ongoing monitoring. Getting the WoE binning wrong – merging monotonicity-violating bins without justification – is an MRA (Matter Requiring Attention) finding.
Claude Code generates credit scorecard pipelines that follow regulatory best practices: coarse classing with monotone WoE, information value (IV) feature selection, logistic regression with interpretable coefficients, and the full validation suite that model risk management requires.
The Workflow
Step 1: Credit Modeling Setup
pip install numpy pandas scikit-learn scipy
pip install optbinning # optimal WoE binning
pip install scorecardpy # scorecard development toolkit
pip install matplotlib seaborn
mkdir -p src/scorecard src/validation data/ reports/
Step 2: Build WoE Scorecard
# src/scorecard/woe_scorecard.py
"""Credit scorecard: WoE transformation, logistic regression, scaling."""
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from dataclasses import dataclass
@dataclass
class WoEBin:
variable: str
bin_edges: list
woe_values: list
iv: float # Information Value
@dataclass
class ScorecardResult:
model: LogisticRegression
woe_bins: list
gini: float
ks: float
base_score: int
pdo: int # points to double odds
def compute_woe_iv(feature: pd.Series, target: pd.Series,
n_bins: int = 10) -> WoEBin:
"""Compute Weight of Evidence and Information Value for a feature."""
# Create bins
try:
bins = pd.qcut(feature, n_bins, duplicates='drop')
except ValueError:
bins = pd.cut(feature, n_bins, duplicates='drop')
# WoE calculation
crosstab = pd.crosstab(bins, target)
if crosstab.shape[1] < 2:
return WoEBin(feature.name, [], [], 0.0)
n_good = crosstab[0]
n_bad = crosstab[1]
total_good = n_good.sum()
total_bad = n_bad.sum()
assert total_good > 0 and total_bad > 0, "Need both classes"
pct_good = n_good / total_good
pct_bad = n_bad / total_bad
# Avoid log(0) with Laplace smoothing
pct_good = pct_good.clip(lower=0.0001)
pct_bad = pct_bad.clip(lower=0.0001)
woe = np.log(pct_good / pct_bad)
# Information Value
iv = ((pct_good - pct_bad) * woe).sum()
# Extract bin edges
bin_edges = sorted(set(
[b.left for b in bins.cat.categories] +
[b.right for b in bins.cat.categories]
))
return WoEBin(
variable=feature.name,
bin_edges=bin_edges,
woe_values=woe.values.tolist(),
iv=float(iv)
)
def select_features_by_iv(features: pd.DataFrame,
target: pd.Series,
iv_threshold: float = 0.02,
max_features: int = 15) -> list:
"""Select features based on Information Value.
IV interpretation: <0.02 weak, 0.02-0.1 medium, 0.1-0.3 strong, >0.3 suspicious
"""
iv_scores = {}
for col in features.columns:
woe_bin = compute_woe_iv(features[col], target)
iv_scores[col] = woe_bin.iv
# Sort by IV descending
ranked = sorted(iv_scores.items(), key=lambda x: x[1], reverse=True)
# Filter: IV > threshold AND IV < 0.5 (suspiciously predictive = potential leak)
selected = [name for name, iv in ranked
if iv_threshold <= iv <= 0.5][:max_features]
assert len(selected) >= 3, f"Only {len(selected)} features pass IV filter"
return selected
def build_scorecard(X_train: pd.DataFrame, y_train: pd.Series,
base_score: int = 600,
pdo: int = 20,
base_odds: float = 50.0) -> ScorecardResult:
"""Build logistic regression scorecard with WoE features."""
# WoE transform all features
woe_bins = []
X_woe = pd.DataFrame(index=X_train.index)
for col in X_train.columns:
woe_bin = compute_woe_iv(X_train[col], y_train)
woe_bins.append(woe_bin)
# Apply WoE transformation
bins = pd.cut(X_train[col],
bins=[-np.inf] + woe_bin.bin_edges[1:-1] + [np.inf])
bin_to_woe = dict(zip(range(len(woe_bin.woe_values)),
woe_bin.woe_values))
X_woe[col] = bins.cat.codes.map(bin_to_woe).fillna(0)
# Fit logistic regression
model = LogisticRegression(
penalty='l2', C=1.0, solver='lbfgs',
max_iter=1000, class_weight='balanced'
)
model.fit(X_woe, y_train)
# Validate: all coefficients should be negative for WoE-transformed features
# (higher WoE = more good -> lower default probability)
prob = model.predict_proba(X_woe)[:, 1]
gini = compute_gini(y_train, prob)
ks = compute_ks(y_train, prob)
return ScorecardResult(model, woe_bins, gini, ks, base_score, pdo)
def compute_gini(y_true: pd.Series, y_prob: np.ndarray) -> float:
"""Gini coefficient = 2 * AUC - 1."""
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_true, y_prob)
return float(2 * auc - 1)
def compute_ks(y_true: pd.Series, y_prob: np.ndarray) -> float:
"""Kolmogorov-Smirnov statistic: max separation between cumulative distributions."""
df = pd.DataFrame({'prob': y_prob, 'target': y_true})
df = df.sort_values('prob')
total_good = (df['target'] == 0).sum()
total_bad = (df['target'] == 1).sum()
cum_good = (df['target'] == 0).cumsum() / total_good
cum_bad = (df['target'] == 1).cumsum() / total_bad
ks = float(np.max(np.abs(cum_good - cum_bad)))
return ks
Step 3: Model Validation
# src/validation/model_validation.py
"""Credit model validation: PSI, GINI stability, backtesting."""
import numpy as np
import pandas as pd
def population_stability_index(expected: np.ndarray,
actual: np.ndarray,
n_bins: int = 10) -> float:
"""PSI: measures shift in score distribution over time.
PSI < 0.1: no significant shift
PSI 0.1-0.25: moderate shift, investigate
PSI > 0.25: significant shift, rebuild model
"""
breakpoints = np.percentile(expected, np.linspace(0, 100, n_bins + 1))
breakpoints[0] = -np.inf
breakpoints[-1] = np.inf
expected_pct = np.histogram(expected, bins=breakpoints)[0] / len(expected)
actual_pct = np.histogram(actual, bins=breakpoints)[0] / len(actual)
# Avoid log(0)
expected_pct = np.clip(expected_pct, 0.0001, None)
actual_pct = np.clip(actual_pct, 0.0001, None)
psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
return float(psi)
Step 4: Verify
python3 -c "
import numpy as np
import pandas as pd
from src.scorecard.woe_scorecard import (
compute_woe_iv, select_features_by_iv, build_scorecard
)
np.random.seed(42)
N = 5000
# Synthetic credit data
X = pd.DataFrame({
'income': np.random.lognormal(10.5, 0.5, N),
'debt_ratio': np.random.beta(2, 5, N),
'credit_age_months': np.random.exponential(60, N),
'num_delinquencies': np.random.poisson(0.5, N),
'utilization': np.random.beta(3, 7, N),
})
# Default probability increases with debt_ratio and delinquencies
logit = -3 + 2*X['debt_ratio'] + 0.5*X['num_delinquencies'] - 0.01*X['income']/1000
prob = 1 / (1 + np.exp(-logit))
y = (np.random.rand(N) < prob).astype(int)
print(f'Default rate: {y.mean():.2%}')
# Feature selection
selected = select_features_by_iv(X, y)
print(f'Selected features: {selected}')
# Build scorecard
result = build_scorecard(X[selected], y)
print(f'GINI: {result.gini:.3f}')
print(f'KS: {result.ks:.3f}')
assert result.gini > 0.3, f'GINI too low: {result.gini}'
assert result.ks > 0.2, f'KS too low: {result.ks}'
print('Credit scorecard: PASS')
"
CLAUDE.md for Credit Scoring
# Credit Scoring Model Development
## Regulatory Framework
- Basel II/III IRB approach
- SR 11-7 (OCC Model Risk Management)
- SS1/23 (PRA Model Risk Management)
- ECBC Guidelines on creditworthiness assessment
## Model Development Standards
- WoE transformation with monotone bins (no non-monotone WoE allowed)
- Information Value for feature selection (IV: 0.02-0.5 range)
- Logistic regression for interpretability (required by most regulators)
- GINI > 0.40 for application scorecards, > 0.30 for behavioral
## Validation Metrics
- GINI coefficient (discrimination power)
- KS statistic (rank ordering)
- PSI (population stability, < 0.10 green)
- Hosmer-Lemeshow (calibration)
- Backtesting: predicted vs actual default rates by score band
## Libraries
- optbinning (optimal WoE binning with constraints)
- scorecardpy (scorecard development)
- scikit-learn (logistic regression)
- statsmodels (statistical tests)
## Common Commands
- python3 src/scorecard/woe_scorecard.py — build scorecard
- python3 src/validation/model_validation.py — run validation
Common Pitfalls
- Non-monotone WoE bins: Regulators reject models where WoE does not increase monotonically with risk. Merging adjacent bins to fix this post-hoc without economic justification is also flagged. Claude Code uses constrained optimal binning (optbinning) that enforces monotonicity during bin construction.
- Suspiciously high IV features: IV > 0.5 usually indicates data leakage (e.g., a delinquency flag that is set after default). Claude Code flags features with IV > 0.5 and requires manual review before inclusion.
- Missing PSI monitoring: A model deployed without ongoing score distribution monitoring degrades silently as the applicant population shifts. Claude Code generates monthly PSI reports and triggers alerts when PSI exceeds 0.10.
Related
Try it: Paste your error into our Error Diagnostic for an instant fix.