Quickstart¶
Three short snippets that cover the full surface: the row-level
primitives, the sklearn splitter you will use most, and Combinatorial
Purged Cross-Validation with backtest-path reconstruction and the
deflated-Sharpe statistics. All snippets are runnable on a fresh
pip install purgedcv.
1. Row-level primitives¶
The two functions purge and apply_embargo are the building blocks
every splitter ships on top of. They take positional indices, the
prediction and evaluation timestamps for every row, and return the
training indices that survive.
import numpy as np
import pandas as pd
from purgedcv import apply_embargo, purge
n = 1000
pred = pd.Series(pd.date_range("2024-01-01", periods=n, freq="D"))
evalu = pred + pd.Timedelta(days=5) # 5-day forward label
train_idx = np.arange(0, 800)
test_idx = np.arange(800, 900)
train_kept = purge(
train_idx, test_idx,
prediction_times=pred,
evaluation_times=evalu,
purge_horizon="5D",
)
train_kept = apply_embargo(
train_kept, test_idx,
prediction_times=pred,
evaluation_times=evalu,
embargo="2D",
)
print(len(train_idx), "->", len(train_kept), "after purge + embargo")
2. PurgedKFold in cross_val_score¶
Every splitter follows the scikit-learn splitter protocol, so it works
inside cross_val_score, GridSearchCV, and Pipeline without glue
code.
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from purgedcv import PurgedKFold
n, h = 1000, 5
rng = np.random.default_rng(0)
features = rng.standard_normal((n, 4))
labels = rng.standard_normal(n)
pred = pd.Series(pd.date_range("2024-01-01", periods=n, freq="D"))
evalu = pred + pd.Timedelta(days=h)
cv = PurgedKFold(
n_splits=5,
prediction_times=pred,
evaluation_times=evalu,
purge_horizon=f"{h}D",
embargo=f"{h}D",
)
scores = cross_val_score(GradientBoostingRegressor(), features, labels, cv=cv)
print("honest R^2 per fold:", scores)
PurgedGroupKFold (entity-level holdout) and WalkForwardSplit
(chronological train-on-the-past, expanding or rolling) follow the same
constructor pattern.
3. CPCV + backtest paths + deflated Sharpe¶
The full workflow from chapter 12 of Advances in Financial Machine
Learning: enumerate C(N, K) purged folds, fit-predict on each, assemble
the per-path out-of-sample predictions with reconstruct_paths, and
correct the resulting Sharpe ratios for the number of model trials.
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from purgedcv import CombinatorialPurgedCV, deflated_sharpe_ratio
n = 800
rng = np.random.default_rng(0)
features = rng.standard_normal((n, 5))
labels = rng.standard_normal(n)
pred = pd.Series(pd.date_range("2024-01-01", periods=n, freq="D"))
evalu = pred + pd.Timedelta(days=3)
cpcv = CombinatorialPurgedCV(
n_splits=6,
n_test_groups=2,
prediction_times=pred,
evaluation_times=evalu,
purge_horizon="3D",
)
paths = cpcv.backtest_paths(Ridge(), features, labels)
# paths.shape == (n_paths, n_samples); NaN = unseen position
# Treat each per-path mean predicted score as a per-strategy Sharpe-like
# series and correct for the number of model trials we actually ran.
per_path_sharpe = np.nanmean(paths, axis=1) / np.nanstd(paths, axis=1)
dsr = deflated_sharpe_ratio(
per_path_sharpe.mean(),
n_trials=len(per_path_sharpe),
var_sharpe=per_path_sharpe.var(ddof=1),
)
print("Deflated Sharpe (probability skill is real):", dsr)
For the full numerical example matching ยง7.4.1 of the book, see
examples/energy_demand_pjm.ipynb.