split_cv applies rsample::vfold_cv to each dataset in a named or
unnamed list, returning a list of data.table objects that each contain
the CV split objects alongside the corresponding training and validation
sets.
split_cv(
split_dt,
v = 10L,
repeats = 1L,
strata = NULL,
breaks = 4L,
pool = 0.1,
...
)A list whose every element is a data.frame or
data.table. Must be non-empty.
Number of folds. Must be a single integer >= 2.
Default is 10.
Number of repeats. Must be a single integer >= 1.
Default is 1.
A single character string naming the stratification column.
The column must exist in every dataset. Set to NULL for no
stratification. Default is NULL.
Number of bins when stratifying a numeric variable. Used
only when strata is non-NULL. Default is 4.
Proportion threshold for pooling small strata. Used only
when strata is non-NULL. Default is 0.1.
Additional arguments forwarded to rsample::vfold_cv.
A list of data.table objects (one per input dataset), each
containing:
splits — rsample split objects.
id — fold identifier (always present).
id2 — repeat identifier (present only when repeats > 1).
train — list-column of training data frames.
validate — list-column of validation data frames.
The output list preserves the names of split_dt.
For each dataset in split_dt the function:
Validates inputs once before entering the processing loop.
Builds a vfold_cv argument list, appending stratification
parameters only when strata is non-NULL to avoid passing
unsupported arguments to rsample.
Converts the rsample tibble to a data.table in a single
as.data.table() call, preserving all fold-identifier columns
(id, id2) without hard-coding on the value of repeats.
Appends train and validate list-columns by reference via :=.
When strata is specified, it must exist in all datasets;
a missing column raises an error rather than silently falling back
to unstratified CV.
breaks and pool are forwarded to rsample::vfold_cv only
when strata is non-NULL, preventing invalid-argument errors.
as.data.table() on an already-data.table input is a no-op
(no copy is made).
rsample::vfold_cv() — underlying cross-validation function
rsample::training() — extract training set from a split
rsample::testing() — extract validation set from a split
nest_cv() — nested data.table variant of this utility
# Prepare example data: Convert first 3 columns of iris dataset to long format and split
dt_split <- w2l_split(data = iris, cols2l = 1:3)
# dt_split is now a list containing 3 data tables for Sepal.Length, Sepal.Width, and Petal.Length
# Example 1: Single cross-validation (no repeats)
split_cv(
split_dt = dt_split, # Input list of split data
v = 3, # Set 3-fold cross-validation
repeats = 1 # Perform cross-validation once (no repeats)
)
#> $Sepal.Length
#> splits id train validate
#> <list> <char> <list> <list>
#> 1: <vfold_split[100x50x150x3]> Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]> Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]> Fold3 <data.table[100x3]> <data.table[50x3]>
#>
#> $Sepal.Width
#> splits id train validate
#> <list> <char> <list> <list>
#> 1: <vfold_split[100x50x150x3]> Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]> Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]> Fold3 <data.table[100x3]> <data.table[50x3]>
#>
#> $Petal.Length
#> splits id train validate
#> <list> <char> <list> <list>
#> 1: <vfold_split[100x50x150x3]> Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]> Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]> Fold3 <data.table[100x3]> <data.table[50x3]>
#>
# Returns a list where each element contains:
# - splits: rsample split objects
# - id: fold numbers (Fold1, Fold2, Fold3)
# - train: training set data
# - validate: validation set data
# Example 2: Repeated cross-validation
split_cv(
split_dt = dt_split, # Input list of split data
v = 3, # Set 3-fold cross-validation
repeats = 2 # Perform cross-validation twice
)
#> $Sepal.Length
#> splits id id2 train
#> <list> <char> <char> <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1 Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1 Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1 Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2 Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2 Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2 Fold3 <data.table[100x3]>
#> validate
#> <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#>
#> $Sepal.Width
#> splits id id2 train
#> <list> <char> <char> <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1 Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1 Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1 Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2 Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2 Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2 Fold3 <data.table[100x3]>
#> validate
#> <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#>
#> $Petal.Length
#> splits id id2 train
#> <list> <char> <char> <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1 Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1 Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1 Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2 Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2 Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2 Fold3 <data.table[100x3]>
#> validate
#> <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#>
# Returns a list where each element contains:
# - splits: rsample split objects
# - id: repeat numbers (Repeat1, Repeat2)
# - id2: fold numbers (Fold1, Fold2, Fold3)
# - train: training set data
# - validate: validation set data