split_cv applies rsample::vfold_cv to each dataset in a named or unnamed list, returning a list of data.table objects that each contain the CV split objects alongside the corresponding training and validation sets.

split_cv(
  split_dt,
  v = 10L,
  repeats = 1L,
  strata = NULL,
  breaks = 4L,
  pool = 0.1,
  ...
)

Arguments

split_dt

A list whose every element is a data.frame or data.table. Must be non-empty.

v

Number of folds. Must be a single integer >= 2. Default is 10.

repeats

Number of repeats. Must be a single integer >= 1. Default is 1.

strata

A single character string naming the stratification column. The column must exist in every dataset. Set to NULL for no stratification. Default is NULL.

breaks

Number of bins when stratifying a numeric variable. Used only when strata is non-NULL. Default is 4.

pool

Proportion threshold for pooling small strata. Used only when strata is non-NULL. Default is 0.1.

...

Additional arguments forwarded to rsample::vfold_cv.

Value

A list of data.table objects (one per input dataset), each containing:

  • splits — rsample split objects.

  • id — fold identifier (always present).

  • id2 — repeat identifier (present only when repeats > 1).

  • train — list-column of training data frames.

  • validate — list-column of validation data frames.

The output list preserves the names of split_dt.

Details

For each dataset in split_dt the function:

  1. Validates inputs once before entering the processing loop.

  2. Builds a vfold_cv argument list, appending stratification parameters only when strata is non-NULL to avoid passing unsupported arguments to rsample.

  3. Converts the rsample tibble to a data.table in a single as.data.table() call, preserving all fold-identifier columns (id, id2) without hard-coding on the value of repeats.

  4. Appends train and validate list-columns by reference via :=.

Note

  • When strata is specified, it must exist in all datasets; a missing column raises an error rather than silently falling back to unstratified CV.

  • breaks and pool are forwarded to rsample::vfold_cv only when strata is non-NULL, preventing invalid-argument errors.

  • as.data.table() on an already-data.table input is a no-op (no copy is made).

See also

Examples

# Prepare example data: Convert first 3 columns of iris dataset to long format and split
dt_split <- w2l_split(data = iris, cols2l = 1:3)
# dt_split is now a list containing 3 data tables for Sepal.Length, Sepal.Width, and Petal.Length

# Example 1: Single cross-validation (no repeats)
split_cv(
  split_dt = dt_split,  # Input list of split data
  v = 3,                # Set 3-fold cross-validation
  repeats = 1           # Perform cross-validation once (no repeats)
)
#> $Sepal.Length
#>                         splits     id               train           validate
#>                         <list> <char>              <list>             <list>
#> 1: <vfold_split[100x50x150x3]>  Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]>  Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]>  Fold3 <data.table[100x3]> <data.table[50x3]>
#> 
#> $Sepal.Width
#>                         splits     id               train           validate
#>                         <list> <char>              <list>             <list>
#> 1: <vfold_split[100x50x150x3]>  Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]>  Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]>  Fold3 <data.table[100x3]> <data.table[50x3]>
#> 
#> $Petal.Length
#>                         splits     id               train           validate
#>                         <list> <char>              <list>             <list>
#> 1: <vfold_split[100x50x150x3]>  Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]>  Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]>  Fold3 <data.table[100x3]> <data.table[50x3]>
#> 
# Returns a list where each element contains:
# - splits: rsample split objects
# - id: fold numbers (Fold1, Fold2, Fold3)
# - train: training set data
# - validate: validation set data

# Example 2: Repeated cross-validation
split_cv(
  split_dt = dt_split,  # Input list of split data
  v = 3,                # Set 3-fold cross-validation
  repeats = 2           # Perform cross-validation twice
)
#> $Sepal.Length
#>                         splits      id    id2               train
#>                         <list>  <char> <char>              <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1  Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1  Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1  Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2  Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2  Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2  Fold3 <data.table[100x3]>
#>              validate
#>                <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#> 
#> $Sepal.Width
#>                         splits      id    id2               train
#>                         <list>  <char> <char>              <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1  Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1  Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1  Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2  Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2  Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2  Fold3 <data.table[100x3]>
#>              validate
#>                <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#> 
#> $Petal.Length
#>                         splits      id    id2               train
#>                         <list>  <char> <char>              <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1  Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1  Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1  Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2  Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2  Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2  Fold3 <data.table[100x3]>
#>              validate
#>                <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#> 
# Returns a list where each element contains:
# - splits: rsample split objects
# - id: repeat numbers (Repeat1, Repeat2)
# - id2: fold numbers (Fold1, Fold2, Fold3)
# - train: training set data
# - validate: validation set data