A robust cross-validation splitting utility for multiple datasets with advanced stratification and configuration options.

split_cv(
  split_dt,
  v = 10,
  repeats = 1,
  strata = NULL,
  breaks = 4,
  pool = 0.1,
  ...
)

Arguments

split_dt

list of input datasets

  • Must contain data.frame or data.table elements

  • Supports multiple dataset processing

  • Cannot be empty

v

The number of partitions of the data set.

repeats

The number of times to repeat the V-fold partitioning.

strata

A variable in data (single character or name) used to conduct stratified sampling. When not NULL, each resample is created within the stratification variable. Numeric strata are binned into quartiles.

breaks

A single number giving the number of bins desired to stratify a numeric stratification variable.

pool

A proportion of data used to determine if a particular group is too small and should be pooled into another group. We do not recommend decreasing this argument below its default of 0.1 because of the dangers of stratifying groups that are too small.

...

These dots are for future extensions and must be empty.

Value

list of data.table objects containing:

  • splits: Cross-validation split objects

  • train: Training dataset subsets

  • validate: Validation dataset subsets

Details

Advanced Cross-Validation Mechanism:

  1. Input dataset validation

  2. Stratified or unstratified sampling

  3. Flexible fold generation

  4. Train-validate set creation

Sampling Strategies:

  • Supports multiple dataset processing

  • Handles stratified and unstratified sampling

  • Generates reproducible cross-validation splits

Note

Important Constraints:

  • Requires non-empty input datasets

  • All datasets must be data.frame or data.table

  • Strata column must exist if specified

  • Computational resources impact large dataset processing

See also

Examples

# Prepare example data: Convert first 3 columns of iris dataset to long format and split
dt_split <- w2l_split(data = iris, cols2l = 1:3)
# dt_split is now a list containing 3 data tables for Sepal.Length, Sepal.Width, and Petal.Length

# Example 1: Single cross-validation (no repeats)
split_cv(
  split_dt = dt_split,  # Input list of split data
  v = 3,                # Set 3-fold cross-validation
  repeats = 1           # Perform cross-validation once (no repeats)
)
#> $Sepal.Length
#>                         splits     id               train           validate
#>                         <list> <char>              <list>             <list>
#> 1: <vfold_split[100x50x150x3]>  Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]>  Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]>  Fold3 <data.table[100x3]> <data.table[50x3]>
#> 
#> $Sepal.Width
#>                         splits     id               train           validate
#>                         <list> <char>              <list>             <list>
#> 1: <vfold_split[100x50x150x3]>  Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]>  Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]>  Fold3 <data.table[100x3]> <data.table[50x3]>
#> 
#> $Petal.Length
#>                         splits     id               train           validate
#>                         <list> <char>              <list>             <list>
#> 1: <vfold_split[100x50x150x3]>  Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]>  Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]>  Fold3 <data.table[100x3]> <data.table[50x3]>
#> 
# Returns a list where each element contains:
# - splits: rsample split objects
# - id: fold numbers (Fold1, Fold2, Fold3)
# - train: training set data
# - validate: validation set data

# Example 2: Repeated cross-validation
split_cv(
  split_dt = dt_split,  # Input list of split data
  v = 3,                # Set 3-fold cross-validation
  repeats = 2           # Perform cross-validation twice
)
#> $Sepal.Length
#>                         splits      id    id2               train
#>                         <list>  <char> <char>              <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1  Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1  Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1  Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2  Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2  Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2  Fold3 <data.table[100x3]>
#>              validate
#>                <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#> 
#> $Sepal.Width
#>                         splits      id    id2               train
#>                         <list>  <char> <char>              <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1  Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1  Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1  Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2  Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2  Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2  Fold3 <data.table[100x3]>
#>              validate
#>                <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#> 
#> $Petal.Length
#>                         splits      id    id2               train
#>                         <list>  <char> <char>              <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1  Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1  Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1  Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2  Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2  Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2  Fold3 <data.table[100x3]>
#>              validate
#>                <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#> 
# Returns a list where each element contains:
# - splits: rsample split objects
# - id: repeat numbers (Repeat1, Repeat2)
# - id2: fold numbers (Fold1, Fold2, Fold3)
# - train: training set data
# - validate: validation set data