A robust cross-validation splitting utility for multiple datasets with advanced stratification and configuration options.
split_cv(
split_dt,
v = 10,
repeats = 1,
strata = NULL,
breaks = 4,
pool = 0.1,
...
)
list
of input datasets
Must contain data.frame
or data.table
elements
Supports multiple dataset processing
Cannot be empty
The number of partitions of the data set.
The number of times to repeat the V-fold partitioning.
A variable in data
(single character or name) used to conduct
stratified sampling. When not NULL
, each resample is created within the
stratification variable. Numeric strata
are binned into quartiles.
A single number giving the number of bins desired to stratify a numeric stratification variable.
A proportion of data used to determine if a particular group is too small and should be pooled into another group. We do not recommend decreasing this argument below its default of 0.1 because of the dangers of stratifying groups that are too small.
These dots are for future extensions and must be empty.
list
of data.table
objects containing:
splits
: Cross-validation split objects
train
: Training dataset subsets
validate
: Validation dataset subsets
Advanced Cross-Validation Mechanism:
Input dataset validation
Stratified or unstratified sampling
Flexible fold generation
Train-validate set creation
Sampling Strategies:
Supports multiple dataset processing
Handles stratified and unstratified sampling
Generates reproducible cross-validation splits
Important Constraints:
Requires non-empty input datasets
All datasets must be data.frame
or data.table
Strata column must exist if specified
Computational resources impact large dataset processing
rsample::vfold_cv()
Core cross-validation function
# Prepare example data: Convert first 3 columns of iris dataset to long format and split
dt_split <- w2l_split(data = iris, cols2l = 1:3)
# dt_split is now a list containing 3 data tables for Sepal.Length, Sepal.Width, and Petal.Length
# Example 1: Single cross-validation (no repeats)
split_cv(
split_dt = dt_split, # Input list of split data
v = 3, # Set 3-fold cross-validation
repeats = 1 # Perform cross-validation once (no repeats)
)
#> $Sepal.Length
#> splits id train validate
#> <list> <char> <list> <list>
#> 1: <vfold_split[100x50x150x3]> Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]> Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]> Fold3 <data.table[100x3]> <data.table[50x3]>
#>
#> $Sepal.Width
#> splits id train validate
#> <list> <char> <list> <list>
#> 1: <vfold_split[100x50x150x3]> Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]> Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]> Fold3 <data.table[100x3]> <data.table[50x3]>
#>
#> $Petal.Length
#> splits id train validate
#> <list> <char> <list> <list>
#> 1: <vfold_split[100x50x150x3]> Fold1 <data.table[100x3]> <data.table[50x3]>
#> 2: <vfold_split[100x50x150x3]> Fold2 <data.table[100x3]> <data.table[50x3]>
#> 3: <vfold_split[100x50x150x3]> Fold3 <data.table[100x3]> <data.table[50x3]>
#>
# Returns a list where each element contains:
# - splits: rsample split objects
# - id: fold numbers (Fold1, Fold2, Fold3)
# - train: training set data
# - validate: validation set data
# Example 2: Repeated cross-validation
split_cv(
split_dt = dt_split, # Input list of split data
v = 3, # Set 3-fold cross-validation
repeats = 2 # Perform cross-validation twice
)
#> $Sepal.Length
#> splits id id2 train
#> <list> <char> <char> <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1 Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1 Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1 Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2 Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2 Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2 Fold3 <data.table[100x3]>
#> validate
#> <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#>
#> $Sepal.Width
#> splits id id2 train
#> <list> <char> <char> <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1 Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1 Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1 Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2 Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2 Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2 Fold3 <data.table[100x3]>
#> validate
#> <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#>
#> $Petal.Length
#> splits id id2 train
#> <list> <char> <char> <list>
#> 1: <vfold_split[100x50x150x3]> Repeat1 Fold1 <data.table[100x3]>
#> 2: <vfold_split[100x50x150x3]> Repeat1 Fold2 <data.table[100x3]>
#> 3: <vfold_split[100x50x150x3]> Repeat1 Fold3 <data.table[100x3]>
#> 4: <vfold_split[100x50x150x3]> Repeat2 Fold1 <data.table[100x3]>
#> 5: <vfold_split[100x50x150x3]> Repeat2 Fold2 <data.table[100x3]>
#> 6: <vfold_split[100x50x150x3]> Repeat2 Fold3 <data.table[100x3]>
#> validate
#> <list>
#> 1: <data.table[50x3]>
#> 2: <data.table[50x3]>
#> 3: <data.table[50x3]>
#> 4: <data.table[50x3]>
#> 5: <data.table[50x3]>
#> 6: <data.table[50x3]>
#>
# Returns a list where each element contains:
# - splits: rsample split objects
# - id: repeat numbers (Repeat1, Repeat2)
# - id2: fold numbers (Fold1, Fold2, Fold3)
# - train: training set data
# - validate: validation set data