Select Top Percentage of Data and Statistical Summarization

The top_perc function selects the top percentage of data based on a specified trait and computes summary statistics. It allows for grouping by additional columns and offers flexibility in the type of statistics calculated. The function can also retain the selected data if needed.

top_perc(data, perc, trait, by = NULL, type = "mean_sd", keep_data = FALSE)

Arguments

data

A data.frame containing the source dataset for analysis

Supports various data frame-like structures
Automatically converts non-data frame inputs

perc

Numeric vector of percentages for data selection

Range: -1 to 1
Positive values: Select top percentiles
Negative values: Select bottom percentiles
Multiple percentiles supported

trait

Character string specifying the 'selection column'

Must be a valid column name in the input data
Used as the basis for top/bottom percentage selection

by

Optional character vector for 'grouping columns'

Default is NULL
Enables stratified analysis
Allows granular percentage selection within groups

type

Statistical summary type

Default: "mean_sd"
Controls the type of summary statistics computed
Supports various summary methods from rstatix

keep_data

Logical flag for data retention

Default: FALSE
TRUE: Return both summary statistics and selected data
FALSE: Return only summary statistics

Value

A list or data frame:

If keep_data is FALSE, a data frame with summary statistics.
If keep_data is TRUE, a list where each element is a list containing summary statistics (stat) and the selected top data (data).

Note

The perc parameter accepts values between -1 and 1. Positive values select the top percentage, while negative values select the bottom percentage.
The function performs initial checks to ensure required arguments are provided and valid.
Grouping by additional columns (by) is optional and allows for more granular analysis.
The type parameter specifies the type of summary statistics to compute, with "mean_sd" as the default.
If keep_data is set to TRUE, the function will return both the summary statistics and the selected top data for each percentage.

Examples

# Example 1: Basic usage with single trait
# This example selects the top 10% of observations based on Petal.Width
# keep_data=TRUE returns both summary statistics and the filtered data
top_perc(iris, 
         perc = 0.1,                # Select top 10%
         trait = c("Petal.Width"),  # Column to analyze
         keep_data = TRUE)          # Return both stats and filtered data
#> $Petal.Width_0.1
#> $Petal.Width_0.1$stat
#> # A tibble: 1 × 5
#>   variable        n  mean    sd top_perc
#>   <fct>       <dbl> <dbl> <dbl> <chr>   
#> 1 Petal.Width    17  2.34   0.1 10%     
#> 
#> $Petal.Width_0.1$data
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 1           6.3         3.3          6.0         2.5 virginica
#> 2           6.5         3.0          5.8         2.2 virginica
#> 3           7.2         3.6          6.1         2.5 virginica
#> 4           5.8         2.8          5.1         2.4 virginica
#> 5           6.4         3.2          5.3         2.3 virginica
#> 6           7.7         3.8          6.7         2.2 virginica
#> 7           7.7         2.6          6.9         2.3 virginica
#> 8           6.9         3.2          5.7         2.3 virginica
#> 9           6.4         2.8          5.6         2.2 virginica
#> 10          7.7         3.0          6.1         2.3 virginica
#> 11          6.3         3.4          5.6         2.4 virginica
#> 12          6.7         3.1          5.6         2.4 virginica
#> 13          6.9         3.1          5.1         2.3 virginica
#> 14          6.8         3.2          5.9         2.3 virginica
#> 15          6.7         3.3          5.7         2.5 virginica
#> 16          6.7         3.0          5.2         2.3 virginica
#> 17          6.2         3.4          5.4         2.3 virginica
#> 
#> 

# Example 2: Using grouping with 'by' parameter
# This example performs the same analysis but separately for each Species
# Returns nested list with stats and filtered data for each group
top_perc(iris, 
         perc = 0.1,                # Select top 10%
         trait = c("Petal.Width"),  # Column to analyze
         by = "Species")            # Group by Species
#> # A tibble: 3 × 6
#>   Species    variable        n  mean    sd top_perc
#>   <fct>      <fct>       <dbl> <dbl> <dbl> <chr>   
#> 1 setosa     Petal.Width     9 0.433 0.071 10%     
#> 2 versicolor Petal.Width     5 1.66  0.089 10%     
#> 3 virginica  Petal.Width     6 2.45  0.055 10%     

# Example 3: Complex example with multiple percentages and grouping variables
# Reshape data from wide to long format for Sepal.Length and Sepal.Width
iris |> 
  tidyr::pivot_longer(1:2,
                      names_to = "names", 
                      values_to = "values") |> 
  mintyr::top_perc(
    perc = c(0.1, -0.2),
    trait = "values",
    by = c("Species", "names"),
    type = "mean_sd")
#> # A tibble: 12 × 7
#>    Species    names        variable     n  mean    sd top_perc
#>    <fct>      <chr>        <fct>    <dbl> <dbl> <dbl> <chr>   
#>  1 setosa     Sepal.Length values       5  5.64 0.134 10%     
#>  2 setosa     Sepal.Width  values       6  4.08 0.194 10%     
#>  3 versicolor Sepal.Length values       6  6.8  0.126 10%     
#>  4 versicolor Sepal.Width  values       5  3.26 0.089 10%     
#>  5 virginica  Sepal.Length values       5  7.74 0.089 10%     
#>  6 virginica  Sepal.Width  values       5  3.6  0.2   10%     
#>  7 setosa     Sepal.Length values      11  4.53 0.135 -20%    
#>  8 setosa     Sepal.Width  values      12  2.97 0.219 -20%    
#>  9 versicolor Sepal.Length values      11  5.28 0.244 -20%    
#> 10 versicolor Sepal.Width  values      13  2.35 0.151 -20%    
#> 11 virginica  Sepal.Length values      11  5.79 0.336 -20%    
#> 12 virginica  Sepal.Width  values      11  2.56 0.15  -20%

Select Top Percentage of Data and Statistical Summarization

Arguments

Value

Note

See also

Examples