The top_perc
function selects the top percentage of data based on a specified trait and computes summary statistics.
It allows for grouping by additional columns and offers flexibility in the type of statistics calculated.
The function can also retain the selected data if needed.
top_perc(data, perc, trait, by = NULL, type = "mean_sd", keep_data = FALSE)
A data.frame
containing the source dataset for analysis
Supports various data frame-like structures
Automatically converts non-data frame inputs
Numeric vector of percentages for data selection
Range: -1
to 1
Positive values: Select top percentiles
Negative values: Select bottom percentiles
Multiple percentiles supported
Character string specifying the 'selection column'
Must be a valid column name in the input data
Used as the basis for top/bottom percentage selection
Optional character vector for 'grouping columns'
Default is NULL
Enables stratified analysis
Allows granular percentage selection within groups
Statistical summary type
Default: "mean_sd"
Controls the type of summary statistics computed
Supports various summary methods from rstatix
Logical flag for data retention
Default: FALSE
TRUE
: Return both summary statistics and selected data
FALSE
: Return only summary statistics
A list or data frame:
If keep_data
is FALSE, a data frame with summary statistics.
If keep_data
is TRUE, a list where each element is a list containing summary statistics (stat
) and the selected top data (data
).
The perc
parameter accepts values between -1 and 1. Positive values select the top percentage, while negative values select the bottom percentage.
The function performs initial checks to ensure required arguments are provided and valid.
Grouping by additional columns (by
) is optional and allows for more granular analysis.
The type
parameter specifies the type of summary statistics to compute, with "mean_sd" as the default.
If keep_data
is set to TRUE, the function will return both the summary statistics and the selected top data for each percentage.
rstatix::get_summary_stats()
Statistical summary computation
dplyr::top_frac()
Percentage-based data selection
# Example 1: Basic usage with single trait
# This example selects the top 10% of observations based on Petal.Width
# keep_data=TRUE returns both summary statistics and the filtered data
top_perc(iris,
perc = 0.1, # Select top 10%
trait = c("Petal.Width"), # Column to analyze
keep_data = TRUE) # Return both stats and filtered data
#> $Petal.Width_0.1
#> $Petal.Width_0.1$stat
#> # A tibble: 1 × 5
#> variable n mean sd top_perc
#> <fct> <dbl> <dbl> <dbl> <chr>
#> 1 Petal.Width 17 2.34 0.1 10%
#>
#> $Petal.Width_0.1$data
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 6.3 3.3 6.0 2.5 virginica
#> 2 6.5 3.0 5.8 2.2 virginica
#> 3 7.2 3.6 6.1 2.5 virginica
#> 4 5.8 2.8 5.1 2.4 virginica
#> 5 6.4 3.2 5.3 2.3 virginica
#> 6 7.7 3.8 6.7 2.2 virginica
#> 7 7.7 2.6 6.9 2.3 virginica
#> 8 6.9 3.2 5.7 2.3 virginica
#> 9 6.4 2.8 5.6 2.2 virginica
#> 10 7.7 3.0 6.1 2.3 virginica
#> 11 6.3 3.4 5.6 2.4 virginica
#> 12 6.7 3.1 5.6 2.4 virginica
#> 13 6.9 3.1 5.1 2.3 virginica
#> 14 6.8 3.2 5.9 2.3 virginica
#> 15 6.7 3.3 5.7 2.5 virginica
#> 16 6.7 3.0 5.2 2.3 virginica
#> 17 6.2 3.4 5.4 2.3 virginica
#>
#>
# Example 2: Using grouping with 'by' parameter
# This example performs the same analysis but separately for each Species
# Returns nested list with stats and filtered data for each group
top_perc(iris,
perc = 0.1, # Select top 10%
trait = c("Petal.Width"), # Column to analyze
by = "Species") # Group by Species
#> # A tibble: 3 × 6
#> Species variable n mean sd top_perc
#> <fct> <fct> <dbl> <dbl> <dbl> <chr>
#> 1 setosa Petal.Width 9 0.433 0.071 10%
#> 2 versicolor Petal.Width 5 1.66 0.089 10%
#> 3 virginica Petal.Width 6 2.45 0.055 10%
# Example 3: Complex example with multiple percentages and grouping variables
# Reshape data from wide to long format for Sepal.Length and Sepal.Width
iris |>
tidyr::pivot_longer(1:2,
names_to = "names",
values_to = "values") |>
mintyr::top_perc(
perc = c(0.1, -0.2),
trait = "values",
by = c("Species", "names"),
type = "mean_sd")
#> # A tibble: 12 × 7
#> Species names variable n mean sd top_perc
#> <fct> <chr> <fct> <dbl> <dbl> <dbl> <chr>
#> 1 setosa Sepal.Length values 5 5.64 0.134 10%
#> 2 setosa Sepal.Width values 6 4.08 0.194 10%
#> 3 versicolor Sepal.Length values 6 6.8 0.126 10%
#> 4 versicolor Sepal.Width values 5 3.26 0.089 10%
#> 5 virginica Sepal.Length values 5 7.74 0.089 10%
#> 6 virginica Sepal.Width values 5 3.6 0.2 10%
#> 7 setosa Sepal.Length values 11 4.53 0.135 -20%
#> 8 setosa Sepal.Width values 12 2.97 0.219 -20%
#> 9 versicolor Sepal.Length values 11 5.28 0.244 -20%
#> 10 versicolor Sepal.Width values 13 2.35 0.151 -20%
#> 11 virginica Sepal.Length values 11 5.79 0.336 -20%
#> 12 virginica Sepal.Width values 11 2.56 0.15 -20%