| Title: | A Moment-Targeting Normality Transformation Based on Tukey g-h Distribution |
|---|---|
| Description: | Implements a moment-targeting normality transformation based on the simultaneous optimization of Tukey g-h distribution parameters. The method is designed to minimize both asymmetry (skewness) and excess peakedness (kurtosis) in non-normal data by mapping it to a standard normal distribution (Cebeci et al., 2026 <doi:10.3390/sym18030458>). Optimization is performed by minimizing an objective function derived from the Anderson-Darling goodness-of-fit statistic with Stephens's correction factor, utilizing the L-BFGS-B algorithm for robust parameter estimation. This approach provides an effective alternative to power transformations like Box-Cox and Yeo-Johnson, particularly for data requiring precise tail-behavior adjustment. |
| Authors: | Zeynel Cebeci [aut, cre], Figen Ceritoglu [aut], Melis Celik Guney [aut], Adnan Unalan [aut] |
| Maintainer: | Zeynel Cebeci <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 1.1.2 |
| Built: | 2026-05-20 05:54:40 UTC |
| Source: | https://github.com/zcebeci/osktnorm |
Adaptive Box-Cox (ABC) transformation (Yu et al, 2022) is a data transformation method to transform a non-normal numeric variable toward normality by tuning a power parameter based on a normality test result. The method selects the optimal transformation parameter by maximizing the Shapiro-Wilk normality test p-value.
abc(x, lrange = seq(-3, 3, 0.01))abc(x, lrange = seq(-3, 3, 0.01))
x |
A numeric vector containing the data to be transformed. If non-positive values are present, the data are automatically shifted to ensure positivity. |
lrange |
A numeric vector specifying the range of lambda values to be evaluated. The default is a sequence from -3 to 3 with step size 0.01. |
The ABC method searches over a predefined range of transformation parameters () and
applies a Box-Cox-type transformation for each candidate value.
For each , the transformed data are tested for normality using the
Shapiro-Wilk test. The optimal is selected as the value that maximizes
the logarithm of the Shapiro-Wilk p-value.
The transformation is defined as:
In this implementation, to satisfy the Box-Cox requirement of strictly positive data, the function automatically shifts the input vector if any non-positive values are detected.
A list with the following components:
transformed |
A numeric vector containing the transformed data. |
lambda |
The selected lambda value that maximizes the Shapiro–Wilk log p-value. |
Zeynel Cebeci
Yu, H., Sang, P., & Huan, T. (2022). Adaptive Box-Cox transformation: A highly flexible feature-specific data transformation to improve metabolomic data normality for better statistical analysis. Analytical Chemistry, 94(23), 8267-8276. doi:10.1021/acs.analchem.2c00503.
set.seed(12) x <- rexp(100) result <- abc(x) result$lambda hist(result$transformed, main = "ABC Transformed Data")set.seed(12) x <- rexp(100) result <- abc(x) result$lambda hist(result$transformed, main = "ABC Transformed Data")
Performs numerical back-transformation (inverse transformation) for the
Optimized Skewness and Kurtosis Transformation (OSKT) based on the Tukey
family (Tukey, 1977). The function recovers the original-scale data from the
OSKT-transformed values using either a Newton–Raphson algorithm or a
bracketing root-finding method.
backoskt(Z, X_mean, X_sd, g, h, tol = 1e-10, maxiter = 1e6, method = c("ur", "nr"))backoskt(Z, X_mean, X_sd, g, h, tol = 1e-10, maxiter = 1e6, method = c("ur", "nr"))
Z |
A numeric vector of OSKT-transformed and standardized observations. |
X_mean |
Sample mean of the original data used during standardization. |
X_sd |
Sample standard deviation of the original data used during standardization. |
g |
Skewness parameter of the Tukey |
h |
Kurtosis (tail heaviness) parameter of the Tukey |
tol |
Numerical tolerance for convergence of the root-finding algorithms.
Default is |
maxiter |
Maximum number of iterations allowed for the |
method |
Character string specifying the numerical inversion method.
Either |
The OSKT transformation is based on the Tukey transformation applied
to standardized data. Since the inverse transformation has no closed-form
solution, numerical methods are required.
Two inversion strategies are provided:
"nr": Newton–Raphson iteration initialized at ,
offering fast convergence when the derivative is well-behaved.
"ur": A safe bracketing method using uniroot,
ensuring convergence at the expense of computational speed.
After inversion, the results are de-standardized using the supplied mean and standard deviation to recover the original data scale.
A list with the following components:
X_orig: Back-transformed observations on the original scale.
X_s: Back-transformed standardized values.
time_seconds: Total computation time in seconds.
method: The numerical inversion method used.
Zeynel Cebeci
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
Headrick, T. C., & Pant, M. D. (2012). Characterizing Tukey and distributions through their moments.
Journal of Statistical Distributions and Applications, 1(1), 1–20.
set.seed(123) x <- rt(200, df = 5) # Example parameters (typically estimated via oskt) g <- 0.2 h <- 0.15 # Standardize and apply forward g-h transformation x_s <- scale(x) z <- ((exp(g * x_s) - 1) / g) * exp(0.5 * h * x_s^2) # Back-transformation res <- backoskt( Z = z, X_mean = mean(x), X_sd = sd(x), g = g, h = h, method = "nr" ) head(x) head(res$X_orig) plot(x, res$X_orig, xlab="Original", ylab="Back transformed", col="blue", pch=19) hist(x) hist(res$X_orig)set.seed(123) x <- rt(200, df = 5) # Example parameters (typically estimated via oskt) g <- 0.2 h <- 0.15 # Standardize and apply forward g-h transformation x_s <- scale(x) z <- ((exp(g * x_s) - 1) / g) * exp(0.5 * h * x_s^2) # Back-transformation res <- backoskt( Z = z, X_mean = mean(x), X_sd = sd(x), g = g, h = h, method = "nr" ) head(x) head(res$X_orig) plot(x, res$X_orig, xlab="Original", ylab="Back transformed", col="blue", pch=19) hist(x) hist(res$X_orig)
Computes the inverse of the Optimized Skewness and Kurtosis Transformation (OSKT) using high-performance numerical root-finding algorithms implemented in C++. The function efficiently recovers original-scale observations from OSKT-transformed values by solving a nonlinear equation for each observation.
backosktfast( Z, X_mean, X_sd, g, h, method = "auto", tol = 1e-10, maxiter_nr = 1000, maxiter_brent = 2000 )backosktfast( Z, X_mean, X_sd, g, h, method = "auto", tol = 1e-10, maxiter_nr = 1000, maxiter_brent = 2000 )
Z |
Numeric vector of OSKT-transformed values to be inverted.
Missing values ( |
X_mean |
Numeric scalar. Mean of the original data before standardization. |
X_sd |
Numeric scalar. Standard deviation of the original data before standardization. |
g |
Numeric scalar. Optimized skewness parameter returned by the OSKT transformation function. Values close to zero are handled using a numerically stable limiting form. |
h |
Numeric scalar. Optimized kurtosis parameter returned by the OSKT transformation function.
Must be non-negative ( |
method |
Character string specifying the numerical root-finding strategy:
|
tol |
Positive numeric scalar specifying the convergence tolerance for the root-finding algorithms. |
maxiter_nr |
Positive integer. Maximum number of iterations allowed for the Newton–Raphson phase. |
maxiter_brent |
Positive integer. Maximum number of iterations allowed for the Brent–Dekker phase. |
The Optimized Skewness and Kurtosis Transformation (OSKT) is defined as
where is the standardized variable.
When , the transformation is defined by the continuous limit
This function numerically solves the nonlinear equation
for , and then applies the inverse standardization
Numerical Methods:
Newton–Raphson uses analytic derivatives and exhibits quadratic convergence near the solution but may fail for extreme values or poor initial guesses.
Brent–Dekker is a robust bracketing algorithm combining bisection, secant, and inverse quadratic interpolation (Brent, 1973). Convergence is guaranteed if a root is bracketed.
Auto mode combines both approaches, achieving high performance while retaining robustness.
All heavy numerical computations are implemented in C++ via Rcpp.
A list with the following components:
Numeric vector of inverse-transformed values on the original scale.
Entries are NA where inversion failed or input values were NA.
Character vector of the same length as Z, indicating which method
succeeded for each observation:
"failed"Root-finding failed to converge.
"nr"Newton–Raphson succeeded (when method = "nr").
"brent"Brent–Dekker succeeded (when method = "brent").
"auto-nr"Newton–Raphson succeeded in auto mode.
"auto-brent"Brent–Dekker fallback succeeded in auto mode.
Zeynel Cebeci
Brent, R. P. (1973). Algorithms for Minimization without Derivatives. Prentice-Hall, Englewood Cliffs, NJ.
oskt for the forward OSKT transformation. osktfast for the fast forward OSKT transformation. backoskt for the forward OSKT transformation using uniroot in R. uniroot for R's base root-finding routine.
# Example data set.seed(123) X <- c(-50, -10, 0, 10, 50) Z <- scale(X) # Newton–Raphson res_nr <- backosktfast(Z, 0, 1, g = 0.5, h = 0.1, method = "nr") res_nr$X_orig # Brent–Dekker res_br <- backosktfast(Z, 0, 1, g = 0.5, h = 0.1, method = "brent") res_br$X_orig # Auto mode res <- backosktfast(Z, X_mean = 0, X_sd = 1, g = 0.5, h = 0.1) res$X_orig table(res$method_used) # Handling missing values Z_na <- c(-10, 0, 10, NA) backosktfast(Z_na, 0, 1, g = 0.3, h = 0.05)$X_orig# Example data set.seed(123) X <- c(-50, -10, 0, 10, 50) Z <- scale(X) # Newton–Raphson res_nr <- backosktfast(Z, 0, 1, g = 0.5, h = 0.1, method = "nr") res_nr$X_orig # Brent–Dekker res_br <- backosktfast(Z, 0, 1, g = 0.5, h = 0.1, method = "brent") res_br$X_orig # Auto mode res <- backosktfast(Z, X_mean = 0, X_sd = 1, g = 0.5, h = 0.1) res$X_orig table(res$method_used) # Handling missing values Z_na <- c(-10, 0, 10, NA) backosktfast(Z_na, 0, 1, g = 0.3, h = 0.05)$X_orig
Performs a Box-Cox transformation on a numeric vector. Optionally, the data can be shifted to ensure all values are positive before applying the transformation. If lambda is not provided, it is estimated via maximum likelihood.
boxcox(x, lambda = NULL, makepositive = FALSE, eps = 1e-06)boxcox(x, lambda = NULL, makepositive = FALSE, eps = 1e-06)
x |
A numeric vector to be transformed. |
lambda |
Optional numeric value of the Box–Cox transformation parameter.
If |
makepositive |
Logical. If |
eps |
A small positive constant used for numerical stability.
It is added implicitly when enforcing positivity or to avoid taking the logarithm
of zero. Default is |
The Box-Cox transformation is defined as:
If makepositive = TRUE, the function shifts the data by abs(min(x)) + 1
if there are zero or negative values, to make all values positive.
Lambda is estimated via maximum likelihood over a grid of values from -4 to 4 (default 500 points) if not specified.
A list with the following components:
transformed |
The transformed numeric vector. |
lambda |
The lambda used in the transformation (either provided or estimated). |
shift |
The amount by which the original data was shifted to make it positive (0 if no shift). |
Zeynel Cebeci
# Generate positively skewed example data set.seed(123) x <- rlnorm(50, meanlog = 0, sdlog = 1) # Box-Cox with estimated lambda (MLE) res <- boxcox(x) head(res$transformed) res$lambda res$shift # Box-Cox with specified lambda res2 <- boxcox(x, lambda = 0.25) head(res2$transformed) # Box-Cox with automatic shift for nonpositive values x2 <- x - quantile(x, 0.2) res3 <- boxcox(x2, makepositive = TRUE) head(res3$transformed) res3$shift# Generate positively skewed example data set.seed(123) x <- rlnorm(50, meanlog = 0, sdlog = 1) # Box-Cox with estimated lambda (MLE) res <- boxcox(x) head(res$transformed) res$lambda res$shift # Box-Cox with specified lambda res2 <- boxcox(x, lambda = 0.25) head(res2$transformed) # Box-Cox with automatic shift for nonpositive values x2 <- x - quantile(x, 0.2) res3 <- boxcox(x2, makepositive = TRUE) head(res3$transformed) res3$shift
Ajanların konumlarına ve belirlenen algılama yarıçapına (R) göre sistemin topolojisini belirleyen bitişiklik matrisini oluşturur. Tezdeki Denklem 2.2 uyarınca, birbirine R mesafesinden yakın olan ajanlar komşu kabul edilir.
compute_adjacency_fast(pos, R)compute_adjacency_fast(pos, R)
pos |
N x n boyutunda sayısal matris. Ajanların öklid uzayındaki konumlarını içerir. |
R |
Sayısal değer (skaler). Algılama yarıçapını (Sensing Radius) belirtir. İki ajan arasındaki mesafe R'den küçükse aralarında bir bağ olduğu varsayılır. |
Fonksiyon, R'ın yerleşik dist fonksiyonunu kullanarak tüm ajanlar arasındaki mesafe matrisini çıkarır.
Ardından vektörel mantıksal karşılaştırma ile şu koşulu kontrol eder:
Kendisiyle olan bağlantı (mesafe > 0 koşulu ile) hariç tutulur. Sonuç, bellek verimliliği için tamsayı (integer) tipine dönüştürülür.
N x N boyutunda, 1 ve 0'lardan oluşan simetrik bir matris döndürür. A(i,j) = 1 ise i ve j ajanları komşudur.
Sürü Kontrol Araştırma Grubu
Chapter 2, Equation 2.2: Neighborhood definition. Source Code: Matris Tabanlı Sürü Kontrol Algoritmaları.
compute_control_input_fast
Performs a Monte Carlo based Cramér-von Mises (CVM) goodness-of-fit test for normality. The p-value is approximated via simulation under the standard normal distribution.
cvmtest(x, nsim = 10000, seed = NULL)cvmtest(x, nsim = 10000, seed = NULL)
x |
A numeric vector of observations to be tested for normality. |
nsim |
Number of Monte Carlo simulations used to approximate the null distribution of the CVM statistic.
Default is |
seed |
Optional integer seed for reproducibility. If |
The Cramér-von Mises statistic measures the squared distance between the empirical distribution function of the data and the theoretical cumulative distribution function under the null hypothesis.
In this implementation, the null hypothesis assumes that the data follow a standard normal distribution. The CVM statistic is computed as:
where denotes the standard normal cumulative distribution function
and are the ordered observations.
Because the exact distribution of the statistic is not used, the p-value is estimated via Monte Carlo simulation by repeatedly generating samples from the standard normal distribution and recomputing the CVM statistic.
A list with the following components:
statistic |
Observed Cramér-von Mises test statistic. |
p.value |
Monte Carlo estimated p-value. |
Zeynel Cebeci
Cramér, H. (1928). On the composition of elementary errors: First paper: Mathematical deductions. Scandinavian Actuarial Journal, 1928(1), 13-74. doi:10.1080/03461238.1928.10416862.
von Mises, R. (1931). Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und theoretischen Physik.
# Generate normal distributed data set.seed(123) x <- rnorm(50) result <- cvmtest(x, nsim = 1000) # Increase in real apps result$statistic result$p.value # Generate positively skewed example data set.seed(123) x <- rlnorm(50, meanlog = 0, sdlog = 1) result <- cvmtest(x, nsim = 1000) # Increase in real apps result$statistic result$p.value# Generate normal distributed data set.seed(123) x <- rnorm(50) result <- cvmtest(x, nsim = 1000) # Increase in real apps result$statistic result$p.value # Generate positively skewed example data set.seed(123) x <- rlnorm(50, meanlog = 0, sdlog = 1) result <- cvmtest(x, nsim = 1000) # Increase in real apps result$statistic result$p.value
Performs a rank-based inverse normal transformation (INT) that maps a numeric vector to approximately standard normal scores.
int(x, ties.method = "average", na.action = "keep")int(x, ties.method = "average", na.action = "keep")
x |
A numeric vector to be transformed. Missing values (NA) are allowed. |
ties.method |
A character string specifying how ties are handled in ranking.
Default is |
na.action |
A character string indicating how missing values are treated
during ranking. Default is |
The inverse normal transformation (INT) is a nonparametric normalization method based on data ranks. The procedure first maps ranks to uniform quantiles and then applies the inverse standard normal distribution function.
For each non-missing observation , the transformation is defined as:
where is the number of non-missing observations.
A list with one element:
A numeric vector of the same length as x containing
inverse normal transformed values. Missing values are preserved.
x <- c(5, 2, 8, 8, 3) res <- int(x) res$transformed # With missing values x2 <- c(1.2, NA, 3.4, 2.1) int(x2)$transformedx <- c(5, 2, 8, 8, 3) res <- int(x) res$transformed # With missing values x2 <- c(1.2, NA, 3.4, 2.1) int(x2)$transformed
Transforms a non-normal variable into a Gaussian (Normal) distribution using the Iterative Generalized Method of Moments (IGMM) for Lambert W x F transforms.
It handles skewed (s), heavy-tailed (h), or both (hh) distributions.
lambert(x, type = c("s", "h", "hh"), maxiter = 200, tol = 1e-06, step_gamma = 0.25, step_delta = 0.1)lambert(x, type = c("s", "h", "hh"), maxiter = 200, tol = 1e-06, step_gamma = 0.25, step_delta = 0.1)
x |
A numeric vector to be transformed. |
type |
Character string specifying the type of transformation: |
maxiter |
Maximum number of IGMM iterations. Default is 200. |
tol |
Convergence tolerance for parameter updates. Default is 1e-6. |
step_gamma |
The damping factor for the skewness parameter ( |
step_delta |
The damping factor for the heavy-tail parameter ( |
The function uses a robust Halley's method to solve the Lambert W function internally. The IGMM algorithm iteratively updates the transformation parameters
( and ) to minimize the skewness and excess kurtosis of the latent variable.
A list containing:
transformed |
The numeric vector of Gaussianized values, maintaining NA positions. |
params |
A list of estimated parameters: |
iterations |
Number of iterations performed until convergence. |
converged |
Logical indicating if the algorithm converged within |
method |
Character string indicating the estimation method ( |
Zeynel Cebeci
Goerg, G. M. (2011). Lambert W random variables - A new family of generalized skewed distributions with applications to risk estimation. The Annals of Applied Statistics, 5(3), 2197-2230. doi:10.1214/11-AOAS457
Goerg, G. M. (2015). The Lambert Way to Gaussianize heavy-tailed data with the inverse of Tukey's h transformation as a special case. The Scientific World Journal, 2015, 1-18. doi:10.1155/2015/909231
Corless, R. M., Gonnet, G. H., Hare, D. E. G., Jeffrey, D. J., & Knuth, D. E. (1996). On the Lambert W function. Advances in Computational Mathematics, 5(1), 329-359. doi:10.1007/BF02124750
# Generate skewed data using a Gamma distribution set.seed(123) skewed_data <- rgamma(500, shape = 2, scale = 2) # Apply the Lambert W transformation (skewed type) result <- lambert(skewed_data, type = "s") # Visualization op <- par(mfrow = c(1, 2)) hist(skewed_data, main = "Original (Gamma)", col = "orange", breaks = 30) hist(result$transformed, main = "Gaussianized (Lambert)", col = "skyblue", breaks = 30) par(op)# Generate skewed data using a Gamma distribution set.seed(123) skewed_data <- rgamma(500, shape = 2, scale = 2) # Apply the Lambert W transformation (skewed type) result <- lambert(skewed_data, type = "s") # Visualization op <- par(mfrow = c(1, 2)) hist(skewed_data, main = "Original (Gamma)", col = "orange", breaks = 30) hist(result$transformed, main = "Gaussianized (Lambert)", col = "skyblue", breaks = 30) par(op)
Applies the Tukey g-and-h transformation to a numeric vector and estimates optimal skewness and tail-heaviness parameters
by minimizing the Anderson-Darling normality test statistic (A).
oskt(x, init_params = c(0.1, 0.1), lower_bounds = c(-1, 0), upper_bounds = c(1, 0.5))oskt(x, init_params = c(0.1, 0.1), lower_bounds = c(-1, 0), upper_bounds = c(1, 0.5))
x |
A numeric vector of observations to be transformed. The data are internally standardized to zero mean and unit variance. |
init_params |
Numeric vector of length two giving the initial values of the Tukey g-and-h parameters |
lower_bounds |
Numeric vector of length two specifying lower bounds for |
upper_bounds |
Numeric vector of length two specifying upper bounds for |
The Tukey g-and-h transformation is defined as:
where controls skewness and controls tail heaviness.
To assess normality of the transformed data, the Anderson–Darling statistic
is computed directly from its analytical form under the standard
normal distribution. The implementation includes the Stephens correction
used when location and scale parameters are estimated from the data:
Optimal parameters are obtained by minimizing the corrected statistic
using constrained optimization via the "L-BFGS-B" algorithm.
A list with the following components:
transformed |
Numeric vector containing the transformed data. |
g |
Estimated skewness parameter of the Tukey g-and-h transformation. |
h |
Estimated tail-heaviness parameter of the Tukey g-and-h transformation. |
This function does not rely on external normality testing packages. The Anderson–Darling statistic is computed explicitly to ensure numerical consistency and package independence.
If optimization fails, the standardized input data are returned and
g and h are set to NA.
Zeynel Cebeci
Stephens, M. A. (1974). EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association, 69(347), 730-737.
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
set.seed(123) x <- rexp(100) restrans <- oskt(x) restrans$g restrans$h hist(restrans$transformed, xlab="Transformed Values", main = "OSKT Transformed Data") set.seed(123) x <- rlnorm(300, meanlog = 0, sdlog = 0.75) restrans <- oskt(x) restrans$g restrans$h hist(restrans$transformed, xlab="Transformed Values", main = "OSKT Transformed Data")set.seed(123) x <- rexp(100) restrans <- oskt(x) restrans$g restrans$h hist(restrans$transformed, xlab="Transformed Values", main = "OSKT Transformed Data") set.seed(123) x <- rlnorm(300, meanlog = 0, sdlog = 0.75) restrans <- oskt(x) restrans$g restrans$h hist(restrans$transformed, xlab="Transformed Values", main = "OSKT Transformed Data")
Performs a g-and-h transformation to reduce skewness and kurtosis, minimizing the Anderson-Darling A2 statistic. The transformation aims to normalize non-normal data by targeting skewness and kurtosis simultaneously.
osktfast(x, init_params = c(0.1, 0.1), lower_bounds = c(-1, 0), upper_bounds = c(1, 0.5), maxiter = 200)osktfast(x, init_params = c(0.1, 0.1), lower_bounds = c(-1, 0), upper_bounds = c(1, 0.5), maxiter = 200)
x |
Numeric vector of input data. |
init_params |
Numeric vector of length 2: initial values for g and h. Default is c(0.1, 0.1). |
lower_bounds |
Numeric vector of length 2: lower bounds for g and h. Default is c(-1, 0). |
upper_bounds |
Numeric vector of length 2: upper bounds for g and h. Default is c(1, 0.5). |
maxiter |
Integer: maximum number of iterations for the optimizer. Default is 200. |
This function uses a pure C++ implementation of the L-BFGS-B algorithm to optimize the g-and-h transformation parameters for normality. The objective function is the modified Anderson-Darling A2 statistic with Stephen's correction for small samples. It is suitable for moment-targeting normalization is desired.
A list containing:
transformed |
The transformed numeric vector. |
g |
Estimated g parameter of the g-and-h transformation. |
h |
Estimated h parameter of the g-and-h transformation. |
value |
Anderson-Darling A2 statistic at the optimum. |
Zeynel Cebeci
set.seed(123) x <- rnorm(100, mean=5, sd=2) # Simulate non-normal data res <- osktfast(x) # Check transformed data head(res$transformed) res$g res$h res$valueset.seed(123) x <- rnorm(100, mean=5, sd=2) # Simulate non-normal data res <- osktfast(x) # Check transformed data head(res$transformed) res$g res$h res$value
Applies Optimal Skewness–Kurtosis Transformation (OSKT) column-wise to numeric variables in a matrix or data frame. Non-numeric columns are automatically excluded and reported. Optional normality diagnostics can be computed for transformed traits.
osktnorm(data, normtests = FALSE, nsim = 100, shapiro_limit = 5000, verbose = TRUE, keep_raw = FALSE)osktnorm(data, normtests = FALSE, nsim = 100, shapiro_limit = 5000, verbose = TRUE, keep_raw = FALSE)
data |
A |
normtests |
Controls which normality diagnostics are computed for transformed traits. Possible values:
Available test identifiers (case-insensitive):
|
nsim |
Integer. Number of Monte Carlo simulations used in |
shapiro_limit |
Integer. Maximum sample size allowed for Shapiro–Wilk test.
If the sample size exceeds this limit, the Shapiro p-value
is returned as |
verbose |
Logical. If |
keep_raw |
Logical. If |
The function performs the following steps:
Validates that the input is a matrix or data frame.
Detects numeric columns.
Excludes non-numeric columns and reports them (if verbose = TRUE).
Stops with an error if no numeric columns remain.
Applies osktfast() to each numeric trait.
Optionally computes selected normality diagnostics on the transformed data.
Normality tests are done on the transformed values after normalization.
Even if a single test is requested, the output in normtests remains a data frame
organized by trait (rows).
An object of class "osktnorm" containing:
normalized: A data frame containing the transformed numeric traits.
parameters: A data frame of optimized OSKT parameters (g, h, and objective value A2) for each trait.
normtests: A data frame of selected normality diagnostics where each row corresponds to a trait and each column to a test. Returns NULL if normtests = FALSE.
removed_columns: A character vector of excluded non-numeric column names.
osktfast,
zatest,
zctest,
cvmtest,
pearsonp
set.seed(123) origdata <- data.frame( id = factor(sprintf("id%03d", 1:100)), trait1 = rexp(100), trait2 = rchisq(100, df = 3), group = factor(sample(letters[1:3], 100, TRUE)) ) res1 <- osktnorm(origdata) head(res1$normalized) res1$parameters res2 <- osktnorm(origdata, normtests = "cvm") head(res2$normalized) res2$parameters res2$normtests res3 <- osktnorm(origdata, normtests = c("cvm","sw","ppm")) head(res3$normalized) res3$parameters res3$normtests res4 <- osktnorm(origdata, normtests = "all") print(res4$normtests)set.seed(123) origdata <- data.frame( id = factor(sprintf("id%03d", 1:100)), trait1 = rexp(100), trait2 = rchisq(100, df = 3), group = factor(sample(letters[1:3], 100, TRUE)) ) res1 <- osktnorm(origdata) head(res1$normalized) res1$parameters res2 <- osktnorm(origdata, normtests = "cvm") head(res2$normalized) res2$parameters res2$normtests res3 <- osktnorm(origdata, normtests = c("cvm","sw","ppm")) head(res3$normalized) res3$parameters res3$normtests res4 <- osktnorm(origdata, normtests = "all") print(res4$normtests)
Computes the Pearson P metric for assessing deviation from normality. The statistic is defined as the Pearson Chi-square goodness-of-fit statistic divided by its degrees of freedom. This scaled form is used as a normality metric rather than a formal hypothesis test.
pearsonp(x, nbins = NULL)pearsonp(x, nbins = NULL)
x |
A numeric vector of observations. Missing values are removed prior to computation. |
nbins |
Optional integer specifying the number of equal-probability bins.
If |
The data are standardized using the sample mean and standard deviation. The standardized values are then grouped into equal-probability bins defined by the quantiles of the standard normal distribution.
Let denote the Pearson Chi-square statistic and
the degrees of freedom, accounting for estimation of
the mean and variance. The Pearson P metric is defined as
.
Unlike nortest::pearson.test, this function does not compute
a p-value and should be interpreted as a descriptive normality metric.
Smaller values indicate closer agreement with normality.
An object of class "htest" containing:
statistic |
The Pearson P metric ( |
method |
A character string describing the metric. |
data.name |
The name of the input data. |
df |
Degrees of freedom used in the scaling. |
Zeynel Cebeci
set.seed(28) x <- rnorm(100) res <- pearsonp(x) res$statistic set.seed(42) x <- rlnorm(100) pearsonp(x)$statistic res <- pearsonp(x, nbins = 8) res$statisticset.seed(28) x <- rnorm(100) res <- pearsonp(x) res$statistic set.seed(42) x <- rlnorm(100) pearsonp(x)$statistic res <- pearsonp(x, nbins = 8) res$statistic
A comprehensive dataset containing 37 morphological, agronomic, and quality traits measured across 193 rice genotypes. The data covers flowering times across different locations, seed morphology, and grain quality parameters.
data(phenodata)data(phenodata)
A data frame with 193 observations on the following 37 variables:
IIDCharacter vector; Individual Identifier.
FTANumeric; Flowering time at Arkansas.
FTFInteger; Flowering time at Faridpur.
FTBInteger; Flowering time at Aberdeen.
RTANumeric; Flowering time ratio of Arkansas/Aberdeen.
RTFNumeric; Flowering time ratio of Faridpur/Aberdeen.
CULMNumeric; Culm habit (stem growth pattern).
LPUBInteger; Leaf pubescence (0: absent, 1: present).
FLLNumeric; Flag leaf length.
FLWNumeric; Flag leaf width.
AWNInteger; Awn presence (0: absent, 1: present).
PNPNumeric; Panicle number per plant.
PHTNumeric; Plant height.
PLENNumeric; Panicle length.
PPBNNumeric; Primary panicle branch number.
SNPPNumeric; Seed number per panicle.
FLPPNumeric; Florets per panicle.
PFRTNumeric; Panicle fertility.
SDLNumeric; Seed length.
SDWNumeric; Seed width.
SDVNumeric; Seed volume.
SDSANumeric; Seed surface area.
BRLNumeric; Brown rice seed length.
BRWNumeric; Brown rice seed width.
BRSANumeric; Brown rice surface area.
BRVNumeric; Brown rice volume.
SLWRNumeric; Seed length/width ratio.
BLWRNumeric; Brown rice length/width ratio.
SCOLInteger; Seed color.
PCOLInteger; Pericarp color.
STRHNumeric; Straighthead susceptibility.
BLSTInteger; Blast resistance score.
AMYNumeric; protlose content.
ASVNumeric; Alkali spreading value.
PROTNumeric; Protein content.
Y07ANumeric; Year 2007 flowering time at Arkansas.
Y06ANumeric; Year 2006 flowering time at Arkansas.
The dataset is subject to an omit action for missing values, with several genotypes excluded due to incomplete phenotypic records.
These traits are essential for quantitative trait loci (QTL) mapping and genome-wide association studies (GWAS) in rice diversity research.
A reduced version of data, obtained from the Rice Diversity Project. Original phenotypic and genotypic data are available at http://www.ricediversity.org/data/.
Zhao, K., Tung, C. W., Eizenga, G. C., Wright, M. H., Ali, M. L., Price, A. H., ... & McCouch, S. R. (2011). Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nature Communications, 2(1), 467. doi:10.1038/ncomms1467
data(phenodata) # Summary of plant height across the population summary(phenodata$PHT) # Correlation between Seed Length and Brown Rice Seed Length plot(phenodata$SDL, phenodata$BRL, xlab = "Seed Length", ylab = "Brown Rice Length", main = "Seed Morphology Correlation") # Normalize protein data prot <- phenodata$PROT prot <- as.matrix(prot[!is.na(prot)]) sw_before <- shapiro.test(prot) prot_oskt <- osktfast(prot)$transformed sw_after <- shapiro.test(prot_oskt) par(mfrow = c(1, 2)) hist(prot, breaks = 20, col = "lightgreen", main = "Original", xlab = "Values") hist(prot_oskt, breaks = 20, col = "skyblue", main = "OSKT Normalized", xlab = "Transformed Values") op <- par(mfrow = c(1, 1)) print("Shapiro-Wilk Test (Before OSKT)") print(sw_before) print("\nShapiro-Wilk Test (After OSKT)") print(sw_after) par(op)data(phenodata) # Summary of plant height across the population summary(phenodata$PHT) # Correlation between Seed Length and Brown Rice Seed Length plot(phenodata$SDL, phenodata$BRL, xlab = "Seed Length", ylab = "Brown Rice Length", main = "Seed Morphology Correlation") # Normalize protein data prot <- phenodata$PROT prot <- as.matrix(prot[!is.na(prot)]) sw_before <- shapiro.test(prot) prot_oskt <- osktfast(prot)$transformed sw_after <- shapiro.test(prot_oskt) par(mfrow = c(1, 2)) hist(prot, breaks = 20, col = "lightgreen", main = "Original", xlab = "Values") hist(prot_oskt, breaks = 20, col = "skyblue", main = "OSKT Normalized", xlab = "Transformed Values") op <- par(mfrow = c(1, 1)) print("Shapiro-Wilk Test (Before OSKT)") print(sw_before) print("\nShapiro-Wilk Test (After OSKT)") print(sw_after) par(op)
These functions provide visualization tools to reproduce the graphical results presented in Chapter 3 of the thesis. They use ggplot2 to plot agent trajectories, tracking errors, and statistical consensus rates.
plot_trajectories(history_pos, leader_history, is_informed, T_step) plot_tracking_errors(history_pos, leader_history, is_informed, T_step) plot_consensus_stats(results_df)plot_trajectories(history_pos, leader_history, is_informed, T_step) plot_tracking_errors(history_pos, leader_history, is_informed, T_step) plot_consensus_stats(results_df)
history_pos |
A numeric array of dimensions |
leader_history |
A numeric matrix of dimensions |
is_informed |
A logical vector of length |
T_step |
A numeric scalar representing the sampling time step used in the simulation (e.g., 0.1). |
results_df |
A data frame summarizing Monte Carlo simulation results. Must contain columns: |
plot_trajectories: Visualizes the movement paths of the agents and the virtual leader in X and Y dimensions. This function reproduces the style of Figures 3.5 - 3.8 in the thesis. Informed agents are plotted in blue, uninformed in black/gray, and the virtual leader in dashed red.
plot_tracking_errors: Plots the position error () over time for each agent. This function reproduces the style of Figures 3.1 - 3.4 in the thesis. It allows visual verification of whether the swarm successfully tracks the leader (error converges to zero).
plot_consensus_stats: Visualizes the relationship between consensus rate, sensing radius, and the percentage of informed agents. This reproduces the statistical analysis plots found in Figures 3.33 - 3.34.
A ggplot object that can be printed or modified further.
Cagatay Cebeci
prepare_plot_data
# --- Example for Trajectory and Error Plots --- # Assuming 'history_pos' and 'leader_history' are generated from a simulation: N <- 20 is_informed <- rep(FALSE, N); is_informed[1:2] <- TRUE # 10% informed T_step <- 0.1 ## Not run: # Plot Trajectories p1 <- plot_trajectories(history_pos, leader_history, is_informed, T_step) print(p1) # Plot Errors p2 <- plot_tracking_errors(history_pos, leader_history, is_informed, T_step) print(p2) ## End(Not run) # --- Example for Statistical Plot --- # Mock data representing thesis Table 3.1 results results <- data.frame( Radius = rep(c(5, 7, 10, 20), each = 2), AgentCount = rep(c(20, 50), 4), ConsensusRate = runif(8, 0.5, 1.0), InformedPercentage = 10 ) p3 <- plot_consensus_stats(results) print(p3)# --- Example for Trajectory and Error Plots --- # Assuming 'history_pos' and 'leader_history' are generated from a simulation: N <- 20 is_informed <- rep(FALSE, N); is_informed[1:2] <- TRUE # 10% informed T_step <- 0.1 ## Not run: # Plot Trajectories p1 <- plot_trajectories(history_pos, leader_history, is_informed, T_step) print(p1) # Plot Errors p2 <- plot_tracking_errors(history_pos, leader_history, is_informed, T_step) print(p2) ## End(Not run) # --- Example for Statistical Plot --- # Mock data representing thesis Table 3.1 results results <- data.frame( Radius = rep(c(5, 7, 10, 20), each = 2), AgentCount = rep(c(20, 50), 4), ConsensusRate = runif(8, 0.5, 1.0), InformedPercentage = 10 ) p3 <- plot_consensus_stats(results) print(p3)
Performs a robust reweighted maximum likelihood estimation of the Box-Cox (RBC) transformation parameter for univariate data, following the methodology of Raymaekers and Rousseeuw (2024). The procedure aims at achieving central normality by iteratively downweighting outlying observations using Huber-type weights.
rewbc(x, lrange = seq(-3, 3, by = 0.01), rwsteps = 2, k = 1.5)rewbc(x, lrange = seq(-3, 3, by = 0.01), rwsteps = 2, k = 1.5)
x |
A numeric vector of observations. Missing values are removed. If non-positive values are present, the data are automatically shifted to ensure positivity. |
lrange |
A numeric vector specifying the grid of candidate Box-Cox
transformation parameters |
rwsteps |
An integer specifying the number of iterative reweighting steps used in the reweighted maximum likelihood procedure. |
k |
The tuning constant for the Huber-weight function applied to the standardized transformed data. Smaller values lead to stronger downweighting of extreme observations. |
The function first computes the classical maximum likelihood estimate (MLE) of the Box-Cox transformation parameter assuming normality of the transformed data.
In subsequent iterations, the data are transformed using the current
estimate of . Robust estimates of location and scale
(median and MAD) are used to compute standardized residuals, from which
Huber-type weights are derived. These weights are then used to
re-maximize the Box-Cox log-likelihood over the specified grid of
values.
The weighted log-likelihood maximized at each step is given by
where denotes the (possibly weighted) variance of the
transformed data, and robustness enters through the estimation of
via observation weights.
A list with the following components:
transformed |
The Box-Cox transformed data using the final estimated |
lambda |
The estimated Box-Cox transformation parameter. |
weights |
The final Huber-type weights assigned to each observation. |
steps |
The number of reweighting iterations performed. |
Zeynel Cebeci
Raymaekers, J., & Rousseeuw, P. J. (2024). Transforming variables to central normality. Machine Learning, 113(8), 4953–4975.
boxcox, yeojohnson, abc, rewyj, osktnorm
# Generate non-normal data set.seed(123) x <- c(rnorm(90), rnorm(10, mean = 5)) head(x) shapiro.test(x) # Reweigted Box-Cox with estimated lambda res <- rewbc(x) res$lambda head(res$transformed) shapiro.test(res$transformed) hist(res$transformed, main = "Reweighted Box-Cox Transformed Data") # Reweigted Box-Cox with specified lambda res2 <- rewbc(x, lrange = c(-1, 0.5, 1)) res2$lambda head(res2$transformed) shapiro.test(res2$transformed)# Generate non-normal data set.seed(123) x <- c(rnorm(90), rnorm(10, mean = 5)) head(x) shapiro.test(x) # Reweigted Box-Cox with estimated lambda res <- rewbc(x) res$lambda head(res$transformed) shapiro.test(res$transformed) hist(res$transformed, main = "Reweighted Box-Cox Transformed Data") # Reweigted Box-Cox with specified lambda res2 <- rewbc(x, lrange = c(-1, 0.5, 1)) res2$lambda head(res2$transformed) shapiro.test(res2$transformed)
Performs a robust, reweighted maximum likelihood estimation of the Yeo-Johnson transformation parameter for univariate data. Outliers are downweighted using Huber-type weights in an iteratively reweighted likelihood framework.
rewyj(x, lrange = seq(-3, 3, by = 0.01), rwsteps = 2, k = 1.5)rewyj(x, lrange = seq(-3, 3, by = 0.01), rwsteps = 2, k = 1.5)
x |
A numeric vector of observations. Missing values are removed. The data may contain both positive and negative values. |
lrange |
A numeric vector specifying the grid of candidate
|
rwsteps |
An integer specifying the number of reweighting iterations in the iteratively reweighted maximum likelihood procedure. |
k |
Tuning constant for the Huber weight function (Huber, 1981). Larger values reduce robustness, while smaller values increase downweighting of extreme observations. |
The function implements the reweighted maximum likelihood (RewML) approach for the Yeo-Johnson transformation (Yeo &Johnson, 2000) as described by Raymaekers and Rousseeuw (2024).
In the first step, the classical maximum likelihood estimate (MLE)
of the Yeo-Johnson transformation parameter is obtained
under a normality assumption.
Subsequently, the algorithm iteratively:
Transforms the data using the current estimate of .
Computes robust location and scale estimates using the median and median absolute deviation (MAD).
Standardizes the transformed data and computes Huber-type weights.
Re-maximizes a weighted Yeo–Johnson log-likelihood over the
specified grid of values.
The Jacobian term of the Yeo-Johnson transformation is included unweighted, following the formulation in Raymaekers and Rousseeuw (2024).
The weighted log-likelihood has the form
where is the weighted variance of the transformed data
and denotes the Jacobian contribution of the
Yeo-Johnson transformation.
A list with the following components:
transformed |
The Yeo-Johnson transformed data using the estimated |
lambda |
The estimated Yeo–Johnson transformation parameter. |
weights |
Final robust weights assigned to each observation. |
steps |
Number of reweighting iterations performed. |
Zeynel Cebeci
Raymaekers, J., & Rousseeuw, P. J. (2024). Transforming variables to central normality. Machine Learning, 113(8), 4953-4975.
Yeo, I.-K. & Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika, 87(4), 954-959. doi:10.1093/biomet/87.4.954.
Huber, P. J. (1981). Robust Statistics. Wiley.
yeojohnson, boxcox, abc, rewbc, osktnorm
# Generate non-normal data set.seed(123) x <- c(rnorm(90), rnorm(10, mean = 5)) head(x) shapiro.test(x) # Reweigted Yeo-Johnson with estimated lambda res <- rewyj(x) res$lambda head(res$transformed) shapiro.test(res$transformed) # Reweigted Yeo-Johnson with specified lambda res2 <- rewyj(x, lrange = c(-1, 0.5, 1)) res2$lambda head(res2$transformed) shapiro.test(res2$transformed) hist(res2$transformed, main = "Reweighted Yeo-Johnson Transformed Data")# Generate non-normal data set.seed(123) x <- c(rnorm(90), rnorm(10, mean = 5)) head(x) shapiro.test(x) # Reweigted Yeo-Johnson with estimated lambda res <- rewyj(x) res$lambda head(res$transformed) shapiro.test(res$transformed) # Reweigted Yeo-Johnson with specified lambda res2 <- rewyj(x, lrange = c(-1, 0.5, 1)) res2$lambda head(res2$transformed) shapiro.test(res2$transformed) hist(res2$transformed, main = "Reweighted Yeo-Johnson Transformed Data")
Performs the Gel-Gastwirth robust version of the Jarque-Bera normality test using quantile-based measures of skewness and kurtosis. The test is designed to reduce sensitivity to outliers by avoiding moment-based estimators.
rjbtest(x)rjbtest(x)
x |
A numeric vector of observations. Missing values are removed prior to computation. |
The classical Jarque–Bera test relies on moment-based estimates of skewness and kurtosis, making it highly sensitive to outliers.
The Gel–Gastwirth robust Jarque–Bera (RJB) test replaces these moments with robust, quantile-based measures.
Robust skewness is measured using the Bowley skewness:
where denotes the empirical -quantile.
Robust kurtosis is measured using the Moors kurtosis (excess form):
The robust Jarque–Bera test statistic is defined as
Under the null hypothesis of normality, the statistic is asymptotically distributed as a chi-squared distribution with 2 degrees of freedom.
An object of class "htest" containing the following components:
statistic |
The value of the robust Jarque–Bera test statistic. |
p.value |
The asymptotic p-value computed from the chi-squared distribution with 2 degrees of freedom. |
method |
A character string describing the test. |
data.name |
A character string giving the name of the data. |
Zeynel Cebeci
Gel, Y. R. and Gastwirth, J. L. (2008). A robust modification of the Jarque–Bera test of normality. Economics Letters, 99(1), 30-32.
# Generate normal distributed data set.seed(123) x <- rnorm(150) result <- rjbtest(x) result$statistic result$p.value # Generate positively skewed example data set.seed(123) x <- rlnorm(150, meanlog = 0, sdlog = 1) result <- rjbtest(x) result$statistic result$p.value# Generate normal distributed data set.seed(123) x <- rnorm(150) result <- rjbtest(x) result$statistic result$p.value # Generate positively skewed example data set.seed(123) x <- rlnorm(150, meanlog = 0, sdlog = 1) result <- rjbtest(x) result$statistic result$p.value
Performs a Yeo-Johnson transformation (Yeo & Johnson, 2000) on a numeric vector. The transformation estimates the optimal lambda via maximum likelihood if not provided. Optionally, the transformed data can be standardized.
yeojohnson(x, lambda = NULL, standardize = TRUE, eps = 1e-6)yeojohnson(x, lambda = NULL, standardize = TRUE, eps = 1e-6)
x |
A numeric vector to transform. |
lambda |
Optional numeric value of lambda for the Yeo-Johnson transformation.
If |
standardize |
Logical. If |
eps |
Numeric tolerance used to handle cases where lambda is approximately 0 or 2.
Default is |
The Yeo-Johnson transformation is a generalization of the Box-Cox transformation that can handle both positive and negative values:
If lambda is not specified, it is estimated via maximum likelihood over a
grid of values from -3 to 3 (step 0.01). Standardization is optional and centers
the transformed data at mean 0 with standard deviation 1.
A list with the following components:
transformed |
The transformed numeric vector. |
lambda |
The lambda value used in the transformation (either provided or estimated via MLE if not defined). |
Zeynel Cebeci
Yeo, I.-K. and Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika, 87(4), 954-959. doi:10.1093/biomet/87.4.954.
abc, boxcox, rewbc, rewyj, osktnorm
# Generate log-normal data set.seed(123) x <- rlnorm(50) head(x) shapiro.test(x) # Yeo-Johnson with estimated lambda res <- yeojohnson(x) res$lambda head(res$transformed) shapiro.test(res$transformed) # Yeo-Johnson with specified lambda res2 <- yeojohnson(x, lambda = -1) res2$lambda head(res2$transformed) shapiro.test(res2$transformed) # Standardization turned off res3 <- yeojohnson(x, standardize = FALSE) res3$lambda head(res3$transformed) shapiro.test(res3$transformed)# Generate log-normal data set.seed(123) x <- rlnorm(50) head(x) shapiro.test(x) # Yeo-Johnson with estimated lambda res <- yeojohnson(x) res$lambda head(res$transformed) shapiro.test(res$transformed) # Yeo-Johnson with specified lambda res2 <- yeojohnson(x, lambda = -1) res2$lambda head(res2$transformed) shapiro.test(res2$transformed) # Standardization turned off res3 <- yeojohnson(x, standardize = FALSE) res3$lambda head(res3$transformed) shapiro.test(res3$transformed)
Performs the Zhang-Wu ZA test for assessing normality. The test is based on a weighted empirical distribution function statistic and uses Monte Carlo simulation to obtain p-values under the null hypothesis of normality with unknown mean and variance.
zatest(x, nsim = 10000, eps = 1e-10, ncores = 1, seed = NULL)zatest(x, nsim = 10000, eps = 1e-10, ncores = 1, seed = NULL)
x |
A numeric vector of observations. Missing and non-finite values are removed prior to computation. |
nsim |
Number of Monte Carlo simulations used to approximate the null distribution of the test statistic. Larger values improve accuracy at the cost of increased computation time. |
eps |
A small positive constant used to truncate probability values away from 0 and 1 to ensure numerical stability in logarithmic computations. |
ncores |
Number of CPU cores to be used for parallel Monte Carlo simulation.
Must be a positive integer. The default |
seed |
Optional integer value used to set the random number generator seed
for reproducibility of the Monte Carlo simulations.
If |
Let denote the ordered sample.
The data are standardized using the sample mean and standard deviation,
and the standard normal cumulative distribution function is evaluated at
the standardized observations.
The Zhang–Wu ZA test statistic is defined as
where denotes the standard normal cumulative distribution
function.
Because the null distribution of the statistic depends on estimated parameters, asymptotic critical values are not available. Instead, p-values are obtained via Monte Carlo simulation under the null hypothesis by repeatedly generating samples from a normal distribution, re-estimating the mean and variance, and recomputing the test statistic.
Parallel computation is supported via the ncores argument.
An object of class "htest" containing the following components:
statistic |
The observed value of the ZA test statistic. |
p.value |
Monte Carlo p-value for the test. |
method |
A character string describing the test. |
data.name |
A character string giving the name of the data. |
Zeynel Cebeci
Zhang, J. & Wu, Y. (2005). Likelihood-ratio tests for normality. Computational Statistics & Data Analysis, 49(3), 709-721. doi:10.1016/j.csda.2004.05.034.
# Normal data set.seed(123) x <- rnorm(50) resx <- zatest(x, nsim=100) resx$statistic # Test statistic resx$p.value # Log-normal data (non-normal) set.seed(123) y <- rlnorm(50, meanlog = 0, sdlog = 1) resy <- zatest(y, nsim=100) resy$statistic resy$p.value # Exponential data (non-normal) set.seed(123) w <- rexp(50) resw <- zatest(w, nsim=100, ncores=1) resw$p.value # Parallel execution using multiple CPU cores ## Not run: z <- rt(100, 5,2) cores <- parallel::detectCores()-1 resz <- zatest(z, nsim = 10000, ncores = cores) resz$statistic resz$p.value ## End(Not run)# Normal data set.seed(123) x <- rnorm(50) resx <- zatest(x, nsim=100) resx$statistic # Test statistic resx$p.value # Log-normal data (non-normal) set.seed(123) y <- rlnorm(50, meanlog = 0, sdlog = 1) resy <- zatest(y, nsim=100) resy$statistic resy$p.value # Exponential data (non-normal) set.seed(123) w <- rexp(50) resw <- zatest(w, nsim=100, ncores=1) resw$p.value # Parallel execution using multiple CPU cores ## Not run: z <- rt(100, 5,2) cores <- parallel::detectCores()-1 resz <- zatest(z, nsim = 10000, ncores = cores) resz$statistic resz$p.value ## End(Not run)
Performs the Zhang-Wu ZC goodness-of-fit test for assessing normality. The test is based on a likelihood-ratio type statistic proposed by Zhang (2002)<10.1111/1467-9868.00337> and further discussed by Zhang and Wu (2005)<10.1016/j.csda.2004.05.034>. The p-value is obtained using a Monte Carlo procedure with parameter re-estimation.
zctest(x, nsim = 10000, eps = 1e-10, ncores = 1, seed = NULL)zctest(x, nsim = 10000, eps = 1e-10, ncores = 1, seed = NULL)
x |
A numeric vector of observations. Missing and non-finite values are removed prior to computation. |
nsim |
Number of Monte Carlo simulations used to approximate the null distribution of the test statistic. Larger values yield more accurate p-values at the expense of increased computation time. |
eps |
A small positive constant used to truncate normal probabilities away from 0 and 1 to ensure numerical stability in logarithmic computations. |
ncores |
Number of CPU cores to be used for parallel Monte Carlo simulation.
The default value |
seed |
Optional integer value used to set the random number generator seed
for reproducibility of the Monte Carlo simulations.
If |
Let denote the observed data.
The data are standardized using the sample mean and standard deviation,
and the ordered standardized values are transformed to
normal scores , where denotes the
standard normal distribution function.
The ZC test statistic is defined as
Because the finite-sample null distribution of the statistic is not
available in closed form, the p-value is computed using a Monte Carlo
procedure. In each simulation, a normal sample of size is
generated, standardized using its own sample mean and standard
deviation, and the ZC statistic is recomputed.
The Monte Carlo p-value is computed using the unbiased estimator
where is the observed test statistic and
are the simulated statistics.
To ensure numerical stability, probabilities are truncated to lie in
the interval .
An object of class "htest" with the following components:
statistic |
The value of the ZC test statistic. |
p.value |
Monte Carlo p-value for the test. |
method |
A character string describing the test. |
data.name |
A character string giving the name of the data. |
Zeynel Cebeci
Zhang, J. (2002). Powerful goodness-of-fit tests based on the likelihood ratio. Journal of the Royal Statistical Society: Series B, 64, 281-294. doi:10.1111/1467-9868.00337.
Zhang, J. & Wu, Y. (2005). Likelihood-ratio tests for normality. Computational Statistics & Data Analysis, 49, 709-721. doi:10.1016/j.csda.2004.05.034.
# Normal data set.seed(123) x <- rnorm(50) resx <- zctest(x, nsim=100) resx$statistic # Test statistic resx$p.value # Log-normal data (non-normal) set.seed(123) y <- rlnorm(50, meanlog = 0, sdlog = 1) resy <- zctest(y, nsim=100) resy$statistic resy$p.value # Exponential data (non-normal) set.seed(123) w <- rexp(50) resw <- zctest(w, nsim=100, ncores=1) resw$p.value # Parallel execution using multiple CPU cores ## Not run: z <- rt(100, 5,2) cores <- parallel::detectCores()-1 resz <- zctest(z, nsim = 10000, ncores = cores) resz$statistic resz$p.value ## End(Not run)# Normal data set.seed(123) x <- rnorm(50) resx <- zctest(x, nsim=100) resx$statistic # Test statistic resx$p.value # Log-normal data (non-normal) set.seed(123) y <- rlnorm(50, meanlog = 0, sdlog = 1) resy <- zctest(y, nsim=100) resy$statistic resy$p.value # Exponential data (non-normal) set.seed(123) w <- rexp(50) resw <- zctest(w, nsim=100, ncores=1) resw$p.value # Parallel execution using multiple CPU cores ## Not run: z <- rt(100, 5,2) cores <- parallel::detectCores()-1 resz <- zctest(z, nsim = 10000, ncores = cores) resz$statistic resz$p.value ## End(Not run)