NAME Statistics::CaseResampling - Efficient resampling and calculation of medians with confidence intervals SYNOPSIS use Statistics::CaseResampling ':all'; my \$sample = [1,3,5,7,1,2,9]; # ... usually MUCH more data ... my \$confidence = 0.95; # ~2*sigma or "90% within confidence limits" #my \$confidence = 0.37; # ~1*sigma or "~66% within confidence limits" # calculate the median of the sample with lower and upper confidence # limits using resampling/bootstrapping: my (\$lower_cl, \$median, \$upper_cl) = median_simple_confidence_limits(\$sample, \$confidence); # There are many auxiliary functions: my \$resampled = resample(\$sample); # \$resampled is now a random set of measurements from \$sample, # including potential duplicates my \$medians = resample_medians(\$sample, \$n_resamples); # \$medians is now an array reference containing the medians # of \$n_resamples resample runs # This is vastly more efficient that doing the same thing with # repeated resample() calls! # Analogously: my \$means = resample_means(\$sample, \$n_resamples); # you can get the cl's from a set of separately resampled medians, too: my (\$lower_cl, \$median, \$upper_cl) = simple_confidence_limits_from_samples(\$median, \$medians, \$confidence); # utility functions: print median([1..5]), "\n"; # prints 3 print mean([1..5]), "\n"; # prints 3, too, surprise! print select_kth([1..5], 1), "\n"; # inefficient way to calculate the minimum DESCRIPTION The purpose of this (XS) module is to calculate the median (or in principle also other statistics) with confidence intervals on a sample. To do that, it uses a technique called bootstrapping. In a nutshell, it resamples the sample a lot of times and for each resample, it calculates the median. From the distribution of medians, it then calculates the confidence limits. In order to implement the confidence limit calculation, various other functions had to be implemented efficiently (both algorithmically efficient and done in C). These functions may be useful in their own right and are thus exposed to Perl. Most notably, this exposes a median (and general selection) algorithm that works in linear time as opposed to the trivial implementation that requires "O(n*log(n))". Random numbers The resampling involves drawing many random numbers. Therefore, the module comes with an embedded Mersenne twister (taken from "Math::Random::MT"). If you want to change the seed of the RNG, do this: \$Statistics::CaseResampling::Rnd = Statistics::CaseResampling::RdGen::setup(\$seed); or \$Statistics::CaseResampling::Rnd = Statistics::CaseResampling::RdGen::setup(@seed); Do not use the embedded random number generator for other purposes. Use "Math::Random::MT" instead! At this point, you cannot change the type of RNG. EXPORT None by default. Can export any of the functions that are documented below using standard "Exporter" semantics, including the customary ":all" group. FUNCTIONS This list of functions is loosely sorted from *basic* to *comprehensive* because the more complicated functions are usually (under the hood, in C) implemented using the basic building blocks. Unfortunately, that means you may want to read the documentation backwards :) All of these functions are written in C for speed. approx_erf(\$x) Calculates an approximatation of the error function of *x*. Implemented after Winitzki, Sergei (6 February 2008). "A handy approximation for the error function and its inverse" (PDF). http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf Quoting: Relative precision better than 1.3e-4. approx_erf_inv(\$x) Calculates an approximation of the inverse of the error function of *x*. Algorithm from the same source as "approx_erf". Quoting: Relative precision better than 2e-3. nsigma_to_alpha(\$nsigma) Calculates the probability that a measurement from a normal distribution is further away from the mean than \$nsigma standard deviations. The confidence level (what you pass as the "CONFIDENCE" parameter to some functions in this module) is "1 - nsigma_to_alpha(\$nsigma)". alpha_to_nsigma(\$alpha) Inverse of "nsigma_to_alpha()". mean(ARRAYREF) Calculates the mean of a sample. median(ARRAYREF) Calculates the median (second quartile) of a sample. Works in linear time thanks to using a selection instead of a sort. Unfortunately, the way this is implemented, the median of an even number of parameters is, here, defined as the "n/2-1"th largest number and not the average of the "n/2-1"th and the "n/2"th number. This shouldn't matter for nontrivial sample sizes! median_absolute_deviation(ARRAYREF) Calculates the median absolute deviation (MAD) in what I believe is O(n). Take care to rescale the MAD before using it in place of a standard deviation. first_quartile(ARRAYREF) Calculates the first quartile of the sample. third_quartile(ARRAYREF) Calculates the third quartile of the sample. select_kth(ARRAYREF, K) Selects the kth smallest element from the sample. This is the general function that implements the median/quartile calculation. You get the median with this definition of K: my \$k = int(@\$sample/2) + (@\$sample & 1); my \$median = select_kth(\$sample, \$k); resample(ARRAYREF) Returns a reference to an array containing N random elements from the input array, where N is the length of the original array. This is different from shuffling in that it's random drawing without removing the drawn elements from the sample. resample_medians(ARRAYREF, NMEDIANS) Returns a reference to an array containing the medians of "NMEDIANS" resamples of the original input sample. resample_means(ARRAYREF, NMEANS) Returns a reference to an array containing the means of "NMEANS" resamples of the original input sample. simple_confidence_limits_from_median_samples(STATISTIC, STATISTIC_SAMPLES, CONFIDENCE) Calculates the confidence limits for *STATISTIC*. Returns the lower confidence limit, the statistic, and the upper confidence limit. *STATISTIC_SAMPLES* is the output of, for example, "resample_means(\@sample)". *CONFIDENCE* indicates the fraction of data within the confidence limits (assuming a normal, symmetric distribution of the statistic => *simple confidence limits*). For example, to get the 90% confidence (~2 sigma) interval for the mean of your sample, you can do the following: my \$sample = []; my \$means = resample_means(\$sample, \$n_resamples); my (\$lower_cl, \$mean, \$upper_cl) = simple_confidence_limits_from_samples(mean(\$sample), \$means, 0.90); If you want to apply this logic to other statistics such as the first or third quartile, you have to manually resample and lose the advantage of doing it in C: my \$sample = []; my \$quartiles = []; foreach (1..1000) { push @\$quartiles, first_quartile(resample(\$sample)); } my (\$lower_cl, \$firstq, \$upper_cl) = simple_confidence_limits_from_samples( first_quartile(\$sample), \$quartiles, 0.90 ); For a reliable calculation of the confidence limits, you should use at least 1000 samples if not more. Yes. This whole procedure is expensive. For medians, however, use the following function: median_simple_confidence_limits(SAMPLE, CONFIDENCE, [NSAMPLES]) Calculates the confidence limits for the "CONFIDENCE" level for the median of *SAMPLE*. Returns the lower confidence limit, the median, and the upper confidence limit. In order to calculate the limits, a lot of resampling has to be done. *NSAMPLES* defaults to 1000. Try running this a couple of times on your data interactively to see how the limits still vary a little bit at this setting. TODO One could calculate more statistics in C for performance. SEE ALSO Math::Random::MT On the approximation of the error function: Winitzki, Sergei (6 February 2008). "A handy approximation for the error function and its inverse" (PDF). http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf AUTHOR Steffen Mueller, COPYRIGHT AND LICENSE Copyright (C) 2010, 2011, 2012 by Steffen Mueller This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.0 or, at your option, any later version of Perl 5 you may have available.