Module dstats.summary

Summary statistics such as mean, median, sum, variance, skewness, kurtosis. Except for median and median absolute deviation, which cannot be calculated online, all summary statistics have both an input range interface and an output range interface.

Notes

The put method on the structs defined in this module returns this by ref. The use case for returning this is to enable these structs to be used with std.algorithm.reduce. The rationale for returning by ref is that the return value usually won't be used, and the overhead of returning a large struct by value should be avoided.

Bugs

This whole module assumes that input will be doubles or types implicitly convertible to double. No allowances are made for user-defined numeric types such as BigInts. This is necessary for simplicity. However, if you have a function that converts your data to doubles, most of these functions work with any input range, so you can simply map this function onto your range.

Author

David Simcha

Functions

NameDescription
geometricMean(data) Calculates the geometric mean of any input range that has elements implicitly convertible to double
interquantileRange(data, quantile) Computes the interquantile range of data at the given quantile value in O(N) time complexity. For example, using a quantile value of either 0.25 or 0.75 will give the interquartile range. (This is the default since it is apparently the most common interquantile range in common usage.) Using a quantile value of 0.2 or 0.8 will give the interquntile range.
kurtosis(data) Excess kurtosis relative to normal distribution. High kurtosis means that the variance is due to infrequent, large deviations from the mean. Low kurtosis means that the variance is due to frequent, small deviations from the mean. The normal distribution is defined as having kurtosis of 0. Input must be an input range with elements implicitly convertible to double.
mean(data) Finds the arithmetic mean of any input range whose elements are implicitly convertible to double.
meanStdev(data) Puts all elements of data into a MeanSD struct, then returns this struct. This can be faster than doing this manually due to ILP optimizations.
median(data) Finds median of an input range in O(N) time on average. In the case of an even number of elements, the mean of the two middle elements is returned. This is a convenience founction designed specifically for numeric types, where the averaging of the two middle elements is desired. A more general selection algorithm that can handle any type with a total ordering, as well as selecting any position in the ordering, can be found at dstats.sort.quickSelect() and dstats.sort.partitionK(). Allocates memory, does not reorder input data.
medianAbsDev(data) Calculates the median absolute deviation of a dataset. This is the median of all absolute differences from the median of the dataset.
medianPartition(data) Median finding as in median(), but will partition input data such that elements less than the median will have smaller indices than that of the median, and elements larger than the median will have larger indices than that of the median. Useful both for its partititioning and to avoid memory allocations. Requires a random access range with swappable elements.
skewness(data) Skewness is a measure of symmetry of a distribution. Positive skewness means that the right tail is longer/fatter than the left tail. Negative skewness means the left tail is longer/fatter than the right tail. Zero skewness indicates a symmetrical distribution. Input must be an input range with elements implicitly convertible to double.
stdev(data) Calculate the standard deviation of an input range with members implicitly converitble to double.
sum(data) Finds the sum of an input range whose elements implicitly convert to double. User has option of making U a different type than T to prevent overflows on large array summing operations. However, by default, return type is T (same as input type).
summary(data) Convenience function. Puts all elements of data into a Summary struct, and returns this struct.
variance(data) Finds the variance of an input range with members implicitly convertible to doubles.
zScore(range) Returns a range with whatever properties T has (forward range, random access range, bidirectional range, hasLength, etc.), of the z-scores of the underlying range. A z-score of an element in a range is defined as (element - mean(range)) / stdev(range).
zScore(range, mean, sd) Allows the construction of a ZScore range with precomputed mean and stdev.

Structs

NameDescription
GeometricMean Output range to calculate the geometric mean online. Operates similarly to dstats.summary.Mean
Mean Output range to calculate the mean online. Getter for mean costs a branch to check for N == 0. This struct uses O(1) space and does *NOT* store the individual elements.
MeanSD Output range to compute mean, stdev, variance online. Getter methods for stdev, var cost a few floating point ops. Getter for mean costs a single branch to check for N == 0. Relatively expensive floating point ops, if you only need mean, try Mean. This struct uses O(1) space and does *NOT* store the individual elements.
MedianAbsDev Plain old data holder struct for median, median absolute deviation. Alias this'd to the median absolute deviation member.
Summary Output range to compute mean, stdev, variance, skewness, kurtosis, min, and max online. Using this struct is relatively expensive, so if you just need mean and/or stdev, try MeanSD or Mean. Getter methods for stdev, var cost a few floating point ops. Getter for mean costs a single branch to check for N == 0. Getters for skewness and kurtosis cost a whole bunch of floating point ops. This struct uses O(1) space and does *NOT* store the individual elements.
ZScore