Module dstats.sort

A comprehensive sorting library for statistical functions. Each function takes N arguments, which are arrays or array-like objects, sorts the first and sorts the rest in lockstep. For merge and insertion sort, if the last argument is a ulong*, increments the dereference of this ulong* by the bubble sort distance between the first argument and the sorted version of the first argument. This is useful for some statistical calculations.

All sorting functions have the precondition that all parallel input arrays must have the same length.

Notes

Comparison functions must be written such that compFun(x, x) == false. For example, "a < b" is good, "a <= b" is not.

These functions are heavily optimized for sorting arrays of ints and floats (by far the most common case when doing statistical calculations). In these cases, they can be several times faster than the equivalent functions in std.algorithm. Since sorting is extremely important for non-parametric statistics, this results in important real-world performance gains. However, it comes at a price in terms of generality:

1. They assume that what they are sorting is cheap to copy via normal assignment.

2. They don't work at all with general ranges, only arrays and maybe ranges very similar to arrays.

3. All tuning and micro-optimization is done with ints and floats, not classes, large structs, strings, etc.

Examples

auto foo = [3, 1, 2, 4, 5].dup;
auto bar = [8, 6, 7, 5, 3].dup;
qsort(foo, bar);
assert(foo == [1, 2, 3, 4, 5]);
assert(bar == [6, 7, 8, 5, 3]);
auto baz = [1.0, 0, -1, -2, -3].dup;
mergeSort!("a > b")(bar, foo, baz);
assert(bar == [8, 7, 6, 5, 3]);
assert(foo == [3, 2, 1, 4, 5]);
assert(baz == [-1.0, 0, 1, -2, -3]);

Author

David Simcha

Functions

Name Description
heapSort Heap sort. Unstable, O(N log N) time average and worst case, O(1) space, large constant term in time complexity.
heapSortImpl
insertionSort Insertion sort. O(N2) time worst, average case, O(1) space, VERY small constant, which is why it's useful for sorting small subarrays in divide and conquer algorithms. If last argument is a ulong*, increments the dereference of this argument by the bubble sort distance between the input array and the sorted version of the input.
insertionSortImpl
intIsNaN
makeMultiHeap
medianOf3
merge
mergeInPlace
mergeSort Merge sort. O(N log N) time, O(N) space, small constant. Stable sort. If last argument is a ulong* instead of an array-like type, the dereference of the ulong* will be incremented by the bubble sort distance between the input array and the sorted version. This is useful in some statistics functions such as Kendall's tau.
mergeSortImpl
mergeSortInPlace In-place merge sort, based on C++ STL's stable_sort(). O(N log2 N) time complexity, O(1) space complexity, stable. Much slower than plain old mergeSort(), so only use it if you really need the O(1) space.
mergeSortInPlaceImpl
mergeSortTemp Merge sort, allowing caller to provide a temp variable. This allows recycling instead of repeated allocations. If D is data, T is temp, and U is a ulong* for calculating bubble sort distance, this can be called as mergeSortTemp(D, D, D, T, T, T, U) or mergeSortTemp(D, D, D, T, T, T) where each D has a T of corresponding type.
multiSiftDown
partitionK Partitions the input data according to compFun, such that position k contains the kth largest/smallest element according to compFun. For all elements e with indices < k, !compFun(data[k], e) is guaranteed to be true. For all elements e with indices > k, !compFun(e, data[k]) is guaranteed to be true. For example, if compFun is "a < b", all elements with indices < k will be <= data[k], and all elements with indices larger than k will be >= k. Reorders any additional input arrays in lockstep.
partitionKImpl
postProcess
postProcess
prepareForSorting
qsort Quick sort. Unstable, O(N log N) time average, worst case, O(log N) space, small constant term in time complexity.
qsortImpl
quickSelect Returns the kth largest/smallest element (depending on compFun, 0-indexed) in the input array in O(N) time. Allocates memory, does not modify input array.
removeWhitespace
rotateLeft
rotateRight

Classes

Name Description
SortException

Structs

Name Description
TopN Given a set of data points entered through the put function, this output range maintains the invariant that the top N according to compFun will be contained in the data structure. Uses a heap internally, O(log N) insertion time. Good for finding the largest/smallest N elements of a very large dataset that cannot be sorted quickly in its entirety, and may not even fit in memory. If less than N datapoints have been entered, all are contained in the structure.

Templates

Name Description
isSimpleComparison

Enum values

Name Type Description
DATA
TEMP

Aliases

Name Type Description
ArrayElemType T

Authors

Copyright

License