Package com.tdunning.math.stats
Class MergingDigest
java.lang.Object
com.tdunning.math.stats.TDigest
com.tdunning.math.stats.AbstractTDigest
com.tdunning.math.stats.MergingDigest
- All Implemented Interfaces:
Serializable
Maintains a t-digest by collecting new points in a buffer that is then sorted occasionally and merged
into a sorted array that contains previously computed centroids.
This can be very fast because the cost of sorting and merging is amortized over several insertion. If
we keep N centroids total and have the input array is k long, then the amortized cost is something like
N/k + log k
These costs even out when N/k = log k. Balancing costs is often a good place to start in optimizing an
algorithm. For different values of compression factor, the following table shows estimated asymptotic
values of N and suggested values of k:
The virtues of this kind of t-digest implementation include:
Compression | N | k |
50 | 78 | 25 |
100 | 157 | 42 |
200 | 314 | 73 |
- No allocation is required after initialization
- The data structure automatically compresses existing centroids when possible
- No Java object overhead is incurred for centroids since data is kept in primitive arrays
- See Also:
-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final double
private int
private final double[]
private final int[]
private final double[]
private int
private final double[]
private double
private double
private static boolean
private static boolean
private final double[]
Fields inherited from class com.tdunning.math.stats.AbstractTDigest
gen, recordAllData
-
Constructor Summary
ConstructorsConstructorDescriptionMergingDigest
(double compression) Allocates a buffer merging t-digest.MergingDigest
(double compression, int bufferSize) If you know the size of the temporary buffer for incoming points, you can use this entry point.MergingDigest
(double compression, int bufferSize, int size) Fully specified constructor. -
Method Summary
Modifier and TypeMethodDescriptionprivate void
void
add
(double x, int w) Adds a sample to a histogram.(package private) void
private void
void
void
asBytes
(ByteBuffer buf) Serialize this TDigest into a byte buffer.(package private) static double
asinApproximation
(double x) void
asSmallBytes
(ByteBuffer buf) Serialize this TDigest into a byte buffer.private static double
bound
(double v) int
byteSize()
Returns the number of bytes required to encode this TDigest using #asBytes().double
cdf
(double x) Returns the fraction of all points added which are invalid input: '<'= x.int
ACollection
that lets you go through the centroids in ascending order by mean.(package private) int
Exposed for testing.private int
checkWeights
(double[] w, double total, int last) void
compress()
Re-examines a t-digest to determine whether some centroids are redundant.double
Returns the current compression factor.private static double
eval
(double[] model, double[] vars) static MergingDigest
fromBytes
(ByteBuffer buf) private double
integratedLocation
(double q) Converts a quantile into a centroid scale value.private double
integratedQ
(double k) private void
merge
(double[] incomingMean, double[] incomingWeight, int incomingCount, List<List<Double>> incomingData, int[] incomingOrder, double unmergedWeight) private void
double
quantile
(double q) Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.Turns on internal data recording.long
size()
Returns the number of points that have been added to this TDigest.int
Returns the number of bytes required to encode this TDigest using #asSmallBytes().Methods inherited from class com.tdunning.math.stats.AbstractTDigest
add, add, createCentroid, decode, encode, interpolate, isRecording, quantile, weightedAverage
Methods inherited from class com.tdunning.math.stats.TDigest
checkValue, createAvlTreeDigest, createDigest, createMergingDigest, getMax, getMin, setMinMax
-
Field Details
-
compression
private final double compression -
lastUsedCell
private int lastUsedCell -
totalWeight
private double totalWeight -
weight
private final double[] weight -
mean
private final double[] mean -
data
-
unmergedWeight
private double unmergedWeight -
tempUsed
private int tempUsed -
tempWeight
private final double[] tempWeight -
tempMean
private final double[] tempMean -
tempData
-
order
private final int[] order -
usePieceWiseApproximation
private static boolean usePieceWiseApproximation -
useWeightLimit
private static boolean useWeightLimit
-
-
Constructor Details
-
MergingDigest
public MergingDigest(double compression) Allocates a buffer merging t-digest. This is the normally used constructor that allocates default sized internal arrays. Other versions are available, but should only be used for special cases.- Parameters:
compression
- The compression factor
-
MergingDigest
public MergingDigest(double compression, int bufferSize) If you know the size of the temporary buffer for incoming points, you can use this entry point.- Parameters:
compression
- Compression factor for t-digest. Same as 1/\delta in the paper.bufferSize
- How many samples to retain before merging.
-
MergingDigest
public MergingDigest(double compression, int bufferSize, int size) Fully specified constructor. Normally only used for deserializing a buffer t-digest.- Parameters:
compression
- Compression factorbufferSize
- Number of temporary centroidssize
- Size of main buffer
-
-
Method Details
-
recordAllData
Turns on internal data recording.- Overrides:
recordAllData
in classAbstractTDigest
- Returns:
- This TDigest so that configurations can be done in fluent style.
-
add
- Specified by:
add
in classAbstractTDigest
-
add
public void add(double x, int w) Description copied from class:TDigest
Adds a sample to a histogram. -
add
-
add
-
add
-
mergeNewValues
private void mergeNewValues() -
merge
-
checkWeights
int checkWeights()Exposed for testing. -
checkWeights
private int checkWeights(double[] w, double total, int last) -
integratedLocation
private double integratedLocation(double q) Converts a quantile into a centroid scale value. The centroid scale is nominally the number k of the centroid that a quantile point q should belong to. Due to round-offs, however, we can't align things perfectly without splitting points and centroids. We don't want to do that, so we have to allow for offsets. In the end, the criterion is that any quantile range that spans a centroid scale range more than one should be split across more than one centroid if possible. This won't be possible if the quantile range refers to a single point or an already existing centroid. This mapping is steep near q=0 or q=1 so each centroid there will correspond to less q range. Near q=0.5, the mapping is flatter so that centroids there will represent a larger chunk of quantiles.- Parameters:
q
- The quantile scale value to be mapped.- Returns:
- The centroid scale value corresponding to q.
-
integratedQ
private double integratedQ(double k) -
asinApproximation
static double asinApproximation(double x) -
eval
private static double eval(double[] model, double[] vars) -
bound
private static double bound(double v) -
compress
public void compress()Description copied from class:TDigest
Re-examines a t-digest to determine whether some centroids are redundant. If your data are perversely ordered, this may be a good idea. Even if not, this may save 20% or so in space. The cost is roughly the same as adding as many data points as there are centroids. This is typically invalid input: '<' 10 * compression, but could be as high as 100 * compression. This is a destructive operation that is not thread-safe. -
size
public long size()Description copied from class:TDigest
Returns the number of points that have been added to this TDigest. -
cdf
public double cdf(double x) Description copied from class:TDigest
Returns the fraction of all points added which are invalid input: '<'= x. -
quantile
public double quantile(double q) Description copied from class:TDigest
Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff. -
centroidCount
public int centroidCount()- Specified by:
centroidCount
in classTDigest
-
centroids
Description copied from class:TDigest
ACollection
that lets you go through the centroids in ascending order by mean. Centroids returned will not be re-used, but may or may not share storage with this TDigest. -
compression
public double compression()Description copied from class:TDigest
Returns the current compression factor.- Specified by:
compression
in classTDigest
- Returns:
- The compression factor originally used to set up the TDigest.
-
byteSize
public int byteSize()Description copied from class:TDigest
Returns the number of bytes required to encode this TDigest using #asBytes(). -
smallByteSize
public int smallByteSize()Description copied from class:TDigest
Returns the number of bytes required to encode this TDigest using #asSmallBytes(). Note that this is just as expensive as actually compressing the digest. If you don't care about time, but want to never over-allocate, this is fine. If you care about compression and speed, you pretty much just have to overallocate by using allocating #byteSize() bytes.- Specified by:
smallByteSize
in classTDigest
- Returns:
- The number of bytes required.
-
asBytes
Description copied from class:TDigest
Serialize this TDigest into a byte buffer. Note that the serialization used is very straightforward and is considerably larger than strictly necessary. -
asSmallBytes
Description copied from class:TDigest
Serialize this TDigest into a byte buffer. Some simple compression is used such as using variable byte representation to store the centroid weights and using delta-encoding on the centroid means so that floats can be reasonably used to store the centroid means.- Specified by:
asSmallBytes
in classTDigest
- Parameters:
buf
- The byte buffer into which the TDigest should be serialized.
-
fromBytes
-