Good evening! This evening, a note on high-dynamic-range (HDR) histograms.

### problem

How should one record garbage collector pause times?

A few options present themselves: you could just record the total pause time. Or, also record the total number of collections, which allows you to compute the average. Maximum is easy enough, too. But then you might also want the median or the p90 or the p99, and these percentile values are more gnarly: you either need to record all the pause times, which can itself become a memory leak, or you need to approximate via a histogram.

Let’s assume that you decide on the histogram approach. How should you compute the bins? It would be nice to have microsecond accuracy on the small end, but if you bin by microsecond you could end up having millions of bins, which is not what you want. On the other end you might have multi-second GC pauses, and you definitely want to be able to record those values.

Consider, though, that it’s not so important to have microsecond
precision for a 2-second pause. This points in a direction of wanting
bins that are *relatively* close to each other, but whose *absolute*
separation can vary depending on whether we are measuring microseconds
or milliseconds. You want approximately uniform precision over a high
dynamic range.

### logarithmic binning

The basic observation is that you should be able to make a histogram that gives you, say, 3 significant figures on measured values. Such a histogram would count anything between 1230 and 1240 in the same bin, and similarly for 12300 and 12400. The gap between bins increases as the number of digits grows.

Of course computers prefer base-2 numbers over base-10, so let’s do
that. Say we are measuring nanoseconds, and the maximum number of
seconds we expect is 100 or so. There are about 2^{30}
nanoseconds in a second, and 100 is a little less than 2^{7}, so
that gives us a range of 37 bits. Let’s say we want a precision of 4
significant base-2 digits, or 4 bits; then we will have one set of
2^{4} bins for 10-bit values, another for 11-bit values, and
so-on, for a total of 37 × 2^{4} bins, or 592 bins. If we use a
32-bit integer count per bin, such a histogram would be 2.5kB or so,
which I think is acceptable.

Say you go to compute the bin for a value. Firstly, note that there are
some values that do not have 4 significant bits: if you record a
measurement of 1 nanosecond, presumably that is just 1 significant
figure. These are like the
denormals in
floating-point numbers. Let’s just say that recording a value *val* in
[0, 2^{4}-1] goes to bin *val*.

If *val* is 2^{4} or more, then we compute the *major* and
*minor* components. The major component is the number of bits needed to
represent *val*, minus the 4 precision bits. We can define it like this
in C, assuming that *val* is a 64-bit value:

#define max_value_bits 37 #define precision 4 uint64_t major = 64ULL - __builtin_clzl(val) - precision;

The `64 - __builtin_clzl(val)` gives us the ceiling of the base-2
logarithm of the value. And actually, to take into account the denormal
case, we do this instead:

uint64_t major = val < (1ULL << precision) ? 0ULL : 64ULL - __builtin_clzl(val) - precision;

Then to get the minor component, we right-shift *val* by *major* bits,
unless it is a denormal, in which case the minor component is the value
itself:

uint64_t minor = val < (1 << precision) ? val : (val >> (major - 1ULL)) & ((1ULL << precision) - 1ULL);

Then the histogram bucket for a given value can be computed directly:

uint64_t idx = (major << precision) | minor;

Of course, we would prefer to bound our storage, hence the consideration about 37 total bits in 100 seconds of nanoseconds. So let’s do that, and record any out-of-bounds value in the special last bucket, indicating that we need to expand our histogram next time:

if (idx >= (max_value_bits << precision)) idx = max_value_bits << precision;

The histogram itself can then be defined simply as having enough buckets for all major and minor components in range, plus one for overflow:

struct histogram { uint32_t buckets[(max_value_bits << precision) + 1]; };

Getting back the lower bound for a bucket is similarly simple, again with a case for denormals:

uint64_t major = idx >> precision; uint64_t minor = idx & ((1ULL << precision) - 1ULL); uint64_t min_val = major ? ((1ULL << precision) | minor) << (major - 1ULL) : minor;

### y u no library

How many lines of code does something need to be before you will include it as a library instead of re-implementing? If I am honest with myself, there is no such limit, as far as code goes at least: only a limit of time. I am happy to re-implement anything; time is my only enemy. But strategically speaking, time too is the fulcrum: if I can save time by re-implementing over integrating a library, I will certainly hack.

The canonical library in this domain is HdrHistogram. But even the C port is thousands of lines of code! A histogram should not take that much code! Hence this blog post today. I think what we have above is sufficient. HdrHistogram’s documentation speaks in terms of base-10 digits of precision, but that is easily translated to base-2 digits: if you want 0.1% precision, then in base-2 you’ll need 10 bits; no problem.

I should of course mention that HdrHistogram includes an API that compensates for coordinated omission, but I think such an API is straigtforward to build on top of the basics.

My code, for what it is worth, and which may indeed be buggy, is over here. But don’t re-use it: write your own. It could be much nicer in C++ or Rust, or any other language.

Finally, I would note that somehow this feels very basic; surely there is prior art? I feel like in 2003, Google would have had a better answer than today; alack. Pointers appreciated to other references, and if you find them, do tell me more about your search strategy, because mine is inadequate. Until then, gram you later!