This page is a design discussion about a potential new element, "audiopeaks".
Goal
Make audio waveform previews easy for gstreamer-based editing applications like PiTiVi, Jokosher and Buzztard. Let them the share the low-level code that computes waveform properties and does the drawing. The python applications will get good performance without polluting their python codebase.
Method
A new element ("audiopeaks") with one sink pad that takes in audio and one source pad that spits out data required for the waveform preview.
Requirements
- Provide statistics of consecutive audio blocks of user-specifiable duration.
- Provide at least max, min, and rms (the minimum needed to clone the Audacity waveform view)
- Handle mono and stereo.
- Allow waveform views to be rendered incrementally
- Allow rendering of live streams
- Cache data to avoid needing to re-read the entire source if a different block size is requested
- Operate correctly without a disk buffer if specified by the user
- buffer in memory if there is no disk buffer and the user requests memory buffering
- but limit memory use to a user-specified maximum, even to 0 (i.e. bufferless)
- When zooming and panning, allow users to achieve perceived quickness comparable to the best existing waveform previews
Preliminary design
The element (audiopeaks) would have input caps of "audio/x-raw-*" and output caps of a novel type "audiopeaks/x-raw-int" or "audiopeaks/x-raw-float", with standard audio properties ("channels", "width", "depth", "endianness"). Its "rate" property would be a rational fraction (e.g. 3/2) instead of an integer. It would also have an additional property called "statistics", giving a list of block statistics like "max", "min", "rms", "maxabs", "mean", "rmsdb", etc.. The output buffers will be like audio buffers, but with the statistics interleaved at the finest level.
The element would have properties "memlimit" (the max # of bytes to use as a RAM cache) and "peakfile" indicating the location of a peak file. If memlimit is zero and the peak file is disabled, audiopeaks will not cache any data. (If the peak file is available, the memory cache will probably not be used.)
Other possible properties: disklimit, maxlatency, exactonly, maxbitrate.
Behaviors
Whenever a source and sink are both connected, and the state of the pipeline is playing, audiopeaks will send buffers out of its src pad that provide (approximately) correct statistics about audio blocks whose size corresponds to the rate in the downstream caps. If memory or disk cache is available, the element will cache statistics to enable fast servicing of future requests, including requests at different rates (i.e. different block sizes).
When a seek event is received on the src pad, if the seek can easily be satisfied from memory cache or the peak file, the audiopeaks element will not forward the event upstream, and will instead service the request from cache. Otherwise it will forward the event upstream as necessary.
Buffers provided at the src file will be large enough to avoid excessive downstream overhead (unless limited by the maxlatency property). This means output buffers will typically represent much longer time intervals than input buffers.
If the peak file already exists, the element will assume that the provided file is correct for whatever is connected to the sink pad. If a peak file is connected but not an input stream, the element will serve seek requests from the peak file when possible, or report a seek failure otherwise.
Peak Files
A peak file is a file commonly used by audio editing programs to cache precalculated statistics about input audio in order to speed up waveform drawing calculations. The data is stored in a file to avoid spending inordinate amounts of RAM, and also to avoid recalculating the statistics every time the program is started. SoundForge, Wavelab, CoolEdit, Cakewalk, and Logic all use undocumented proprietary formats that have not been reverse-engineered. Reaper (a proprietary audio editor) uses a publicly documented format that is probably not appropriate for us because it does not cache RMS values. Of the open source editors, Qtractor and GrooveManager use undocumented formats, and Audacity stores the relevant data in hundreds of separate block files. None of these are convenient for us, and the benefits of compatibility are small, so we will likely design our own peak file format.
Our peak file format should ideally permit storage of statistics about N input samples without knowing N in advance, and without requiring the file to be rearranged after it is written before it can be used. It should permit determination of statistics of b blocks of size k in O(b*log(N)) time, regardless of k. (Some current peakfile formats are only design to operate well at a few zoom levels corresponding to the options available in the UI, but we should strive to work well for all block sizes because we are abstracted from the UI.) It should appropriately balance space-efficiency, speed, and accuracy, keeping in mind the latency of disk seeks/OS buffer misses/L1 cache misses and the CPU cost of seeking and decoding the input stream.
Peak File Design
This is way too much fun. The various possible peak file designs have a variety of tradeoffs: average vs. peak complexity, cache friendliness, ability to stay under hard limits and work on constrained systems. Ideally audiopeaks might offer more than one, and keep an abstraction barrier (i.e. a lookup function) around the exact implementation.
Design 1: Self-folding postorder trees
A logical structure for the peak file is a binary tree in postorder traversal, i.e. A, B, C=AoB, D, E, F=DoE, G=CoF, etc.. This allows the file to be written by purely appending values to the peak file as a stream arrives, without having to know in advance how long the stream is. Only log(N) memory is required (negligible).
The case of a file size limited to M values is trickier. One option is to rewrite the file when this happens, discarding the bottom layer of leaf nodes from the tree and collapsing the tree into the front half of the file. However, this is a large operation that creates a sudden spike in I/O.
A trickier solution would be to maintain a position folding function F(N,M)->[1,M] that tells you where to write something that you want to write to position N. For N<M, F(N,M)=N. For N>=M, F(N,M) should direct you to the position that will overwrite the least important node, or (arbitrary special value) if N is not more important than the least important node. (Decoding such a file requires knowing the last value of N, i.e. the length of the waveform, which can be indicated in the header.)
Let BP(N,M)->[1,M] (N<M) be the location of the N'th-highest-priority element of a postorder tree of length M, i.e. a breadth-first retraversal of the postorder. (This makes BP a permutation on M elements.)
Let TS(N,k)->[1,N/2k]U{Null} be the "tree squash" operation: given the element at location N in a postorder traversal, TS(N,k) maps the bottom k rows to Null, and otherwise gives the location of the node at N in a new postorder tree without these nodes.
Then for M<=N<2M, F(N,M) = BP(TS(N,1),M), i.e. overwriting the last M/2 original nodes, which are the leaf nodes.
For N>=2M, no original leaf nodes remain, so we must raise the bar again, calling TS twice to discard the second level of nodes as well, and begin overwriting the second-level nodes already written. These fall into two categories: the second-level nodes in the original tree (N<M), and the second level nodes in the second pass (M<=N<2M). For 2N<=M<3M we overwrite the original nodes: F(N,M) = BP(TS(N,2)-M/4,M). For 3M<=N<4M, we overwrite the new second-level nodes: F(N,M) = BP(BP(TS(N,2)-3M/4,M/2),M).
For 4M<=N<8M there are 4 different kinds of level-3 nodes to replace, nodes that: are original, replaced original level-2 nodes, replaced level-1 nodes, or replaced level-2 nodes that replaced level-1 nodes. This is getting messy; we need a recursive definition.
A recursion for F(N,M) is:
if N <= M return N
else return F(BP(TS(N,log(N/M)+1),M*2log(N/M)),M)
I am visualizing this as:
- Take the top M/2 nodes from the right half-tree (TS)
- Map them onto the level of nodes in the left half-tree that has M/2 nodes (in postorder) (BP)
- Zoom in to the left half-tree
- Repeat this folding operation until the input node has been folded below M. (F)
BP and TS both operate in time O(log(M)) so F operates in time O(log(N/M)*log(M)). The density of calls to F falls like 1/N as, on each successive pass, the base blocks grow larger. Therefore, for large N the cost will be dominated by other factors. In other words, F is so cheap it's free.
There's nothing about this algorithm that intrinsically requires postorder storage, but the folding behavior seems to ensure that no other storage order enjoys a significant advantage.
Postorder Tree Algorithms
Parent(N)
// O(log(N)), binary search from the smallest ancestor of [0,N]
ancestor = 2^nextpow2(N+2)-1
step = (ancestor+1)/2
do {
left = ancestor-step
step /= 2
if (left > N) {
ancestor = left
} else {
right = ancestor - 1
if (right != N) {
ancestor = right
}
}
} while (left != N and right != N)
return ancestor // Now the parent
Height(N)
// Every node has 2^Height(N)-1 nodes in its subtree (including itself)
ancestor = 2^nextpow2(N+2)-1
height = nextpow2(N+2)-1
step = (ancestor+1)/2
do {
left = ancestor-step
step /= 2
if (left > N) {
ancestor = left
height--
} else {
right = ancestor - 1
if (right != N) {
ancestor = right
height--
}
}
} while (left != N and right != N)
return height
Left(N)
return N - 2^(Height(N)-1)
Right(N)
return N-1
TreeSquash(N, k)
if (Height(N) <= k) return Null
// Starting from a known ancestor, walk down towards N, and also recapitulate the same behavior from a new ancestor squashed by k
ancestor = 2^nextpow2(N+2)-1
step = (ancestor+1)/2
new_ancestor = ancestor>>k
new_step = step>>k
do {
left = ancestor - step
new_left = new_ancestor - new_step
step /= 2
new_step /= 2
if (left >= N) {
ancestor = left
new_ancestor = new_left
else {
ancestor--
new_ancestor--
}
} while (ancestor != N)
return new_ancestorTakes time O(log(N)-k)
BreadthfirstPostorder(N,M)
// Given the index of a node in the breadth-first traversal of a tree, return the index of the same node in a postorder traversal.
// M is the number of nodes in the tree. For simplicity we presently assume that the tree is complete (M+1 is a power of 2)
// Method: Walk down from the root toward the node, recording each L/R choice. Breadthfirst indices directly encode this information.
// Every node has 2^Height(N)-1 nodes in its subtree (including itself)
ancestor = M
step = (ancestor+1)/2
bf_index = 1 // 1 is the root in breadth-first
while (ancestor != N) {
left = ancestor-step
step /= 2
bf_index <<= 1
if (left >= N) {
ancestor = left
} else {
ancestor--
bf_index++
}
}
}
return bf_indexTakes time O(log(M)). ==== Notable problems with Design 1==== If every block computation requires log(N) reads in essentially random locations, then log(N) disk seeks may be required. Even if log(N) is 1, this means ~10ms per block, or 10 seconds of "grinding" to produce a 1000-pixel preview. This is bad; current editors (that use breadth-first-type storage and fixed block sizes) can do it in a single seek and a small amount of high-speed linear IO. This technique will only have viable performance if much of the file is cached by the OS in memory. It doesn't help that a synchronous interface like mmap doesn't provide any way to service reads out of order. Similar considerations apply in RAM regarding cache misses.
A branching factor of 2 is silly and wasteful here (but higher branching factors are going to be very tricky to work out).
Haven't worked out how to set a file size limit that doesn't correspond to a complete tree ... especially silly because real inputs are not going to have size 2n.
Design 2: Multifile
If we can rely on the filesystem's allocation heuristic, and don't mind having more than one file on disk, then a much easier solution is possible: store each level of the tree in a separate file. The algorithms for this are trivial, even for large radix. Size limits are easily handled: just delete the finest-grained file. The number of files is log_R(N) for radix R and N samples, but R is likely to be large (Audacity seems to use 256). Efficiency is maximized; seeks are minimized.
Not all applications may permit multiple files on disk, and the filesystem may fragment the files badly if they are all appending simultaneously. This approach is also difficult to generalize to the RAM case with fixed-size allocations. In memory, typical solutions involve calls to malloc (or realloc). We may be able to schedule these calls so that the user-specified limit is not temporarily exceeded, but even so, memory fragmentation can occur, and we are relying on realloc's heuristics.
Design 3: Pseudo-multifile
If we have a single file or memory block, we can write an allocator into the algorithm, essentially requiring an O(M) reallocation operation every time the limit is reached. This is especially irritating for the unlimited case, which in proposals 1 and 2 has almost no apparent overhead during writing. It is similar to Design 2 in the case where there is no additional free space on the system, so the allocator must work entirely within the limit.
In the unlimited case the allocator could use a very large sparse file to make allocations less frequent ... but this risks creating unexpectedly enormous files in various circumstances, so better not.
A compromise would be to write in Design 1, and then rewrite the file once it's finished. This makes it difficult to update, should a new seek event arrive covering a different section of the input.
Seek-oriented design
Modern disk systems tend to have read bandwidths, for contiguous reads, of 100 MB/sec and seek times of 10 ms or more. That means it's cheaper to read 1 MB than to perform 1 seek. If our on-disk structures cost 4 bytes each, this represents 250,000 nodes. If one seek were required for every measurement, each measurement would have to represent 5 seconds of audio ... equivalent to visualizing a whole 2-hour recording across one screen. Thus, it is probably better to assume that no seeks are ever performed during a rendering operation, as this can only impact the most extremely large zoom factors.
Suppose that nodes are stored representing a set of reduction factors F. If a desired reduction factor f is in the set, it may be recovered by a direct linear scan. If it is not in the set, then the renderer must look for the divisors of f, incurring an efficiency penalty, and if none are in F, then the reduction cannot be computed. For this reason, most renderers limit themselves to powers of 2, and store a handful of such powers in F to limit the bandwidth amplification. (They must also restrict themselves to offsets that are multiples of the reduction factor, seemingly a reasonable requirement.)
If one seek is required to find the start of the segment of interest, then there is no sense optimizing to read much less than 1 MB of data, below which the seek time dominates the total time. For a 1000-pixel visualization, this implies that we may read up to 1000 bytes per pixel, or about 256 nodes. Using a greedy algorithm, I have determined that we may satisfy every scaling from 1 to 256,000, with an amplification factor of at most 256, with 26310 scale factors, requiring 5.26 times as many nodes as input samples. (This is probably too many to be tenable.)
For truly continuous zooming interfaces even this is not enough, because the number of samples per pixel may not be an integer, especially in editors (such as Pitivi) that operate without any single project samplerate. The only sensible thing to do in this case is to allow a certain amount of "slop" in the number of samples contributing to each pixel. If we define a slop of 0 to be exact scaling, and a slop of 1 to indicate that the scale may vary by a factor of 2, then a given scale factor f may be computed from any scale <= f*slop. To maintain an amplification factor of at most A, we need a scale in F at every integer power of A*slop. Conversely, if F contains every power of B, then the max amplification factor is B/slop.
This suggests that a reasonable design might store reduction factors that are powers of 4, so that slops of approximately 2% and higher remain seek-limited.

