May 20, 2023

Andy Wingoapproaching cps soup

(Andy Wingo)

Good evening, hackers. Today's missive is more of a massive, in the sense that it's another presentation transcript-alike; these things always translate to many vertical pixels.

In my defense, I hardly ever give a presentation twice, so not only do I miss out on the usual per-presentation cost amortization and on the incremental improvements of repetition, the more dire error is that whatever message I might have can only ever reach a subset of those that it might interest; here at least I can be more or less sure that if the presentation would interest someone, that they will find it.

So for the time being I will try to share presentations here, in the spirit of, well, why the hell not.

CPS Soup

A functional intermediate language

10 May 2023 – Spritely

Andy Wingo

Igalia, S.L.

Last week I gave a training talk to Spritely Institute collaborators on the intermediate representation used by Guile's compiler.

CPS Soup

Compiler: Front-end to Middle-end to Back-end

Middle-end spans gap between high-level source code (AST) and low-level machine code

Programs in middle-end expressed in intermediate language

CPS Soup is the language of Guile’s middle-end

An intermediate representation (IR) (or intermediate language, IL) is just another way to express a computer program. Specifically it's the kind of language that is appropriate for the middle-end of a compiler, and by "appropriate" I meant that an IR serves a purpose: there has to be a straightforward transformation to the IR from high-level abstract syntax trees (ASTs) from the front-end, and there has to be a straightforward translation from IR to machine code.

There are also usually a set of necessary source-to-source transformations on IR to "lower" it, meaning to make it closer to the back-end than to the front-end. There are usually a set of optional transformations to the IR to make the program run faster or allocate less memory or be more simple: these are the optimizations.

"CPS soup" is Guile's IR. This talk presents the essentials of CPS soup in the context of more traditional IRs.

How to lower?

High-level:

(+ 1 (if x 42 69))

Low-level:

  cmpi $x, #f
  je L1
  movi $t, 42
  j L2  
L1:
  movi $t, 69
L2:
  addi $t, 1

How to get from here to there?

Before we dive in, consider what we might call the dynamic range of an intermediate representation: we start with what is usually an algebraic formulation of a program and we need to get down to a specific sequence of instructions operating on registers (unlimited in number, at this stage; allocating to a fixed set of registers is a back-end concern), with explicit control flow between them. What kind of a language might be good for this? Let's attempt to answer the question by looking into what the standard solutions are for this problem domain.

1970s

Control-flow graph (CFG)

graph := array<block>
block := tuple<preds, succs, insts>
inst  := goto B
       | if x then BT else BF
       | z = const C
       | z = add x, y
       ...

BB0: if x then BB1 else BB2
BB1: t = const 42; goto BB3
BB2: t = const 69; goto BB3
BB3: t2 = addi t, 1; ret t2

Assignment, not definition

Of course in the early days, there was no intermediate language; compilers translated ASTs directly to machine code. It's been a while since I dove into all this but the milestone I have in my head is that it's the 70s when compiler middle-ends come into their own right, with Fran Allen's work on flow analysis and optimization.

In those days the intermediate representation for a compiler was a graph of basic blocks, but unlike today the paradigm was assignment to locations rather than definition of values. By that I mean that in our example program, we get t assigned to in two places (BB1 and BB2); the actual definition of t is implicit, as a storage location, and our graph consists of assignments to the set of storage locations in the program.

1980s

Static single assignment (SSA) CFG

graph := array<block>
block := tuple<preds, succs, phis, insts>
phi   := z := φ(x, y, ...)
inst  := z := const C
       | z := add x, y
       ...
BB0: if x then BB1 else BB2
BB1: v0 := const 42; goto BB3
BB2: v1 := const 69; goto BB3
BB3: v2 := φ(v0,v1); v3:=addi t,1; ret v3

Phi is phony function: v2 is v0 if coming from first predecessor, or v1 from second predecessor

These days we still live in Fran Allen's world, but with a twist: we no longer model programs as graphs of assignments, but rather graphs of definitions. The introduction in the mid-80s of so-called "static single-assignment" (SSA) form graphs mean that instead of having two assignments to t, we would define two different values v0 and v1. Then later instead of reading the value of the storage location associated with t, we define v2 to be either v0 or v1: the former if we reach the use of t in BB3 from BB1, the latter if we are coming from BB2.

If you think on the machine level, in terms of what the resulting machine code will be, this either function isn't a real operation; probably register allocation will put v0, v1, and v2 in the same place, say $rax. The function linking the definition of v2 to the inputs v0 and v1 is purely notational; in a way, you could say that it is phony, or not real. But when the creators of SSA went to submit this notation for publication they knew that they would need something that sounded more rigorous than "phony function", so they instead called it a "phi" (φ) function. Really.

2003: MLton

Refinement: phi variables are basic block args

graph := array<block>
block := tuple<preds, succs, args, insts>

Inputs of phis implicitly computed from preds

BB0(a0): if a0 then BB1() else BB2()
BB1(): v0 := const 42; BB3(v0)
BB2(): v1 := const 69; BB3(v1)
BB3(v2): v3 := addi v2, 1; ret v3

SSA is still where it's at, as a conventional solution to the IR problem. There have been some refinements, though. I learned of one of them from MLton; I don't know if they were first but they had the idea of interpreting phi variables as arguments to basic blocks. In this formulation, you don't have explicit phi instructions; rather the "v2 is either v1 or v0" property is expressed by v2 being a parameter of a block which is "called" with either v0 or v1 as an argument. It's the same semantics, but an interesting notational change.

Refinement: Control tail

Often nice to know how a block ends (e.g. to compute phi input vars)

graph := array<block>
block := tuple<preds, succs, args, insts,
               control>
control := if v then L1 else L2
         | L(v, ...)
         | switch(v, L1, L2, ...)
         | ret v

One other refinement to SSA is to note that basic blocks consist of some number of instructions that can define values or have side effects but which otherwise exhibit fall-through control flow, followed by a single instruction that transfers control to another block. We might as well store that control instruction separately; this would let us easily know how a block ends, and in the case of phi block arguments, easily say what values are the inputs of a phi variable. So let's do that.

Refinement: DRY

Block successors directly computable from control

Predecessors graph is inverse of successors graph

graph := array<block>
block := tuple<args, insts, control>

Can we simplify further?

At this point we notice that we are repeating ourselves; the successors of a block can be computed directly from the block's terminal control instruction. Let's drop those as a distinct part of a block, because when you transform a program it's unpleasant to have to needlessly update something in two places.

While we're doing that, we note that the predecessors array is also redundant, as it can be computed from the graph of block successors. Here we start to wonder: am I simpliying or am I removing something that is fundamental to the algorithmic complexity of the various graph transformations that I need to do? We press on, though, hoping we will get somewhere interesting.

Basic blocks are annoying

Ceremony about managing insts; array or doubly-linked list?

Nonuniformity: “local” vs ‘`global’' transformations

Optimizations transform graph A to graph B; mutability complicates this task

  • Desire to keep A in mind while making B
  • Bugs because of spooky action at a distance

Recall that the context for this meander is Guile's compiler, which is written in Scheme. Scheme doesn't have expandable arrays built-in. You can build them, of course, but it is annoying. Also, in Scheme-land, functions with side-effects are conventionally suffixed with an exclamation mark; after too many of them, both the writer and the reader get fatigued. I know it's a silly argument but it's one of the things that made me grumpy about basic blocks.

If you permit me to continue with this introspection, I find there is an uneasy relationship between instructions and locations in an IR that is structured around basic blocks. Do instructions live in a function-level array and a basic block is an array of instruction indices? How do you get from instruction to basic block? How would you hoist an instruction to another basic block, might you need to reallocate the block itself?

And when you go to transform a graph of blocks... well how do you do that? Is it in-place? That would be efficient; but what if you need to refer to the original program during the transformation? Might you risk reading a stale graph?

It seems to me that there are too many concepts, that in the same way that SSA itself moved away from assignment to a more declarative language, that perhaps there is something else here that might be more appropriate to the task of a middle-end.

Basic blocks, phi vars redundant

Blocks: label with args sufficient; “containing” multiple instructions is superfluous

Unify the two ways of naming values: every var is a phi

graph := array<block>
block := tuple<args, inst>
inst  := L(expr)
       | if v then L1() else L2()
       ...
expr  := const C
       | add x, y
       ...

I took a number of tacks here, but the one I ended up on was to declare that basic blocks themselves are redundant. Instead of containing an array of instructions with fallthrough control-flow, why not just make every instruction a control instruction? (Yes, there are arguments against this, but do come along for the ride, we get to a funny place.)

While you are doing that, you might as well unify the two ways in which values are named in a MLton-style compiler: instead of distinguishing between basic block arguments and values defined within a basic block, we might as well make all names into basic block arguments.

Arrays annoying

Array of blocks implicitly associates a label with each block

Optimizations add and remove blocks; annoying to have dead array entries

Keep labels as small integers, but use a map instead of an array

graph := map<label, block>

In the traditional SSA CFG IR, a graph transformation would often not touch the structure of the graph of blocks. But now having given each instruction its own basic block, we find that transformations of the program necessarily change the graph. Consider an instruction that we elide; before, we would just remove it from its basic block, or replace it with a no-op. Now, we have to find its predecessor(s), and forward them to the instruction's successor. It would be useful to have a more capable data structure to represent this graph. We might as well keep labels as being small integers, but allow for sparse maps and growth by using an integer-specialized map instead of an array.

This is CPS soup

graph := map<label, cont>
cont  := tuple<args, term>
term  := continue to L
           with values from expr
       | if v then L1() else L2()
       ...
expr  := const C
       | add x, y
       ...

SSA is CPS

This is exactly what CPS soup is! We came at it "from below", so to speak; instead of the heady fumes of the lambda calculus, we get here from down-to-earth basic blocks. (If you prefer the other way around, you might enjoy this article from a long time ago.) The remainder of this presentation goes deeper into what it is like to work with CPS soup in practice.

Scope and dominators

BB0(a0): if a0 then BB1() else BB2()
BB1(): v0 := const 42; BB3(v0)
BB2(): v1 := const 69; BB3(v1)
BB3(v2): v3 := addi v2, 1; ret v3

What vars are “in scope” at BB3? a0 and v2.

Not v0; not all paths from BB0 to BB3 define v0.

a0 always defined: its definition dominates all uses.

BB0 dominates BB3: All paths to BB3 go through BB0.

Before moving on, though, we should discuss what it means in an SSA-style IR that variables are defined rather than assigned. If you consider variables as locations to which values can be assigned and which initially hold garbage, you can read them at any point in your program. You might get garbage, though, if the variable wasn't assigned something sensible on the path that led to reading the location's value. It sounds bonkers but it is still the C and C++ semantic model.

If we switch instead to a definition-oriented IR, then a variable never has garbage; the single definition always precedes any uses of the variable. That is to say that all paths from the function entry to the use of a variable must pass through the variable's definition, or, in the jargon, that definitions dominate uses. This is an invariant of an SSA-style IR, that all variable uses be dominated by their associated definition.

You can flip the question around to ask what variables are available for use at a given program point, which might be read equivalently as which variables are in scope; the answer is, all definitions from all program points that dominate the use site. The "CPS" in "CPS soup" stands for continuation-passing style, a dialect of the lambda calculus, which has also has a history of use as a compiler intermediate representation. But it turns out that if we use the lambda calculus in its conventional form, we end up needing to maintain a lexical scope nesting at the same time that we maintain the control-flow graph, and the lexical scope tree can fail to reflect the dominator tree. I go into this topic in more detail in an old article, and if it interests you, please do go deep.

CPS soup in Guile

Compilation unit is intmap of label to cont

cont := $kargs names vars term
      | ...
term := $continue k src expr
      | ...
expr := $const C
      | $primcall ’add #f (a b)
      | ...

Conventionally, entry point is lowest-numbered label

Anyway! In Guile, the concrete form that CPS soup takes is that a program is an intmap of label to cont. A cont is the smallest labellable unit of code. You can call them blocks if that makes you feel better. One kind of cont, $kargs, binds incoming values to variables. It has a list of variables, vars, and also has an associated list of human-readable names, names, for debugging purposes.

A $kargs contains a term, which is like a control instruction. One kind of term is $continue, which passes control to a continuation k. Using our earlier language, this is just goto *k*, with values, as in MLton. (The src is a source location for the term.) The values come from the term's expr, of which there are a dozen kinds or so, for example $const which passes a literal constant, or $primcall, which invokes some kind of primitive operation, which above is add. The primcall may have an immediate operand, in this case #f, and some variables that it uses, in this case a and b. The number and type of the produced values is a property of the primcall; some are just for effect, some produce one value, some more.

CPS soup

term := $continue k src expr
      | $branch kf kt src op param args
      | $switch kf kt* src arg
      | $prompt k kh src escape? tag
      | $throw src op param args

Expressions can have effects, produce values

expr := $const val
      | $primcall name param args
      | $values args
      | $call proc args
      | ...

There are other kinds of terms besides $continue: there is $branch, which proceeds either to the false continuation kf or the true continuation kt depending on the result of performing op on the variables args, with immediate operand param. In our running example, we might have made the initial term via:

(build-term
  ($branch BB1 BB2 'false? #f (a0)))

The definition of build-term (and build-cont and build-exp) is in the (language cps) module.

There is also $switch, which takes an unboxed unsigned integer arg and performs an array dispatch to the continuations in the list kt, or kf otherwise.

There is $prompt which continues to its k, having pushed on a new continuation delimiter associated with the var tag; if code aborts to tag before the prompt exits via an unwind primcall, the stack will be unwound and control passed to the handler continuation kh. If escape? is true, the continuation is escape-only and aborting to the prompt doesn't need to capture the suspended continuation.

Finally there is $throw, which doesn't continue at all, because it causes a non-resumable exception to be thrown. And that's it; it's just a handful of kinds of term, determined by the different shapes of control-flow (how many continuations the term has).

When it comes to values, we have about a dozen expression kinds. We saw $const and $primcall, but I want to explicitly mention $values, which simply passes on some number of values. Often a $values expression corresponds to passing an input to a phi variable, though $kargs vars can get their definitions from any expression that produces the right number of values.

Kinds of continuations

Guile functions untyped, can multiple return values

Error if too few values, possibly truncate too many values, possibly cons as rest arg...

Calling convention: contract between val producer & consumer

  • both on call and return side

Continuation of $call unlike that of $const

When a $continue term continues to a $kargs with a $const 42 expression, there are a number of invariants that the compiler can ensure: that the $kargs continuation is always passed the expected number of values, that the vars that it binds can be allocated to specific locations (e.g. registers), and that because all predecessors of the $kargs are known, that those predecessors can place their values directly into the variable's storage locations. Effectively, the compiler determines a custom calling convention between each $kargs and its predecessors.

Consider the $call expression, though; in general you don't know what the callee will do to produce its values. You don't even generally know that it will produce the right number of values. Therefore $call can't (in general) continue to $kargs; instead it continues to $kreceive, which expects the return values in well-known places. $kreceive will check that it is getting the right number of values and then continue to a $kargs, shuffling those values into place. A standard calling convention defines how functions return values to callers.

The conts

cont := $kfun src meta self ktail kentry
      | $kclause arity kbody kalternate
      | $kargs names syms term
      | $kreceive arity kbody
      | $ktail

$kclause, $kreceive very similar

Continue to $ktail: return

$call and return (and $throw, $prompt) exit first-order flow graph

Of course, a $call expression could be a tail-call, in which case it would continue instead to $ktail, indicating an exit from the first-order function-local control-flow graph.

The calling convention also specifies how to pass arguments to callees, and likewise those continuations have a fixed calling convention; in Guile we start functions with $kfun, which has some metadata attached, and then proceed to $kclause which bridges the boundary between the standard calling convention and the specialized graph of $kargs continuations. (Many details of this could be tweaked, for example that the case-lambda dispatch built-in to $kclause could instead dispatch to distinct functions instead of to different places in the same function; historical accidents abound.)

As a detail, if a function is well-known, in that all its callers are known, then we can lighten the calling convention, moving the argument-count check to callees. In that case $kfun continues directly to $kargs. Similarly for return values, optimizations can make $call continue to $kargs, though there is still some value-shuffling to do.

High and low

CPS bridges AST (Tree-IL) and target code

High-level: vars in outer functions in scope

Closure conversion between high and low

Low-level: Explicit closure representations; access free vars through closure

CPS soup is the bridge between parsed Scheme and machine code. It starts out quite high-level, notably allowing for nested scope, in which expressions can directly refer to free variables. Variables are small integers, and for high-level CPS, variable indices have to be unique across all functions in a program. CPS gets lowered via closure conversion, which chooses specific representations for each closure that remains after optimization. After closure conversion, all variable access is local to the function; free variables are accessed via explicit loads from a function's closure.

Optimizations at all levels

Optimizations before and after lowering

Some exprs only present in one level

Some high-level optimizations can merge functions (higher-order to first-order)

Because of the broad remit of CPS, the language itself has two dialects, high and low. The high level dialect has cross-function variable references, first-class abstract functions (whose representation hasn't been chosen), and recursive function binding. The low-level dialect has only specific ways to refer to functions: labels and specific closure representations. It also includes calls to function labels instead of just function values. But these are minor variations; some optimization and transformation passes can work on either dialect.

Practicalities

Intmap, intset: Clojure-style persistent functional data structures

Program: intmap<label,cont>

Optimization: program→program

Identify functions: (program,label)→intset<label>

Edges: intmap<label,intset<label>>

Compute succs: (program,label)→edges

Compute preds: edges→edges

I mentioned that programs were intmaps, and specifically in Guile they are Clojure/Bagwell-style persistent functional data structures. By functional I mean that intmaps (and intsets) are values that can't be mutated in place (though we do have the transient optimization).

I find that immutability has the effect of deploying a sense of calm to the compiler hacker -- I don't need to worry about data structures changing out from under me; instead I just structure all the transformations that you need to do as functions. An optimization is just a function that takes an intmap and produces another intmap. An analysis associating some data with each program label is just a function that computes an intmap, given a program; that analysis will never be invalidated by subsequent transformations, because the program to which it applies will never be mutated.

This pervasive feeling of calm allows me to tackle problems that I wouldn't have otherwise been able to fit into my head. One example is the novel online CSE pass; one day I'll either wrap that up as a paper or just capitulate and blog it instead.

Flow analysis

A[k] = meet(A[p] for p in preds[k])
         - kill[k] + gen[k]

Compute available values at labels:

  • A: intmap<label,intset<val>>
  • meet: intmap-intersect<intset-intersect>
  • -, +: intset-subtract, intset-union
  • kill[k]: values invalidated by cont because of side effects
  • gen[k]: values defined at k

But to keep it concrete, let's take the example of flow analysis. For example, you might want to compute "available values" at a given label: these are the values that are candidates for common subexpression elimination. For example if a term is dominated by a car x primcall whose value is bound to v, and there is no path from the definition of V to a subsequent car x primcall, we can replace that second duplicate operation with $values (v) instead.

There is a standard solution for this problem, which is to solve the flow equation above. I wrote about this at length ages ago, but looking back on it, the thing that pleases me is how easy it is to decompose the task of flow analysis into manageable parts, and how the types tell you exactly what you need to do. It's easy to compute an initial analysis A, easy to define your meet function when your maps and sets have built-in intersect and union operators, easy to define what addition and subtraction mean over sets, and so on.

Persistent data structures FTW

  • meet: intmap-intersect<intset-intersect>
  • -, +: intset-subtract, intset-union

Naïve: O(nconts * nvals)

Structure-sharing: O(nconts * log(nvals))

Computing an analysis isn't free, but it is manageable in cost: the structure-sharing means that meet is usually trivial (for fallthrough control flow) and the cost of + and - is proportional to the log of the problem size.

CPS soup: strengths

Relatively uniform, orthogonal

Facilitates functional transformations and analyses, lowering mental load: “I just have to write a function from foo to bar; I can do that”

Encourages global optimizations

Some kinds of bugs prevented by construction (unintended shared mutable state)

We get the SSA optimization literature

Well, we're getting to the end here, and I want to take a step back. Guile has used CPS soup as its middle-end IR for about 8 years now, enough time to appreciate its fine points while also understanding its weaknesses.

On the plus side, it has what to me is a kind of low cognitive overhead, and I say that not just because I came up with it: Guile's development team is small and not particularly well-resourced, and we can't afford complicated things. The simplicity of CPS soup works well for our development process (flawed though that process may be!).

I also like how by having every variable be potentially a phi, that any optimization that we implement will be global (i.e. not local to a basic block) by default.

Perhaps best of all, we get these benefits while also being able to use the existing SSA transformation literature. Because CPS is SSA, the lessons learned in SSA (e.g. loop peeling) apply directly.

CPS soup: weaknesses

Pointer-chasing, indirection through intmaps

Heavier than basic blocks: more control-flow edges

Names bound at continuation only; phi predecessors share a name

Over-linearizes control, relative to sea-of-nodes

Overhead of re-computation of analyses

CPS soup is not without its drawbacks, though. It's not suitable for JIT compilers, because it imposes some significant constant-factor (and sometimes algorithmic) overheads. You are always indirecting through intmaps and intsets, and these data structures involve significant pointer-chasing.

Also, there are some forms of lightweight flow analysis that can be performed naturally on a graph of basic blocks without looking too much at the contents of the blocks; for example in our available variables analysis you could run it over blocks instead of individual instructions. In these cases, basic blocks themselves are an optimization, as they can reduce the size of the problem space, with corresponding reductions in time and memory use for analyses and transformations. Of course you could overlay a basic block graph on top of CPS soup, but it's not a well-worn path.

There is a little detail that not all phi predecessor values have names, since names are bound at successors (continuations). But this is a detail; if these names are important, little $values trampolines can be inserted.

Probably the main drawback as an IR is that the graph of conts in CPS soup over-linearizes the program. There are other intermediate representations that don't encode ordering constraints where there are none; perhaps it would be useful to marry CPS soup with sea-of-nodes, at least during some transformations.

Finally, CPS soup does not encourage a style of programming where an analysis is incrementally kept up to date as a program is transformed in small ways. The result is that we end up performing much redundant computation within each individual optimization pass.

Recap

CPS soup is SSA, distilled

Labels and vars are small integers

Programs map labels to conts

Conts are the smallest labellable unit of code

Conts can have terms that continue to other conts

Compilation simplifies and lowers programs

Wasm vs VM backend: a question for another day :)

But all in all, CPS soup has been good for Guile. It's just SSA by another name, in a simpler form, with a functional flavor. Or, it's just CPS, but first-order only, without lambda.

In the near future, I am interested in seeing what a new GC will do for CPS soup; will bump-pointer allocation palliate some of the costs of pointer-chasing? We'll see. A tricky thing about CPS soup is that I don't think that anyone else has tried it in other languages, so it's hard to objectively understand its characteristics independent of Guile itself.

Finally, it would be nice to engage in the academic conversation by publishing a paper somewhere; I would like to see interesting criticism, and blog posts don't really participate in the citation graph. But in the limited time available to me, faced with the choice between hacking on something and writing a paper, it's always been hacking, so far :)

Speaking of limited time, I probably need to hit publish on this one and move on. Happy hacking to all, and until next time.

by Andy Wingo at May 20, 2023 07:10 AM

May 19, 2023

GStreamerGStreamer 1.22.3 stable bug fix release

(GStreamer)

The GStreamer team is pleased to announce the second bug fix release in the stable 1.22 release series of your favourite cross-platform multimedia framework!

This release only contains bugfixes and it should be safe to update from 1.22.x.

Highlighted bugfixes:

  • avdec: fix occasional video decoder deadlock on seeking with FFmpeg 6.0
  • decodebin3: fix regression handling input streams without CAPS or TIME segment such as e.g. udpsrc or pushfilesrc
  • bluez: a2dpsink: fix Bluetooth SIG Certification test failures
  • osxvideosink: fix deadlock upon closing output window
  • qtdemux: fix edit list handling regression and AV1 codec box parsing
  • qtmux: fix extraction of CEA608 closed caption data from S334-1A packets
  • rtspsrc: Fix handling of * control path
  • splitmux: timestamp handling improvements
  • v4l2videodec: Rework dynamic resolution change handling (needed for IMX6 mainline codec)
  • videoflip: fix regression with automatically rotating video based on tags
  • d3d11: many d3d11videosink and d3d11compositor fixes
  • webrtc, rtp: numerous data race fixes and stability fixes
  • cerbero: Add support for RHEL9 and Rocky Linux; build timecodestamper plugin with libltc support
  • various bug fixes, memory leak fixes, and other stability and reliability improvements

See the GStreamer 1.22.3 release notes for more details.

Binaries for Android, iOS, Mac OS X and Windows will be available shortly.

Release tarballs can be downloaded directly here:

May 19, 2023 10:00 AM

May 02, 2023

Andy Wingostructure and interpretation of ark

(Andy Wingo)

Hello, dear readers! Today's article describes Ark, a new JavaScript-based mobile development platform. If you haven't read them yet, you might want to start by having a look at my past articles on Capacitor, React Native, NativeScript, and Flutter; having a common understanding of the design space will help us understand where Ark is similar and where it differs.

Ark, what it is

If I had to bet, I would guess that you have not heard of Ark. (I certainly hadn't either, when commissioned to do this research series.) To a first approximation, Ark—or rather, what I am calling Ark; I don't actually know the name for the whole architecture—is a loosely Flutter-like UI library implemented on top of a dialect of JavaScript, with build-time compilation to bytecode (like Hermes) but also with support for just-in-time and ahead-of-time compilation of bytecode to native code. It is made by Huawei.

At this point if you are already interested in this research series, I am sure this description raises more questions than it answers. Flutter-like? A dialect? Native compilation? Targetting what platforms? From Huawei? We'll get to all of these, but I think we need to start with the last question.

How did we get here?

In my last article on Flutter, I told a kind of just-so history of how Dart and Flutter came to their point in the design space. Thanks to corrections from a kind reader, it happened to also be more or less correct. In this article, though, I haven't talked with Ark developers at all; I don't have the benefit of a true claim on history. And yet, the only way I can understand Ark is by inventing a narrative, so here we go. It might even be true!

Recall that in 2018, Huawei was a dominant presence in the smartphone market. They were shipping excellent hardware at good prices both to the Chinese and to the global markets. Like most non-Apple, non-Google manufacturers, they shipped Android, and like most Android OEMs, they shipped Google's proprietary apps (mail, maps, etc.).

But then, over the next couple years, the US decided that allowing Huawei to continue on as before was, like, against national security interests or something. Huawei was barred from American markets, a number of suppliers were forbidden from selling hardware components to Huawei, and even Google was prohibited from shipping its mobile apps on Huawei devices. The effect on Huawei's market share for mobile devices was enormous: its revenue was cut in half over a period of a couple years.

In this position, as Huawei, what do you do? I can't even imagine, but specifically looking at smartphones, I think I would probably do about what they did. I'd fork Android, for starters, because that's what you already know and ship, and Android is mostly open source. I'd probably plan on continuing to use its lower-level operating system pieces indefinitely (kernel and so on) because that's not a value differentiator. I'd probably ship the same apps on top at first, because otherwise you slip all the release schedules and lose revenue entirely.

But, gosh, there is the risk that your product will be perceived as just a worse version of Android: that's not a good position to be in. You need to be different, and ideally better. That will take time. In the meantime, you claim that you're different, without actually being different yet. It's a somewhat ridiculous position to be in, but I can understand how you get here; Ars Technica published a scathing review poking fun at the situation. But, you are big enough to ride it out, knowing that somehow eventually you will be different.

Up to now, this part of the story is relatively well-known; the part that follows is more speculative on my part. Firstly, I would note that Huawei had been working for a while on a compiler and language run-time called Ark Compiler, with the goal of getting better performance out of Android applications. If I understand correctly, this compiler took the Java / Dalvik / Android Run Time bytecodes as its input, and outputted native binaries along with a new run-time implementation.

As I can attest from personal experience, having a compiler leads to hubris: you start to consider source languages like a hungry person looks at a restaurant menu. "Wouldn't it be nice to ingest that?" That's what we say at restaurants, right, fellow humans? So in 2019 and 2020 when the Android rug was pulled out from underneath Huawei, I think having in-house compiler expertise allowed them to consider whether they wanted to stick with Java at all, or whether it might be better to choose a more fashionable language.

Like black, JavaScript is always in fashion. What would it mean, then, to retool Huawei's operating system -- by then known by the name "HarmonyOS" -- to expose a JavaScript-based API as its primary app development framework? You could use your Ark compiler somehow to implement JavaScript (hubris!) and then you need a UI framework. Having ditched Java, it is now thinkable to ditch all the other Android standard libraries, including the UI toolkit: you start anew, in a way. So are you going to build a Capacitor, a React Native, a NativeScript, a Flutter? Surely not precisely any of these, but what will it be like, and how will it differ?

Incidentally, I don't know the origin story for the name Ark, but to me it brings to mind tragedy and rebuilding: in the midst of being cut off from your rich Android ecosystem, you launch a boat into the sea, holding a promise of a new future built differently. Hope and hubris in one vessel.

Two programming interfaces

In the end, Huawei builds two things: something web-like and something like Flutter. (I don't mean to suggest copying or degeneracy here; it's rather that I can only understand things in relation to other things, and these are my closest points of comparison for what they built.)

The web-like programming interface specifies UIs using an XML dialect, HML, and styles the resulting node tree with CSS. You augment these nodes with JavaScript behavior; the main app is a set of DOM-like event handlers. There is an API to dynamically create DOM nodes, but unlike the other systems we have examined, the HarmonyOS documentation doesn't really sell you on using a high-level framework like Angular.

If this were it, I think Ark would not be so compelling: the programming model is more like what was available back in the DHTML days. I wouldn't expect people to be able to make rich applications that delight users, given these primitives, though CSS animation and the HML loop and conditional rendering from the template system might be just expressive enough for simple applications.

The more interesting side is the so-called "declarative" UI programming model which exposes a Flutter/React-like interface. The programmer describes the "what" of the UI by providing a tree of UI nodes in its build function, and the framework takes care of calling build when necessary and of rendering that tree to the screen.

Here I need to show some example code, because it is... weird. Well, I find it weird, but it's not too far from SwiftUI in flavor. A small example from the fine manual:

@Entry
@Component
struct MyComponent {
  build() {
    Stack() {
        Image($rawfile('Tomato.png'))
        Text('Tomato')
            .fontSize(26)
            .fontWeight(500)
    }
  }
}

The @Entry decorator (*) marks this struct (**) as being the main entry point for the app. @Component marks it as being a component, like a React functional component. Components conform to an interface (***) which defines them as having a build method which takes no arguments and returns no values: it creates the tree in a somewhat imperative way.

But as you see the flavor is somewhat declarative, so how does that work? Also, build() { ... } looks syntactically a lot like Stack() { ... }; what's the deal, are they the same?

Before going on to answer this, note my asterisks above: these are concepts that aren't in JavaScript. Indeed, programs written for HarmonyOS's declarative framework aren't JavaScript; they are in a dialect of TypeScript that Huawei calls ArkTS. In this case, an interface is a TypeScript concept. Decorators would appear to correspond to an experimental TypeScript feature, looking at the source code.

But struct is an ArkTS-specific extension, and Huawei has actually extended the TypeScript compiler to specifically recognize the @Component decorator, such that when you "call" a struct, for example as above in Stack() { ... }, TypeScript will parse that as a new expression type EtsComponentExpression, which may optionally be followed by a block. When Stack() is invoked, its children (instances of Image and Text, in this case) will be populated via running the block.

Now, though TypeScript isn't everyone's bag, it's quite normalized in the JavaScript community and not a hard sell. Language extensions like the handling of @Component pose a more challenging problem. Still, Facebook managed to sell people on JSX, so perhaps Huawei can do the same for their dialect. More on that later.

Under the hood, it would seem that we have a similar architecture to Flutter: invoking the components creates a corresponding tree of elements (as with React Native's shadow tree), which then are lowered to render nodes, which draw themselves onto layers using Skia, in a multi-threaded rendering pipeline. Underneath, the UI code actually re-uses some parts of Flutter, though from what I can tell HarmonyOS developers are replacing those over time.

Restrictions and extensions

So we see that the source language for the declarative UI framework is TypeScript, but with some extensions. It also has its restrictions, and to explain these, we have to talk about implementation.

Of the JavaScript mobile application development frameworks we discussed, Capacitor and NativeScript used "normal" JS engines from web browsers, while React Native built their own Hermes implementation. Hermes is also restricted, in a way, but mostly inasmuch as it lags the browser JS implementations; it relies on source-to-source transpilers to get access to new language features. ArkTS—that's the name of HarmonyOS's "extended TypeScript" implementation—has more fundamental restrictions.

Recall that the Ark compiler was originally built for Android apps. There you don't really have the ability to load new Java or Kotlin source code at run-time; in Java you have class loaders, but those load bytecode. On an Android device, you don't have to deal with the Java source language. If we use a similar architecture for JavaScript, though, what do we do about eval?

ArkTS's answer is: don't. As in, eval is not supported on HarmonyOS. In this way the implementation of ArkTS can be divided into two parts, a frontend that produces bytecode and a runtime that runs the bytecode, and you never have to deal with the source language on the device where the runtime is running. Like Hermes, the developer produces bytecode when building the application and ships it to the device for the runtime to handle.

Incidentally, before we move on to discuss the runtime, there are actually two front-ends that generate ArkTS bytecode: one written in C++ that seems to only handle standard TypeScript and JavaScript, and one written in TypeScript that also handles "extended TypeScript". The former has a test262 runner with about 10k skipped tests, and the latter doesn't appear to have a test262 runner. Note, I haven't actually built either one of these (or any of the other frameworks, for that matter).

The ArkTS runtime is itself built on a non-language-specific common Ark runtime, and the set of supported instructions is the union of the core ISA and the JavaScript-specific instructions. Bytecode can be interpreted, JIT-compiled, or AOT-compiled.

On the side of design documentation, it's somewhat sparse. There are some core design docs; readers may be interested in the rationale to use a bytecode interface for Ark as a whole, or the optimization overview.

Indeed ArkTS as a whole has a surfeit of optimizations, to an extent that makes me wonder which ones are actually needed. There are source-to-source optimizations on bytecode, which I expect are useful if you are generating ArkTS bytecode from JavaScript, where you probably don't have a full compiler implementation. There is a completely separate optimizer in the eTS part of the run-time, based on what would appear to be a novel "circuit-based" IR that bears some similarity to sea-of-nodes. Finally the whole thing appears to bottom out in LLVM, which of course has its own optimizer. I can only assume that this situation is somewhat transitory. Also, ArkTS does appear to generate its own native code sometimes, notably for inline cache stubs.

Of course, when it comes to JavaScript, there are some fundamental language semantics and there is also a large and growing standard library. In the case of ArkTS, this standard library is part of the run-time, like the interpreter, compilers, and the garbage collector (generational concurrent mark-sweep with optional compaction).

All in all, when I step back from it, it's a huge undertaking. Implementing JavaScript is no joke. It appears that ArkTS has done the first 90% though; the proverbial second 90% should only take a few more years :)

Evaluation

If you told a younger me that a major smartphone vendor switched from Java to JavaScript for their UI, you would probably hear me react in terms of the relative virtues of the programming languages in question. At this point in my career, though, the only thing that comes to mind is what an expensive proposition it is to change everything about an application development framework. 200 people over 5 years would be my estimate, though of course teams are variable. So what is it that we can imagine that Huawei bought with a thousand person-years of investment? Towards what other local maximum are we heading?

Startup latency

I didn't mention it before, but it would seem that one of the goals of HarmonyOS is in the name: Huawei wants to harmonize development across the different range of deployment targets. To the extent possible, it would be nice to be able to write the same kinds of programs for IoT devices as you do for feature-rich smartphones and tablets and the like. In that regard one can see through all the source code how there is a culture of doing work ahead-of-time and preventing work at run-time; for example see the design doc for the interpreter, or for the file format, or indeed the lack of JavaScript eval.

Of course, this wide range of targets also means that the HarmonyOS platform bears the burden of a high degree of abstraction; not only can you change the kernel, but also the JavaScript engine (using JerryScript on "lite" targets).

I mention this background because sometimes in news articles and indeed official communication from recent years there would seem to be some confusion that HarmonyOS is just for IoT, or aimed to be super-small, or something. In this evaluation I am mostly focussed on the feature-rich side of things, and there my understanding is that the developer will generate bytecode ahead-of-time. When an app is installed on-device, the AOT compiler will turn it into a single ELF image. This should generally lead to fast start-up.

However it would seem that the rendering library that paints UI nodes into layers and then composits those layers uses Skia in the way that Flutter did pre-Impeller, which to be fair is a quite recent change to Flutter. I expect therefore that Ark (ArkTS + ArkUI) applications also experience shader compilation jank at startup, and that they may be well-served by tesellating their shapes into primitives like Impeller does so that they can precompile a fixed, smaller set of shaders.

Jank

Maybe it's just that apparently I think Flutter is great, but ArkUI's fundamental architectural similarity to Flutter makes me think that jank will not be a big issue. There is a render thread that is separate from the ArkTS thread, so like with Flutter, async communication with main-thread interfaces is the main jank worry. And on the ArkTS side, ArkTS even has a number of extensions to be able to share objects between threads without copying, should that be needed. I am not sure how well-developed and well-supported these extensions are, though.

I am hedging my words, of course, because I am missing a bit of social proof; HarmonyOS is still in infant days, and it doesn't have much in the way of a user base outside China, from what I can tell, and my ability to read Chinese is limited to what Google Translate can do for me :) Unlike other frameworks, therefore, I haven't been as able to catch a feel of the pulse of the ArkUI user community: what people are happy about, what the pain points are.

It's also interesting that unlike iOS or Android, HarmonyOS is only exposing these "web-like" and "declarative" UI frameworks for app development. This makes it so that the same organization is responsible for the software from top to bottom, which can lead to interesting cross-cutting optimizations: functional reactive programming isn't just a developer-experience narrative, but it can directly affect the shape of the rendering pipeline. If there is jank, someone in the building is responsible for it and should be able to fix it, whether it is in the GPU driver, the kernel, the ArkTS compiler, or the application itself.

Peak performance

I don't know how to evaluate ArkTS for peak performance. Although there is a JIT compiler, I don't have the feeling that it is as tuned for adaptive optimization as V8 is.

At the same time, I find it interesting that HarmonyOS has chosen to modify JavaScript. While it is doing that, could they switch to a sound type system, to allow the kinds of AOT optimizations that Dart can do? It would be an interesting experiment.

As it is, though, if I had to guess, I would say that ArkTS is well-positioned for predictably good performance with AOT compilation, although I would be interested in seeing the results of actually running it.

Aside: On the importance of storytelling

In this series I have tried to be charitable towards the frameworks that I review, to give credit to what they are trying to do, even while noting where they aren't currently there yet. That's part of why I need a plausible narrative for how the frameworks got where they are, because that lets me have an idea of where they are going.

In that sense I think that Ark is at an interesting inflection point. When I started reading documentation about ArkUI and HarmonyOS and all that, I bounced out—there were too many architectural box diagrams, too many generic descriptions of components, too many promises with buzzwords. It felt to me like the project was trying to justify itself to a kind of clueless management chain. Was there actually anything here?

But now when I see the adoption of a modern rendering architecture and a bold new implementation of JavaScript, along with the willingness to experiment with the language, I think that there is an interesting story to be told, but this time not to management but to app developers.

Of course you wouldn't want to market to app developers when your system is still a mess because you haven't finished rebuilding an MVP yet. Retaking my charitable approach, then, I can only think that all the architectural box diagrams were a clever blind to avoid piquing outside interest while the app development kit wasn't ready yet :) As and when the system starts working well, presumably over the next year or so, I would expect HarmonyOS to invest much more heavily in marketing and developer advocacy; the story is interesting, but you have to actually tell it.

Aside: O platform, my platform

All of the previous app development frameworks that we looked at were cross-platform; Ark is not. It could be, of course: it does appear to be thoroughly open source. But HarmonyOS devices are the main target. What implications does this have?

A similar question arises in perhaps a more concrete way if we start with the mature Flutter framework: what would it mean to make a Flutter phone?

The first thought that comes to mind is that having a Flutter OS would allow for the potential for more cross-cutting optimizations that cross abstraction layers. But then I think, what does Flutter really need? It has the GPU drivers, and we aren't going to re-implement those. It has the bridge to the platform-native SDK, which is not such a large and important part of the app. You get input from the platform, but that's also not so specific. So maybe optimization is not the answer.

On the other hand, a Flutter OS would not have to solve the make-it-look-native problem; because there would be no other "native" toolkit, your apps won't look out of place. That's nice. It's not something that would make the platform compelling, though.

HarmonyOS does have this embryonic concept of app mobility, where like you could put an app from your phone on your fridge, or something. Clearly I am not doing it justice here, but let's assume it's a compelling use case. In that situation it would be nice for all devices to present similar abstractions, so you could somehow install the same app on two different kinds of devices, and they could communicate to transfer data. As you can see here though, I am straying far from my domain of expertise.

One reasonable way to "move" an app is to have it stay running on your phone and the phone just communicates pixels with your fridge (or whatever); this is the low-level solution. I think HarmonyOS appears to be going for the higher-level solution where the app actually runs logic on the device. In that case it would make sense to ship UI assets and JavaScript / extended TypeScript bytecode to the device, which would run the app with an interpreter (for low-powered devices) or use JIT/AOT compilation. The Ark runtime itself would live on all devices, specialized to their capabilities.

In a way this is the Apple WatchOS solution (as I understand it); developers publish their apps as LLVM bitcode, and Apple compiles it for the specific devices. A FlutterOS with a Flutter run-time on all devices could do something similar. As with WatchOS you wouldn't have to ship the framework itself in the app bundle; it would be on the device already.

Finally, publishing apps as some kind of intermediate representation also has security benefits: as the OS developer, you can ensure some invariants via the toolchain that you control. Of course, you would have to ensure that the Flutter API is sufficiently expressive for high-performance applications, while also not having arbitrary machine code execution vulnerabilities; there is a question of language and framework design as well as toolchain and runtime quality of implementation. HarmonyOS could be headed in this direction.

Conclusion

Ark is a fascinating effort that holds much promise. It's also still in motion; where will it be when it anneals to its own local optimum? It would appear that the system is approaching usability, but I expect a degree of churn in the near-term as Ark designers decide which language abstractions work for them and how to, well, harmonize them with the rest of JavaScript.

For me, the biggest open question is whether developers will love Ark in the way they love, say, React. In a market where Huawei is still a dominant vendor, I think the material conditions are there for a good developer experience: people tend to like Flutter and React, and Ark is similar. Huawei "just" needs to explain their framework well (and where it's hard to explain, to go back and change it so that it is explainable).

But in a more heterogeneous market, to succeed Ark would need to make a cross-platform runtime like the one Flutter has and engage in some serious marketing efforts, so that developers don't have to limit themselves to targetting the currently-marginal HarmonyOS. Selling extensions to JavaScript will be much more difficult in a context where the competition is already established, but perhaps Ark will be able to productively engage with TypeScript maintainers to move the language so it captures some of the benefits of Dart that facilitate ahead-of-time compilation.

Well, that's it for my review round-up; hope you have enjoyed the series. I have one more pending article, speculating about some future technologies. Until then, happy hacking, and see you next time.

by Andy Wingo at May 02, 2023 09:13 AM

April 27, 2023

Stéphane CerveauESExtractor: how to integrate a dependency-free library to the Khronos CTS

ESExtractor, how to integrate a dependency-free library to the Khronos CTS

Since the Vulkan CTS is now able to test and check Vulkan Video support including video decoding, it was necessary to define the kind of media container to be used inside the test cases and the library to extract the necessary encoded data.

In a first attempt, the FFMpeg media toolkit had been chosen to extract the video packets from the A/V ISO base media format chosen as a container reference. This library was provided as a binary package and loaded dynamically at each test run.

As Vulkan video aims to test only video contents, it was not necessary to choose a complex media container, so first all the videos were converted to the elementary stream format for H264 and H265 contents. This is a very elementary format based on MPEG start codes and NAL unit identification.

To avoid an extra multimedia solution integrable only with binaries, a first attempt to replace FFmpeg was, to use GStreamer and an in-house helper library called demuxeres. It was smaller but needed to be a binary still to avoid the glib/gstreamer system dependencies (self contained library). it was a no-go still because the binary package would be awkward to support on the various platforms targetted by the the Khronos CTS.

So at Igalia, we decided to implement a minimal, dependency-free, custom library, written in C++ to be compliant with the Khronos CTS and simple to integrate into any build system.

This library is called ESExtractor

What is ESExtractor ?

ESExtractor aims to be a simple elementary stream extractor. For the first revision it was able to extract video data from a file in the NAL standard. The first official release was 0.2.4. In this release, only the NAL was supported with both the H264 and H265 streams supported.

As Vulkan Video aims to support more than H264 and H265 including format such as AV1 or VP9, the ESExtractor had to support multiple format. A redesign has been started to support multiple format and is now available in the version 0.3.2.

How ESExtractor works

A simple C interface is provided to maximise portability and use by other languages.

es_extractor_new

ESExtractor extractor = es_extractor_new(filePath, "options"));

This interface returns the main object which will give you access to the packets according to the file path and the options given in the arguments. This interface returns a ready to use object where the stream has been initially parsed to determine the kind of video during the object creation.

Then you can check the video format with:

ESEVideoFormat eVideoFormat = es_extractor_video_format(extractor);

or the video codec with:

ESEVideoCodec eVideoCodec = es_extractor_video_codec(extractor);

It supports H264, H265, AV1 and VP9 for now.

es_extractor_read_packet

This API is the main function to retrieve the available packets from the file. Each time this API is called, the library will return the next available packet according to the format and the specific alignment (ie NAL) and a status to let the application decide what to do next. The packet should be freed using es_extractor_clear_packet.

Has CI powered by github

To test the library usage, we have implementing a testing framework in addition to a CI infrastructure As github offers a very powerful worklow, we decided to use this platform to test the library on various architectures and platforms. The CI is now configured to release packages for 64 and 32 bits on Linux and Windows.

As usual, if you would like to learn more about Vulkan Video, ESExtractor or any other open multimedia framework, please contact us!

April 27, 2023 10:10 AM

April 26, 2023

Andy Wingostructure and interpretation of flutter

(Andy Wingo)

Good day, gentle hackfolk. Like an old-time fiddler I would appear to be deep in the groove, playing endless variations on a theme, in this case mobile application frameworks. But one can only recognize novelty in relation to the familiar, and today's note is a departure: we are going to look at Flutter, a UI toolkit based not on JavaScript but on the Dart language.

The present, from the past

Where to start, even? The problem is big enough that I'll approach it from three different angles: from the past, from the top, and from the bottom.

With the other frameworks we looked at, we didn't have to say much about their use of JavaScript. JavaScript is an obvious choice, in 2023 at least: it is ubiquitous, has high quality implementations, and as a language it is quite OK and progressively getting better. Up to now, "always bet on JS" has had an uninterrupted winning streak.

But winning is not the same as unanimity, and Flutter and Dart represent an interesting pole of contestation. To understand how we got here, we have to go back in time. Ten years ago, JavaScript just wasn't a great language: there were no modules, no async functions, no destructuring, no classes, no extensible iteration, no optional arguments to functions. In addition it was hobbled with a significant degree of what can only be called accidental sloppiness: with which can dynamically alter a lexical scope, direct eval that can define new local variables, Function.caller, and so on. Finally, larger teams were starting to feel the need for more rigorous language tooling that could use types to prohibit some classes of invalid programs.

All of these problems in JavaScript have been addressed over the last decade, mostly successfully. But in 2010 or so if you were a virtual machine engineer, you might look at JavaScript and think that in some other world, things could be a lot better. That's effectively what happened: the team that originally built V8 broke off and started to work on what became Dart.

Initially, Dart was targetted for inclusion in the Chrome web browser as an alternate "native" browser language. This didn't work, for various reasons, but since then Dart grew the Flutter UI toolkit, which has breathed new life into the language. And this is a review of Flutter, not a review of Dart, not really anyway; to my eyes, Dart is spiritually another JavaScript, different but in the same family. Dart's implementation has many interesting aspects as well that we'll get into later on, but all of these differences are incidental: they could just as well be implemented on top of JavaScript, TypeScript, or another source language in that family. Even if Flutter isn't strictly part of the JavaScript-based mobile application development frameworks that we are comparing, it is valuable to the extent that it shows what is possible, and in that regard there is much to say.

Flutter, from the top

At its most basic, Flutter is a UI toolkit for Dart. In many ways it is like React. Like React, its interface follows the functional-reactive paradigm: programmers describe the "what", and Flutter takes care of the "how". Also, like the phenomenon in which new developers can learn React without really knowing JavaScript, Flutter is the killer app for Dart: Flutter developers mostly learn Dart at the same time that they pick up Flutter.

In some other ways, Flutter is the logical progression of React, going in the same direction but farther along. Whereas React-on-the-web takes the user's declarative specifications of what the UI should look like and lowers them into DOM trees, and React Native lowers them to platform-native UI widgets, Flutter has its own built-in layout, rasterization, and compositing engine: Flutter draws all the pixels.

This has the predictable challenge that Flutter has to make significant investments so that its applications don't feel out-of-place on their platform, but on the other hand it opens up a huge space for experimentation and potential optimization: Flutter has the potential to beat native at its own game. Recall that with React Native, the result of the render-commit-mount process is a tree of native widgets. The native platform will surely then perform a kind of layout on those widgets, divide them into layers that correspond to GPU textures, paint those layers, then composite them to the screen -- basically, what a web engine will do.

What if we could instead skip the native tree and go directly to the lower GPU layer? That is the promise of Flutter. Flutter has the potential to create much more smooth and complex animations than the other application development frameworks we have mentioned, with lower overhead and energy consumption.

In practice... that's always the question, isn't it? Again, please accept my usual caveat that I am a compilers guy moonlighting in the user interface domain, but my understanding is that Flutter mostly lives up to its promise, but with one significant qualification which we'll get to in a minute. But before that, let's traverse Flutter from the other direction, coming up from Dart.

Dart, from the bottom

To explain some aspects of Dart I'm going to tell a just-so story that may or may not be true. I know and like many of the Dart developers, and we have similar instincts, so it's probably not too far from the truth.

Let's say you are the team that originally developed V8, and you decide to create a new language. You write a new virtual machine that looks like V8, taking Dart source code as input and applying advanced adaptive compilation techniques to get good performance. You can even be faster than JS because your language is just a bit more rigid than JavaScript is: you have traded off expressivity for performance. (Recall from our discussion of NativeScript that expressivity isn't a value judgment: there can be reasons to pay for more "mathematically appealing operational equivalences", in Felleisen's language, in exchange for applying more constraints on a language.)

But, you fail to ship the VM in a browser; what do you do? The project could die; that would be annoying, but you work for Google, so it happens all the time. However, a few interesting things happen around the same time that will cause you to pivot. One is a concurrent experiment by Chrome developers to pare the web platform down to its foundations and rebuild it. This effort will eventually become Flutter; while it was originally based on JS, eventually they will choose to switch to Dart.

The second thing that happens is that recently-invented smart phones become ubiquitous. Most people have one, and the two platforms are iOS and Android. Flutter wants to target them. You are looking for your niche, and you see that mobile application development might be it. As the Flutter people continue to experiment, you start to think about what it would mean to target mobile devices with Dart.

The initial Dart VM was made to JIT, but as we know, Apple doesn't let people do this on iOS. So instead you look to write a quick-and-dirty ahead-of-time compiler, based on your JIT compiler that takes your program as input, parses and baseline-compiles it, and generates an image that can be loaded at runtime. It ships on iOS. Funnily enough, it ships on Android too, because AOT compilation allows you to avoid some startup costs; forced to choose between peak performance via JIT and fast startup via AOT, you choose fast startup.

It's a success, you hit your product objectives, and you start to look further to a proper ahead-of-time compiler native code that can stand alone without the full Dart run-time. After all, if you have to compile at build-time, you might as well take the time to do some proper optimizations. You actually change the language to have a sound typing system so that the compiler can make program transformations that are valid as long as it can rely on the program's types.

Fun fact: I am told that the shift to a sound type system actually started before Flutter and thus before AOT, because of a Dart-to-JavaScript compiler that you inherited from the time in which you thought the web would be the main target. The Dart-to-JS compiler used to be a whole-program compiler; this enabled it to do flow-sensitive type inference, resulting in faster and smaller emitted JavaScript. But whole-program compilation doesn't scale well in terms of compilation time, so Dart-to-JS switched to separate per-module compilation. But then you lose lots of types! The way to recover the fast-and-small-emitted-JS property was through a stronger, sound type system for Dart.

At this point, you still have your virtual machine, plus your ahead-of-time compiler, plus your Dart-to-JS compiler. Such riches, such bounty! It is not a bad situation to be in, in 2023: you can offer a good development experience via the just-in-time compiled virtual machine. Apparently you can even use the JIT on iOS in developer mode, because attaching ptrace to a binary allows for native code generation. Then when you go to deploy, you make a native binary that includes everything.

For the web, you also have your nice story, even nicer than with JavaScript in some ways: because the type checker and ahead-of-time compiler are integrated in Dart, you don't have to worry about WebPack or Vite or minifiers or uglifiers or TypeScript or JSX or Babel or any of the other things that JavaScript people are used to. Granted, the tradeoff is that innovation is mostly centralized with the Dart maintainers, but currently Google seems to be investing enough so that's OK.

Stepping back, this story is not unique to Dart; many of its scenes also played out in the world of JavaScript over the last 5 or 10 years as well. Hermes (and QuickJS, for that matter) does ahead-of-time compilation, albeit only to bytecode, and V8's snapshot facility is a form of native AOT compilation. But the tooling in the JavaScript world is more diffuse than with Dart. With the perspective of developing a new JavaScript-based mobile operating system in mind, the advantages that Dart (and thus Flutter) has won over the years are also on the table for JavaScript to win. Perhaps even TypeScript could eventually migrate to have a sound type system, over time; it would take a significant investment but the JS ecosystem does evolve, if slowly.

(Full disclosure: while the other articles in this series were written without input from the authors of the frameworks under review, through what I can only think was good URL guesswork, a draft copy of this article leaked to Flutter developers. Dart hacker Slava Egorov kindly sent me a mail correcting a number of misconceptions I had about Dart's history. Fair play on whoever guessed the URL, and many thanks to Slava for the corrections; any remaining errors are wholly mine, of course!)

Evaluation

So how do we expect Flutter applications to perform? If we were writing a new mobile OS based on JavaScript, what would it mean in terms of performance to adopt a Flutter-like architecture?

Startup latency

Flutter applications are well-positioned to start fast, with ahead-of-time compilation. However they have had problems realizing this potential, with many users seeing a big stutter when they launch a Flutter app.

To explain this situation, consider the structure of a typical low-end Android mobile device: you have a small number of not-terribly-powerful CPU cores, but attached to the same memory you also have a decent GPU with many cores. For example, the SoC in the low-end Moto E7 Plus has 8 CPU cores and 128 GPU cores (texture shader units). You could paint widget pixels into memory from either the CPU or the GPU, but you'd rather do it in the GPU because it has so many more cores: in the time it takes to compute the color of a single pixel on the CPU, on the GPU you could do, like, 128 times as many, given that the comparison is often between multi-threaded rasterization on the GPU versus single-threaded rasterization on the CPU.

Flutter has always tried to paint on the GPU. Historically it has done so via a GPU back-end to the Skia graphics library, notably used by Chrome among other projects. But, Skia's API is a drawing API, not a GPU API; Skia is the one responsible for configuring the GPU to draw what we want. And here's the problem: configuring the GPU takes time. Skia generates shader code at run-time for rasterizing the specific widgets used by the Flutter programmer. That shader code then needs to be compiled to the language the GPU driver wants, which looks more like Vulkan or Metal. The process of compilation and linking takes time, potentially seconds, even.

The solution to "too much startup shader compilation" is much like the solution to "too much startup JavaScript compilation": move this phase to build time. The new Impeller rendering library does just that. However to do that, it had to change the way that Flutter renders: instead of having Skia generate specialized shaders at run-time, Impeller instead lowers the shapes that it draws to a fixed set of primitives, and then renders those primitives using a smaller, fixed set of shaders. These primitive shaders are pre-compiled at build time and included in the binary. By switching to this new renderer, Flutter should be able to avoid startup jank.

Jank

Of all the application development frameworks we have considered, to my mind Flutter is the best positioned to avoid jank. It has the React-like asynchronous functional layout model, but "closer to the metal"; by skipping the tree of native UI widgets, it can potentially spend less time for each frame render.

When you start up a Flutter app on iOS, the shell of the application is actually written in Objective C++. On Android it's the same, except that it's Java. That shell then creates a FlutterView widget and spawns a new thread to actually run Flutter (and the user's Dart code). Mostly, Flutter runs on its own, rendering frames to the GPU resources backing the FlutterView directly.

If a Flutter app needs to communicate with the platform, it passes messages across an asynchronous channel back to the main thread. Although these messages are asynchronous, this is probably the largest potential source of jank in a Flutter app, outside the initial frame paint: any graphical update which depends on the answer to an asynchronous call may lag.

Peak performance

Dart's type system and ahead-of-time compiler optimize for predictable good performance rather than the more variable but potentially higher peak performance that could be provided by just-in-time compilation.

This story should probably serve as a lesson to any future platform. The people that developed the original Dart virtual machine had a built-in bias towards just-in-time compilation, because it allows the VM to generate code that is specialized not just to the program but also to the problem at hand. A given system with ahead-of-time compilation can always be made to perform better via the addition of a just-in-time compiler, so the initial focus was on JIT compilation. On iOS of course this was not possible, but on Android and other platforms where this was available it was the default deployment model.

However, even Android switched to ahead-of-time compilation instead of the JIT model in order to reduce startup latency: doing any machine code generation at all at program startup was more work than was needed to get to the first frame. One could add JIT back again on top of AOT but it does not appear to be a high priority.

I would expect that Capacitor could beat Dart in some raw throughput benchmarks, given that Capacitor's JavaScript implementation can take advantage of the platform's native JIT capability. Does it matter, though, as long as you are hitting your frame budget? I do not know.

Aside: An escape hatch to the platform

What happens if you want to embed a web view into a Flutter app?

If you think on the problem for a moment I suspect you will arrive at the unsatisfactory answer, which is that for better or for worse, at this point it is too expensive even for Google to make a new web engine. Therefore Flutter will have to embed the native WebView. However Flutter runs on its own threads; the native WebView has its own process and threads but its interface to the app is tied to the main UI thread.

Therefore either you need to make the native WebView (or indeed any other native widget) render itself to (a region of) Flutter's GPU backing buffer, or you need to copy the native widget's pixels into their own texture and then composite them in Flutter-land. It's not so nice! The Android and iOS platform view documentation discuss some of the tradeoffs and mitigations.

Aside: For want of a canvas

There is a very funny situation in the React Native world in which, if the application programmer wants to draw to a canvas, they have to embed a whole WebView into the React Native app and then proxy the canvas calls into the WebView. Flutter is happily able to avoid this problem, because it includes its own drawing library with a canvas-like API. Of course, Flutter also has the luxury of defining its own set of standard libraries instead of necessarily inheriting them from the web, so when and if they want to provide equivalent but differently-shaped interfaces, they can do so.

Flutter manages to be more expressive than React Native in this case, without losing much in the way of understandability. Few people will have to reach to the canvas layer, but it is nice to know it is there.

Conclusion

Dart and Flutter are terribly attractive from an engineering perspective. They offer a delightful API and a high-performance, flexible runtime with a built-in toolchain. Could this experience be brought to a new mobile operating system as its primary programming interface, based on JavaScript? React Native is giving it a try, but I think there may be room to take things further to own the application from the program all the way down to the pixels.

Well, that's all from me on Flutter and Dart for the time being. Next up, a mystery guest; see you then!

by Andy Wingo at April 26, 2023 01:50 PM

April 24, 2023

Andy Wingostructure and interpretation of nativescript

(Andy Wingo)

Greetings, hackers tall and hackers small!

We're only a few articles in to this series on mobile application development frameworks, but I feel like we are already well into our journey. We started our trip through the design space with a look at Ionic / Capacitor, which defines its user interface in terms of the web platform, and only calls out to iOS or Android native features as needed. We proceeded on to React Native, which moves closer to native by rendering to platform-provided UI widgets, layering a cross-platform development interface on top.

Today's article takes an in-depth look at NativeScript, whose point in the design space is further on the road towards the platform, unabashedly embracing the specificities of the API available on iOS and Android, exposing these interfaces directly to the application programmer.

In practice what this looks like is that a NativeScript app is a native app which simply happens to call JavaScript on the main UI thread. That JavaScript has access to all native APIs, directly, without the mediation of serialization or message-passing over a bridge or message queue.

The first time I heard this I thought that it couldn't actually be all native APIs. After all, new versions of iOS and Android come out quite frequently, and surely it would take some effort on the part of NativeScript developers to expose the new APIs to JavaScript. But no, it really includes all of the various native APIs: the NativeScript developers wrote a build-time inspector that uses the platform's native reflection capabilities to grovel through all available APIs and to automatically generate JavaScript bindings, with associated TypeScript type definitions so that the developer knows what is available.

Some of these generated files are checked into source, so you can get an idea of the range of interfaces that are accessible to programmers; for example, see the iOS type definitions for x86-64. There are bindings for, like, everything.

Given access to all the native APIs, how do you go about making an app? You could write the same kind of programs that you would in Swift or Kotlin, but in JavaScript. But this would require more than just the ability to access native capabilities when needed: it needs a thorough knowledge of the platform interfaces, plus NativeScript itself on top. Most people don't have this knowledge, and those that do are probably programming directly in Swift or Kotlin already.

On one level, NativeScript's approach is to take refuge in that most ecumenical of adjectives, "unopinionated". Whereas Ionic / Capacitor encourages use of web platform interfaces, and React Native only really supports React as a programming paradigm, NativeScript provides a low-level platform onto which you can layer a number of different high-level frameworks.

Now, most high-level JavaScript application development frameworks are oriented to targetting the web: they take descriptions of user interfaces and translate them to the DOM. When targetting NativeScript, you could make it so that they target native UI widgets instead. However given the baked-in assumptions of how widgets should be laid out (notably via CSS), there is some impedance-matching to do between DOM-like APIs and native toolkits.

NativeScript's answer to this problem is a middle layer: a cross-platform UI library that provides DOM-like abstractions and CSS layout in a way that bridges the gap between web-like and native. You can even define parts of the UI using a NativeScript-specific XML vocabulary, which NativeScript compiles to native UI widget calls at run-time. Of course, there is no CSS engine in UIKit or Android's UI toolkit, so NativeScript includes its own, implemented in JavaScript of course.

You could program directly to this middle layer, but I suspect that its real purpose is in enabling Angular, Vue, Svelte, or the like. The pitch would be that NativeScript lets app developers use pleasant high-level abstractions, but while remaining close to the native APIs; you can always drop down for more power and expressiveness if needed.

Diving back down to the low level, as we mentioned all of the interactions between JavaScript and the native platform APIs happen on the main application UI thread. NativeScript does also allow programmers to create background threads, using an implementation of the Web Worker API. One could even in theory run a React-based UI in a worker thread and proxy native UI updates to the main thread; as an unopinionated platform, NativeScript can support many different frameworks and paradigms.

Finally, there is the question of how NativeScript runs the JavaScript in an application. Recall that Ionic / Capacitor uses the native JS engine, by virtue of using the native WebView, and that React Native used to use JavaScriptCore on both platforms but now uses its own Hermes implementation. NativeScript is another point in the design space, using V8 on both platforms. (They used to use JavaScriptCore on iOS but switched to V8 once V8 was able to run on iOS in "jitless" mode.) Besides the reduced maintenance burden of using a single implementation on all platforms, this also has the advantage of being able to use V8 snapshots to move JavaScript parse-and-compile work to build-time, even on iOS.

Evaluation

NativeScript is fundamentally simple: it's V8 running in an application's main UI thread, with access to all platform native APIs. So how do we expect it to perform?

Startup latency

In theory, applications with a NativeScript-like architecture should have no problem with startup time, because they can pre-compile all of their JavaScript into V8 snapshots. Snapshots are cheap to load up because they are already in a format that V8 is ready to consume.

In practice, it would seem that V8 snapshots do not perform as expected for NativeScript. There are a number of aspects about this situation that I don't understand, which I suspect relate to the state of the tooling around V8 rather than to the fundamental approach of ahead-of-time compilation. V8 is really made for Chrome, and it could be that not enough maintenance resources have been devoted to this snapshot facility.

In the meantime, NativeScript instead uses V8's code cache feature, which caches the result of parsing and compiling JavaScript files on the device. In this way the first time an app is installed or updated, it might start up slowly, but subsequent runs are faster. If you were designing a new operating system, you'd probably want to move this work to app install-time.

As we mentioned above, NativeScript apps have access to all native APIs. That is a lot of APIs, and only some of those interfaces will actually be used by any given app. In an ideal world, we would expect the build process to only include JavaScript code for those APIs that are needed by the application. However in the presence of eval and dynamic property lookup, pruning the native API surface to the precise minimum is a hard problem for a bundler to perform on its own. The solution for the time being is to manually allow and deny subsets of the platform native API. It's not an automatic process though, so it can be error-prone.

Besides the work that the JavaScript engine has to do to load an application's code, the other startup overhead involves whatever work that JavaScript might need to perform before the first frame is shown. In the case of NativeScript, more work is done before the initial layout than one would think: the main UI XML file is parsed by an XML parser written in JavaScript, any needed CSS files are parsed and loaded (again by JavaScript), and the tree of XML elements is translated to a tree of UI elements. The layout of the items in the view tree is then computed (in JavaScript, but calling into native code to measure text and so on), and then the app is ready.

At this point, I am again going to wave my "I am just a compiler engineer" flag: I am not a UI specialist, much less a NativeScript specialist. As in compilers, performance measurement and monitoring are key to UI development, but I suspect that also as in compilers there is a role for gut instinct. Incremental improvements are best driven by metrics, but qualitative leaps are often the result of somewhat ineffable hunches or even guesswork. In that spirit I can only surmise that React Native has an advantage over NativeScript in time-to-first-frame, because its layout is performed in C++ and because its element tree is computed directly from JavaScript instead of having JavaScript interpret XML and CSS files. In any case, I look forward to the forthcoming part 2 of the NativeScript and React Native performance investigations that were started in November 2022.

If I were NativeScript and using NativeScript's UI framework, and if startup latency proves to actually be a problem, I would lean into something in the shape of Angular's ahead-of-time compilation mode, but for the middle NativeScript UI layer.

Jank

On the face of it, NativeScript is the most jank-prone of the three frameworks we have examined, because it runs JavaScript on the main application UI thread, interleaved with UI event handling and painting and all of that. If an app's JavaScript takes too long to run, the app might miss frames or fail to promptly handle an event.

On the other hand, relative to React Native, the user's code is much closer to the application's behavior. There's no asynchrony between the application's logic and its main loop: in NativeScript it is easy to identify the code causing jank and eventually fix it.

The other classic JavaScript-on-the-main-thread worry relates to garbage collection pauses. V8's garbage collector does try to minimize the stop-the-world phase by tracing the heap concurrently and leveraging parallelism during pauses. Also, the user interface of a mobile app runs in an event loop, and typically spends most of its time idle; V8 exposes some API that can take advantage of this idle time to perform housekeeping tasks instead of needing to do them when handling high-priority events.

That said, having looked into the code of both the iOS and Android run-times, NativeScript does not currently take advantage of this facility. I dug deeper and it would seem that V8 itself is in flux, as the IdleNotificationDeadline API is on its way out; is the thought that concurrent tracing is largely sufficient? I would expect that if conservative stack scanning lands, we will see a re-introduction of this kind of API, as it does make sense to synchronize with the event loop when scanning the main thread stack.

Peak performance

As we have seen in our previous evaluations, this question boils down to "is the JavaScript engine state-of-the-art, and can it perform just-in-time compilation". In the case of NativeScript, the answers are yes and maybe, respectively: V8 is state-of-the-art, and it can JIT on Android, but not on iOS.

Perhaps the mitigation here is that the hardware that iOS runs on tends to be significantly more powerful than median Android devices; if you had to pick a subset of users to penalize with an interpreter-only run-time, people with iPhones are the obvious choice, because they can afford it.

Aside: Are markets wise?

Recall that our perspective in this series is that of the designer of a new JavaScript-based mobile development platform. We are trying to answer the question of what would it look like if a new platform offered a NativeScript-like experience. In this regard, only the structure of NativeScript is of interest, and notably its "market success" is not relevant, except perhaps in some Hayekian conception of the world in which markets are necessarily smarter than, well, me, or you, or any one of us.

It must be said, though, that React Native is the 800-pound gorilla of JavaScript mobile application development. The 2022 State of JS survey shows that among survey respondents, more people are aware of React Native than any other mobile framework, and people are generally more positive about React Native than other frameworks. Does NativeScript's mitigated market share indicate something about its architecture, or does it speak speak more to the size of Facebook's budget, both on the developer experience side and on marketing?

Aside: On the expressive power of application framworks

Oddly, I think the answer to the market wisdom question might be found in a 35-year-old computer science paper, "On the expressive power of programming languages" (PDF).

In this paper, Matthias Felleisen considers the notion of what it means for one programming language to be more expressive than another. For example, is a language with just for less expressive than a language with both for and while? Intuitively we would say no, these are similar things; you can make a simple local transformation of while (x) {...} to for (;x;) {...} and you have exactly the same program semantics. On the other hand a language with just for is less expressive than one which also has goto; there is no simple local rewrite that can turn goto into for.

In the same way, we can consider the question of what it would mean for one library to be more expressive than another. After all, the API of a library exposes a language in which its user can write programs; we should be able to reason about these languages. So between React Native and NativeScript, which one is more expressive?

By Felleisen's definitions, NativeScript is clearly the more expressive language: there is no simple local transformation that can turn imperative operations on native UI widgets into equivalent functional-reactive programs. Yes, with enough glue code React Native can reach directly to native APIs in a similar way as NativeScript, but everything that touches the native UI tree is expressly under React Native's control: there is no sanctioned escape hatch.

You might think that "more expressive" is always better, but Felleisen's take is more nuanced than that. Yes, he says, more expressive languages do allow programmers to make more concise programs, because they allow programmers to define abstractions that encapsulate patterns, and this is a good thing. However he also concludes that "an increase in expressive power is related to a decrease of the set of 'natural' (mathematically appealing) operational equivalences." Less expressive programming languages are easier to reason about, in general, and indeed that is one of the recognized strengths of React's programming model: it is easy to compose components and have confidence that the result will work.

Summary

A NativeScript-like architecture offers the possibility of performance: the developer has all the capabilities needed for writing pleasant-to-use applications that blend in with the platform-native experience. It is up to the developers to choose how to use the power at their disposal. In the wild, I expect that the low-level layer of NativeScript's API is used mainly by expert developers, who know how to assemble well-functioning machines from the parts on offer.

As a primary programming interface for a new JavaScript-based mobile platform, though, just providing a low-level API would seem to be not enough. NativeScript rightly promotes the use of more well-known high-level frameworks on top: Angular, Vue, Svelte, and such. Less experienced developers should use an opinionated high-level UI framework; these developers don't have good opinions yet and the API should lead them in the right direction.

That's it for today. Thanks for reading these articles, by the way; I have enjoyed diving into this space. Next up, we'll take a look beyond JavaScript, to Flutter and Dart. Until then, happy hacking!

by Andy Wingo at April 24, 2023 09:10 AM

April 21, 2023

Andy Wingostructure and interpretation of react native

(Andy Wingo)

Hey hey! Today's missive continues exploring the space of JavaScript and mobile application development.

Yesterday we looked into Ionic / Capacitor, giving a brief structural overview of what Capacitor apps look like under the hood and how this translates to three aspects of performance: startup latency, jank, and peak performance. Today we'll apply that same approach to another popular development framework, React Native.

Background: React

I don't know about you, but I find that there is so much marketing smoke and lights around the whole phenomenon that is React and React Native that sometimes it's hard to see what's actually there. This is compounded by the fact that the programming paradigm espoused by React (and its "native" cousin that we are looking at here) is so effective at enabling JavaScript UI programmers to focus on the "what" and not the "how" that the machinery supporting React recedes into the background.

At its most basic, React is what they call a functional reactive programming model. It is functional in the sense that the user interface elements render as a function of the global application state. The reactive comes into how user input is handled, but I'm not going to focus on that here.

React's rendering process starts with a root element tree, describing the root node of the user interface. An element is a JavaScript object with a type property. To render an element tree, if the value of the type property is a string, then the element is terminal and doesn't need further lowering, though React will visit any node in the children property of the element to render them as needed.

Otherwise if the type property of an element is a function, then the element node is functional. In that case React invokes the node's render function (the type property), passing the JavaScript element object as the argument. React will then recursively re-render the element tree produced as a result of rendering the component until all nodes are terminal. (Functional element nodes can instead have a class as their type property, but the concerns are pretty much the same.)

(In the language of React Native, a terminal node is a React Host Component, and a functional node is a React Composite Component, and both are React Elements. There are many imprecisely-used terms in React and I will continue this tradition by using the terms I mention above.)

The rendering phase of a React application is thus a function from an element tree to a terminal element tree. Nodes of element trees can be either functional or terminal. Terminal element trees are composed only of terminal elements. Rendering lowers all functional nodes to terminal nodes. This description applies both to React (targetting the web) and React Native (which we are reviewing here).

It's probably useful to go deeper into what React does with a terminal element tree, before building to the more complex pipeline used in React Native, so here we go. The basic idea is that React-on-the-web does impedance matching between the functional description of what the UI should have, as described by a terminal element tree, and the stateful tree of DOM nodes that a web browser uses to actually paint and display the UI. When rendering yields a new terminal element tree, React will compute the difference between the new and old trees. From that difference React then computes the set of imperative actions needed to mutate the DOM tree to correspond to what the new terminal element tree describes, and finally applies those changes.

In this way, small changes to the leaves of a React element tree should correspond to small changes in the DOM. Additionally, since rendering is a pure function of the global application state, we can avoid rendering at all when the application state hasn't changed. We'll dive into performance more deeply later on in the article.

React Native doesn't use a WebView

React Native is similar to React-on-the-web in intent but different in structure. Instead of using a WebView on native platforms, as Ionic / Capacitor does, React Native renders the terminal element tree to platform-native UI widgets.

When a React Native functional element renders to a terminal element, it will create not just a JS object for the terminal node as React-on-the-web does, but also a corresponding C++ shadow object. The fully lowered tree of terminal elements will thus have a corresponding tree of C++ shadow objects. React Native will then calculate the layout for each node in the shadow tree, and then commit the shadow tree: as on the web, React Native computes the set of imperative actions needed to change the current UI so that it corresponds to what the shadow tree describes. These changes are then applied on the main thread of the application.

The twisty path that leads one to implement JavaScript

The description above of React Native's rendering pipeline applies to the so-called "new architecture", which has been in the works for some years and is only now (April 2023) starting to be deployed. The key development that has allowed React Native to move over to this architecture is tighter integration and control over its JavaScript implementation. Instead of using the platform's JavaScript engine (JavaScriptCore on iOS or V8 on Android), Facebook went and made their own whole new JavaScript implementation, Hermes. Let's step back a bit to see if we can imagine why anyone in their right mind would make a new JS implementation.

In the last article, I mentioned that the only way to get peak JS performance on iOS is to use the platform's WkWebView, which enables JIT compilation of JavaScript code. React Native doesn't want a WebView, though. I guess you could create an invisible WebView and just run your JavaScript in it, but the real issue is that the interface to the JavaScript engine is so narrow as to be insufficiently expressive. You can't cheaply synchronously create a shadow tree of layout objects, for example, because every interaction with JavaScript has to cross a process boundary.

So, it may be that JIT is just not worth paying for, if it means having to keep JavaScript at arm's distance from other parts of the application. How do you do JavaScript without a browser on mobile, though? Either you use the platform's JavaScript engine, or you ship your own. It would be nice to use the same engine on iOS and Android, though. When React Native was first made, V8 wasn't able to operate in a mode that didn't JIT, so React Native went with JavaScriptCore on both platforms.

Bundling your own JavaScript engine has the nice effect that you can easily augment it with native extensions, for example to talk to the Swift or Java app that actually runs the main UI. That's what I describe above with the creation of the shadow tree, but that's not quite what the original React Native did; I can only speculate but I suspect that there was a fear that JavaScript rendering work (or garbage collection!) could be heavy enough to cause the main UI to drop frames. Phones were less powerful in 2016, and JavaScript engines were less good. So the original React Native instead ran JavaScript in a separate thread. When a render would complete, the resulting terminal element tree would be serialized as JSON and shipped over to the "native" side of the application, which would actually apply the changes.

This arrangement did work, but it ran into problems whenever the system needed synchronous communication between native and JavaScript subsystems. As I understand it, this was notably the case when React layout would need the dimensions of a native UI widget; to avoid a stall, React would assume something about the dimensions of the native UI, and then asynchronously re-layout once the actual dimensions were known. This was particularly gnarly with regards to text measurements, which depend on low-level platform-specific rendering details.

To recap: React Native had to interpret its JS on iOS and was using a "foreign" JS engine on Android, so they weren't gaining anything by using a platform JS interpreter. They would sometimes have some annoying layout jank when measuring native components. And what's more, React Native apps would still experience the same problem as Ionic / Capacitor apps, in that application startup time was dominated by parsing and compiling the JavaScript source files.

The solution to this problem was partly to switch to the so-called "new architecture", which doesn't serialize and parse so much data in the course of rendering. But the other side of it was to find a way to move parsing and compiling JavaScript to the build phase, instead of having to parse and compile JS every time the app was run. On V8, you would do this by generating a snapshot. On JavaScriptCore, which React Native used, there was no such facility. Faced with this problem and armed with Facebook's bank account, the React Native developers decided that the best solution would be to make a new JavaScript implementation optimized for ahead-of-time compilation.

The result is Hermes. If you are familiar with JavaScript engines, it is what you might expect: a JavaScript parser, originally built to match the behavior of Esprima; an SSA-based intermediate representation; a set of basic optimizations; a custom bytecode format; an interpreter to run that bytecode; a GC to manage JS objects; and so on. Of course, given the presence of eval, Hermes needs to include the parser and compiler as part of the virtual machine, but the hope is that most user code will be parsed and compiled ahead-of-time.

If this were it, I would say that Hermes seems to me to be a dead end. V8 is complete; Hermes is not. For example, Hermes doesn't have with, async function implementation has been lagging, and so on. Why Hermes when you can V8 (with snapshots), now that V8 doesn't require JIT code generation?

I thought about this for a while and in the end, given that V8's main target isn't as an embedded library in a mobile app, perhaps the binary size question is the one differentiating factor (in theory) for Hermes. By focussing on lowering distribution size, perhaps Hermes will be a compelling JS engine in its own right. In any case, Facebook can afford to keep Hermes running for a while, regardless of whether it has a competitive advantage or not.

It sounds like I'm criticising Hermes here but that's not really the point. If you can afford it, it's good to have code you control. For example one benefit that I see React Native getting from Hermes is that they control the threading model; they can mostly execute JS in its own thread, but interrupt that thread and switch to synchronous main-thread execution in response to high-priority events coming from the user. You might be able to do that with V8 at some point but the mobile-apps-with-JS domain is still in flux, so it's nice to have a sandbox that React Native developers can use to explore the system design space.

Evaluation

With that long overview out of the way, let's take a look to what kinds of performance we can expect out of a React Native system.

Startup latency

Because React Native apps have their JavaScript code pre-compiled to Hermes bytecode, we can expect that the latency imposed by JavaScript during application startup is lower than is the case with Ionic / Capacitor, which needs to parse and compile the JavaScript at run-time.

However, it must be said that as a framework, React tends to result in large application sizes and incurs significant work at startup time. One of React's strengths is that it allows development teams inside an organization to compose well: because rendering is a pure function, it's easy to break down the task of making an app into subtasks to be handled by separate groups of people. Could this strength lead to a kind of weakness, in that there is less of a need for overall coordination on the project management level, such that in the end nobody feels responsible for overall application performance? I don't know. I think the concrete differences between React Native and React (the C++ shadow object tree, the multithreading design, precompilation) could mean that React Native is closer to an optimum in the design space than React. It does seem to me though that whether a platform's primary development toolkit shold be React-like remains an open question.

Jank

In theory React Native is well-positioned to avoid jank. JavaScript execution is mostly off the main UI thread. The threading model changes to allow JavaScript rendering to be pre-empted onto the main thread do make me wonder, though: what if that work takes too much time, or what if there is a GC pause during that pre-emption? I would not be surprised to see an article in the next year or two from the Hermes team about efforts to avoid GC during high-priority event processing.

Another question I would have about jank relates to interactivity. Say the user is dragging around a UI element on the screen, and the UI needs to re-layout itself. If rendering is slow, then we might expect to see a lag between UI updates and the dragging motion; the app technically isn't dropping frames, but the render can't complete in the 16 milliseconds needed for a 60 frames-per-second update frequency.

Peak perf

But why might rendering be slow? On the one side, there is the fact that Hermes is not a high-performance JavaScript implementation. It uses a simple bytecode interpreter, and will never be able to meet the performance of V8 with JIT compilation.

However the other side of this is the design of the application framework. In the limit, React suffers from the O(n) problem: any change to the application state requires the whole element tree to be recomputed. Rendering and layout work is proportional to the size of the application, which may have thousands of nodes.

Of course, React tries to minimize this work, by detecting subtrees whose layout does not change, by avoiding re-renders when state doesn't change, by minimizing the set of mutations to the native widget tree. But the native widgets aren't the problem: the programming model is, or it can be anyway.

Aside: As good as native?

Again in theory, React Native can used to write apps that are as good as if they were written directly against platform-native APIs in Kotlin or Swift, because it uses the same platform UI toolkits as native applications. React Native can also do this at the same time as being cross-platform, targetting iOS and Android with the same code. In practice, besides the challenge of designing suitable cross-platform abstractions, React Native has to grapple with potential performance and memory use overheads of JavaScript, but the result has the potential to be quite satisfactory.

Aside: Haven't I seen that rendering model somewhere?

As I mentioned in the last article, I am a compiler engineer, not a UI specialist. In the course of my work I do interact with a number of colleagues working on graphics and user interfaces, notably in the context of browser engines. I was struck when reading about React Native's rendering pipeline about how much it resembled what a browser itself will do as part of the layout, paint, and render pipeline: translate a tree of objects to a tree of immutable layout objects, clip those to the viewport, paint the ones that are dirty, and composite the resulting textures to the screen.

It's funny to think about how many levels we have here: the element tree, the recursively expanded terminal element tree, the shadow object tree, the platform-native widget tree, surely a corresponding platform-native layout tree, and then the GPU backing buffers that are eventually composited together for the user to see. Could we do better? I could certainly imagine any of these mobile application development frameworks switching to their own Metal/Vulkan-based rendering architecture at some point, to flatten out these layers.

Summary

By all accounts, React Native is a real delight to program for; it makes developers happy. The challenge is to make it perform well for users. With its new rendering architecture based on Hermes, React Native may well be on the path to addressing many of these problems. Bytecode pre-compilation should go a long way towards solving startup latency, provided that React's expands-to-fit-all-available-space tendency is kept in check.

If you were designing a new mobile operating system from the ground up, though, I am not sure that you would necessarily end up with React Native as it is. At the very least, you would include Hermes and the base run-time as part of your standard library, so that every app doesn't have to incur the space costs of shipping the run-time. Also, in the same way that Android can ahead-of-time and just-in-time compile its bytecode, I would expect that a mobile operating system based on React Native would extend its compiler with on-device post-install compilation and possibly JIT compilation as well. And at that point, why not switch back to V8?

Well, that's food for thought. Next up, NativeScript. Until then, happy hacking!

by Andy Wingo at April 21, 2023 08:20 AM

April 20, 2023

Andy Wingostructure and interpretation of capacitor programs

(Andy Wingo)

Good day, hackers! Today's note is a bit of a departure from compilers internals. A client at work recently asked me to look into cross-platform mobile application development and is happy for the results to be shared publically. This, then, is the first in a series of articles.

Mobile apps and JavaScript: how does it work?

I'll be starting by taking a look at Ionic/Capacitor, React Native, NativeScript, Flutter/Dart, and then a mystery guest. This article will set the stage and then look into Ionic/Capacitor.

The angle I am taking is, if you were designing a new mobile operating system that uses JavaScript as its native application development language, what would it mean to adopt one of these as your primary app development toolkit? It's a broad question but I hope we can come up with some useful conclusions.

I'm going to approach the problem from the perspective of a compiler engineer. Compilers translate high-level languages that people like to write into low-level languages that machines like to run. Compilation is a bridge between developers and users: developers experience the source language interface to the compiler, while users experience the result of compiling those source languages. Note, though, that my expertise is mostly on the language and compiler side of things; though I have worked on a few user-interface libraries in my time, I am certainly not a graphics specialist, nor a user-experience specialist.

I'm explicitly not including the native application development toolkits from Android and iOS, because these are most useful to me as a kind of "control" to the experiment. A cross-platform toolkit that compiles down to use native widgets can only be as good as the official SDK, but it might be worse; and a toolkit that routes around the official widgets by using alternate drawing primitives has the possibility to be better, but at the risk of being inconsistent with other platform apps, along with the more general risk of not being as good as the native UI libraries.

And, of course, there is the irritation that SwiftUI / UIKit / AppKit aren't not open source, and that neither iOS nor Android's native UI library is cross platform. If we are designing a new JavaScript-based mobile operating system—that's the conceit of the original problem statement—then we might as well focus on the existing JS-based toolkits. (Flutter of course is the odd one out, but it's interesting enough to include in the investigation.)

Ionic / Capacitor

Let's begin with the Ionic Framework UI toolkit, based on the Capacitor web run-time.

Overview

Capacitor is like an alternate web browser development kit for phones. It uses the platform's native WebView: WKWebView on iOS or WebView on Android. The JavaScript APIs available within that WebView are extended with Capacitor plugins which can access native APIs; these plugins let JavaScript do the sort of things you wouldn't be able to do directly in a web browser.

(Over time, most of the features provided by Capacitor end up in the web platform in some way. For example, there are web standards to allow JavaScript to detect or lock screen rotation. The Fugu project is particularly avant-garde in pushing these capabilities to the web. But, the wheel of web standards grinds very finely, and for an app to have access to, say, the user's contact list now, it is pragmatic to use an extended-webview solution like Capacitor.)

I call Capacitor a "web browser development kit" because creating a Capacitor project effectively copies the shell of a specialized browser into your source tree, usually with an iOS and Android version. Building the app makes a web browser with your extensions that runs your JS, CSS, image, and other assets. Running the app creates a WebView and invokes your JavaScript; your JS should then create the user interface using the DOM APIs that are standard in a WebView.

As you can see, Capacitor is quite full-featured when it comes to native platform capabilities but is a bit bare-bones when it comes to UI. To provide a richer development environment, the Ionic Framework adds a set of widgets and some application life-cycle abstractions on top of the Capacitor WebView. Ionic is not like a "normal" app framework like Vue or Angular because it takes advantage of Capacitor APIs, and because it tries to mimic the presentation and behavior of the native plaform (iOS or Android, mainly), so that apps built with it don't look out-of-place.

The Ionic Framework itself is built using Web Components, which end up creating styled DOM nodes with JavaScript behavior. These components can compose with other app frameworks such as Vue or Angular, allowing developers to share some of their app code between web and mobile apps---you write the main app in Vue, and just implement the components one way on the web and using something like Ionic Framework on mobile.

To recap, an Ionic app uses Ionic Framework libraries on top of a Capacitor run-time. You could use other libraries on top of Capacitor instead of Ionic. In any case, I'm most interested in Capacitor, so let's continue our focus there.

Evaluation

Performance has a direct effect on the experience that users have when interacting with an application, and here I would like to break down the performance in three ways: one, startup latency; two, jank; and three, peak throughput. As Capacitor is structurally much like a web browser, many of these concerns, pitfalls, and mitigation techniques are similar to those on the web.

Startup latency

Startup latency is how long the app makes you wait after you run it before you can interact with it. The app may hide some of this work behind a splash screen, but if there is too much work to do at startup-time, the risk is the user might get bored or frustrated and switch away. The goal, therefore, it to minimize startup latency. Most people consider time-to-interactive of less than than a second to be sufficient; there's room to improve but for better or for worse, user expectations here are lower than they could be.

In the case of Capacitor, when a user launches an application, the app loads a minimal skeleton program that loads a WebView. On iOS and most Android systems, the rendering and JS parts of the WebView run in a separate process that has access to the app's bundled assets: JavaScript files, images, CSS, and so on. The WebView then executes the JavaScript main function to create the UI.

Capacitor application startup time will be dominated by parsing and compiling the JavaScript source files, as well as any initial work needed to boot the app framework (which could require running a significant amount of JS), and will depend on how fast the user's device is. The most effective performance mitigation here is to reduce the amount of JavaScript loaded, but this can be difficult for complex apps. The main techniques to reduce app size are toolchain-based. Tree-shaking, bundling, and minification reduce the number of JS bytes that the engine needs to parse without changing application logic. Code splitting can defer some work until it is needed, but this might just result in jank later on.

Common Ionic application sizes would seem to be between 500 kB and 5 MB of JavaScript. In 2021, Alex Russell suggested that the soon-to-be standard Android performance baseline should be the Moto E7 Plus which, if its performance relative to the mid-range Moto G4 phone that he measured in 2019 translates to JavaScript engine speed, should be able to parse and compile uncompressed JavaScript at a rate of about 1 MB/s. That's not very much, if you want to get startup latency under a second, and a JS payload size of 5 MB would lead to 5-second startup delays for many users. To my mind, this is the biggest challenge for a Capacitor-based app.

(Calculation detail: The Moto G4 JavaScript parse-and-compile throughput is measured at 170 kB/s, for compressed JS. Assuming a compression ratio of 4 and that the Moto E7 Plus is 50% faster than the Moto G4, that gets us to 1 MB/s for uncompressed JS, for what was projected to be a performance baseline in 2023.)

Jank

Jank is when an application's animation and interaction aren't smooth, because the application somehow missed rendering one or more frames. Technically startup time is a form of jank, but it is useful to treat startup separately.

Generally speaking, the question of whether a Capacitor application will show jank depends on who is doing the rendering: if application JavaScript is driving an animation, then this could cause jank, in the same way as it would on the web. The mitigation is to lean on the web platform so that the WebView is the one in charge of ensuring smooth interaction, for example by using CSS animations or the native scrolling capability.

There may always be an impetus to do some animation in JavaScript, though, for example if the design guidelines require a specific behavior that can't be created using raw CSS.

Peak perf

Peak throughput is how much work an app can do per unit time. For an application written in JavaScript, this is a question of how good can the JavaScript engine make a user's code run fast.

Here we need to make an important aside on the strategies that a JavaScript engine uses to run a user's code. A standard technique that JS engines use to make JavaScript run fast is just-in-time (JIT) compilation, in which the engine emits specialized machine code at run-time and then runs that code. However, on iOS the platform prohibits JIT code generation for most applications. The usual justification for this restriction is that the low-level capability granted to a program that allows it to JIT-compile increases the likelihood of security exploits. The only current exception to the no-JIT policy on iOS is for WebView (including the one in the system web browser), which iOS maintainers consider to be sufficiently sandboxed, so that any security exploit that uses JIT code generation won't be able to access other capabilities possessed by the application process.

So much for the motivation, as I understand it. The upshot is, there is no JIT on iOS outside web browsers or web views. If a cross-platform application development toolkit wants peak JavaScript performance on iOS, it has to run that JS in a WebView. Capacitor does this, so it has fast JS on iOS. Note that these restrictions are not in place on Android; you can have fast JS on Android without using a WebView, by using the system's JavaScript libraries.

Of course, actually getting peak throughput out of JavaScript is an art but the well-known techniques for JavaScript-on-the-web apply here; we won't cover them in this series.

The bridge

All of these performance observations are common to all web browsers, but with Capacitor there is the additional wrinkle that a Capacitor application can access native capabilities, for example access to a user's contacts. When application JavaScript goes to access the contacts API, Capacitor will send a message to the native side of the application over a bridge. The message will be serialized to JSON. Capacitor will include an implementation for the native side of the bridge into your app's source code, written in Swift for iOS or Java for Android. The native end of the bridge parses the JSON message, performs the requested action, and sends a message back with the result. Some messages may need to be proxied to the main application thread, because some native APIs can only be processed there.

As you can imagine, the bridge has an overhead. Apps that have high-bandwidth access to native capabilities will also have JSON encoding overhead, as well general asynchronous coordination overhead. It may even be possible that encoding or decoding a large JSON message causes the WebView to miss a frame, for example when accessing a large file in local storage.

The bridge is a necessary component to the design, though; an iOS WebView can't have direct in-process access to native capabilities. For Android, the WebView APIs do not appear to allow this either, though it is theoretically possible to ship your own WebView that could access native APIs directly. In any case, the Capacitor multi-process solution does allow for some parallelism, and the enforced asynchronous nature of the APIs should lead to less modal application programming.

Aside: Can WebView act like native widgets?

Besides performance, one way that users experience application behavior is by way of expectations: users expect (but don't require) a degree of uniformity between different apps on the same platform, for example the same scrolling behavior or the same kinds of widgets. This aspect isn't the most important one to my investigation, because I'm more concerned with a new operating system that might come with its own design language with its own standard JavaScript-based component library, but it's one to note; an Ionic app is starting at a disadvantage relative to a platform-native app. Sometimes ensuring a platform-native UI is just a question of CSS styling and using the right library (like Ionic Framework), but sometimes it might take significant engineering or possibly even new additions to the web platform.

Aside: Is a WebView as good as native widgets?

As a final observation on user experience, it's worth asking the question of whether you can actually make a user interface using the web platform that is "as good" as native widgets. After all, the API and widget set (DOM) available to a WebView is very different from that available to a native application; is a WebView able to use CPU and GPU resources efficiently to create a smooth interface? Are web standards, CSS, and the DOM API somehow a constraint that prevents efficient user interface construction? Here, somewhat embarrassingly, I can't actually say. I am instead going to just play the "compiler engineer" card, that I'm not a UI specialist; to me it is an open question.

Summary

Capacitor, with or without the Ionic Framework on top, is the always-bet-on-the-web solution to the cross-platform mobile application development problem. It uses stock system WebViews and JavaScript, augmented with native capabilities as needed. Consistency with the UX of the specific platform (iOS or Android) may require significant investment, though.

Performance-wise, my evaluation is that Capacitor apps are well-positioned for peak JS performance due to JIT code generation and low jank due to offloading animations to the web platform, but may have problems with startup latency, as Capacitor needs to parse and compile a possibly significant amount of JavaScript each time it is run.

Next time, a different beast: React Native. Until then, happy hacking!

by Andy Wingo at April 20, 2023 10:20 AM

April 19, 2023

Sebastian DrögeBuilding a GStreamer plugin in Rust with meson instead of cargo

(Sebastian Dröge)

Over the Easter holidays I spent some time on a little experiment: How hard is it to build a GStreamer plugin in Rust with meson instead of using the default Rust build system cargo?

meson is a community-developed, open-source, cross-platform (including Windows), multi-language build system that is supposed to be fast and easy to use. It’s nowadays used as a build system by a lot of components of the GNOME desktop, by GStreamer, systemd, Mesa, X.org, QEMU and many other pieces of Free Software. Mostly for building C or C++ code, but also for configuring and installing Python or JavaScript code, Vala, and other languages.

Wouldn’t it be nice if we could also build software in Rust with it, build it together with existing code in other languages and have a common build system for it all? What would be the advantages and disadvantages of that, and what’s the current status of Rust support in meson? How much effort is it to make use of all the existing 100k software packages (“crates”) that are available for Rust?

Especially as most the projects mentioned before are looking into adopting Rust more or less seriously as a safer and more modern successor of C, these seem like useful questions to investigate. Anecdotally, I also heard that a maintainer of one of these projects said that being able to use the same build system as the rest of the codebase would be a requirement to even consider the language. Another project is starting to write some components in Rust and building them with meson, but without depending on any external Rust dependencies for now.

Another reason for looking into this was that there seems to be the opinion that you can’t really use any build system apart from cargo for building Rust code, and using meson would be very hard to impossible and involve a lot of disadvantages. This has lead to currently all GNOME applications written in Rust having a chimera of a build system using both meson and cargo because neither of the two does everything these applications need. Such a setup is hard to maintain, debug and probably almost nobody really understands it. cargo’s design does not make embedding into another build system easy, and both cargo and meson have very specific, and to some degree incompatible, opinions about how things have to be done. Let’s see if that’s actually necessary and what’s missing to move away from that. As Facebook is apparently using buck to build part of their Rust code, and Google bazel, this shouldn’t really be impossible.

As I’m a GStreamer developer, the maintainer of its Rust bindings and the one who started writing plugins for it in Rust, trying to do build a GStreamer plugin in Rust with meson instead of cargo seemed like the obvious choice for this experiment.

However, everything here applies in the same way to building GTK applications with its Rust bindings or similarly to any of the software mentioned before for writing components of them in Rust.

EDIT: After publishing the first version I was told that meson actually supports a couple of things that I missed before. For example, running Rust tests is already supported just fine (but more verbose in the build definition than cargo), and meson can already install executables setuid root by itself. Also there is an equivalent to cargo init, meson init, which makes starting new projects a bit more convenient. Also I wrote that cargo clippy is not really supported yet. While true, you can get around that by telling meson (or any other build system) to use clippy-driver as compiler instead of using rustc. I’ve updated the text below accordingly.

Summary

The code for my experiment can be found here, but at the point of writing it needs some changes to meson that are not merged yet. A list of those changes with some explanation can be found further below. The git repository also includes a cargo build system that gives the same build results for comparison purposes.

I should also make clear that this is only an experiment at this point and while it works fine, it is more manual work than necessary and if you depend on a lot of existing crates from crates.io then you probably want to wait a bit more before considering meson. Also more on that later. However, if you won’t have to depend on a lot of crates, your codebase is relatively self-contained and maybe even has to be built together with C code, then meson is already a viable alternative and has some advantages to offer compared to cargo. But also some disadvantages.

Almost all of the manual work I did as part of this experiment can be automated, and a big part of that is not going to be a lot of work either. I didn’t do that here to get an idea of the problems that would actually be encountered in practice when implementing such an automated system. I’ll get to that in more detail at the very end.

In summary I would say that

  • meson is less convenient and less declarative than cargo, but in exchange more powerful and flexible
  • meson’s Rust support is not very mature yet and there’s very little tooling integration
  • cargo is great and easy to use if your project falls into the exact pattern it handles but easily becomes annoying for you and your users otherwise
  • the developer experience is much better with cargo currently but the build and deployment experience is better with meson

More on each of these items below.

Procedure

As a goal I wanted to build one of the parts of the gst-plugins-rs tutorial, specifically an identity-kind of element that simply passes through its input, and build that into a GStreamer plugin that can be loaded into any GStreamer process. For that it has to be built into cdylib in Rust terms: a shared library that offers a C ABI. For this purpose, meson already has a big advantage over cargo in that it can actually install the build result in its correct place while cargo can only install executables right now. But more on that part later too.

The main task for this was translating all the Rust crate dependencies from cargo to meson, manually one by one. Overall 44 dependencies were needed, and I translated them into so-called meson wrap dependencies. The wrap dependency system of meson, similar to cargo, allows to download source code of dependencies from another location (in this case crates.io) and to extend it with a patch, in this case to add the meson-based build system for each.

In practice this meant creating a lot of meson.build files based on the Cargo.toml of each crate. The following following is the meson.build for the toml_edit crate.

project('toml_edit-rs', 'rust',
  version : '0.19.8',
  meson_version : '>= 1.0.0',
  default_options : ['buildtype=debugoptimized',
                     'rust_std=2021'
                    ]
)

rustc = meson.get_compiler('rust')

toml_datetime_dep = dependency('toml_datetime-rs', version : ['>= 0.6', '< 0.7'])
winnow_dep = dependency('winnow-rs', version : ['>= 0.4', '< 0.5'])
indexmap_dep = dependency('indexmap-rs', version : ['>= 1.9.1', '< 2.0'])

features = []
features += ['--cfg', 'feature="default"']

toml_edit = static_library('toml_edit', 'src/lib.rs',
  rust_args : features,
  rust_crate_type : 'rlib',
  dependencies : [toml_datetime_dep, winnow_dep, indexmap_dep],
  pic : true,
)

toml_edit_dep = declare_dependency(link_with : toml_edit)

and the corresponding wrap file

[wrap-file]
directory = toml_edit-0.19.8
source_url = https://crates.io/api/v1/crates/toml_edit/0.19.8/download
source_filename = toml_edit-0.19.8.tar.gz
source_hash = 239410c8609e8125456927e6707163a3b1fdb40561e4b803bc041f466ccfdc13
diff_files = toml_edit-rs.meson.patch

[provide]
toml_edit-rs = toml_edit_dep

As can be seen from the above, this could all be autogenerated from the corresponding Cargo.toml and that’s the case for a lot of crates. There are also some plans and ideas since quite a while in the meson community to actually develop such a tool, but so far this hasn’t materialized yet. Maybe my experiment can provide some motivation to actually start with that work.

For simplicity, when translating these by hand I didn’t consider including

  • any optional dependencies unless needed for my tasks
  • any tests, examples, executables as part of the crate
  • any cargo feature configuration instead of exactly what I needed

All of this can be easily done with meson though, and an automated tool for translating Cargo.toml into meson wraps should easily be able to handle that. For cargo features there are multiple ways of mapping them to meson, so some conventions have to be defined first of all. Similarly for naming Rust crates as meson dependencies it will be necessary to define some conventions to allow sharing between projects.

The hard part for translating some of the crates were the cargo build.rs build scripts. These build scripts allow running arbitrary Rust code as part of the build, which will also make automatic translation challenging. More on that later.

Once I had translated all the 44 dependencies, including the GStreamer Rust bindings, various procedural Rust macros and a lot of basic crates of the Rust ecosystem, I copied the plugin code into the new repository, changed some minor things and could then write a meson.build file for it.

project('gst-plugin-meson-test', 'rust',
  version : '0.0.1',
  meson_version : '>= 1.0.0',
  default_options : ['buildtype=debugoptimized',
                     'rust_std=2021'
                    ]
)

plugins_install_dir = '@0@/gstreamer-1.0'.format(get_option('libdir'))

rustc = meson.get_compiler('rust')

add_global_arguments(
  '-C', 'embed-bitcode=no',
  language: 'rust'
)

gst_dep = dependency('gstreamer-rs', version : ['>= 0.20', '< 0.21'])
once_cell_dep = dependency('once_cell-rs', version : ['>= 1.0', '< 2.0'])

gst_plugin_meson_test = dynamic_library('gstmesontest', 'src/lib.rs',
  rust_crate_type : 'cdylib',
  rust_dependency_map : {
    'gstreamer' : 'gst',
  },
  dependencies : [gst_dep, once_cell_dep],
  install : true,
  install_dir : plugins_install_dir,
)

To get everything building like this with the latest meson version (1.1.0), at this point some additional changes are needed: 1, 2, 3 and 4. Also currently all of this only supports native compilation. Cross-compilation of proc-macro in meson is currently broken but should be not too hard to fix in the future.

A careful reader will also notice that currently all crates with a dash (-) in their name are named like that for the dependency name, but the library they’re building has the dash replaced by an underscore in its name. This is due to a rustc bug (or undocumented behaviour?), and meson will likely require this translation of forbidden characters in crate names for the foreseeable future.

With all that in place, building the project is a matter of running

$ meson builddir
$ ninja -C builddir

compared to a single command with cargo

$ cargo build

Note that the meson build will be slower because by default meson already builds with optimizations (-C opt-level=2) while cargo does not.

Comparison between cargo and meson

In the following section I’m going to compare some aspects of cargo and meson, and give my general impression of both. To take the conclusion ahead, my perfect build system would include aspects of both cargo and meson and currently both are lacking in their own way. Like with any other tool (or programming language, for that matter): if you don’t have anything to complain about it you don’t know it well enough yet or don’t know any of the alternatives.

To avoid that people focus on the negative aspects, I’m also going to try to describe all the shortcomings of one build system as a good feature of the other. This is no competition but both meson and cargo can learn a lot from each other.

Build Times

Above I shortly mentioned build times. And because that’s something everybody is interested in and a common complaint about Rust, let’s start with that topic even if it’s in my opinion the most boring one.

Overall you would expect both build systems to behave the same if they do the same work as they are both basically just sophisticated Rust compiler process spawners. If you want to improve build times then your time is probably spent better on the Rust compiler itself and LLVM.

All times below are measured on my system very unscientifically with time. This is all only to give an idea of the general behaviour and to check if there are conceptual inefficiencies or problems. Also, when reproducing these results make sure to pass -Dbuildtype=debug to meson for comparable results between meson and cargo.

  • Running meson (build configuration): 1.4s (1.0s user, 0.4s system)
  • Running ninja (actual build): 10.4s (23.5s user, 3.2s system)
  • Running cargo: 11.0s (34.1s user, 6.6s system)

One thing that shows here immediately is that both need approximately the same time. The build alone takes slightly less than cargo, the configure and build steps together slightly more. So far what would be expected. However, cargo uses almost 45% more CPU time. I didn’t investigate this in great detail but the main two reasons for that are likely

  • cargo is building and running a build.rs build script in 23 of the 44 dependencies, which takes a while and also pulls in some more dependencies that are not needed for the meson build
  • meson is currently parallelizing less well compared to cargo when building Rust code, or otherwise the build step would likely be 10% or more faster

cargo build scripts

Interestingly, the main pain point when translating the crate build systems from cargo to meson also seems to be the main factor for cargo being more inefficient than it could be. This also seems to be a known fact in the wider Rust community by now.

But in addition to being inefficient and painful to translate (especially automatically), it is in my opinion also a maintenance nightmare and literally the worst part of cargo. It’s not declarative in what the build script is actually trying to do, it’s very easy to write build scripts that don’t work correctly in other environments and two crates doing the same things in a build script are generally going to behave differently in non-obvious ways.

For all the crates that I translated, the reasons why the build scripts existed in 23 of 44 crates should really be features that cargo should provide directly and that meson can do directly:

  • Checking which version of the Rust compiler is used and based on that enabling/disabling various features
  • Checking features or versions of the underlying platform or operating system
  • Checking for existence of native, external libraries or even building them
  • Not in these cases: code generation

Especially the last two points are painful for build and deployment of Rust code, and that every crate has its own special way of solving it in a build script doesn’t make it better. And in the end both tasks are exactly what you have a build system for: building things, tracking dependencies between them and generating an ideal build plan or schedule.

The system-deps crate is providing a way for expressing external dependencies in a declarative way as part of Cargo.toml, and a similar system built into cargo and integrated with the platform’s mechanism for finding libraries would be a huge improvement. Similar approaches for the other aspects would also be helpful. And not just for making a different build system for crates, but also for developers using cargo.

I’m sure that once cargo gained features for handling the four items above there will still be a need for custom build scripts in some situations, but these four items should cover more than 95% of the currently existing build scripts. It’s a mystery to me why there are no apparent efforts being made to improve cargo in this regard.

Good Parts of meson

Now let’s focus on some of the good parts of meson in comparison to cargo.

More flexible and expressive

Generally, meson has a much more flexible and expressive build definition language. It looks more like a scripting language than the toml files used by cargo, and as such appears more complex.

However thanks to this approach, almost anything for which cargo requires custom build scripts, with the disadvantages listed above, can be written directly as part of the meson build definitions. Expressing many of these things in toml for cargo would likely become convoluted and not very straightforward, as can already be seen nowadays with e.g. platform-specific dependencies.

meson provides built-in modules with support for e.g.

  • finding and checking versions of compilers of different languages and other build tools, e.g. for code generation, including the Rust bindgen tool
  • testing compilation of code snippets
  • finding external library dependencies and checking their versions
  • defining shared/static library, excutable and code generation build targets together with how (and if) they should be installed later
  • installing (and generating) data files, including specific support for e.g. library metadata (pkg-config), documentation, gobject-introspection, …

As an escape hatch, whenever something can’t be expressed by just meson, it is also possible to run external configuration/build/install scripts that could be written as a Python/shell script, C or Rust executable, or anything else really. This is rarely needed though and the meson project seems to add support for anything that people actually need in practice.

Build configuration system

meson provides an extensible configuration mechanism for the build, which allow the user building the software to customize the build process and its results.

  1. The built-in options allow for configuring things like switching between debug and release builds, defining toolchains and defining where to install build results.
  • The build options allow each project to extend the above with custom configuration, e.g. for enabling/disabling optional features of the project or generally selecting between different configurations. Apart from simple boolean flags this also allows for other types of configuration, including integers and strings.

  • cargo only provides one of the two in the form of fixed, non-extensible compiler profiles and configuration.

    The second is not to be confused with cargo’s feature flags, which are more a mechanism for selecting configuration of dependencies by the developer of the software. The meson build configuration however is for the builder of the software to select different configurations. While for executables cargo feature flags are used in a similar way for boolean configurations, cargo does not provide anything for this in general.

    As a workaround for this, a couple of Rust crates are using environment variables together with the env! macro for build configuration but this is fragile and not discoverable, and more of a hack than a real solution.

    Support for Rust and non-Rust code in the same project

    meson supports mixing Rust and non-Rust code in the same project, and allows tracking dependencies between targets using different languages in the same way.

    While it is not supported to mix Rust and e.g. C code in the same build target due to Rust’s compilation model, it is possible to e.g. build a static Rust library, a static C library and link both together into e.g. a D application or Python module. An example for this would be this Python module that combines C, C++, Rust and Fortran code.

    Code generation can be handled in a similar way as in the end code generation is just another transformation from one format into another.

    cargo doesn’t directly support anything but Rust code. As usual, build scripts provide a mechanism to get around this limitation. The cc crate for example is widely used to build C code and there are also crates for building meson or cmake based software as part of the build process.

    All of this is completely opaque to cargo though and can’t be taken into account for defining an optimal build schedule, can’t be configured from the outside and regularly fails in non-standard build situations (e.g. cross-compilation).

    Installation of other files than executables

    meson allows every build result to be installed in a configurable location. This is especially useful for more complex applications that might have to provide various data files or come as an executable plus multiple libraries or plugins, or simply for projects that only provide a shared library. If any of the built-in installation mechanisms are not sufficient (e.g. the executable should be get specific process capabilities set via setcap), meson also allows to customize the install process via scripts.

    cargo only allows to install executables right now. There are cargo extensions that also allow for more complex tasks, e.g. cargo xtask but there is no standard mechanism. There once was an RFC to make cargo’s installation process extensible, but the way how this was proposed would also suffer from the same problems as the cargo build scripts.

    External dependency and library support

    In addition to mixing build targets with multiple languages in the same project, meson also has a mechanism to find external dependencies in different ways. If an external dependency is not found, it can be provided and built as part of the project via the wrap mechanism mentioned before. The latter is similar to how cargo handles dependencies, but the former is missing completely and currently implemented via build scripts instead, e.g. by using the pkg-config crate.

    As Rust does not currently provide a stable ABI and provides no standard mechanism to locate library crates on the system, this mostly applies to library dependencies written in other languages. meson does support building Rust shared/static libraries and installing them too, but because of the lack of a stable ABI this has to be made use of very carefully.

    On the other hand, Rust allows building shared/static libraries that provide the stable C ABI of the platform (cdylib, staticlib crate types). meson allows building these correctly too, and also offers mechanisms for installing them together with their (potentially autogenerated header files) and locating them again later in other projects via e.g. pkg-config.

    For cargo this job can be taken care of by cargo-c, including actually building shared libraries correctly by setting e.g. the soname correctly and setting other kinds of versioning information.

    Good Parts of cargo

    After writing so much about meson and how great it is, let’s now look at some aspects of cargo that are better than what meson provides. Like I said before, both have their good sides.

    Simpler and more declarative build definitions

    The cargo manifest format is clearly a lot simpler and more declarative than the meson build definition format.

    For a simple project it looks more like a project description than something written in a scripting language.

    [package]
    name = "hello_world"
    version = "0.1.0"
    edition = "2021"
    
    [dependencies]
    anyhow = "1.0"
    

    As long as a project stays in the boundaries of what cargo makes easy to express, which should be the case for the majority of existing Rust projects, it is going to be simpler than meson. The lack of various features in cargo that require the use of a build script prevent this currently for many crates, but that seems like something that could be easily improved.

    meson on the other hand feels more like writing actual build scripts in some kind of scripting language, and information like “what dependencies does this have” are not as easily visible as from something like a Cargo.toml.

    Tooling integration

    cargo provides a lot development tools that make development with it a very convenient and smooth experience. There are also dozens of cargo extension commands that provide additional features on top of cargo.

    cargo init creates a new project, cargo add adds new dependencies, cargo check type-checks code without full compilation, cargo clippy runs a powerful linter, cargo doc builds the documentation (incl. dependencies), cargo bench and cargo test allow running tests and benchmarks, cargo show-asm shows the generated, annotated assembly for the code, cargo udeps finds unused dependencies, …

    All of this makes development of Rust project a smooth and well-integrated experience.

    In addition rust-analyzer provides a lot of IDE features via the LSP protocol for various editors and IDEs.

    Right now, IDEs and editors are assuming Rust projects to make use of cargo and offer integration with its features.

    On the other hand, when using meson almost none of this is currently provided and development feels less well integrated. Right now the only features provided by meson for making Rust development easier is generation of a rust-project.json to make use of rust-analyzer and being able to run tests in a similar way to cargo, and of course actually building the code. Building of documentation could be easily added to meson and is supported for other languages, something like cargo add for wrap dependencies exists already and adding crates.io support to it would be possible, but it’s going to take a while and a bit of effort to handle crates with cargo build scripts. Making use of all the cargo extension commands without actually using cargo seems unrealistic.

    In the end, cargo is the default build system for Rust and everything currently assumes usage of cargo so using cargo offers the best developer experience.

    Rust dependencies

    As shortly mentioned above, cargo add makes it extremely easy to add new Rust dependencies to a project and build them as part of the project. This helps a lot with code reuse and modularity. That an average Rust project has dozens of direct dependencies and maybe a hundred or more indirect dependencies shows this quite clearly, as does the encapsulation of very small tasks in separate dependencies compared to huge multi-purpose libraries that are common with other languages.

    cargo also directly handles updating of dependencies via cargo update, including making sure that only semver compatible versions are automatically updated and allows including multiple incompatible versions of dependencies in the same build if necessary.

    In addition to just adding dependencies, cargo features allow for conditional compilation and for defining which parts and features of a dependency should be enabled or not.

    meson has some explicit handling of dependencies and the wrap system also allows building external dependencies as part of the project, but adding or updating dependencies is more manual process unless they are in the meson wrapdb. For now it is completely manual with regards to Rust crates. It is no different from adding dependencies in non-Rust languages though.

    There is also no direct equivalent of the cargo feature flags, which potentially seems like a useful addition to meson.

    Next steps

    Considering all of the above, there is not really a simple answer to which of the two choices is the best for your project. As of now I would personally always use cargo for Rust projects if I can, especially if they will have other Rust dependencies. It’s a lot more convenient to develop with cargo.

    However various features of meson might make it a better choice for some projects, maybe already now or at least in the future. For example the ability to handle multiple languages, to handle dependencies and shared libraries correctly or the ability to install data files together with the application. Or simply because the remainder of the project already uses meson for compiling e.g. C code, meson might be the more natural choice for adding Rust code to the mix.

    As outlined above there are many areas of meson that could be improved and where Rust support is not very mature yet, and to make meson a more viable alternative to cargo these will have to happen sooner or later. Similarly, there are various areas where cargo also could be improved and learn from meson, and where improvements to cargo, in addition to making cargo more flexible and easier to use, would also make it easier to handle other build systems.

    From a meson point of view, I would consider the following the next steps.

    Various bugs and missing features

    rustc and cargo

    On the Rust compiler side, there are currently two minor issues that would need looking into.

    • rustc#80792: Adding support for passing environment variables via the command-line instead of using actual environment variables. Environment variables are currently used for build configuration by many crates. Allowing to provide them via the command-line would allow for more clean and reliable build rules without accidentally leaking actual environment variables into the build.
    • rustc#110460: Undocumented and maybe unintentional library filename requirement based on the crate name. This currently requires meson to disallow dashes in crate names and to have them explicitly replaced by underscore, which is something cargo does implicitly. Implicit conversion inside meson would also be feasible but probably not desireable because there would then be a mismatch between the name of the build target in the build definition and the actual name of the build target when building it. Similar to the mild confusion that some people ran into when noticing that a crate with a dash in the name can only be referenced with underscores from Rust code.

    In addition it would be useful to look into moving various features from build scripts into proper cargo, as outlined above. This will probably need a lot of design and discussion effort, and will likely also take years after implementation until the important crates are moved to it due to very conservative Rust toolchain version requirement policies in various of these crates.

    So on the cargo side that’s more something for the long run but also something that would greatly benefit users of cargo.

    meson

    On the meson side there are a couple of bugs of different severity that will have to be solved, and probably more that will show up once more people are starting to use meson to build Rust projects.

    In addition to the ones I mentioned above and that are merged already into the meson git repository, at the time of this writing there were for example the following outstanding issues

    • meson#11681: Add a feature to meson to allow renaming crates when using as a dependency. This is used throughout the Rust ecosystem for handling multiple versions of the same crate at once, or also simply for having a more convenient local name of a dependency.
    • meson#11695: Parallelize the Rust build better by already starting to build the next Rust targets when the metadata of the dependencies is available. This should bring build time improvements of 10% or more on machines with enough cores, and the lack of it is the main reason why the meson build in my experiment was “only” as fast as the cargo build and not faster.
    • meson#11702: Cross-compilation of proc-macro crates is currently using the wrong toolchain and simply doesn’t work.
    • meson#11694: Indirect dependencies of all dependencies are passed onwards on the later compiler invocations, which brings the risk of unnecessary name conflicts and simply causes more work for the compiler than necessary.
    • meson#10030: Add support for passing environment variables to the Rust compiler. As mentioned above, many crates are currently using this for build configuration so this would have to be supported by meson in one way or another.

    Apart from the second one in this list these should all be doable relatively fast and generally getting fixes and improvements required for Rust merged into meson was a fast and pleasant experience so far. I didn’t encounter any unnecessary bikeshedding or stop energy.

    Tooling for managing cargo dependencies

    During my experiment I wrote all the meson wrap files manually. This does not really scale, is inconvenient, and also makes it harder to update dependencies later.

    The goal here would be to provide a tool in the shape of cargo add and cargo update that allows to automatically add cargo-based Rust dependencies to a meson project. This is something that was discussed a lot in the past and various people in the meson community have ideas and plans around this. meson already has something similar for the wrapdb, meson wrap add and meson wrap update, but the idea would be to have something similar (or integration into that) to directly support crates from crates.io so Rust dependencies can be added with as little effort to a meson project as they can currently to a cargo project.

    Apart from the cargo build scripts this shouldn’t be a lot of effort and a project for an afternoon at most, so maybe I’ll give that a try one of these days if nobody beats me to it.

    As part of such a tool, it will also be necessary to define conventions about naming, mapping of cargo features, versioning, etc. of Rust crates inside meson and this should ideally be done from early on to avoid unnecessary churn. The way how I did it as part of my experiment has various drawbacks with regards to versioning and needs improvements.

    Handling cargo build scripts is a bigger issue though. As my experiment showed, about half of the crates had build scripts. While all of them were more or less trivial, automatically translating this into meson build definitions seems unrealistic to me.

    It might be possible to have meson use the cargo build scripts directly in one way or another, or they would have to be translated manually to meson build definitions. The latter would considerably improve build times so seems like a better approach for common crates at least. And for those the meson build definitions could be stored in a central place, like the meson wrapdb or maybe even be included in the crates on crates.io if their maintainers feel like dealing with two build systems.

    Together with all this, some thought will also have to be put into how to locate such Rust dependencies, similar to how pkg-config allows to locate shared libraries. For example, Linux distributions will want to package such dependencies and make sure that a project built for such a distribution is making use of the packaged dependencies instead of using any other version, or worse downloading some version from the Internet at build time. The way how this is currently handled by cargo is also not optimal for Linux distributions and a couple of other build and deployment scenarios.

    Because of the lack of a stable Rust ABI this would mean locating Rust source code.

    Tooling integration

    And last, as mentioned above there is basically no tooling integration right now apart from being able to build Rust code and using rust-analyzer. meson should at least support the most basic tasks that cargo supports, and that meson already supports for other languages: running tests and benchmarks, running linters and building documentation.

    Once those basic tasks are done, it might be worth investigating other tooling integration like the various cargo extension commands offer or extending those commands to handle other build systems, e.g. via the rust-project.json that rust-analyzer uses for that purpose.

    by slomo at April 19, 2023 02:26 PM

    April 18, 2023

    Andy Wingosticking point

    (Andy Wingo)

    Good evening, gentle readers. A brief note tonight, on a sticky place.

    See, I have too many projects right now.

    In and of itself this is not so much of a problem as a condition. I know my limits; I keep myself from burning out by shedding load, and there is a kind of priority list of which projects keep adequate service levels.

    First come the tiny humans that are in my care who need their butts wiped and bodies translated to and from school or daycare and who -- well you know the old Hegelian trope, that the dialectic crank of history doesn't turn itself, that it takes actions from people to synthesize the thesis and the antithesis, and that even History itself isn't always monotonic; in the same way, bedtime is a reality, there are the material conditions of sleepiness and you're-gonna-be-so-tired-tomorrow but without the World-Historical Actor which whose name is Dada ain't nobody getting into pyjamas, not to mention getting in bed and going to sleep.

    Lower in the priority queue there is work, and work is precious: a whole day, a day of just thinking about things and solving problems; yes, I wish I could choose all of the problems that I work on but the simple fact of being able to think and solve is a gift, or rather a reward, time stolen from the arbitrary chaotic preliterate interruptors. I love my kids -- and here I have to choose my conjunction wisely -- and also they love me. Which is nice! They express this through wanting to sit on my lap and demand that I read to them when I am thinking about things. We all have our love languages, I suppose.

    And the house? There are 5 of us now, and that's, you know, more than 5 times the work it was before because the 3 wee ones are more chaotic than the adults. There is so much laundry. We are now a cook-the-whole-500g-bag-of-pasta family, and we're far from the teenager stage.

    Finally there are the weird side projects, of which there are a few. I have some sporty things I do that I can't avoid because I am socially coerced into doing them; all in all a fine configuration. I have some volunteer work, which has parts I like that take time, which I am fine with, and parts I don't that I neglect. There's the garden, also neglected but nice to muck about in sometimes. Sometimes I see friends? Rarely, so rarely, but sometimes.

    But, netfriends, if I find a free couple hours, I hack. Not less than a couple hours, because otherwise I can't get into things. More? Yes please! But usually no, because priorities.

    I write this note because I realized that in the last month or so, hain't been no hacktime; there's barely 30 minutes between the end of the evening kitchen-clean and the onset of zzzz. That's fine, really; ultimately I need to carve out work time for this sort of thing, somehow or other.

    But I was also realizing that I painted myself into a corner with my GC work. See, I need some new benchmarks, and I can't afford to use Guile itself as my benchmark yet, because I can't afford to fix deep bugs right now. I need something small. Specifically I need a little language implementation with efficient GC integration. I was looking at porting the little Scheme implementation from my Wasm JIT work to C -- because the rest of whippet is in C -- but is that a good enough reason? I got mired in the swamps of precise GC roots in C and let me tell you, it is disgusting. You don't have RAII like you do in C++, without GCC extensions. You don't have stack maps, like you would like. I might be able to push through but it's the sort of project you need a day for, and those days have not been forthcoming recently.

    Anyway, just a little hack log tonight, not detailing a success, but rather a condition. Perhaps observing the position of this logjam will change its velocity. Here's hoping, and until next time, happy hacking!

    by Andy Wingo at April 18, 2023 08:23 PM

    April 11, 2023

    GStreamerGStreamer 1.22.2 stable bug fix release

    (GStreamer)

    The GStreamer team is pleased to announce the second bug fix release in the stable 1.22 release series of your favourite cross-platform multimedia framework!

    This release only contains bugfixes and it should be safe to update from 1.22.x.

    Highlighted bugfixes:

    • avdec_h264: fix decoder deadlocks with FFmpeg 6.0
    • rtspsrc: fix regression with URI protocols in OPTIONS requests for RTSP over TLS
    • rtspsrc: improved control url handling compatibility for broken servers
    • decklink: fix 10 bit RGB (r210) format auto detection for capture and fix playout if video caps are configured before audio caps
    • d3d11videosink: Fix tearing in case of fullscreen mode
    • playbin: fix deadlock when stopping stream with subtitles visible (even more)
    • typefinding: fix regression not detecting application/dash+xml in some corner cases
    • osxvideosink: fix broken aspect ratio and frame drawing region
    • decodebin3, parsebin: Improve elementary stream handling when decoders are not present and fix hang when removing a failing stream
    • urisourcebin: Propagate sticky events from parsebin, so that the STREAM_START event with the GstStream info is always available when pads get exposed
    • v4l2: Add support for YVU420M format; mark JPEG content as parsed
    • h264decoder, h265decoder: DPB bumping process and latency reporting fixes
    • Opus: Fix reading of extended channel config in MPEG-TS and fix missing sample rate when remuxing from RTP to Matroska
    • zxing: add support for building against zxing-c++ 2.0
    • cerbero: Fix packaging of Rust plugins on Android; fix modern Gentoo distro detection
    • various bug fixes, memory leak fixes, and other stability and reliability improvements

    See the GStreamer 1.22.2 release notes for more details.

    Binaries for Android, iOS, Mac OS X and Windows will be available shortly.

    Release tarballs can be downloaded directly here:

    April 11, 2023 08:00 PM

    March 20, 2023

    Andy Wingoa world to win: webassembly for the rest of us

    (Andy Wingo)

    Good day, comrades!

    Today I'd like to share the good news that WebAssembly is finally coming for the rest of us weirdos.

    A world to win

    WebAssembly for the rest of us

    17 Mar 2023 – BOB 2023

    Andy Wingo

    Igalia, S.L.

    This is a transcript-alike of a talk that I gave last week at BOB 2023, a gathering in Berlin of people that are using "technologies beyond the mainstream" to get things done: Haskell, Clojure, Elixir, and so on. PDF slides here, and I'll link the video too when it becomes available.

    WebAssembly, the story

    WebAssembly is an exciting new universal compute platform

    WebAssembly: what even is it? Not a programming language that you would write software in, but rather a compilation target: a sort of assembly language, if you will.

    WebAssembly, the pitch

    Predictable portable performance

    • Low-level
    • Within 10% of native

    Reliable composition via isolation

    • Modules share nothing by default
    • No nasal demons
    • Memory sandboxing

    Compile your code to WebAssembly for easier distribution and composition

    If you look at what the characteristics of WebAssembly are as an abstract machine, to me there are two main areas in which it is an advance over the alternatives.

    Firstly it's "close to the metal" -- if you compile for example an image-processing library to WebAssembly and run it, you'll get similar performance when compared to compiling it to x86-64 or ARMv8 or what have you. (For image processing in particular, native still generally wins because the SIMD primitives in WebAssembly are more narrow and because getting the image into and out of WebAssembly may imply a copy, but the general point remains.) WebAssembly's instruction set covers a broad range of low-level operations that allows compilers to produce efficient code.

    The novelty here is that WebAssembly is both portable while also being successful. We language weirdos know that it's not enough to do something technically better: you have to also succeed in getting traction for your alternative.

    The second interesting characteristic is that WebAssembly is (generally speaking) a principle-of-least-authority architecture: a WebAssembly module starts with access to nothing but itself. Any capabilities that an instance of a module has must be explicitly shared with it by the host at instantiation-time. This is unlike DLLs which have access to all of main memory, or JavaScript libraries which can mutate global objects. This characteristic allows WebAssembly modules to be reliably composed into larger systems.

    WebAssembly, the hype

    It’s in all browsers! Serve your code to anyone in the world!

    It’s on the edge! Run code from your web site close to your users!

    Compose a library (eg: Expat) into your program (eg: Firefox), without risk!

    It’s the new lightweight virtualization: Wasm is what containers were to VMs! Give me that Kubernetes cash!!!

    Again, the remarkable thing about WebAssembly is that it is succeeding! It's on all of your phones, all your desktop web browsers, all of the content distribution networks, and in some cases it seems set to replace containers in the cloud. Launch the rocket emojis!

    WebAssembly, the reality

    WebAssembly is a weird backend for a C compiler

    Only some source languages are having success on WebAssembly

    What about Haskell, Ocaml, Scheme, F#, and so on – what about us?

    Are we just lazy? (Well...)

    So why aren't we there? Where is Clojure-on-WebAssembly? Where are the F#, the Elixir, the Haskell compilers? Some early efforts exist, but they aren't really succeeding. Why is that? Are we just not putting in the effort? Why is it that Rust gets to ride on the rocket ship but Scheme does not?

    WebAssembly, the reality (2)

    WebAssembly (1.0, 2.0) is not well-suited to garbage-collected languages

    Let’s look into why

    As it turns out, there is a reason that there is no good Scheme implementation on WebAssembly: the initial version of WebAssembly is a terrible target if your language relies on the presence of a garbage collector. There have been some advances but this observation still applies to the current standardized and deployed versions of WebAssembly. To better understand this issue, let's dig into the guts of the system to see what the limitations are.

    GC and WebAssembly 1.0

    Where do garbage-collected values live?

    For WebAssembly 1.0, only possible answer: linear memory

    (module
      (global $hp (mut i32) (i32.const 0))
      (memory $mem 10)) ;; 640 kB

    The primitive that WebAssembly 1.0 gives you to represent your data is what is called linear memory: just a buffer of bytes to which you can read and write. It's pretty much like what you get when compiling natively, except that the memory layout is more simple. You can obtain this memory in units of 64-kilobyte pages. In the example above we're going to request 10 pages, for 640 kB. Should be enough, right? We'll just use it all for the garbage collector, with a bump-pointer allocator. The heap pointer / allocation pointer is kept in the mutable global variable $hp.

    (func $alloc (param $size i32) (result i32)
      (local $ret i32)
      (loop $retry
        (local.set $ret (global.get $hp))
        (global.set $hp
          (i32.add (local.get $size) (local.get $ret)))
    
        (br_if 1
          (i32.lt_u (i32.shr_u (global.get $hp) 16)
                    (memory.size))
          (local.get $ret))
    
        (call $gc)
        (br $retry)))
    

    Here's what an allocation function might look like. The allocation function $alloc is like malloc: it takes a number of bytes and returns a pointer. In WebAssembly, a pointer to memory is just an offset, which is a 32-bit integer (i32). (Having the option of a 64-bit address space is planned but not yet standard.)

    If this is your first time seeing the text representation of a WebAssembly function, you're in for a treat, but that's not the point of the presentation :) What I'd like to focus on is the (call $gc) -- what happens when the allocation pointer reaches the end of the region?

    GC and WebAssembly 1.0 (2)

    What hides behind (call $gc) ?

    Ship a GC over linear memory

    Stop-the-world, not parallel, not concurrent

    But... roots.

    The first thing to note is that you have to provide the $gc yourself. Of course, this is doable -- this is what we do when compiling to a native target.

    Unfortunately though the multithreading support in WebAssembly is somewhat underpowered; it lets you share memory and use atomic operations but you have to create the threads outside WebAssembly. In practice probably the GC that you ship will not take advantage of threads and so it will be rather primitive, deferring all collection work to a stop-the-world phase.

    GC and WebAssembly 1.0 (3)

    Live objects are

    • the roots
    • any object referenced by a live object

    Roots are globals and locals in active stack frames

    No way to visit active stack frames

    What's worse though is that you have no access to roots on the stack. A GC has to keep live objects, as defined circularly as any object referenced by a root, or any object referenced by a live object. It starts with the roots: global variables and any GC-managed object referenced by an active stack frame.

    But there we run into problems, because in WebAssembly (any version, not just 1.0) you can't iterate over the stack, so you can't find active stack frames, so you can't find the stack roots. (Sometimes people want to support this as a low-level capability but generally speaking the consensus would appear to be that overall performance will be better if the engine is the one that is responsible for implementing the GC; but that is foreshadowing!)

    GC and WebAssembly 1.0 (3)

    Workarounds

    • handle stack for precise roots
    • spill all possibly-pointer values to linear memory and collect conservatively

    Handle book-keeping a drag for compiled code

    Given the noniterability of the stack, there are basically two work-arounds. One is to have the compiler and run-time maintain an explicit stack of object roots, which the garbage collector can know for sure are pointers. This is nice because it lets you move objects. But, maintaining the stack is overhead; the state of the art solution is rather to create a side table (a "stack map") associating each potential point at which GC can be called with instructions on how to find the roots.

    The other workaround is to spill the whole stack to memory. Or, possibly just pointer-like values; anyway, you conservatively scan all words for things that might be roots. But instead of having access to the memory to which the WebAssembly implementation would spill your stack, you have to do it yourself. This can be OK but it's sub-optimal; see my recent post on the Whippet garbage collector for a deeper discussion of the implications of conservative root-finding.

    GC and WebAssembly 1.0 (4)

    Cycles with external objects (e.g. JavaScript) uncollectable

    A pointer to a GC-managed object is an offset to linear memory, need capability over linear memory to read/write object from outside world

    No way to give back memory to the OS

    Gut check: gut says no

    If that were all, it would already be not so great, but it gets worse! Another problem with linear-memory GC is that it limits the potential for composing a number of modules and the host together, because the garbage collector that manages JavaScript objects in a web browser knows nothing about your garbage collector over your linear memory. You can easily create memory leaks in a system like that.

    Also, it's pretty gross that a reference to an object in linear memory requires arbitrary read-write access over all of linear memory in order to read or write object fields. How do you build a reliable system without invariants?

    Finally, once you collect garbage, and maybe you manage to compact memory, you can't give anything back to the OS. There are proposals in the works but they are not there yet.

    If the BOB audience had to choose between Worse is Better and The Right Thing, I think the BOB audience is much closer to the Right Thing. People like that feel instinctual revulsion to ugly systems and I think GC over linear memory describes an ugly system.

    GC and WebAssembly 1.0 (5)

    There is already a high-performance concurrent parallel compacting GC in the browser

    Halftime: C++ N – Altlangs 0

    The kicker is that WebAssembly 1.0 requires you to write and deliver a terrible GC when there is already probably a great GC just sitting there in the host, one that has hundreds of person-years of effort invested in it, one that will surely do a better job than you could ever do. WebAssembly as hosted in a web browser should have access to the browser's garbage collector!

    I have the feeling that while those of us with a soft spot for languages with garbage collection have been standing on the sidelines, Rust and C++ people have been busy on the playing field scoring goals. Tripping over the ball, yes, but eventually they do manage to make within striking distance.

    Change is coming!

    Support for built-in GC set to ship in Q4 2023

    With GC, the material conditions are now in place

    Let’s compile our languages to WebAssembly

    But to continue the sportsball metaphor, I think in the second half our players will finally be able to get out on the pitch and give it the proverbial 110%. Support for garbage collection is coming to WebAssembly users, and I think even by the end of the year it will be shipping in major browsers. This is going to be big! We have a chance and we need to sieze it.

    Scheme to Wasm

    Spritely + Igalia working on Scheme to WebAssembly

    Avoid truncating language to platform; bring whole self

    • Value representation
    • Varargs
    • Tail calls
    • Delimited continuations
    • Numeric tower

    Even with GC, though, WebAssembly is still a weird machine. It would help to see the concrete approaches that some languages of interest manage to take when compiling to WebAssembly.

    In that spirit, the rest of this article/presentation is a walkthough of the approach that I am taking as I work on a WebAssembly compiler for Scheme. (Thanks to Spritely for supporting this work!)

    Before diving in, a meta-note: when you go to compile a language to, say, JavaScript, you are mightily tempted to cut corners. For example you might implement numbers as JavaScript numbers, or you might omit implementing continuations. In this work I am trying to not cut corners, and instead to implement the language faithfully. Sometimes this means I have to work around weirdness in WebAssembly, and that's OK.

    When thinking about Scheme, I'd like to highlight a few specific areas that have interesting translations. We'll start with value representation, which stays in the GC theme from the introduction.

    Scheme to Wasm: Values

    ;;       any  extern  func
    ;;        |
    ;;        eq
    ;;     /  |   \
    ;; i31 struct  array
    

    The unitype: (ref eq)

    Immediate values in (ref i31)

    • fixnums with 30-bit range
    • chars, bools, etc

    Explicit nullability: (ref null eq) vs (ref eq)

    The GC extensions for WebAssembly are phrased in terms of a type system. Oddly, there are three top types; as far as I understand it, this is the result of a compromise about how WebAssembly engines might want to represent these different kinds of values. For example, an opaque JavaScript value flowing into a WebAssembly program would have type (ref extern). On a system with NaN boxing, you would need 64 bits to represent a JS value. On the other hand a native WebAssembly object would be a subtype of (ref any), and might be representable in 32 bits, either because it's a 32-bit system or because of pointer compression.

    Anyway, three top types. The user can define subtypes of struct and array, instantiate values of those types, and access their fields. The life cycle of reference-typed objects is automatically managed by the run-time, which is just another way of saying they are garbage-collected.

    For Scheme, we need a common supertype for all values: the unitype, in Bob Harper's memorable formulation. We can use (ref any), but actually we'll use (ref eq) -- this is the supertype of values that can be compared by (pointer) identity. So now we can code up eq?:

    (func $eq? (param (ref eq) (ref eq))
               (result i32)
      (ref.eq (local.get a) (local.get b)))
    

    Generally speaking in a Scheme implementation there are immediates and heap objects. Immediates can be encoded in the bits of a value, whereas for heap object the bits of a value encode a reference (pointer) to an object on the garbage-collected heap. We usually represent small integers as immediates, as well as booleans and other oddball values.

    Happily, WebAssembly gives us an immediate value type, i31. We'll encode our immediates there, and otherwise represent heap objects as instances of struct subtypes.

    Scheme to Wasm: Values (2)

    Heap objects subtypes of struct; concretely:

    (struct $heap-object
      (struct (field $tag-and-hash i32)))
    (struct $pair
      (sub $heap-object
        (struct i32 (ref eq) (ref eq))))
    

    GC proposal allows subtyping on structs, functions, arrays

    Structural type equivalance: explicit tag useful

    We actually need to have a common struct supertype as well, for two reasons. One is that we need to be able to hash Scheme values by identity, but for this we need an embedded lazily-initialized hash code. It's a bit annoying to take the per-object memory hit but it's a reality, and the JVM does it this way, so it must not be so terrible.

    The other reason is more subtle: WebAssembly's type system is built in such a way that types that are "structurally" equivalent are indistinguishable. So a pair has two fields, besides the hash, but there might be a number of other fundamental object types that have the same shape; you can't fully rely on WebAssembly's dynamic type checks (ref.test et al) to be able to query the type of a value. Instead we re-use the low bits of the hash word to include a type tag, which might be 1 for pairs, 2 for vectors, 3 for closures, and so on.

    Scheme to Wasm: Values (3)

    (func $cons (param (ref eq)
                       (ref eq))
                (result (ref $pair))
      (struct.new_canon $pair
        ;; Assume heap tag for pairs is 1.
        (i32.const 1)
        ;; Car and cdr.
        (local.get 0)
        (local.get 1)))
    
    (func $%car (param (ref $pair))
                (result (ref eq))
      (struct.get $pair 1 (local.get 0)))
    

    With this knowledge we can define cons, as a simple call to struct.new_canon pair.

    I didn't have time for this in the talk, but there is a ghost haunting this code: the ghost of nominal typing. See, in a web browser at least, every heap object will have its first word point to its "hidden class" / "structure" / "map" word. If the engine ever needs to check that a value is of a specific shape, it can do a quick check on the map word's value; if it needs to do deeper introspection, it can dereference that word to get more details.

    Under the hood, testing whether a (ref eq) is a pair or not should be a simple check that it's a (ref struct) (and not a fixnum), and then a comparison of its map word to the run-time type corresponding to $pair. If subtyping of $pair is allowed, we start to want inline caches to handle polymorphism, but the checking the map word is still the basic mechanism.

    However, as I mentioned, we only have structural equality of types; two (struct (ref eq)) type definitions will define the same type and have the same map word (run-time type / RTT). Hence the _canon in the name of struct.new_canon $pair: we create an instance of $pair, with the canonical run-time-type for objects having $pair-shape.

    In earlier drafts of the WebAssembly GC extensions, users could define their own RTTs, which effectively amounts to nominal typing: not only does this object have the right structure, but was it created with respect to this particular RTT. But, this facility was cut from the first release, and it left ghosts in the form of these _canon suffixes on type constructor instructions.

    For the Scheme-to-WebAssembly effort, we effectively add back in a degree of nominal typing via type tags. For better or for worse this results in a so-called "open-world" system: you can instantiate a separately-compiled WebAssembly module that happens to define the same types and use the same type tags and it will be able to happily access the contents of Scheme values from another module. If you were to use nominal types, you would't be able to do so, unless there were some common base module that defined and exported the types of interests, and which any extension module would need to import.

    (func $car (param (ref eq)) (result (ref eq))
      (local (ref $pair))
      (block $not-pair
        (br_if $not-pair
          (i32.eqz (ref.test $pair (local.get 0))))
        (local.set 1 (ref.cast $pair) (local.get 0))
        (br_if $not-pair
          (i32.ne
            (i32.const 1)
            (i32.and
              (i32.const 0xff)
              (struct.get $heap-object 0 (local.get 1)))))
        (return_call $%car (local.get 1)))
    
      (call $type-error)
      (unreachable))
    

    In the previous example we had $%car, with a funny % in the name, taking a (ref $pair) as an argument. But in the general case (barring compiler heroics) car will take an instance of the unitype (ref eq). To know that it's actually a pair we have to make two checks: one, that it is a struct and has the $pair shape, and two, that it has the right tag. Oh well!

    Scheme to Wasm

    • Value representation
    • Varargs
    • Tail calls
    • Delimited continuations
    • Numeric tower

    But with all of that I think we have a solid story on how to represent values. I went through all of the basic value types in Guile and checked that they could all be represented using GC types, and it seems that all is good. Now on to the next point: varargs.

    Scheme to Wasm: Varargs

    (list 'hey)      ;; => (hey)
    (list 'hey 'bob) ;; => (hey bob)

    Problem: Wasm functions strongly typed

    (func $list (param ???) (result (ref eq))
      ???)

    Solution: Virtualize calling convention

    In WebAssembly, you define functions with a type, and it is impossible to call them in an unsound way. You must call $car exactly 2 arguments or it will not compile, and those arguments have to be of specific types, and so on. But Scheme doesn't enforce these restrictions on the language level, bless its little miscreant heart. You can call car with 5 arguments, and you'll get a run-time error. There are some functions that can take a variable number of arguments, doing different things depending on incoming argument count.

    How do we square these two approaches to function types?

    ;; "Registers" for args 0 to 3
    (global $arg0 (mut (ref eq)) (i31.new (i32.const 0)))
    (global $arg1 (mut (ref eq)) (i31.new (i32.const 0)))
    (global $arg2 (mut (ref eq)) (i31.new (i32.const 0)))
    (global $arg3 (mut (ref eq)) (i31.new (i32.const 0)))
    
    ;; "Memory" for the rest
    (type $argv (array (ref eq)))
    (global $argN (ref $argv)
            (array.new_canon_default
              $argv (i31.const 42) (i31.new (i32.const 0))))

    Uniform function type: argument count as sole parameter

    Callee moves args to locals, possibly clearing roots

    The approach we are taking is to virtualize the calling convention. In the same way that when calling an x86-64 function, you pass the first argument in $rdi, then $rsi, and eventually if you run out of registers you put arguments in memory, in the same way we'll pass the first argument in the $arg0 global, then $arg1, and eventually in memory if needed. The function will receive the number of incoming arguments as its sole parameter; in fact, all functions will be of type (func (param i32)).

    The expectation is that after checking argument count, the callee will load its arguments from globals / memory to locals, which the compiler can do a better job on than globals. We might not even emit code to null out the argument globals; might leak a little memory but probably would be a win.

    You can imagine a world in which $arg0 actually gets globally allocated to $rdi, because it is only live during the call sequence; but I don't think that world is this one :)

    Scheme to Wasm

    • Value representation
    • Varargs
    • Tail calls
    • Delimited continuations
    • Numeric tower

    Great, two points out of the way! Next up, tail calls.

    Scheme to Wasm: Tail calls

    ;; Call known function
    (return_call $f arg ...)
    
    ;; Call function by value
    (return_call_ref $type callee arg ...)

    Friends -- I almost cried making this slide. We Schemers are used to working around the lack of tail calls, and I could have done so here, but it's just such a relief that these functions are just going to be there and I don't have to think much more about them. Technically speaking the proposal isn't merged yet; checking the phases document it's at the last station before headed to the great depot in the sky. But, soon soon it will be present and enabled in all WebAssembly implementations, and we should build systems now that rely on it.

    Scheme to Wasm

    • Value representation
    • Varargs
    • Tail calls
    • Delimited continuations
    • Numeric tower

    Next up, my favorite favorite topic: delimited continuations.

    Scheme to Wasm: Prompts (1)

    Problem: Lightweight threads/fibers, exceptions

    Possible solutions

    • Eventually, built-in coroutines
    • binaryen’s asyncify (not yet ready for GC); see Julia
    • Delimited continuations

    “Bring your whole self”

    Before diving in though, one might wonder why bother. Delimited continuations are a building-block that one can use to build other, more useful things, notably exceptions and light-weight threading / fibers. Could there be another way of achieving these end goals without having to implement this relatively uncommon primitive?

    For fibers, it is possible to implement them in terms of a built-in coroutine facility. The standards body seems willing to include a coroutine primitive, but it seems far off to me; not within the next 3-4 years I would say. So let's put that to one side.

    There is a more near-term solution, to use asyncify to implement coroutines somehow; but my understanding is that asyncify is not ready for GC yet.

    For the Guile flavor of Scheme at least, delimited continuations are table stakes of their own right, so given that we will have them on WebAssembly, we might as well use them to implement fibers and exceptions in the same way as we do on native targets. Why compromise if you don't have to?

    Scheme to Wasm: Prompts (2)

    Prompts delimit continuations

    (define k
      (call-with-prompt ’foo
        ; body
        (lambda ()
          (+ 34 (abort-to-prompt 'foo)))
        ; handler
        (lambda (continuation)
          continuation)))
    
    (k 10)       ;; ⇒ 44
    (- (k 10) 2) ;; ⇒ 42

    k is the _ in (lambda () (+ 34 _))

    There are a few ways to implement delimited continuations, but my usual way of thinking about them is that a delimited continuation is a slice of the stack. One end of the slice is the prompt established by call-with-prompt, and the other by the continuation of the call to abort-to-prompt. Capturing a slice pops it off the stack, copying it out to the heap as a callable function. Calling that function splats the captured slice back on the stack and resumes it where it left off.

    Scheme to Wasm: Prompts (3)

    Delimited continuations are stack slices

    Make stack explicit via minimal continuation-passing-style conversion

    • Turn all calls into tail calls
    • Allocate return continuations on explicit stack
    • Breaks functions into pieces at non-tail calls

    This low-level intuition of what a delimited continuation is leads naturally to an implementation; the only problem is that we can't slice the WebAssembly call stack. The workaround here is similar to the varargs case: we virtualize the stack.

    The mechanism to do so is a continuation-passing-style (CPS) transformation of each function. Functions that make no calls, such as leaf functions, don't need to change at all. The same goes for functions that make only tail calls. For functions that make non-tail calls, we split them into pieces that preserve the only-tail-calls property.

    Scheme to Wasm: Prompts (4)

    Before a non-tail-call:

    • Push live-out vars on stacks (one stack per top type)
    • Push continuation as funcref
    • Tail-call callee

    Return from call via pop and tail call:

    (return_call_ref (call $pop-return)
                     (i32.const 0))

    After return, continuation pops state from stacks

    Consider a simple function:

    (define (f x y)
      (+ x (g y))
    

    Before making a non-tail call, a "tailified" function will instead push all live data onto an explicitly-managed stack and tail-call the callee. It also pushes on the return continuation. Returning from the callee pops the return continuation and tail-calls it. The return continuation pops the previously-saved live data and continues.

    In this concrete case, tailification would split f into two pieces:

    (define (f x y)
      (push! x)
      (push-return! f-return-continuation-0)
      (g y))
    
    (define (f-return-continuation-0 g-of-y)
      (define k (pop-return!))
      (define x (pop! x))
      (k (+ x g-of-y)))
    

    Now there are no non-tail calls, besides calls to run-time routines like push! and + and so on. This transformation is implemented by tailify.scm.

    Scheme to Wasm: Prompts (5)

    abort-to-prompt:

    • Pop stack slice to reified continuation object
    • Tail-call new top of stack: prompt handler

    Calling a reified continuation:

    • Push stack slice
    • Tail-call new top of stack

    No need to wait for effect handlers proposal; you can have it all now!

    The salient point is that the stack on which push! operates (in reality, probably four or five stacks: one in linear memory or an array for types like i32 or f64, three for each of the managed top types any, extern, and func, and one for the stack of return continuations) are managed by us, so we can slice them.

    Someone asked in the talk about whether the explicit memory traffic and avoiding the return-address-buffer branch prediction is a source of inefficiency in the transformation and I have to say, yes, but I don't know by how much. I guess we'll find out soon.

    Scheme to Wasm

    • Value representation
    • Varargs
    • Tail calls
    • Delimited continuations
    • Numeric tower

    Okeydokes, last point!

    Scheme to Wasm: Numbers

    Numbers can be immediate: fixnums

    Or on the heap: bignums, fractions, flonums, complex

    Supertype is still ref eq

    Consider imports to implement bignums

    • On web: BigInt
    • On edge: Wasm support module (mini-gmp?)

    Dynamic dispatch for polymorphic ops, as usual

    First, I would note that sometimes the compiler can unbox numeric operations. For example if it infers that a result will be an inexact real, it can use unboxed f64 instead of library routines working on heap flonums ((struct i32 f64); the initial i32 is for the hash and tag). But we still need a story for the general case that involves dynamic type checks.

    The basic idea is that we get to have fixnums and heap numbers. Fixnums will handle most of the integer arithmetic that we need, and will avoid allocation. We'll inline most fixnum operations as a fast path and call out to library routines otherwise. Of course fixnum inputs may produce a bignum output as well, so the fast path sometimes includes another slow-path callout.

    We want to minimize binary module size. In an ideal compile-to-WebAssembly situation, a small program will have a small module size, down to a minimum of a kilobyte or so; larger programs can be megabytes, if the user experience allows for the download delay. Binary module size will be dominated by code, so that means we need to plan for aggressive dead-code elimination, minimize the size of fast paths, and also minimize the size of the standard library.

    For numbers, we try to keep module size down by leaning on the platform. In the case of bignums, we can punt some of this work to the host; on a JavaScript host, we would use BigInt, and on a WASI host we'd compile an external bignum library. So that's the general story: inlined fixnum fast paths with dynamic checks, and otherwise library routine callouts, combined with aggressive whole-program dead-code elimination.

    Scheme to Wasm

    • Value representation
    • Varargs
    • Tail calls
    • Delimited continuations
    • Numeric tower

    Hey I think we did it! Always before when I thought about compiling Scheme or Guile to the web, I got stuck on some point or another, was tempted down the corner-cutting alleys, and eventually gave up before starting. But finally it would seem that the stars are aligned: we get to have our Scheme and run it too.

    Miscellenea

    Debugging: The wild west of DWARF; prompts

    Strings: stringref host strings spark joy

    JS interop: Export accessors; Wasm objects opaque to JS. externref.

    JIT: A whole ’nother talk!

    AOT: wasm2c

    Of course, like I said, WebAssembly is still a weird machine: as a compilation target but also at run-time. Debugging is a right proper mess; perhaps some other article on that some time.

    How to represent strings is a surprisingly gnarly question; there is tension within the WebAssembly standards community between those that think that it's possible for JavaScript and WebAssembly to share an underlying string representation, and those that think that it's a fool's errand and that copying is the only way to go. I don't know which side will prevail; perhaps more on that as well later on.

    Similarly the whole interoperation with JavaScript question is very much in its early stages, with the current situation choosing to err on the side of nothing rather than the wrong thing. You can pass a WebAssembly (ref eq) to JavaScript, but JavaScript can't do anything with it: it has no prototype. The state of the art is to also ship a JS run-time that wraps each wasm object, proxying exported functions from the wasm module as object methods.

    Finally, some language implementations really need JIT support, like PyPy. There, that's a whole 'nother talk!

    WebAssembly for the rest of us

    With GC, WebAssembly is now ready for us

    Getting our languages on WebAssembly now a S.M.O.P.

    Let’s score some goals in the second half!

    (visit-links
     "gitlab.com/spritely/guile-hoot-updates"
     "wingolog.org"
     "wingo@igalia.com"
     "igalia.com"
     "mastodon.social/@wingo")

    WebAssembly has proven to have some great wins for C, C++, Rust, and so on -- but now it's our turn to get in the game. GC is coming and we as a community need to be getting our compilers and language run-times ready. Let's put on the coffee and bang some bytes together; it's still early days and there's a world to win out there for the language community with the best WebAssembly experience. The game is afoot: happy consing!

    by Andy Wingo at March 20, 2023 09:06 AM

    March 14, 2023

    Víctor JáquezReview of Igalia Multimedia activities (2022)

    We, Igalia’s multimedia team, would like to share with you our list of achievements along the past 2022.

    WebKit Multimedia

    WebRTC

    Phil already wrote a first blog post, of a series, on this regard: WebRTC in WebKitGTK and WPE, status updates, part I. Please, be sure to give it a glance, it has nice videos.

    Long story short, last year we started to support Media Capture and Streams in WebKitGTK and WPE using GStreamer, either for input devices (camera and microphone), desktop sharing, webaudio, and web canvas. But this is just the first step. We are currently working on RTCPeerConnection, also using GStreamer, to share all these captured streams with other web peers. Meanwhile, we’ll wait for the second episode of Phil’s series 🙂

    MediaRecorder

    We worked in an initial implementation of MediaRecorder with GStreamer (1.20 or superior). The specification goes about allowing a web browser to record a selected stream. For example, a voice-memo or video application which could encode and upload a capture of your microphone / camera.

    Gamepad

    While WebKitGTK already has Gamepad support, WPE lacked it. We did the implementation last year, and there’s a blog post about it: Gamepad in WPEWebkit, with video showing a demo of it.

    Capture encoded video streams from webcams

    Some webcams only provide high resolution frames encoded in H.264 or so. In order to support these resolutions with those webcams we added the support for negotiate of those formats and decode them internally to handle the streams. Though we are just at the beginning of more efficient support.

    Flatpak SDK maintenance

    A lot of effort went to maintain the Flatpak SDK for WebKit. It is a set of runtimes that allows to have a reproducible build of WebKit, independently of the used Linux distribution. Nowadays the Flatpak SDK is used in Webkit’s EWS, and by many developers.

    Among all the features added during the year we can highlight added Rust support, a full integrity check before upgrading, and offer a way to override dependencies as local projects.

    MSE/EME enhancements

    As every year, massive work was done in WebKit ports using GStreamer for Media Source Extensions and Encrypted Media Extensions, improving user experience with different streaming services in the Web, such as Odysee, Amazon, DAZN, etc.

    In the case of encrypted media, GStreamer-based WebKit ports provide the stubs to communicate with an external Content Decryption Module (CDM). If you’re willing to support this in your platform, you can reach us.

    Also we worked in a video demo showing how MSE/EME works in a Raspberry Pi 3 using WPE:

    WebAudio demo

    We also spent time recording video demos, such as this one, showing WebAudio using WPE on a desktop computer.

    GStreamer

    We managed to merge a lot of bug fixes in GStreamer, which in many cases can be harder to solve rather than implementing new features, though former are more interesting to tell, such as those related with making Rust the main developing language for GStreamer besides C.

    Rust bindings and GStreamer elements for Vonage Video API / OpenTok

    OpenTok is the legacy name of Vonage Video API, and is a PaaS (Platform As a Service) to ease the development and deployment of WebRTC services and applications.

    We published our work in Github of Rust bindings both for the Client SDK for Linux and the Server SDK using REST API, along with a GStreamer plugin to publish and subscribe to video and audio streams.

    GstWebRTCSrc

    In the beginning there was webrtcbin, an element that implements the majority of W3C RTCPeerConnection API. It’s so flexible and powerful that it’s rather hard to use for the most common cases. Then appeared webrtcsink, a wrapper of webrtcbin, written in Rust, which receives GStreamer streams which will be offered and streamed to web peers. Later on, we developed webrtcsrc, the webrtcsink counterpart: an element which source pads push streams from web peers, such as another browser, and forward those Web streams as GStreamer ones in a pipeline. Both webrtcsink and webrtcsrc are written in Rust.

    Behavior-Driven Development test framework for GStreamer

    Behavior-Driven Development is gaining relevance with tools like Cucumber for Java and its domain specific language, Gherkin to define software behaviors. Rustaceans have picked up these ideas and developed cucumber-rs. The logical consequence was obvious: Why not GStreamer?

    Last year we tinkered with GStreamer-Cucumber, a BDD to define behavior tests for GStreamer pipelines.

    GstValidate Rust bindings

    There have been some discussion if BDD is the best way to test GStreamer pipelines, and there’s GstValidate, and also, last year, we added its Rust bindings.

    GStreamer Editing Services

    Though not everything was Rust. We work hard on GStreamer’s nuts and bolts.

    Last year, we gathered the team to hack GStreamer Editing Services, particularly to explore adding OpenGL and DMABuf support, such as downloading or uploading a texture before processing, and selecting a proper filter to avoid those transfers.

    GstVA and GStreamer-VAAPI

    We helped in the maintenance of GStreamer-VAAPI and the development of its near replacement: GstVA, adding new elements such as the H.264 encoder, the compositor and the JPEG decoder. Along with participation on the debate and code reviewing of negotiating DMABuf streams in the pipeline.

    Vulkan decoder and parser library for CTS

    You might have heard about Vulkan has now integrated in its API video decoding, while encoding is currently work-in-progress. We devoted time on helping Khronos with the Vulkan Video Conformance Tests (CTS), particularly with a parser based on GStreamer and developing a H.264 decoder in GStreamer using Vulkan Video API.

    You can check the presentation we did last Vulkanised.

    WPE Android Experiment

    In a joint adventure with Igalia’s Webkit team we did some experiments to port WPE to Android. This is just an internal proof of concept so far, but we are looking forward to see how this will evolve in the future, and what new possibilities this might open up.

    If you have any questions about WebKit, GStreamer, Linux video stack, compilers, etc., please contact us.

    by vjaquez at March 14, 2023 10:45 AM

    March 10, 2023

    Andy Wingopre-initialization of garbage-collected webassembly heaps

    (Andy Wingo)

    Hey comrades, I just had an idea that I won't be able to work on in the next couple months and wanted to release it into the wild. They say if you love your ideas, you should let them go and see if they come back to you, right? In that spirit I abandon this idea to the woods.

    Basically the idea is Wizer-like pre-initialization of WebAssembly modules, but for modules that store their data on the GC-managed heap instead of just in linear memory.

    Say you have a WebAssembly module with GC types. It might look like this:

    (module
      (type $t0 (struct (ref eq)))
      (type $t1 (struct (ref $t0) i32))
      (type $t2 (array (mut (ref $t1))))
      ...
      (global $g0 (ref null eq)
        (ref.null eq))
      (global $g1 (ref $t1)
        (array.new_canon $t0 (i31.new (i32.const 42))))
      ...
      (function $f0 ...)
      ...)
    

    You define some struct and array types, there are some global variables, and some functions to actually do the work. (There are probably also tables and other things but I am simplifying.)

    If you consider the object graph of an instantiated module, you will have some set of roots R that point to GC-managed objects. The live objects in the heap are the roots and any object referenced by a live object.

    Let us assume a standalone WebAssembly module. In that case the set of types T of all objects in the heap is closed: it can only be one of the types $t0, $t1, and so on that are defined in the module. These types have a partial order and can thus be sorted from most to least specific. Let's assume that this sort order is just the reverse of the definition order, for now. Therefore we can write a general type introspection function for any object in the graph:

    (func $introspect (param $obj anyref)
      (block $L2 (ref $t2)
        (block $L1 (ref $t1)
          (block $L0 (ref $t0)
            (br_on_cast $L2 $t2 (local.get $obj))
            (br_on_cast $L1 $t1 (local.get $obj))
            (br_on_cast $L0 $t0 (local.get $obj))
            (unreachable))
          ;; Do $t0 things...
          (return))
        ;; Do $t1 things...
        (return))
      ;; Do $t2 things...
      (return))
    

    In particular, given a WebAssembly module, we can generate a function to trace edges in an object graph of its types. Using this, we can identify all live objects, and what's more, we can take a snapshot of those objects:

    (func $snapshot (result (ref (array (mut anyref))))
      ;; Start from roots, use introspect to find concrete types
      ;; and trace edges, use a worklist, return an array of
      ;; all live objects in topological sort order
      )
    

    Having a heap snapshot is interesting for introspection purposes, but my interest is in having fast start-up. Many programs have a kind of "initialization" phase where they get the system up and running, and only then proceed to actually work on the problem at hand. For example, when you run python3 foo.py, Python will first spend some time parsing and byte-compiling foo.py, importing the modules it uses and so on, and then will actually run foo.py's code. Wizer lets you snapshot the state of a module after initialization but before the real work begins, which can save on startup time.

    For a GC heap, we actually have similar possibilities, but the mechanism is different. Instead of generating an array of all live objects, we could generate a serialized state of the heap as bytecode, and another function to read the bytecode and reload the heap:

    (func $pickle (result (ref (array (mut i8))))
      ;; Return an array of bytecode which, when interpreted,
      ;; can reconstruct the object graph and set the roots
      )
    (func $unpickle (param (ref (array (mut i8))))
      ;; Interpret the bytecode, building object graph in
      ;; topological order
      )
    

    The unpickler is module-dependent: it will need one case to construct each concrete type $tN in the module. Therefore the bytecode grammar would be module-dependent too.

    What you would get with a bytecode-based $pickle/$unpickle pair would be the ability to serialize and reload heap state many times. But for the pre-initialization case, probably that's not precisely what you want: you want to residualize a new WebAssembly module that, when loaded, will rehydrate the heap. In that case you want a function like:

    (func $make-init (result (ref (array (mut i8))))
      ;; Return an array of WebAssembly code which, when
      ;; added to the module as a function and invoked, 
      ;; can reconstruct the object graph and set the roots.
      )
    

    Then you would use binary tools to add that newly generated function to the module.

    In short, there is a space open for a tool which takes a WebAssembly+GC module M and produces M', a module which contains a $make-init function. Then you use a WebAssembly+GC host to load the module and call the $make-init function, resulting in a WebAssembly function $init which you then patch in to the original M to make M'', which is M pre-initialized for a given task.

    Optimizations

    Some of the object graph is constant; for example, an instance of a struct type that has no mutable fields. These objects don't have to be created in the init function; they can be declared as new constant global variables, which an engine may be able to initialize more efficiently.

    The pre-initialized module will still have an initialization phase in which it builds the heap. This is a constant function and it would be nice to avoid it. Some WebAssembly hosts will be able to run pre-initialization and then snapshot the GC heap using lower-level facilities (copy-on-write mappings, pointer compression and relocatable cages, pre-initialization on an internal level...). This would potentially decrease latency and may allow for cross-instance memory sharing.

    Limitations

    There are five preconditions to be able to pickle and unpickle the GC heap:

    1. The set of concrete types in a module must be closed.

    2. The roots of the GC graph must be enumerable.

    3. The object-graph edges from each live object must be enumerable.

    4. To prevent cycles, we have to know when an object has been visited: objects must have identity.

    5. We must be able to create each type in a module.

    I think there are three limitations to this pre-initialization idea in practice.

    One is externref; these values come from the host and are by definition not introspectable by WebAssembly. Let's keep the closed-world assumption and consider the case where the set of external reference types is closed also. In that case if a module allows for external references, we can perhaps make its pickling routines call out to the host to (2) provide any external roots (3) identify edges on externref values (4) compare externref values for identity and (5) indicate some imported functions which can be called to re-create exernal objects.

    Another limitation is funcref. In practice in the current state of WebAssembly and GC, you will only have a funcref which is created by ref.func, and which (3) therefore has no edges and (5) can be re-created by ref.func. However neither WebAssembly nor the JS API has no way of knowing which function index corresponds to a given funcref. Including function references in the graph would therefore require some sort of host-specific API. Relatedly, function references are not comparable for equality (func is not a subtype of eq), which is a little annoying but not so bad considering that function references can't participate in a cycle. Perhaps a solution though would be to assume (!) that the host representation of a funcref is constant: the JavaScript (e.g.) representations of (ref.func 0) and (ref.func 0) are the same value (in terms of ===). Then you could compare a given function reference against a set of known values to determine its index. Note, when function references are expanded to include closures, we will have more problems in this area.

    Finally, there is the question of roots. Given a module, we can generate a function to read the values of all reference-typed globals and of all entries in all tables. What we can't get at are any references from the stack, so our object graph may be incomplete. Perhaps this is not a problem though, because when we unpickle the graph we won't be able to re-create the stack anyway.

    OK, that's my idea. Have at it, hackers!

    by Andy Wingo at March 10, 2023 09:20 AM

    March 07, 2023

    Robert McQueenFlathub in 2023

    (Robert McQueen)

    It’s been quite a few months since the most recent updates about Flathub last year. We’ve been busy behind the scenes, so I’d like to share what we’ve been up to at Flathub and why—and what’s coming up from us this year. I want to focus on:

    • Where Flathub is today as a strong ecosystem with 2,000 apps
    • Our progress on evolving Flathub from a build service to an app store
    • The economic barrier to growing the ecosystem, and its consequences
    • What’s next to overcome our challenges with focused initiatives

    Today

    Flathub is going strong: we offer 2,000 apps from over 1,500 collaborators on GitHub. We’re averaging 700,000 app downloads a day, with 898 million HTTP requests totalling 88.3 TB served by our CDN each day (thank you Fastly!). Flatpak has, in my opinion, solved the largest technical issue which has held back the mainstream growth and acceptance of Linux on the desktop (or other personal computing devices) for the past 25 years: namely, the difficulty for app developers to publish their work in a way that makes it easy for people to discover, download (or sideload, for people in challenging connectivity environments), install and use. Flathub builds on that to help users discover the work of app developers and helps that work reach users in a timely manner.

    Initial results of this disintermediation are promising: even with its modest size so far, Flathub has hundreds of apps that I have never, ever heard of before—and that’s even considering I’ve been working in the Linux desktop space for nearly 20 years and spent many of those staring at the contents of dselect (showing my age a little) or GNOME Software, attending conferences, and reading blog posts, news articles, and forums. I am also heartened to see that many of our OS distributor partners have recognised that this model is hugely complementary and additive to the indispensable work they are doing to bring the Linux desktop to end users, and that “having more apps available to your users” is a value-add allowing you to focus on your core offering and not a zero-sum game that should motivate infighting.

    Ongoing Progress

    Getting Flathub into its current state has been a long ongoing process. Here’s what we’ve been up to behind the scenes:

    Development

    Last year, we concluded our first engagement with Codethink to build features into the Flathub web app to move from a build service to an app store. That includes accounts for users and developers, payment processing via Stripe, and the ability for developers to manage upload tokens for the apps they control. In parallel, James Westman has been working on app verification and the corresponding features in flat-manager to ensure app metadata accurately reflects verification and pricing, and to provide authentication for paying users for app downloads when the developer enables it. Only verified developers will be able to make direct uploads or access payment settings for their apps.

    Legal

    So far, the GNOME Foundation has acted as an incubator and legal host for Flathub even though it’s not purely a GNOME product or initiative. Distributing software to end users along with processing and forwarding payments and donations also has a different legal profile in terms of risk exposure and nonprofit compliance than the current activities of the GNOME Foundation. Consequently, we plan to establish an independent legal entity to own and operate Flathub which reduces risk for the GNOME Foundation, better reflects the independent and cross-desktop interests of Flathub, and provides flexibility in the future should we need to change the structure.

    We’re currently in the process of reviewing legal advice to ensure we have the right structure in place before moving forward.

    Governance

    As Flathub is something we want to set outside of the existing Linux desktop and distribution space—and ensure we represent and serve the widest community of Linux users and developers—we’ve been working on a governance model that ensures that there is transparency and trust in who is making decisions, and why. We have set up a working group with myself and Martín Abente Lahaye from GNOME, Aleix Pol Gonzalez, Neofytos Kolokotronis, and Timothée Ravier from KDE, and Jorge Castro flying the flag for the Flathub community. Thanks also to Neil McGovern and Nick Richards who were also more involved in the process earlier on.

    We don’t want to get held up here creating something complex with memberships and elections, so at first we’re going to come up with a simple/balanced way to appoint people into a board that makes key decisions about Flathub and iterate from there.

    Funding

    We have received one grant for 2023 of $100K from Endless Network which will go towards the infrastructure, legal, and operations costs of running Flathub and setting up the structure described above. (Full disclosure: Endless Network is the umbrella organisation which also funds my employer, Endless OS Foundation.) I am hoping to grow the available funding to $250K for this year in order to cover the next round of development on the software, prepare for higher operations costs (e.g., accounting gets more complex), and bring in a second full-time staff member in addition to Bartłomiej Piotrowski to handle enquiries, reviews, documentation, and partner outreach.

    We’re currently in discussions with NLnet about funding further software development, but have been unfortunately turned down for a grant from the Plaintext Group for this year; this Schmidt Futures project around OSS sustainability is not currently issuing grants in 2023. However, we continue to work on other funding opportunities.

    Remaining Barriers

    My personal hypothesis is that our largest remaining barrier to Linux desktop scale and impact is economic. On competing platforms—mobile or desktop—a developer can offer their work for sale via an app store or direct download with payment or subscription within hours of making a release. While we have taken the “time to first download” time down from months to days with Flathub, as a community we continue to have a challenging relationship with money. Some creators are lucky enough to have a full-time job within the FLOSS space, while a few “superstar” developers are able to nurture some level of financial support by investing time in building a following through streaming, Patreon, Kickstarter, or similar. However, a large proportion of us have to make do with the main payback from our labours being a stream of bug reports on GitHub interspersed with occasional conciliatory beers at FOSDEM (other beverages and events are available).

    The first and most obvious consequence is that if there is no financial payback for participating in developing apps for the free and open source desktop, we will lose many people in the process—despite the amazing achievements of those who have brought us to where we are today. As a result, we’ll have far fewer developers and apps. If we can’t offer access to a growing base of users or the opportunity to offer something of monetary value to them, the reward in terms of adoption and possible payment will be very small. Developers would be forgiven for taking their time and attention elsewhere. With fewer apps, our platform has less to entice and retain prospective users.

    The second consequence is that this also represents a significant hurdle for diverse and inclusive participation. We essentially require that somebody is in a position of privilege and comfort that they have internet, power, time, and income—not to mention childcare, etc.—to spare so that they can take part. If that’s not the case for somebody, we are leaving them shut out from our community before they even have a chance to start. My belief is that free and open source software represents a better way for people to access computing, and there are billions of people in the world we should hope to reach with our work. But if the mechanism for participation ensures their voices and needs are never represented in our community of creators, we are significantly less likely to understand and meet those needs.

    While these are my thoughts, you’ll notice a strong theme to this year will be leading a consultation process to ensure that we are including, understanding and reflecting the needs of our different communities—app creators, OS distributors and Linux users—as I don’t believe that our initiative will be successful without ensuring mutual benefit and shared success. Ultimately, no matter how beautiful, performant, or featureful the latest versions of the Plasma or GNOME desktops are, or how slick the newly rewritten installer is from your favourite distribution, all of the projects making up the Linux desktop ecosystem are subdividing between ourselves an absolutely tiny market share of the global market of personal computers. To make a bigger mark on the world, as a community, we need to get out more.

    What’s Next?

    After identifying our major barriers to overcome, we’ve planned a number of focused initiatives and restructuring this year:

    Phased Deployment

    We’re working on deploying the work we have been doing over the past year, starting first with launching the new Flathub web experience as well as the rebrand that Jakub has been talking about on his blog. This also will finally launch the verification features so we can distinguish those apps which are uploaded by their developers.

    In parallel, we’ll also be able to turn on the Flatpak repo subsets that enable users to select only verified and/or FLOSS apps in the Flatpak CLI or their desktop’s app center UI.

    Consultation

    We would like to make sure that the voices of app creators, OS distributors, and Linux users are reflected in our plans for 2023 and beyond. We will be launching this in the form of Flathub Focus Groups at the Linux App Summit in Brno in May 2023, followed up with surveys and other opportunities for online participation. We see our role as interconnecting communities and want to be sure that we remain transparent and accountable to those we are seeking to empower with our work.

    Whilst we are being bold and ambitious with what we are trying to create for the Linux desktop community, we also want to make sure we provide the right forums to listen to the FLOSS community and prioritise our work accordingly.

    Advisory Board

    As we build the Flathub organisation up in 2023, we’re also planning to expand its governance by creating an Advisory Board. We will establish an ongoing forum with different stakeholders around Flathub: OS vendors, hardware integrators, app developers and user representatives to help us create the Flathub that supports and promotes our mutually shared interests in a strong and healthy Linux desktop community.

    Direct Uploads

    Direct app uploads are close to ready, and they enable exciting stuff like allowing Electron apps to be built outside of flatpak-builder, or driving automatic Flathub uploads from GitHub actions or GitLab CI flows; however, we need to think a little about how we encourage these to be used. Even with its frustrations, our current Buildbot ensures that the build logs and source versions of each app on Flathub are captured, and that the apps are built on all supported architectures. (Is 2023 when we add RISC-V? Reach out if you’d like to help!). If we hand upload tokens out to any developer, even if the majority of apps are open source, we will go from this relatively structured situation to something a lot more unstructured—and we fear many apps will be available on only 64-bit Intel/AMD machines.

    My sketch here is that we need to establish some best practices around how to integrate Flathub uploads into popular CI systems, encouraging best practices so that we promote the properties of transparency and reproducibility that we don’t want to lose. If anyone is a CI wizard and would like to work with us as a thought partner about how we can achieve this—make it more flexible where and how build tasks can be hosted, but not lose these cross-platform and inspectability properties—we’d love to hear from you.

    Donations and Payments

    Once the work around legal and governance reaches a decent point, we will be in the position to move ahead with our Stripe setup and switch on the third big new feature in the Flathub web app. At present, we have already implemented support for one-off payments either as donations or a required purchase. We would like to go further than that, in line with what we were describing earlier about helping developers sustainably work on apps for our ecosystem: we would also like to enable developers to offer subscriptions. This will allow us to create a relationship between users and creators that funds ongoing work rather than what we already have.

    Security

    For Flathub to succeed, we need to make sure that as we grow, we continue to be a platform that can give users confidence in the quality and security of the apps we offer. To that end, we are planning to set up infrastructure to help ensure developers are shipping the best products they possibly can to users. For example, we’d like to set up automated linting and security scanning on the Flathub back-end to help developers avoid bad practices, unnecessary sandbox permissions, outdated dependencies, etc. and to keep users informed and as secure as possible.

    Sponsorship

    Fundraising is a forever task—as is running such a big and growing service. We hope that one day, we can cover our costs through some modest fees built into our payments—but until we reach that point, we’re going to be seeking a combination of grant funding and sponsorship to keep our roadmap moving. Our hope is very much that we can encourage different organisations that buy into our vision and will benefit from Flathub to help us support it and ensure we can deliver on our goals. If you have any suggestions of who might like to support Flathub, we would be very appreciative if you could reach out and get us in touch.

    Finally, Thank You!

    Thanks to you all for reading this far and supporting the work of Flathub, and also to our major sponsors and donors without whom Flathub could not exist: GNOME Foundation, KDE e.V., Mythic Beasts, Endless Network, Fastly, and Equinix Metal via the CNCF Community Cluster. Thanks also to the tireless work of the Freedesktop SDK community to give us the runtime platform most Flatpaks depend on, particularly Seppo Yli-Olli, Codethink and others.

    I wanted to also give my personal thanks to a handful of dedicated people who keep Flathub working as a service and as a community: Bartłomiej Piotrowski is keeping the infrastructure working essentially single-handedly (in his spare time from keeping everything running at GNOME); Kolja Lampe and Bart built the new web app and backend API for Flathub which all of the new functionality has been built on, and Filippe LeMarchand maintains the checker bot which helps keeps all of the Flatpaks up to date.

    And finally, all of the submissions to Flathub are reviewed to ensure quality, consistency and security by a small dedicated team of reviewers, with a huge amount of work from Hubert Figuière and Bart to keep the submissions flowing. Thanks to everyone­—named or unnamed—for building this vision of the future of the Linux desktop together with us.

    (originally posted to Flathub Discourse, head there if you have any questions or comments)

    by ramcq at March 07, 2023 11:00 AM

    March 04, 2023

    GStreamerGStreamer 1.22.1 stable bug fix release

    (GStreamer)

    The GStreamer team is pleased to announce the first bug fix release in the stable 1.22 release series of your favourite cross-platform multimedia framework!

    This release only contains bugfixes and it should be safe to update from 1.22.0.

    Highlighted bugfixes:

    • audio channel-mix: allow up to 64 channels (instead of up to 63 channels)
    • avfvideosrc: Don't wait on main thread for permissions request
    • avvidenc: avoid generating inaccurate output timestamps, especially with variable framerate streams
    • AV1 video codec caps signalling improvements in various elements
    • codectimestamper: Fix timestamping on sequence update
    • d3d11overlaycompositor: fix texture width and height
    • d3d11videosink: Fix rendering on external handle
    • dashdemux2: fix seek operation taking a log time to finish for some streams
    • nvencoder: Fix B-frame encoding on Linux and min buffers in auto GPU mode
    • playbin3: fixing buffering for live pipelines
    • playbin: fix potential deadlock when stopping stream with subtitles visible
    • redenc: fix setting of extension ID for twcc
    • rtspsrc: improved compatibility with more broken RTSP servers
    • v4l2h264dec: Fix Raspberry Pi4 will not play video in application
    • vtdec: fix jittery playback of H.264 Level 4.1 movies in macOS
    • vtdec: Fix non-deterministic frame output after flushing seeks
    • vtenc: fix handling of interlaced ProRes on Apple M1 hardware
    • vtenc: don't advertise ARGB/RGBA64 input caps on M1 Pro/Max with macOS <13
    • wasapi2src: Fix loopback capture on Windows 10 Anniversary Update
    • tools: better handling of non-ASCII command line arguments on Windows
    • gst-libav: fix build against newer ffmpeg versions
    • gst-python: Use arch-specific install dir for gi overrides
    • cerbero: Fix setuptools site.py breakage in Python 3.11
    • macOS packages: Fix broken binaries on macos < 11.0
    • various bug fixes, memory leak fixes, and other stability and reliability improvements

    See the GStreamer 1.22.1 release notes for more details.

    Binaries for Android, iOS, Mac OS X and Windows will be available shortly.

    Release tarballs can be downloaded directly here:

    March 04, 2023 02:00 PM

    February 27, 2023

    Thomas Vander SticheleMeet Me in the Bathroom

    (Thomas Vander Stichele)

    "Welcome to pre-9/11 New York City, when the world was unaware of the profound political and cultural shifts about to occur, and an entire generation was thirsty for more than the post–alternative pop rock plaguing MTV. In the cafés, clubs, and bars of the Lower East Side there convened a group of outsiders and misfits full of ambition and rock star dreams."

    Music was the main reason I wanted to move to New York - I wanted to walk the same streets that the Yeah Yeah Yeahs, the National, Interpol, the Walkmen, the Antlers and Sonic Youth were walking. In my mind they'd meet up and have drinks with each other at the same bars, live close to each other, and I'd just run into them all the time myself. I'm not sure that romantic version of New York ever existed. Paul Banks used to live on a corner inbetween where I live and where my kids go to school now, but that is two decades ago (though for a while, we shared a hairdresser). On one of my first visits to New York before moving here, I had a great chat with Thurston Moore at a café right before taking the taxi back to the airport. And that's as close as I got to living my dream.

    But now the documentary "Meet me in the Bathroom" (based on the book of the same name) shows that version of New York that only existed for a brief moment in time.

    "Meet Me In The Bathroom — ??inspired by Lizzy Goodman’s book of the same name — chronicles the last great romantic age of rock ’n’ roll through the lens of era-defining bands."

    Read the book, watch the documentary (available on Google Play among other platforms), or listen to the Spotify playlist Meet Me in the Bathroom: Every Song From The Book In Chronological Order. For bonus points, listen to Losing My Edge (every band mentioned in the LCD Soundsystem song in the order they're mentioned)

    Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

    flattr this!

    by Thomas at February 27, 2023 10:31 AM

    February 23, 2023

    GStreamerGStreamer 1.20.6 old-stable bug fix release

    (GStreamer)

    The GStreamer team is pleased to announce another bug fix release in the now old-stable 1.20 release series of your favourite cross-platform multimedia framework!

    This release only contains bug fixes. It should be safe to update from 1.20.x.

    Highlighted bugfixes:

    • audio: channel-mix: allow up to 64 channels instead of up to 63 channels
    • AOM AV1 encoder timestamp handling improvements
    • AV1 video codec caps handling improvements in aom plugin, isomp4 and matroska muxers/demuxers
    • avvidenc: fix bitrate control and timestamps off FFmpeg-based video encoders
    • h264parse: fix missing timestamps on outputs when splitting a frame
    • rtspsrc: more workarounds for servers with broken control uri handling
    • playbin3: fix issue with UDP streams, making sure there's enough buffering
    • qmlglsrc: Fix deadlock when stopping and some other fixes
    • qtmux: fix default timescale unit for N/1001 framerates
    • v4l2h264dec: Fix Raspberry Pi4 will not play video in application
    • vtdec: Fix non-deterministic frame output after seeks
    • wasapi2src: Fix loopback capture on Windows 10 Anniversary Update
    • macOS, iOS: Fix Xcode 14 ABI breakage with older Xcode
    • cerbero: Fix some regressions for CentOS in the 1.20 branch
    • cerbero: Fix setuptools site.py breakage in Python 3.11
    • Fix gst-libav build against FFmpeg from git
    • gobject-introspection annotation fixes for bindings
    • Miscellaneous bug fixes, memory leak fixes, and other stability and reliability improvements
    • Performance improvements

    See the GStreamer 1.20.6 release notes for more details.

    Binaries for Android, iOS, macOS and Windows will be available shortly.

    Release tarballs can be downloaded directly here:

    The GStreamer 1.20 series has now been superseded by the new stable GStreamer 1.22 series and we recommend you upgrade at your earliest convenience.

    February 23, 2023 11:50 PM

    February 20, 2023

    Thomas Vander SticheleAll Teams

    (Thomas Vander Stichele)

    "I have hands but I am doing all I can to have daily independence so I can’t be ‘all hands’ at work. I can share ideas, show up, and ask for help when I physically can’t use my hands. Belonging means folks with one hand, no hand and limited hands are valued in the workplace." - Dr. Akilah Cadet

    If you've been wondering why over the past few months you're seeing a lot more "All Teams" meetings on your calendar, it's because language is ever evolving with the time, and people are starting to become more aware and replace ableist language.

    Read more:

    If your team still has "all hands" meetings, propose a more creative name, or default to "all teams" instead. Take some time to familiarize yourself with other ableist language and alternatives.

    Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

    flattr this!

    by Thomas at February 20, 2023 10:28 AM

    February 16, 2023

    Phil NormandWebRTC in WebKitGTK and WPE, status updates, part I

    (Phil Normand)

    Some time ago we at Igalia embarked on the journey to ship a GStreamer-powered WebRTC backend. This is a long journey, it is not over, but we made some progress …

    by Philippe Normand at February 16, 2023 08:30 PM

    February 13, 2023

    Thomas Vander SticheleHow to Support Through Layoffs

    (Thomas Vander Stichele)

    The number one question I've gotten in the past week has been "how can I support people affected by the layoff?"

    First, some general advice:

    • For everyone, remember to "comfort in, dump out" - vent your frustrations to people less affected, support those who are more affected than you.
    • To support Xooglers, don't ask "how can I help?" as that places more burden on them. Offer specific ways you can help them - "Can I write you a Linkedin recommendation? Can I connect you with this person I know at this company who's hiring?". People are affected disproportionally, and if you want to prioritize your help, consider starting with the people on a visa who are now on a tight deadline to find a new sponsor or face leaving the country.
    • To support your colleagues still here, remember we're not all having the same experience. In particular, people outside of the US will be in limbo for weeks or months to come. People can be anywhere on a spectrum of "long time Googler, first mass layoff" to "I've had to go through worse". Don't assume, lead with curiosity, and listen.

    Some resources people shared you might find helpful:

    Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

    flattr this!

    by Thomas at February 13, 2023 10:29 AM

    GStreamerGStreamer Rust bindings 0.20 / Rust Plugins 0.10 release

    (GStreamer)

    Version 0.20 of the GStreamer Rust bindings was released. Together with the bindings, also version 0.10 of the GStreamer Rust plugins was released.

    As usual this release follows the latest gtk-rs 0.17 release and the corresponding API changes.

    This release features relatively few changes and mostly contains the addition of some convenience APIs, the addition of bindings for some minor APIs and various optimizations.

    As part of the optimizations there will be a lot fewer temporary allocations with the new version (especially for NUL-terminated strings), and improved code generation which does not only improve performance but also reduces the size of the resulting binaries considerably. Various GStreamer plugins saw size reductions of 15-20%.

    The new release also brings a lot of bugfixes, most of which were already part of the 0.19.x bugfix releases.

    Details can be found in the release notes for gstreamer-rs.

    The code and documentation for the bindings is available on the freedesktop.org GitLab

    as well as on crates.io.

    The new 0.10 version of the GStreamer Rust plugins features many improvements to the existing plugins as well as a few new plugins. A majority of the changes were already backported to the 0.9 release series and its bugfix releases, which is part of GStreamer 1.22.

    A full list of available plugins can be seen in the repository's README.md.

    Details for this release can be found in the release notes for gst-plugins-rs.

    If you find any bugs, notice any missing features or other issues please report them in GitLab for the bindings or the plugins.

    February 13, 2023 10:00 AM

    February 07, 2023

    Andy Wingowhippet: towards a new local maximum

    (Andy Wingo)

    Friends, you might have noted, but over the last year or so I really caught the GC bug. Today's post sums up that year, in the form of a talk I gave yesterday at FOSDEM. It's long! If you prefer video, you can have a look instead to the at the FOSDEM event page.

    Whippet: A New GC for Guile

    4 Feb 2023 – FOSDEM

    Andy Wingo

    Guile is...

    Mostly written in Scheme

    Also a 30 year old C library

    // API
    SCM scm_cons (SCM car, SCM cdr);
    
    // Many third-party users
    SCM x = scm_cons (a, b);

    So the context for the whole effort is that Guile has this part of its implementation which is in C. It also exposes a lot of that implementation to users as an API.

    Putting the C into GC

    SCM x = scm_cons (a, b);

    Live objects: the roots, plus anything a live object refers to

    How to include x into roots?

    • Refcounting
    • Register (& later unregister) &x with gc
    • Conservative roots

    So what contraints does this kind of API impose on the garbage collector?

    Let's start by considering the simple cons call above. In a garbage-collected environment, the GC is responsible for reclaiming unused memory. How does the GC know that the result of a scm_cons call is in use?

    Generally speaking there are two main strategies for automatic memory management. One is reference counting: you associate a count with an object, incremented once for each referrer; in this case, the stack would hold a reference to x. When removing the reference, you decrement the count, and if it goes to 0 the object is unused and can be freed.

    We GC people used to laugh at reference-counting as a memory management solution because it over-approximates the live object set in the presence of cycles, but it would seem that refcounting is coming back. Anyway, this isn't what Guile does, not right now anyway.

    The other strategy we can use is tracing: the garbage collector periodically finds all of the live objects on the system and then recycles the memory for everything else. But how to actually find the first live objects to trace?

    One way is to inform the garbage collector of the locations of all roots: references to objects originating from outside the heap. This can be done explicitly, as in V8's Handle<> API, or implicitly, in the form of a side table generated by the compiler associating code locations with root locations. This is called precise rooting: the GC is aware of all root locations at all code positions where GC might happen. Generally speaking you want the side table approach, in which the compiler writes out root locations to stack maps, because it doesn't impose any overhead at run-time to register and unregister locations. However for run-time routines implemented in C or C++, you won't be able to get the C compiler to do this for you, so you need the explicit approach if you want precise roots.

    Conservative roots

    Treat every word in stack as potential root; over-approximate live object set

    1993: Bespoke GC inherited from SCM

    2006 (1.8): Added pthreads, bugs

    2009 (2.0): Switch to BDW-GC

    BDW-GC: Roots also from extern SCM foo;, etc

    The other way to find roots is very much not The Right Thing. Call it cheeky, call it sloppy, call it yolo, call it what you like, but in the trade it's known as conservative root-finding. This strategy looks like this:

    uintptr_t *limit = stack_base_for_platform();
    uintptr_t *sp = __builtin_frame_address();
    for (; sp < limit; sp++) {
      void *obj = object_at_address(*sp);
      if (obj) add_to_live_objects(obj);
    }
    

    You just look at every word on the stack and pretend it's a pointer. If it happens to point to an object in the heap, we add that object to the live set. Of course this algorithm can find a spicy integer whose value just happens to correspond to an object's address, even if that object wouldn't have been counted as live otherwise. This approach doesn't compute the minimal live set, but rather a conservative over-approximation. Oh well. In practice this doesn't seem to be a big deal?

    Guile has used conservative root-finding since its beginnings, 30 years ago and more. We had our own bespoke mark-sweep GC in the beginning, but it's now going on 15 years or so that we switched to the third-party Boehm-Demers-Weiser (BDW) collector. It's been good to us! It's better than what we had, it's mostly just worked, and it works correctly with threads.

    Conservative roots

    +: Ergonomic, eliminates class of bugs (handle registration), no compiler constraints

    -: Potential leakage, no compaction / object motion; no bump-pointer allocation, calcifies GC choice

    Conservative root-finding does have advantages. It's quite pleasant to program with, in environments in which the compiler is unable to produce stack maps for you, as it eliminates a set of potential bugs related to explicit handle registration and unregistration. Like stack maps, it also doesn't impose run-time overhead on the user program. And although the compiler isn't constrained to emit code to clear roots, it generally does, and sometimes does so more promptly than would be the case with explicit handle deregistration.

    But, there are disadvantages too. The potential for leaks is one, though I have to say that in 20 years of using conservative-roots systems, I have not found this to be a problem. It's a source of anxiety whenever a program has memory consumption issues but I've never identified it as being the culprit.

    The more serious disadvantage, though, is that conservative edges prevent objects from being moved by the GC. If you know that a location holds a pointer, you can update that location to point to a new location for an object. But if a location only might be a pointer, you can't do that.

    In the end, the ergonomics of conservative collection lead to a kind of calcification in Guile, that we thought that BDW was as good as we could get given the constraints, and that changing to anything else would require precise roots, and thus an API and ABI change, losing users, and so on.

    What if I told you

    You can find roots conservatively and

    • move objects and compact the heap
    • do fast bump-pointer allocation
    • incrementally migrate to precise roots

    BDW is not the local maximum

    But it turns out, that's not true! There is a way to have conservative roots and also use more optimal GC algorithms, and one which preserves the ability to incrementally refactor the system to have more precision if that's what you want.

    Immix

    Fundamental GC algorithms

    • mark-compact
    • mark-sweep
    • evacuation
    • mark-region

    Immix is a mark-region collector

    Let's back up to a high level. Garbage collector implementations are assembled from instances of algorithms, and there are only so many kinds of algorithms out there.

    There's mark-compact, in which the collector traverses the object graph once to find live objects, then once again to slide them down to one end of the space they are in.

    There's mark-sweep, where the collector traverses the graph once to find live objects, then traverses the whole heap, sweeping dead objects into free lists to be used for future allocations.

    There's evacuation, where the collector does a single pass over the object graph, copying the objects outside their space and leaving a forwarding pointer behind.

    The BDW collector used by Guile is a mark-sweep collector, and its use of free lists means that allocation isn't as fast as it could be. We want bump-pointer allocation and all the other algorithms give it to us.

    Then in 2008, Stephen Blackburn and Kathryn McKinley put out their Immix paper that identified a new kind of collection algorithm, mark-region. A mark-region collector will mark the object graph and then sweep the whole heap for unmarked regions, which can then be reused for allocating new objects.

    Allocate: Bump-pointer into holes in thread-local block, objects can span lines but not blocks

    Trace: Mark objects and lines

    Sweep: Coarse eager scan over line mark bytes

    Blackburn and McKinley's paper also describes a new mark-region GC algorithm, Immix, which is interesting because it gives us bump-pointer allocation without requiring that objects be moveable. The diagram above, from the paper, shows the organization of an Immix heap. Allocating threads (mutators) obtain 64-kilobyte blocks from the heap. Blocks contains 128-byte lines. When Immix traces the object graph, it marks both objects and the line the object is on. (Usually blocks are part of 2MB aligned slabs, with line mark bits/bytes are stored in a packed array at the start of the slab. When marking an object, it's easy to find the associated line mark just with address arithmetic.)

    Immix reclaims memory in units of lines. A set of contiguous lines that were not marked in the previous collection form a hole (a region). Allocation proceeds into holes, in the usual bump-pointer fashion, giving us good locality for contemporaneously-allocated objects, unlike freelist allocation. The slow path, if the object doesn't fit in the hole, is to look for the next hole in the block, or if needed to acquire another block, or to stop for collection if there are no more blocks.

    Immix: Opportunistic evacuation

    Before trace, determine if compaction needed. If not, mark as usual

    If so, select candidate blocks and evacuation target blocks. When tracing in that block, try to evacuate, fall back to mark

    The neat thing that Immix adds is a way to compact the heap via opportunistic evacuation. As Immix allocates, it can end up skipping over holes and leaving them unpopulated, and as subsequent cycles of GC occur, it could be that a block ends up with many small holes. If that happens to many blocks it could be time to compact.

    To fight fragmentation, Immix decides at the beginning of a GC cycle whether to try to compact or not. If things aren't fragmented, Immix marks in place; it's cheaper that way. But if compaction is needed, Immix selects a set of blocks needing evacuation and another set of empty blocks to evacuate into. (Immix has to keep around a couple percent of memory in empty blocks in reserve for this purpose.)

    As Immix traverses the object graph, if it finds that an object is in a block that needs evacuation, it will try to evacuate instead of marking. It may or may not succeed, depending on how much space is available to evacuate into. Maybe it will succeed for all objects in that block, and you will be left with an empty block, which might even be given back to the OS.

    Immix: Guile

    Opportunistic evacuation compatible with conservative roots!

    Bump-pointer allocation

    Compaction!

    1 year ago: start work on WIP GC implementation

    Tying this back to Guile, this gives us all of our desiderata: we can evacuate, but we don't have to, allowing us to cause referents of conservative roots to be marked in place instead of moved; we can bump-pointer allocate; and we are back on the train of modern GC implementations. I could no longer restrain myself: I started hacking on a work-in-progress garbage collector workbench about a year ago, and ended up with something that seems to take us in the right direction.

    Whippet vs Immix: Tiny lines

    Immix: 128B lines + mark bit in object

    Whippet: 16B “lines”; mark byte in side table

    More size overhead: 1/16 vs 1/128

    Less fragmentation (1 live obj = 2 lines retained)

    More alloc overhead? More small holes

    What I ended up building wasn't quite Immix. Guile's object representation is very thin and doesn't currently have space for a mark bit, for example, so I would have to have a side table of mark bits. (I could have changed Guile's object representation but I didn't want to require it.) I actually chose mark bytes instead of bits because both the Immix line marks and BDW's own side table of marks were bytes, to allow for parallel markers to race when setting marks.

    Then, given that you have a contiguous table of mark bytes, why not remove the idea of lines altogether? Or what amounts to the same thing, why not make line size to be 16 bytes and do away with per-object mark bits? You can then bump-pointer into holes in the mark byte array. The only thing you need to do to that is to be able to cheaply find the end of an object, so you can skip to the next hole while sweeping; you don't want to have to chase pointers to do that. But consider, you've already paid the cost of having a mark byte associated with every possible start of an object, so if your basic object alignment is 16 bytes, that's a memory overhead of 1/16, or 6.25%; OK. Let's put that mark byte to work and include an "end" bit, indicating the end of the object. Allocating an object has to store into the mark byte array to initialize this "end" marker, but you need to write the mark byte anyway to allow for conservative roots ("does this address hold an object?"); writing the end at the same time isn't so bad, perhaps.

    The expected outcome would be that relative to 128-byte lines, Whippet ends up with more, smaller holes. Such a block would be a prime target for evacuation, of course, but during allocation this is overhead. Or, it could be a source of memory efficiency; who knows. There is some science yet to do to properly compare this tactic to original Immix, but I don't think I will get around to it.

    While I am here and I remember these things, I need to mention two more details. If you read the Immix paper, it describes "conservative line marking", which is related to how you find the end of an object; basically Immix always marks the line an object is on and the next one, in case the object spans the line boundary. Only objects larger than a line have to precisely mark the line mark array when they are traced. Whippet doesn't do this because we have the end bit.

    The other detail is the overflow allocator; in the original Immix paper, if you allocate an object that's smallish but still larger than a line or two, but there's no hole big enough in the block, Immix keeps around a completely empty block per mutator in which to bump-pointer-allocate these medium-sized objects. Whippet doesn't do that either, instead relying on such failure to allocate in a block to cause fragmentation and thus hurry along the process of compaction.

    Whippet vs Immix: Lazy sweeping

    Immix: “cheap” eager coarse sweep

    Whippet: just-in-time lazy fine-grained sweep

    Corrolary: Data computed by sweep available when sweep complete

    Live data at previous GC only known before next GC

    Empty blocks discovered by sweeping

    Having a fine-grained line mark array means that it's no longer a win to do an eager sweep of all blocks after collecting. Instead Whippet applies the classic "lazy sweeping" optimization to make mutators sweep their blocks just before allocating into them. This introduces a delay in the collection algorithm: Whippet doesn't find out about e.g. fragmentation until the whole heap is swept, but by the time we fully sweep the heap, we've exhausted it via allocation. It introduces a different flavor to the GC, not entirely unlike original Immix, but foreign.

    Whippet vs BDW

    Compaction/defrag/pinning, heap shrinking, sticky-mark generational GC, threads/contention/allocation, ephemerons, precision, tools

    Right! With that out of the way, let's talk about what Whippet gives to Guile, relative to BDW-GC.

    Whippet vs BDW: Motion

    Heap-conservative tracing: no object moveable

    Stack-conservative tracing: stack referents pinned, others not

    Whippet: If whole-heap fragmentation exceeds threshold, evacuate most-fragmented blocks

    Stack roots scanned first; marked instead of evacuated, implicitly pinned

    Explicit pinning: bit in mark byte

    If all edges in the heap are conservative, then you can't move anything, because you don't know if an edge is a pointer that can be updated or just a spicy integer. But most systems aren't actually like this: you have conservative edges from the stack, but you can precisely enumerate intra-object edges on the heap. In that case, you have a known set of conservative edges, and you can simply visit those edges first, marking their referents in place instead of evacuating. (Marking an object instead of evacuating implicitly pins it for the duration of the current GC cycle.) Then you visit heap edges precisely, possibly evacuating objects.

    I should note that Whippet has a bit in the mark byte for use in explicitly pinning an object. I'm not sure how to manage who is responsible for setting that bit, or what the policy will be; the current idea is to set it for any object whose identity-hash value is taken. We'll see.

    Whippet vs BDW: Shrinking

    Lazy sweeping finds empty blocks: potentially give back to OS

    Need empty blocks? Do evacuating collection

    Possibility to do http://marisa.moe/balancer.html

    With the BDW collector, your heap can only grow; it will never shrink (unless you enable a non-default option and you happen to have verrry low fragmentation). But with Whippet and evacuation, we can rearrange objects so as to produce empty blocks, which can then be returned to the OS if so desired.

    In one of my microbenchmarks I have the system allocating long-lived data, interspersed with garbage (objects that are dead after allocation) whose size is in a power-law distribution. This should produce quite some fragmentation, eventually, and it does. But then Whippet decides to defragment, and it works great! Since Whippet doesn't keep a whole 2x reserve like a semi-space collector, it usually takes more than one GC cycle to fully compact the heap; usually about 3 cycles, from what I can see. I should do some more measurements here.

    Of course, this is just mechanism; choosing the right heap sizing policy is a different question.

    wingolog.org/archives/2022/10/22/the-sticky-mark-bit-algorithm

    Card marking barrier (256B); compare to BDW mprotect / SIGSEGV

    The Boehm collector also has a non-default mode in which it uses mprotect and a SIGSEGV handler to enable sticky-mark-bit generational collection. I haven't done a serious investigation, but I see it actually increasing run-time by 20% on one of my microbenchmarks that is actually generation-friendly. I know that Azul's C4 collector used to use page protection tricks but I can only assume that BDW's algorithm just doesn't work very well. (BDW's page barriers have another purpose, to enable incremental collection, in which marking is interleaved with allocation, but this mode is off if parallel markers are supported, and I don't know how well it works.)

    Anyway, it seems we can do better. The ideal would be a semi-space nursery, which is the usual solution, but because of conservative roots we are limited to the sticky mark-bit algorithm. Some benchmarks aren't very generation-friendly; the first pair of bars in the chart above shows the mt-gcbench microbenchmark running with and without generational collection, and there's no difference. But in the second, for the quads benchmark, we see a 2x speedup or so.

    Of course, to get generational collection to work, we require mutators to use write barriers, which are little bits of code that run when an object is mutated that tell the GC where it might find links from old objects to new objects. Right now in Guile we don't do this, but this benchmark shows what can happen if we do.

    Whippet vs BDW: Scale

    BDW: TLS segregated-size freelists, lock to refill freelists, SIGPWR for stop

    Whippet: thread-local block, sweep without contention, wait-free acquisition of next block, safepoints to stop with ragged marking

    Both: parallel markers

    Another thing Whippet can do better than BDW is performance when there are multiple allocating threads. The Immix heap organization facilitates minimal coordination between mutators, and maximum locality for each mutator. Sweeping is naturally parallelized according to how many threads are allocating. For BDW, on the other hand, every time an mutator needs to refill its thread-local free lists, it grabs a global lock; sweeping is lazy but serial.

    Here's a chart showing whippet versus BDW on one microbenchmark. On the X axis I add more mutator threads; each mutator does the same amount of allocation, so I'm increasing the heap size also by the same factor as the number of mutators. For simplicity I'm running both whippet and BDW with a single marker thread, so I expect to see a linear increase in elapsed time as the heap gets larger (as with 4 mutators there are roughly 4 times the number of live objects to trace). This test is run on a Xeon Silver 4114, taskset to free cores on a single socket.

    What we see is that as I add workers, elapsed time increases linearly for both collectors, but more steeply for BDW. I think (but am not sure) that this is because whippet effectively parallelizes sweeping and allocation, whereas BDW has to contend over a global lock to sweep and refill free lists. Both have the linear factor of tracing the object graph, but BDW has the additional linear factor of sweeping, whereas whippet scales with mutator count.

    Incidentally you might notice that at 4 mutator threads, BDW randomly crashed, when constrained to a fixed heap size. I have noticed that if you fix the heap size, BDW sometimes (and somewhat randomly) fails. I suspect the crash due to fragmentation and inability to compact, but who knows; multiple threads allocating is a source of indeterminism. Usually when you run BDW you let it choose its own heap size, but for these experiments I needed to have a fixed heap size instead.

    Another measure of scalability is, how does the collector do as you add marker threads? This chart shows that for both collectors, runtime decreases as you add threads. It also shows that whippet is significantly slower than BDW on this benchmark, which is Very Weird, and I didn't have access to the machine on which these benchmarks were run when preparing the slides in the train... so, let's call this chart a good reminder that Whippet is a WIP :)

    While in the train to Brussels I re-ran this test on the 4-core laptop I had on hand, and got the results that I expected: that whippet performed similarly to BDW, and that adding markers improved things, albeit marginally. Perhaps I should look on a different microbenchmark.

    Incidentally, when you configure Whippet for parallel marking at build-time, it uses a different implementation of the mark stack when compared to the parallel marker, even when only 1 marker is enabled. Certainly the parallel marker could use some tuning.

    Whippet vs BDW: Ephemerons

    BDW: No ephemerons

    Whippet: Yes

    Another deep irritation I have with BDW is that it doesn't support ephemerons. In Guile we have a number of facilities (finalizers, guardians, the symbol table, weak maps, et al) built on what BDW does have (finalizers, weak references), but the implementations of these facilities in Guile are hacky, slow, sometimes buggy, and don't compose (try putting an object in a guardian and giving it a finalizer to see what I mean). It would be much better if the collector API supported ephemerons natively, specifying their relationship to finalizers and other facilities, allowing us to build what we need in terms of those primitives. With our own GC, we can do that, and do it in such a way that it doesn't depend on the details of the specific collection algorithm. The exception of course is that as BDW doesn't support ephemerons per se, what we get is actually a weak-key association instead, whose value can keep the key alive. Oh well, it's no worse than the current situation.

    Whippet vs BDW: Precision

    BDW: ~Always stack-conservative, often heap-conservative

    Whippet: Fully configurable (at compile-time)

    Guile in mid/near-term: C stack conservative, Scheme stack precise, heap precise

    Possibly fully precise: unlock semi-space nursery

    Conservative tracing is a fundamental design feature of the BDW collector, both of roots and of inter-heap edges. You can tell BDW how to trace specific kinds of heap values, but the default is to do a conservative scan, and the stack is always scanned conservatively. In contrast, these tradeoffs are all configurable in Whippet. You can scan the stack and heap precisely, or stack conservatively and heap precisely, or vice versa (though that doesn't make much sense), or both conservatively.

    The long-term future in Guile is probably to continue to scan the C stack conservatively, to continue to scan the Scheme stack precisely (even with BDW-GC, the Scheme compiler emits stack maps and installs a custom mark routine), but to scan the heap as precisely as possible. It could be that a user uses some of our hoary ancient APIs to allocate an object that Whippet can't trace precisely; in that case we'd have to disable evacuation / object motion, but we could still trace other objects precisely.

    If Guile ever moved to a fully precise world, that would be a boon for performance, in two ways: first that we would get the ability to use a semi-space nursery instead of the sticky-mark-bit algorithm, and relatedly that we wouldn't need to initialize mark bytes when allocating objects. Second, we'd gain the option to use must-move algorithms for the old space as well (mark-compact, semi-space) if we wanted to. But it's just an option, one that that Whippet opens up for us.

    Whippet vs BDW: Tools?

    Can build heap tracers and profilers moer easily

    More hackable

    (BDW-GC has as many preprocessor directives as whippet has source lines)

    Finally, relative to BDW-GC, whippet has a more intangible advantage: I can actually hack on it. Just as an indication, 15% of BDW source lines are pre-processor directives, and there is one file that has like 150 #ifdef's, not counting #elseif's, many of them nested. I haven't done all that much to BDW itself, but I personally find it excruciating to work on.

    Hackability opens up the possibility to build more tools to help us diagnose memory use problems. They aren't in Whippet yet, but there can be!

    Engineering Whippet

    Embed-only, abstractions, migration, modern; timeline

    OK, that rounds out the comparison between BDW and Whippet, at least on a design level. Now I have a few words about how to actually get this new collector into Guile without breaking the bug budget. I try to arrange my work areas on Guile in such a way that I spend a minimum of time on bugs. Part of my strategy is negligence, I will admit, but part also is anticipating problems and avoiding them ahead of time, even if it takes more work up front.

    Engineering Whippet: Embed-only

    github.com/wingo/whippet-gc/

    Semi: 6 kB; Whippet: 22 kB; BDW: 184 kB

    Compile-time specialization:

    • for embedder (e.g. how to forward objects)
    • for selected GC algorithm (e.g. semi-space vs whippet)

    Built apart, but with LTO to remove library overhead

    So the BDW collector is typically shipped as a shared library that you dynamically link to. I should say that we've had an overall good experience with upgrading BDW-GC in the past; its maintainer (Ivan Maidanski) does a great and responsible job on a hard project. It's been many, many years since we had a bug in BDW-GC. But still, BDW is dependency, and all things being equal we prefer to remove moving parts.

    The approach that Whippet is taking is to be an embed-only library: it's designed to be compiled into your project. It's not an include-only library; it still has to be compiled, but with link-time-optimization and a judicious selection of fast-path interfaces, Whippet is mostly able to avoid abstractions being a performance barrier.

    The result is that Whippet is small, both in source and in binary, which minimizes its maintenance overhead. Taking additional stripped optimized binary size as the metric, by my calculations a semi-space collector (with a large object space and ephemeron support) takes about 6 kB of object file size, whereas Whippet takes 22 and BDW takes 184. Part of how Whippet gets so small is that it is is configured in major ways at compile-time (choice of main GC algorithm), and specialized against the program it's embedding against (e.g. how to patch in a forwarding pointer). Having all API being internal and visible to LTO instead of going through ELF symbol resolution helps in a minor way as well.

    Engineering Whippet: Abstract performance

    User API abstracts over GC algorithm, e.g. semi-space or whippet

    Expose enough info to allow JIT to open-code fast paths

    Inspired by mmtk.io

    Abstractions permit change: of algorithm, over time

    From a composition standpoint, Whippet is actually a few things. Firstly there is an abstract API to make a heap, create per-thread mutators for a heap, and allocate objects for a mutator. There is the aforementioned embedder API, for having the embedding program indicate how to trace objects and install forwarding pointers. Then there is some common code (for example ephemeron support). There are implementations of the different spaces: semi-space, large object, whippet/immix; and finally collector implementations that tie together the spaces into a full implementation of the abstract API. (In practice the more iconic spaces are intertwingled with the collector implementations they define.)

    I don't think I would have gone down this route without seeing some prior work, for example libpas, but it was really MMTk that convinced me that it was worth spending a little time thinking about the GC not as a structureless blob but as a system made of parts and exposing a minimal interface. In particular, I was inspired by seeing that MMTk is able to get good performance while also being abstract, exposing representation details such as how to tell a JIT compiler about allocation fast-paths, but in a principled way. So, thanks MMTk people, for this and so many things!

    I'm particularly happy that the API is abstract enough that it frees up not only the garbage collector to change implementations, but also Guile and other embedders, in that they don't have to bake in a dependency on specific collectors. The semi-space collector has been particularly useful here in ensuring that the abstractions don't accidentally rely on support for object pinning.

    Engineering Whippet: Migration

    API implementable by BDW-GC (except ephemerons)

    First step for Guile: BDW behind Whippet API

    Then switch to whippet/immix (by default)

    The collector API can actually be implemented by the BDW collector. Whippet includes a collector that is a thin wrapper around the BDW API, with support for fast-path allocation via thread-local freelists. In this way we can always check the performance of any given collector against an external fixed point (BDW) as well as a theoretically known point (the semi-space collector).

    Indeed I think the first step for Guile is precisely this: refactor Guile to allocate through the Whippet API, but using the BDW collector as the implementation. This will ensure that the Whippet API is sufficient, and then allow an incremental switch to other collectors.

    Incidentally, when it comes to integrating Whippet, there are some choices to be made. I mentioned that it's quite configurable, and this chart can give you some idea. On the left side is one microbenchmark (mt-gcbench) and on the right is another (quads). The first generates a lot of fragmentation and has a wide range of object sizes, including some very large objects. The second is very uniform and many allocations die young.

    (I know these images are small; right-click to open in new tab or pinch to zoom to see more detail.)

    Within each set of bars we have 10 different scenarios, corresponding to different Whippet configurations. (All of these tests are run on my old 4-core laptop with 4 markers if parallel marking is supported, and a 2x heap.)

    The first bar in each side is serial whippet: one marker. Then we see parallel whippet: four markers. Great. Then there's generational whippet: one marker, but just scanning objects allocated in the current cycle, hoping that produces enough holes. Then generational parallel whippet: the same as before, but with 4 markers.

    The next 4 bars are the same: serial, parallel, generational, parallel-generational, but with one difference: the stack is scanned conservatively instead of precisely. You might be surprised but all of these configurations actually perform better than their precise counterparts. I think the reason is that the microbenchmark uses explicit handle registration and deregistration (it's a stack) instead of compiler-generated stack maps in a side table, but I'm not precisely (ahem) sure.

    Finally the next 2 bars are serial and parallel collectors, but marking everything conservatively. I have generational measurements for this configuration but it really doesn't make much sense to assume that you can emit write barriers in this context. These runs are slower than the previous configuration, mostly because there are some non-pointer locations that get scanned conservatively that wouldn't get scanned precisely. I think conservative heap scanning is less efficient than precise but I'm honestly not sure, there are some instruction locality arguments in the other direction. For mt-gcbench though there's a big array of floating-point values that a precise scan will omit, which causes significant overhead there. Probably for this configuration to be viable Whippet would need the equivalent of BDW's API to allocate known-pointerless objects.

    Engineering Whippet: Modern

    stdatomic

    constexpr-ish

    pthreads (for parallel markers)

    No void*; instead struct types: gc_ref, gc_edge, gc_conservative_ref, etc

    Embed-only lib avoids any returns-struct-by-value ABI issue

    Rust? MMTk; supply chain concerns

    Platform abstraction for conservative root finding

    I know it's a sin, but Whippet is implemented in C. I know. The thing is, in the Guile context I need to not introduce wild compile-time dependencies, because of bootstrapping. And I know that Rust is a fine language to use for GC implementation, so if that's what you want, please do go take a look at MMTk! It's a fantastic project, written in Rust, and it can just slot into your project, regardless of the language your project is written in.

    But if what you're looking for is something in C, well then you have to pick and choose your C. In the case of Whippet I try to use the limited abilities of C to help prevent bugs; for example, I generally avoid void* and instead wrap pointers or addresses into single-field structs that can't be automatically cast, for example to prevent a struct gc_ref that denotes an object reference (or NULL; it's an option type) from being confused with a struct gc_conservative_ref, which might not point to an object at all.

    (Of course, by "C" I mean "C as compiled by gcc and clang with -fno-strict-aliasing". I don't know if it's possible to implement even a simple semi-space collector in C without aliasing violations. Can you access a Foo* object within a mmap'd heap through its new address after it has been moved via memcpy? Maybe not, right? Thoughts are welcome.)

    As a project written in the 2020s instead of the 1990s, Whippet gets to assume a competent C compiler, for example relying on the compiler to inline and fold branches where appropriate. As in libpas, Whippet liberally passes functions as values to inline functions, and relies on the compiler to boil away function calls. Whippet only uses the C preprocessor when it absolutely has to.

    Finally, there is a clean abstraction for anything that's platform-specific, for example finding the current stack bounds. I haven't compiled this code on Windows or MacOS yet, but I am not anticipating too many troubles.

    Engineering Whippet: Timeline

    As time permits

    Whippet TODO: heap growth/shrinking, finalizers, safepoint API

    Guile TODO: safepoints; heap-conservative first

    Precise heap TODO: gc_trace_object, SMOBs, user structs with raw ptr fields, user gc_malloc usage; 3.2

    6 months for 3.1.1; 12 for 3.2.0 ?

    So where does this get us? Where are we now?

    For Whippet itself, I think it's mostly done -- enough to start shifting focus to some different phase. It's missing some needed features, notably the ability to grow the heap at all, as I've been in fixed-heap-size-only mode during development. It's also missing finalizers. And, something needs to be done to unify Guile's handling of safepoints and processing of asynchronous signals with Whippet's need to stop all mutators. Some details remain.

    But, I think we are close to ready to start integrating in Guile. At first this is just porting Guile to use the Whippet API to access BDW instead of using BDW directly. This whole thing is a side project for me that I work on when I can, so it doesn't exactly proceed at full pace. Perhaps this takes 6 months. Then we can cut a new unstable release, and hopefully release 3.2 withe support for the Immix-flavored collector in another 6 or 9 months.

    I thought that we would be forced to make ABI changes, if only because of some legacy APIs assume conservative tracing of object contents. But after a discussion at FOSDEM with Carlo Piovesan I realized this isn't true: because the decision to evacuate or not is made on a collection-by-collection basis, I could simply disable evacuation if the user ever uses a facility that might prohibit object motion, for example if they ever define a SMOB type. If the user wants evacuation, they need to be more precise with their data types, but either way Guile is ready.

    Whippet: A Better GC?

    An Immix-derived GC

    github.com/wingo/whippet-gc/

    https://wingolog.org/tags/gc/

    Guile 3.2 ?

    Thanks to MMTk authors for inspiration!

    And that's it! Thanks for reading all the way here. Comments are quite welcome.

    As I mentioned in the very beginning, this talk was really about Whippet in the context of Guile. There is a different talk to be made about Guile+Whippet versus other language implementations, for example those with concurrent marking or semi-space nurseries or the like. Yet another talk is Whippet in the context of other GC algorithms. But this is a start. It's something I've been working on for a while now already and I'm pleased that it's gotten to a point where it seems to be at least OK, at least an improvement with respect to BDW-GC in some ways.

    But before leaving you, another chart, to give a more global idea of the state of things. Here we compare a single mutator thread performing a specific microbenchmark that makes trees and also lots of fragmentation, across three different GC implementations and a range of heap sizes. The heap size multipliers in this and in all the other tests in this post are calculated analytically based on what the test thinks its maximum heap size should be, not by measuring minimum heap sizes that work. This size is surely lower than the actual maximum required heap size due to internal fragmentation, but the tests don't know about this.

    The three collectors are BDW, a semi-space collector, and whippet. Semi-space manages to squeeze in less than 2x of a heap multiplier because it has (and whippet has) a separate large object space that isn't ever evacuated.

    What we expect is that tighter heaps impose more GC time, and indeed we see that times are higher on the left side than the right.

    Whippet is the only implementation that manages to run at a 1.3x heap, but it takes some time. It's slower than BDW at a 1.5x heap but better there on out, until what appears to be a bug or pathology makes it take longer at 5x. Adding memory should always decrease run time.

    The semi-space collector starts working at 1.75x and then surpasses all collectors from 2.5x onwards. We expect the semi-space collector to win for big heaps, because its overhead is proportional to live data only, whereas mark-sweep and mark-region collectors have to sweep, which is proportional to heap size, and indeed that's what we see.

    I think this chart shows we have some tuning yet to do. The range between 2x and 3x is quite acceptable, but we need to see what's causing Whippet to be slower than BDW at 1.5x. I haven't done as much performance tuning as I would like to but am happy to finally be able to know where we stand.

    And that's it! Happy hacking, friends, and may your heap sizes be ever righteous.

    by Andy Wingo at February 07, 2023 01:14 PM

    February 06, 2023

    Thomas Vander SticheleSRE Philosophy With Jennifer Mace

    (Thomas Vander Stichele)

    "Even the most junior SRE on call starts having director authority. [..] There is a power in that relationship that SRE does have when they think something is in danger. And it's a power we have to be careful not to misuse. But it's important, because that's our job."

    Macey is the guest on Episode 1 of SRE Prodcast, Google's podcast about Site Reliability Engineering. She goes in-depth on some of the core tenets of SRE, including risk, on-call, toil, design involvement, and more. (As a side note, I'm reasonably certain that I'm not the entertaining Belgian that was causing her team failure loops, but I'm too afraid to ask.)

    The whole series is worth a listen, but just like the podcast itself - start with macey's advice.

    "My definition of toil: toil is boring or repetitive work that does not gain you a permanent improvement."

    Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

    flattr this!

    by Thomas at February 06, 2023 09:25 AM

    February 04, 2023

    Thomas Vander SticheleHow Not to Say the Wrong Thing

    (Thomas Vander Stichele)

    "if you’re going to open your mouth, ask yourself if what you are about to say is likely to provide comfort and support. If it isn’t, don’t say it. Don’t, for example, give advice."

    Susan Silk's Ring Theory is a helpful model to navigate what not to say during times of grief and traumatic events.

    Picture a center ring, and inside it the people most affected by what's going on. Picture a larger circle around it, with inside it the people closest to those in the center. Repeat outwards.

    The person in the center ring can say anything they want to anyone, anywhere.
    Everyone else can say those things too, but only to people in the larger outside rings. Otherwise, you support and comfort.

    Now, consider where in this diagram you are, and where the people you are talking to are.

    "Comfort IN, dump OUT."

    This model applies in other situations - for example, managers are better off complaining to their own managers or peers, while supporting their own reports and absorbing their complaints with empathy and compassion.

    Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

    flattr this!

    by Thomas at February 04, 2023 02:24 AM

    January 27, 2023

    Andy Wingothree approaches to heap sizing

    (Andy Wingo)

    How much memory should a program get? Tonight, a quick note on sizing for garbage-collected heaps. There are a few possible answers, depending on what your goals are for the system.

    you: doctor science

    Sometimes you build a system and you want to study it: to identify its principal components and see how they work together, or to isolate the effect of altering a single component. In that case, what you want is a fixed heap size. You run your program a few times and determine a heap size that is sufficient for your problem, and then in future run the program with that new fixed heap size. This allows you to concentrate on the other components of the system.

    A good approach to choosing the fixed heap size for a program is to determine the minimum heap size a program can have by bisection, then multiplying that size by a constant factor. Garbage collection is a space/time tradeoff: the factor you choose represents a point on the space/time tradeoff curve. I would choose 1.5 in general, but this is arbitrary; I'd go more with 3 or even 5 if memory isn't scarce and I'm really optimizing for throughput.

    Note that a fixed-size heap is not generally what you want. It's not good user experience for running ./foo at the command line, for example. The reason for this is that program memory use is usually a function of the program's input, and only in some cases do you know what the input might look like, and until you run the program you don't know what the exact effect of input on memory is. Still, if you have a team of operations people that knows what input patterns look like and has experience with a GC-using server-side process, fixed heap sizes could be a good solution there too.

    you: average josé/fina

    On the other end of the spectrum is the average user. You just want to run your program. The program should have the memory it needs! Not too much of course; that would be wasteful. Not too little either; I can tell you, my house is less than 100m², and I spend way too much time shuffling things from one surface to another. If I had more space I could avoid this wasted effort, and in a similar way, you don't want to be too stingy with a program's heap. Do the right thing!

    Of course, you probably have multiple programs running on a system that are making similar heap sizing choices at the same time, and the relative needs and importances of these programs could change over time, for example as you switch tabs in a web browser, so the right thing really refers to overall system performance, whereas what you are controlling is just one process' heap size; what is the Right Thing, anyway?

    My corner of the GC discourse agrees that something like the right solution was outlined by Kirisame, Shenoy, and Panchekha in a 2022 OOPSLA paper, in which the optimum heap size depends on the allocation rate and the gc cost for a process, which you measure on an ongoing basis. Interestingly, their formulation of heap size calculation can be made by each process without coordination, but results in a whole-system optimum.

    There are some details but you can imagine some instinctive results: for example, when a program stops allocating because it's waiting for some external event like user input, it doesn't need so much memory, so it can start shrinking its heap. After all, it might be quite a while before the program has new input. If the program starts allocating again, perhaps because there is new input, it can grow its heap rapidly, and might then shrink again later. The mechanism by which this happens is pleasantly simple, and I salute (again!) the authors for identifying the practical benefits that an abstract model brings to the problem domain.

    you: a damaged, suspicious individual

    Hoo, friends-- I don't know. I've seen some things. Not to exaggerate, I like to think I'm a well-balanced sort of fellow, but there's some suspicion too, right? So when I imagine a background thread determining that my web server hasn't gotten so much action in the last 100ms and that really what it needs to be doing is shrinking its heap, kicking off additional work to mark-compact it or whatever, when the whole point of the virtual machine is to run that web server and not much else, only to have to probably give it more heap 50ms later, I-- well, again, I exaggerate. The MemBalancer paper has a heartbeat period of 1 Hz and a smoothing function for the heap size, but it just smells like danger. Do I need danger? I mean, maybe? Probably in most cases? But maybe it would be better to avoid danger if I can. Heap growth is usually both necessary and cheap when it happens, but shrinkage is never necessary and is sometimes expensive because you have to shuffle around data.

    So, I think there is probably a case for a third mode: not fixed, not adaptive like the MemBalancer approach, but just growable: grow the heap when and if its size is less than a configurable multiplier (e.g. 1.5) of live data. Never shrink the heap. If you ever notice that a process is taking too much memory, manually kill it and start over, or whatever. Default to adaptive, of course, but when you start to troubleshoot a high GC overhead in a long-lived proess, perhaps switch to growable to see its effect.

    unavoidable badness

    There is some heuristic badness that one cannot avoid: even with the adaptive MemBalancer approach, you have to choose a point on the space/time tradeoff curve. Regardless of what you do, your system will grow a hairy nest of knobs and dials, and if your system is successful there will be a lively aftermarket industry of tuning articles: "Are you experiencing poor object transit? One knob you must know"; "Four knobs to heaven"; "It's raining knobs"; "GC engineers DO NOT want you to grab this knob!!"; etc. (I hope that my British readers are enjoying this.)

    These ad-hoc heuristics are just part of the domain. What I want to say though is that having a general framework for how you approach heap sizing can limit knob profusion, and can help you organize what you have into a structure of sorts.

    At least, this is what I tell myself; inshallah. Now I have told you too. Until next time, happy hacking!

    by Andy Wingo at January 27, 2023 09:45 PM

    January 24, 2023

    Andy Wingoparallel ephemeron tracing

    (Andy Wingo)

    Hello all, and happy new year. Today's note continues the series on implementing ephemerons in a garbage collector.

    In our last dispatch we looked at a serial algorithm to trace ephemerons. However, production garbage collectors are parallel: during collection, they trace the object graph using multiple worker threads. Our problem is to extend the ephemeron-tracing algorithm with support for multiple tracing threads, without introducing stalls or serial bottlenecks.

    Recall that we ended up having to define a table of pending ephemerons:

    struct gc_pending_ephemeron_table {
      struct gc_ephemeron *resolved;
      size_t nbuckets;
      struct gc_ephemeron *buckets[0];
    };
    

    This table holds pending ephemerons that have been visited by the graph tracer but whose keys haven't been found yet, as well as a singly-linked list of resolved ephemerons that are waiting to have their values traced. As a global data structure, the pending ephemeron table is a point of contention between tracing threads that we need to design around.

    a confession

    Allow me to confess my sins: things would be a bit simpler if I didn't allow tracing workers to race.

    As background, if your GC supports marking in place instead of always evacuating, then there is a mark bit associated with each object. To reduce the overhead of contention, a common strategy is to actually use a whole byte for the mark bit, and to write to it using relaxed atomics (or even raw stores). This avoids the cost of a compare-and-swap, but at the cost that multiple marking threads might see that an object's mark was unset, go to mark the object, and think that they were the thread that marked the object. As far as the mark byte goes, that's OK because everybody is writing the same value. The object gets pushed on the to-be-traced grey object queues multiple times, but that's OK too because tracing should be idempotent.

    This is a common optimization for parallel marking, and it doesn't have any significant impact on other parts of the GC--except ephemeron marking. For ephemerons, because the state transition isn't simply from unmarked to marked, we need more coordination.

    high level

    The parallel ephemeron marking algorithm modifies the serial algorithm in just a few ways:

    1. We have an atomically-updated state field in the ephemeron, used to know if e.g. an ephemeron is pending or resolved;

    2. We use separate fields for the pending and resolved links, to allow for concurrent readers across a state change;

    3. We introduce "traced" and "claimed" states to resolve races between parallel tracers on the same ephemeron, and track the "epoch" at which an ephemeron was last traced;

    4. We remove resolved ephemerons from the pending ephemeron hash table lazily, and use atomic swaps to pop from the resolved ephemerons list;

    5. We have to re-check key liveness after publishing an ephemeron to the pending ephemeron table.

    Regarding the first point, there are four possible values for the ephemeron's state field:

    enum {
      TRACED, CLAIMED, PENDING, RESOLVED
    };
    

    The state transition diagram looks like this:

      ,----->TRACED<-----.
     ,         | ^        .
    ,          v |         .
    |        CLAIMED        |
    |  ,-----/     \---.    |
    |  v               v    |
    PENDING--------->RESOLVED
    

    With this information, we can start to flesh out the ephemeron object itself:

    struct gc_ephemeron {
      uint8_t state;
      uint8_t is_dead;
      unsigned epoch;
      struct gc_ephemeron *pending;
      struct gc_ephemeron *resolved;
      void *key;
      void *value;
    };
    

    The state field holds one of the four state values; is_dead indicates if a live ephemeron was ever proven to have a dead key, or if the user explicitly killed the ephemeron; and epoch is the GC count at which the ephemeron was last traced. Ephemerons are born TRACED in the current GC epoch, and the collector is responsible for incrementing the current epoch before each collection.

    algorithm: tracing ephemerons

    When the collector first finds an ephemeron, it does a compare-and-swap (CAS) on the state from TRACED to CLAIMED. If that succeeds, we check the epoch; if it's current, we revert to the TRACED state: there's nothing to do.

    (Without marking races, you wouldn't need either TRACED or CLAIMED states, or the epoch; it would be implicit in the fact that the ephemeron was being traced at all that you had a TRACED ephemeron with an old epoch.)

    So now we have a CLAIMED ephemeron with an out-of-date epoch. We update the epoch and clear the pending and resolved fields, setting them to NULL. If, then, the ephemeron is_dead, we are done, and we go back to TRACED.

    Otherwise we check if the key has already been traced. If so we forward it (if evacuating) and then trace the value edge as well, and transition to TRACED.

    Otherwise we have a live E but we don't know about K; this ephemeron is pending. We transition E's state to PENDING and add it to the front of K's hash bucket in the pending ephemerons table, using CAS to avoid locks.

    We then have to re-check if K is live, after publishing E, to account for other threads racing to mark to K while we mark E; if indeed K is live, then we transition to RESOLVED and push E on the global resolved ephemeron list, using CAS, via the resolved link.

    So far, so good: either the ephemeron is fully traced, or it's pending and published, or (rarely) published-then-resolved and waiting to be traced.

    algorithm: tracing objects

    The annoying thing about tracing ephemerons is that it potentially impacts tracing of all objects: any object could be the key that resolves a pending ephemeron.

    When we trace an object, we look it up in the pending ephemeron hash table. But, as we traverse the chains in a bucket, we also load each node's state. If we find a node that's not in the PENDING state, we atomically forward its predecessor to point to its successor. This is correct for concurrent readers because the end of the chain is always reachable: we only skip nodes that are not PENDING, nodes never become PENDING after they transition away from being PENDING, and we only add PENDING nodes to the front of the chain. We even leave the pending field in place, so that any concurrent reader of the chain can still find the tail, even when the ephemeron has gone on to be RESOLVED or even TRACED.

    (I had thought I would need Tim Harris' atomic list implementation, but it turns out that since I only ever insert items at the head, having annotated links is not necessary.)

    If we find a PENDING ephemeron that has K as its key, then we CAS its state from PENDING to RESOLVED. If this works, we CAS it onto the front of the resolved list. (Note that we also have to forward the key at this point, for a moving GC; this was a bug in my original implementation.)

    algorithm: resolved ephemerons

    Periodically a thread tracing the graph will run out of objects to trace (its mark stack is empty). That's a good time to check if there are resolved ephemerons to trace. We atomically exchange the global resolved list with NULL, and then if there were resolved ephemerons, then we trace their values and transition them to TRACED.

    At the very end of the GC cycle, we sweep the pending ephemeron table, marking any ephemeron that's still there as is_dead, transitioning them back to TRACED, clearing the buckets of the pending ephemeron table as we go.

    nits

    So that's it. There are some drawbacks, for example that this solution takes at least three words per ephemeron. Oh well.

    There is also an annoying point of serialization, which is related to the lazy ephemeron resolution optimization. Consider that checking the pending ephemeron table on every object visit is overhead; it would be nice to avoid this. So instead, we start in "lazy" mode, in which pending ephemerons are never resolved by marking; and then once the mark stack / grey object worklist fully empties, we sweep through the pending ephemeron table, checking each ephemeron's key to see if it was visited in the end, and resolving those ephemerons; we then switch to "eager" mode in which each object visit could potentially resolve ephemerons. In this way the cost of ephemeron tracing is avoided for that part of the graph that is strongly reachable. However, with parallel markers, would you switch to eager mode when any thread runs out of objects to mark, or when all threads run out of objects? You would get greatest parallelism with the former, but you run the risk of some workers prematurely running out of data, but when there is still a significant part of the strongly-reachable graph to traverse. If you wait for all threads to be done, you introduce a serialization point. There is a related question of when to pump the resolved ephemerons list. But these are engineering details.

    Speaking of details, there are some gnarly pitfalls, particularly that you have to be very careful about pre-visit versus post-visit object addresses; for a semi-space collector, visiting an object will move it, so for example in the pending ephemeron table which by definition is keyed by pre-visit (fromspace) object addresses, you need to be sure to trace the ephemeron key for any transition to RESOLVED, and there are a few places this happens (the re-check after publish, sweeping the table after transitioning from lazy to eager, and when resolving eagerly).

    implementation

    If you've read this far, you may be interested in the implementation; it's only a few hundred lines long. It took me quite a while to whittle it down!

    Ephemerons are challenging from a software engineering perspective, because they are logically a separate module, but they interact both with users of the GC and with the collector implementations. It's tricky to find the abstractions that work for all GC algorithms, whether they mark in place or move their objects, and whether they mark the heap precisely or if there are some conservative edges. But if this is the sort of thing that interests you, voilà the API for users and the API to and from collector implementations.

    And, that's it! I am looking forward to climbing out of this GC hole, one blog at a time. There are just a few more features before I can seriously attack integrating this into Guile. Until the next time, happy hacking :)

    by Andy Wingo at January 24, 2023 10:48 AM

    January 23, 2023

    GStreamerGStreamer 1.22.0 new major stable release

    (GStreamer)

    The GStreamer team is excited to announce a new major feature release of your favourite cross-platform multimedia framework!

    As always, this release is again packed with new features, bug fixes and many other improvements.

    The 1.22 release series adds new features on top of the previous 1.20 series and is part of the API and ABI-stable 1.x release series of the GStreamer multimedia framework.

    Highlights:

    • AV1 video codec support improvements
    • New HLS, DASH and Microsoft Smooth Streaming adaptive streaming clients
    • Qt6 support for rendering video inside a QML scene
    • Minimal builds optimised for binary size, including only the individual elements needed
    • Playbin3, Decodebin3, UriDecodebin3, Parsebin enhancements and stabilisation
    • WebRTC simulcast support and support for Google Congestion Control
    • WebRTC-based media server ingestion/egress (WHIP/WHEP) support
    • New easy to use batteries-included WebRTC sender plugin
    • Easy RTP sender timestamp reconstruction for RTP and RTSP
    • ONVIF timed metadata support
    • New fragmented MP4 muxer and non-fragmented MP4 muxer
    • New plugins for Amazon AWS storage and audio transcription services
    • New gtk4paintablesink and gtkwaylandsink renderers
    • New videocolorscale element that can convert and scale in one go for better performance
    • High bit-depth video improvements
    • Touchscreen event support in navigation API
    • Rust plugins now shipped in macOS and Windows/MSVC binary packages
    • H.264/H.265 timestamp correction elements for PTS/DTS reconstruction before muxers
    • Improved design for DMA buffer sharing and modifier handling for hardware-accelerated video decoders/encoders/filters and capturing/rendering on Linux
    • Video4Linux2 hardware accelerated decoder improvements
    • CUDA integration and Direct3D11 integration and plugin improvements
    • New H.264 / AVC, H.265 / HEVC and AV1 hardware-accelerated video encoders for AMD GPUs using the Advanced Media Framework (AMF) SDK
    • applemedia: H.265 / HEVC video encoding + decoding support
    • androidmedia: H.265 / HEVC video encoding support
    • New "force-live" property for audiomixer, compositor, glvideomixer, d3d11compositor etc.
    • Lots of new plugins, features, performance improvements and bug fixes

    For more details check out the GStreamer 1.22 release notes.

    Binaries for Android, iOS, macOS and Windows will be provided in due course.

    You can download release tarballs directly here: gstreamer, gst-plugins-base, gst-plugins-good, gst-plugins-ugly, gst-plugins-bad, gst-libav, gst-rtsp-server, gst-python, gst-editing-services, gst-devtools, gstreamer-vaapi, gstreamer-sharp, gst-omx, or gstreamer-docs.

    January 23, 2023 10:00 PM

    January 14, 2023

    GStreamerGStreamer 1.21.90 pre-release (1.22 rc1)

    (GStreamer)

    The GStreamer team is excited to announce the first release candidate for the upcoming stable 1.22 release series.

    This 1.21.90 pre-release is for testing and development purposes in the lead-up to the stable 1.22 series which is now feature frozen and scheduled for release very soon. Any newly-added API can still change until that point, although it is extremely unlikely for that to happen at this point.

    Depending on how things go there might be more release candidates in the next couple of days, but in any case we're aiming to get 1.22.0 out as soon as possible.

    Preliminary release notes highlighting all the new features, bugfixes, performance optimizations and other important changes will be made available in the next few days.

    Binaries for Android, iOS, Mac OS X and Windows will be available at the usual location in due course.

    Release tarballs can be downloaded directly here:

    As always, please let us know of any issues you run into by filing an issue in Gitlab.

    January 14, 2023 01:00 AM

    January 03, 2023

    Víctor JáquezGStreamer compilation with third party libraries

    Suppose that you have to hack a GStreamer element which requires a library that is not (yet) packaged by your distribution, nor wrapped as a Meson’s subproject. How do you do?

    In our case, we needed the latest version of

    Which are interrelated CMake projects.

    For these cases, GStreamer’s uninstalled development scripts can use a special directory: gstreamer/prefix. As the README.md says:

    NOTE: In the development environment, a fully usable prefix is also configured in gstreamer/prefix where you can install any extra dependency/project.

    This means that gstenv.py script (the responsible of setting up the uninstalled development environment) will add

    • gstreamer/prefix/bin in PATH for executable files.
    • gstreamer/prefix/lib and gstreamer/prefix/share/gstreamer-1.0 in GST_PLUGIN_PATH, for out-of-tree elements.
    • gstreamer/prefix/lib in GI_TYPELIB_PATH for GObject Introspection metadata.
    • gstreamer/prefix/lib/pkgconfig in PKG_CONFIG_PATH for third party dependencies (our case!)
    • gstreamer/prefix/etc/xdg for XDG_CONFIG_DIRS for XDG compliant configuration files.
    • gstreamer/prefix/lib and gstreamer/prefix/lib64 in LD_LIBRARY_PATH for third party libraries.

    Therefore, the general idea, is to compile those third party libraries with their installation prefix as gstreamer/prefix.

    In our case, Vulkan repositories are interrelated so they need to be compiled in certain order. Also, we decided, for self-containment, to clone them in gstreamer/subprojects.

    Vulkan-Headers

    $ cd ~/gst/gstreamer/subprojects
    $ git clone git@github.com:KhronosGroup/Vulkan-Headers.git
    $ cd Vulkan-Headers
    $ mkdir build
    $ cd build
    $ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
    $ cmake --build . --install
    

    Vulkan-Loader

    $ cd ~/gst/gstreamer/subprojects
    $ git clone git@github.com:KhronosGroup/Vulkan-Loader.git
    $ cd Vulkan-Loader
    $ mkdir build
    $ cd build
    $ cmake  -DCMAKE_BUILD_TYPE=Debug -DVULKAN_HEADERS_INSTALL_DIR=/home/vjaquez/gst/gstreamer/prefix DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
    $ cmake --build . --install
    

    Vulkan-Tools

    $ cd ~/gst/gstreamer/subprojects
    $ git clone git@github.com:KhronosGroup/Vulkan-Tools.git
    $ cd Vulkan-Tools
    $ mkdir build
    $ cd build
    $ cmake  -DCMAKE_BUILD_TYPE=Debug -DVULKAN_HEADERS_INSTALL_DIR=/home/vjaquez/gst/gstreamer/prefix DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
    $ cmake --build . --install
    

    Right now we have the Vulkan headers and the Vulkan loader pkg-config file in place. And we should be able to compile GStreamer. Right?

    Not exactly, because gstenv.py only sets the environment variables for the development environment, not for GStreamer compilation. But the solution is simple, because we have all set in the proper order: just to set PKG_CONFIG_PATH when executing meson setup:

    $ PKG_CONFIG_PATH=/home/vjaquez/gst/gstreamer/prefix/lib/pkgconfig meson setup --buildtype=debug build
    

    by vjaquez at January 03, 2023 06:35 PM