March 20, 2023

Andy Wingoa world to win: webassembly for the rest of us

(Andy Wingo)

Good day, comrades!

Today I'd like to share the good news that WebAssembly is finally coming for the rest of us weirdos.

A world to win

WebAssembly for the rest of us

17 Mar 2023 – BOB 2023

Andy Wingo

Igalia, S.L.

This is a transcript-alike of a talk that I gave last week at BOB 2023, a gathering in Berlin of people that are using "technologies beyond the mainstream" to get things done: Haskell, Clojure, Elixir, and so on. PDF slides here, and I'll link the video too when it becomes available.

WebAssembly, the story

WebAssembly is an exciting new universal compute platform

WebAssembly: what even is it? Not a programming language that you would write software in, but rather a compilation target: a sort of assembly language, if you will.

WebAssembly, the pitch

Predictable portable performance

  • Low-level
  • Within 10% of native

Reliable composition via isolation

  • Modules share nothing by default
  • No nasal demons
  • Memory sandboxing

Compile your code to WebAssembly for easier distribution and composition

If you look at what the characteristics of WebAssembly are as an abstract machine, to me there are two main areas in which it is an advance over the alternatives.

Firstly it's "close to the metal" -- if you compile for example an image-processing library to WebAssembly and run it, you'll get similar performance when compared to compiling it to x86-64 or ARMv8 or what have you. (For image processing in particular, native still generally wins because the SIMD primitives in WebAssembly are more narrow and because getting the image into and out of WebAssembly may imply a copy, but the general point remains.) WebAssembly's instruction set covers a broad range of low-level operations that allows compilers to produce efficient code.

The novelty here is that WebAssembly is both portable while also being successful. We language weirdos know that it's not enough to do something technically better: you have to also succeed in getting traction for your alternative.

The second interesting characteristic is that WebAssembly is (generally speaking) a principle-of-least-authority architecture: a WebAssembly module starts with access to nothing but itself. Any capabilities that an instance of a module has must be explicitly shared with it by the host at instantiation-time. This is unlike DLLs which have access to all of main memory, or JavaScript libraries which can mutate global objects. This characteristic allows WebAssembly modules to be reliably composed into larger systems.

WebAssembly, the hype

It’s in all browsers! Serve your code to anyone in the world!

It’s on the edge! Run code from your web site close to your users!

Compose a library (eg: Expat) into your program (eg: Firefox), without risk!

It’s the new lightweight virtualization: Wasm is what containers were to VMs! Give me that Kubernetes cash!!!

Again, the remarkable thing about WebAssembly is that it is succeeding! It's on all of your phones, all your desktop web browsers, all of the content distribution networks, and in some cases it seems set to replace containers in the cloud. Launch the rocket emojis!

WebAssembly, the reality

WebAssembly is a weird backend for a C compiler

Only some source languages are having success on WebAssembly

What about Haskell, Ocaml, Scheme, F#, and so on – what about us?

Are we just lazy? (Well...)

So why aren't we there? Where is Clojure-on-WebAssembly? Where are the F#, the Elixir, the Haskell compilers? Some early efforts exist, but they aren't really succeeding. Why is that? Are we just not putting in the effort? Why is it that Rust gets to ride on the rocket ship but Scheme does not?

WebAssembly, the reality (2)

WebAssembly (1.0, 2.0) is not well-suited to garbage-collected languages

Let’s look into why

As it turns out, there is a reason that there is no good Scheme implementation on WebAssembly: the initial version of WebAssembly is a terrible target if your language relies on the presence of a garbage collector. There have been some advances but this observation still applies to the current standardized and deployed versions of WebAssembly. To better understand this issue, let's dig into the guts of the system to see what the limitations are.

GC and WebAssembly 1.0

Where do garbage-collected values live?

For WebAssembly 1.0, only possible answer: linear memory

(module
  (global $hp (mut i32) (i32.const 0))
  (memory $mem 10)) ;; 640 kB

The primitive that WebAssembly 1.0 gives you to represent your data is what is called linear memory: just a buffer of bytes to which you can read and write. It's pretty much like what you get when compiling natively, except that the memory layout is more simple. You can obtain this memory in units of 64-kilobyte pages. In the example above we're going to request 10 pages, for 640 kB. Should be enough, right? We'll just use it all for the garbage collector, with a bump-pointer allocator. The heap pointer / allocation pointer is kept in the mutable global variable $hp.

(func $alloc (param $size i32) (result i32)
  (local $ret i32)
  (loop $retry
    (local.set $ret (global.get $hp))
    (global.set $hp
      (i32.add (local.get $size) (local.get $ret)))

    (br_if 1
      (i32.lt_u (i32.shr_u (global.get $hp) 16)
                (memory.size))
      (local.get $ret))

    (call $gc)
    (br $retry)))

Here's what an allocation function might look like. The allocation function $alloc is like malloc: it takes a number of bytes and returns a pointer. In WebAssembly, a pointer to memory is just an offset, which is a 32-bit integer (i32). (Having the option of a 64-bit address space is planned but not yet standard.)

If this is your first time seeing the text representation of a WebAssembly function, you're in for a treat, but that's not the point of the presentation :) What I'd like to focus on is the (call $gc) -- what happens when the allocation pointer reaches the end of the region?

GC and WebAssembly 1.0 (2)

What hides behind (call $gc) ?

Ship a GC over linear memory

Stop-the-world, not parallel, not concurrent

But... roots.

The first thing to note is that you have to provide the $gc yourself. Of course, this is doable -- this is what we do when compiling to a native target.

Unfortunately though the multithreading support in WebAssembly is somewhat underpowered; it lets you share memory and use atomic operations but you have to create the threads outside WebAssembly. In practice probably the GC that you ship will not take advantage of threads and so it will be rather primitive, deferring all collection work to a stop-the-world phase.

GC and WebAssembly 1.0 (3)

Live objects are

  • the roots
  • any object referenced by a live object

Roots are globals and locals in active stack frames

No way to visit active stack frames

What's worse though is that you have no access to roots on the stack. A GC has to keep live objects, as defined circularly as any object referenced by a root, or any object referenced by a live object. It starts with the roots: global variables and any GC-managed object referenced by an active stack frame.

But there we run into problems, because in WebAssembly (any version, not just 1.0) you can't iterate over the stack, so you can't find active stack frames, so you can't find the stack roots. (Sometimes people want to support this as a low-level capability but generally speaking the consensus would appear to be that overall performance will be better if the engine is the one that is responsible for implementing the GC; but that is foreshadowing!)

GC and WebAssembly 1.0 (3)

Workarounds

  • handle stack for precise roots
  • spill all possibly-pointer values to linear memory and collect conservatively

Handle book-keeping a drag for compiled code

Given the noniterability of the stack, there are basically two work-arounds. One is to have the compiler and run-time maintain an explicit stack of object roots, which the garbage collector can know for sure are pointers. This is nice because it lets you move objects. But, maintaining the stack is overhead; the state of the art solution is rather to create a side table (a "stack map") associating each potential point at which GC can be called with instructions on how to find the roots.

The other workaround is to spill the whole stack to memory. Or, possibly just pointer-like values; anyway, you conservatively scan all words for things that might be roots. But instead of having access to the memory to which the WebAssembly implementation would spill your stack, you have to do it yourself. This can be OK but it's sub-optimal; see my recent post on the Whippet garbage collector for a deeper discussion of the implications of conservative root-finding.

GC and WebAssembly 1.0 (4)

Cycles with external objects (e.g. JavaScript) uncollectable

A pointer to a GC-managed object is an offset to linear memory, need capability over linear memory to read/write object from outside world

No way to give back memory to the OS

Gut check: gut says no

If that were all, it would already be not so great, but it gets worse! Another problem with linear-memory GC is that it limits the potential for composing a number of modules and the host together, because the garbage collector that manages JavaScript objects in a web browser knows nothing about your garbage collector over your linear memory. You can easily create memory leaks in a system like that.

Also, it's pretty gross that a reference to an object in linear memory requires arbitrary read-write access over all of linear memory in order to read or write object fields. How do you build a reliable system without invariants?

Finally, once you collect garbage, and maybe you manage to compact memory, you can't give anything back to the OS. There are proposals in the works but they are not there yet.

If the BOB audience had to choose between Worse is Better and The Right Thing, I think the BOB audience is much closer to the Right Thing. People like that feel instinctual revulsion to ugly systems and I think GC over linear memory describes an ugly system.

GC and WebAssembly 1.0 (5)

There is already a high-performance concurrent parallel compacting GC in the browser

Halftime: C++ N – Altlangs 0

The kicker is that WebAssembly 1.0 requires you to write and deliver a terrible GC when there is already probably a great GC just sitting there in the host, one that has hundreds of person-years of effort invested in it, one that will surely do a better job than you could ever do. WebAssembly as hosted in a web browser should have access to the browser's garbage collector!

I have the feeling that while those of us with a soft spot for languages with garbage collection have been standing on the sidelines, Rust and C++ people have been busy on the playing field scoring goals. Tripping over the ball, yes, but eventually they do manage to make within striking distance.

Change is coming!

Support for built-in GC set to ship in Q4 2023

With GC, the material conditions are now in place

Let’s compile our languages to WebAssembly

But to continue the sportsball metaphor, I think in the second half our players will finally be able to get out on the pitch and give it the proverbial 110%. Support for garbage collection is coming to WebAssembly users, and I think even by the end of the year it will be shipping in major browsers. This is going to be big! We have a chance and we need to sieze it.

Scheme to Wasm

Spritely + Igalia working on Scheme to WebAssembly

Avoid truncating language to platform; bring whole self

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Even with GC, though, WebAssembly is still a weird machine. It would help to see the concrete approaches that some languages of interest manage to take when compiling to WebAssembly.

In that spirit, the rest of this article/presentation is a walkthough of the approach that I am taking as I work on a WebAssembly compiler for Scheme. (Thanks to Spritely for supporting this work!)

Before diving in, a meta-note: when you go to compile a language to, say, JavaScript, you are mightily tempted to cut corners. For example you might implement numbers as JavaScript numbers, or you might omit implementing continuations. In this work I am trying to not cut corners, and instead to implement the language faithfully. Sometimes this means I have to work around weirdness in WebAssembly, and that's OK.

When thinking about Scheme, I'd like to highlight a few specific areas that have interesting translations. We'll start with value representation, which stays in the GC theme from the introduction.

Scheme to Wasm: Values

;;       any  extern  func
;;        |
;;        eq
;;     /  |   \
;; i31 struct  array

The unitype: (ref eq)

Immediate values in (ref i31)

  • fixnums with 30-bit range
  • chars, bools, etc

Explicit nullability: (ref null eq) vs (ref eq)

The GC extensions for WebAssembly are phrased in terms of a type system. Oddly, there are three top types; as far as I understand it, this is the result of a compromise about how WebAssembly engines might want to represent these different kinds of values. For example, an opaque JavaScript value flowing into a WebAssembly program would have type (ref extern). On a system with NaN boxing, you would need 64 bits to represent a JS value. On the other hand a native WebAssembly object would be a subtype of (ref any), and might be representable in 32 bits, either because it's a 32-bit system or because of pointer compression.

Anyway, three top types. The user can define subtypes of struct and array, instantiate values of those types, and access their fields. The life cycle of reference-typed objects is automatically managed by the run-time, which is just another way of saying they are garbage-collected.

For Scheme, we need a common supertype for all values: the unitype, in Bob Harper's memorable formulation. We can use (ref any), but actually we'll use (ref eq) -- this is the supertype of values that can be compared by (pointer) identity. So now we can code up eq?:

(func $eq? (param (ref eq) (ref eq))
           (result i32)
  (ref.eq (local.get a) (local.get b)))

Generally speaking in a Scheme implementation there are immediates and heap objects. Immediates can be encoded in the bits of a value, whereas for heap object the bits of a value encode a reference (pointer) to an object on the garbage-collected heap. We usually represent small integers as immediates, as well as booleans and other oddball values.

Happily, WebAssembly gives us an immediate value type, i31. We'll encode our immediates there, and otherwise represent heap objects as instances of struct subtypes.

Scheme to Wasm: Values (2)

Heap objects subtypes of struct; concretely:

(struct $heap-object
  (struct (field $tag-and-hash i32)))
(struct $pair
  (sub $heap-object
    (struct i32 (ref eq) (ref eq))))

GC proposal allows subtyping on structs, functions, arrays

Structural type equivalance: explicit tag useful

We actually need to have a common struct supertype as well, for two reasons. One is that we need to be able to hash Scheme values by identity, but for this we need an embedded lazily-initialized hash code. It's a bit annoying to take the per-object memory hit but it's a reality, and the JVM does it this way, so it must not be so terrible.

The other reason is more subtle: WebAssembly's type system is built in such a way that types that are "structurally" equivalent are indistinguishable. So a pair has two fields, besides the hash, but there might be a number of other fundamental object types that have the same shape; you can't fully rely on WebAssembly's dynamic type checks (ref.test et al) to be able to query the type of a value. Instead we re-use the low bits of the hash word to include a type tag, which might be 1 for pairs, 2 for vectors, 3 for closures, and so on.

Scheme to Wasm: Values (3)

(func $cons (param (ref eq)
                   (ref eq))
            (result (ref $pair))
  (struct.new_canon $pair
    ;; Assume heap tag for pairs is 1.
    (i32.const 1)
    ;; Car and cdr.
    (local.get 0)
    (local.get 1)))

(func $%car (param (ref $pair))
            (result (ref eq))
  (struct.get $pair 1 (local.get 0)))

With this knowledge we can define cons, as a simple call to struct.new_canon pair.

I didn't have time for this in the talk, but there is a ghost haunting this code: the ghost of nominal typing. See, in a web browser at least, every heap object will have its first word point to its "hidden class" / "structure" / "map" word. If the engine ever needs to check that a value is of a specific shape, it can do a quick check on the map word's value; if it needs to do deeper introspection, it can dereference that word to get more details.

Under the hood, testing whether a (ref eq) is a pair or not should be a simple check that it's a (ref struct) (and not a fixnum), and then a comparison of its map word to the run-time type corresponding to $pair. If subtyping of $pair is allowed, we start to want inline caches to handle polymorphism, but the checking the map word is still the basic mechanism.

However, as I mentioned, we only have structural equality of types; two (struct (ref eq)) type definitions will define the same type and have the same map word (run-time type / RTT). Hence the _canon in the name of struct.new_canon $pair: we create an instance of $pair, with the canonical run-time-type for objects having $pair-shape.

In earlier drafts of the WebAssembly GC extensions, users could define their own RTTs, which effectively amounts to nominal typing: not only does this object have the right structure, but was it created with respect to this particular RTT. But, this facility was cut from the first release, and it left ghosts in the form of these _canon suffixes on type constructor instructions.

For the Scheme-to-WebAssembly effort, we effectively add back in a degree of nominal typing via type tags. For better or for worse this results in a so-called "open-world" system: you can instantiate a separately-compiled WebAssembly module that happens to define the same types and use the same type tags and it will be able to happily access the contents of Scheme values from another module. If you were to use nominal types, you would't be able to do so, unless there were some common base module that defined and exported the types of interests, and which any extension module would need to import.

(func $car (param (ref eq)) (result (ref eq))
  (local (ref $pair))
  (block $not-pair
    (br_if $not-pair
      (i32.eqz (ref.test $pair (local.get 0))))
    (local.set 1 (ref.cast $pair) (local.get 0))
    (br_if $not-pair
      (i32.ne
        (i32.const 1)
        (i32.and
          (i32.const 0xff)
          (struct.get $heap-object 0 (local.get 1)))))
    (return_call $%car (local.get 1)))

  (call $type-error)
  (unreachable))

In the previous example we had $%car, with a funny % in the name, taking a (ref $pair) as an argument. But in the general case (barring compiler heroics) car will take an instance of the unitype (ref eq). To know that it's actually a pair we have to make two checks: one, that it is a struct and has the $pair shape, and two, that it has the right tag. Oh well!

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

But with all of that I think we have a solid story on how to represent values. I went through all of the basic value types in Guile and checked that they could all be represented using GC types, and it seems that all is good. Now on to the next point: varargs.

Scheme to Wasm: Varargs

(list 'hey)      ;; => (hey)
(list 'hey 'bob) ;; => (hey bob)

Problem: Wasm functions strongly typed

(func $list (param ???) (result (ref eq))
  ???)

Solution: Virtualize calling convention

In WebAssembly, you define functions with a type, and it is impossible to call them in an unsound way. You must call $car exactly 2 arguments or it will not compile, and those arguments have to be of specific types, and so on. But Scheme doesn't enforce these restrictions on the language level, bless its little miscreant heart. You can call car with 5 arguments, and you'll get a run-time error. There are some functions that can take a variable number of arguments, doing different things depending on incoming argument count.

How do we square these two approaches to function types?

;; "Registers" for args 0 to 3
(global $arg0 (mut (ref eq)) (i31.new (i32.const 0)))
(global $arg1 (mut (ref eq)) (i31.new (i32.const 0)))
(global $arg2 (mut (ref eq)) (i31.new (i32.const 0)))
(global $arg3 (mut (ref eq)) (i31.new (i32.const 0)))

;; "Memory" for the rest
(type $argv (array (ref eq)))
(global $argN (ref $argv)
        (array.new_canon_default
          $argv (i31.const 42) (i31.new (i32.const 0))))

Uniform function type: argument count as sole parameter

Callee moves args to locals, possibly clearing roots

The approach we are taking is to virtualize the calling convention. In the same way that when calling an x86-64 function, you pass the first argument in $rdi, then $rsi, and eventually if you run out of registers you put arguments in memory, in the same way we'll pass the first argument in the $arg0 global, then $arg1, and eventually in memory if needed. The function will receive the number of incoming arguments as its sole parameter; in fact, all functions will be of type (func (param i32)).

The expectation is that after checking argument count, the callee will load its arguments from globals / memory to locals, which the compiler can do a better job on than globals. We might not even emit code to null out the argument globals; might leak a little memory but probably would be a win.

You can imagine a world in which $arg0 actually gets globally allocated to $rdi, because it is only live during the call sequence; but I don't think that world is this one :)

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Great, two points out of the way! Next up, tail calls.

Scheme to Wasm: Tail calls

;; Call known function
(return_call $f arg ...)

;; Call function by value
(return_call_ref $type callee arg ...)

Friends -- I almost cried making this slide. We Schemers are used to working around the lack of tail calls, and I could have done so here, but it's just such a relief that these functions are just going to be there and I don't have to think much more about them. Technically speaking the proposal isn't merged yet; checking the phases document it's at the last station before headed to the great depot in the sky. But, soon soon it will be present and enabled in all WebAssembly implementations, and we should build systems now that rely on it.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Next up, my favorite favorite topic: delimited continuations.

Scheme to Wasm: Prompts (1)

Problem: Lightweight threads/fibers, exceptions

Possible solutions

  • Eventually, built-in coroutines
  • binaryen’s asyncify (not yet ready for GC); see Julia
  • Delimited continuations

“Bring your whole self”

Before diving in though, one might wonder why bother. Delimited continuations are a building-block that one can use to build other, more useful things, notably exceptions and light-weight threading / fibers. Could there be another way of achieving these end goals without having to implement this relatively uncommon primitive?

For fibers, it is possible to implement them in terms of a built-in coroutine facility. The standards body seems willing to include a coroutine primitive, but it seems far off to me; not within the next 3-4 years I would say. So let's put that to one side.

There is a more near-term solution, to use asyncify to implement coroutines somehow; but my understanding is that asyncify is not ready for GC yet.

For the Guile flavor of Scheme at least, delimited continuations are table stakes of their own right, so given that we will have them on WebAssembly, we might as well use them to implement fibers and exceptions in the same way as we do on native targets. Why compromise if you don't have to?

Scheme to Wasm: Prompts (2)

Prompts delimit continuations

(define k
  (call-with-prompt ’foo
    ; body
    (lambda ()
      (+ 34 (abort-to-prompt 'foo)))
    ; handler
    (lambda (continuation)
      continuation)))

(k 10)       ;; ⇒ 44
(- (k 10) 2) ;; ⇒ 42

k is the _ in (lambda () (+ 34 _))

There are a few ways to implement delimited continuations, but my usual way of thinking about them is that a delimited continuation is a slice of the stack. One end of the slice is the prompt established by call-with-prompt, and the other by the continuation of the call to abort-to-prompt. Capturing a slice pops it off the stack, copying it out to the heap as a callable function. Calling that function splats the captured slice back on the stack and resumes it where it left off.

Scheme to Wasm: Prompts (3)

Delimited continuations are stack slices

Make stack explicit via minimal continuation-passing-style conversion

  • Turn all calls into tail calls
  • Allocate return continuations on explicit stack
  • Breaks functions into pieces at non-tail calls

This low-level intuition of what a delimited continuation is leads naturally to an implementation; the only problem is that we can't slice the WebAssembly call stack. The workaround here is similar to the varargs case: we virtualize the stack.

The mechanism to do so is a continuation-passing-style (CPS) transformation of each function. Functions that make no calls, such as leaf functions, don't need to change at all. The same goes for functions that make only tail calls. For functions that make non-tail calls, we split them into pieces that preserve the only-tail-calls property.

Scheme to Wasm: Prompts (4)

Before a non-tail-call:

  • Push live-out vars on stacks (one stack per top type)
  • Push continuation as funcref
  • Tail-call callee

Return from call via pop and tail call:

(return_call_ref (call $pop-return)
                 (i32.const 0))

After return, continuation pops state from stacks

Consider a simple function:

(define (f x y)
  (+ x (g y))

Before making a non-tail call, a "tailified" function will instead push all live data onto an explicitly-managed stack and tail-call the callee. It also pushes on the return continuation. Returning from the callee pops the return continuation and tail-calls it. The return continuation pops the previously-saved live data and continues.

In this concrete case, tailification would split f into two pieces:

(define (f x y)
  (push! x)
  (push-return! f-return-continuation-0)
  (g y))

(define (f-return-continuation-0 g-of-y)
  (define k (pop-return!))
  (define x (pop! x))
  (k (+ x g-of-y)))

Now there are no non-tail calls, besides calls to run-time routines like push! and + and so on. This transformation is implemented by tailify.scm.

Scheme to Wasm: Prompts (5)

abort-to-prompt:

  • Pop stack slice to reified continuation object
  • Tail-call new top of stack: prompt handler

Calling a reified continuation:

  • Push stack slice
  • Tail-call new top of stack

No need to wait for effect handlers proposal; you can have it all now!

The salient point is that the stack on which push! operates (in reality, probably four or five stacks: one in linear memory or an array for types like i32 or f64, three for each of the managed top types any, extern, and func, and one for the stack of return continuations) are managed by us, so we can slice them.

Someone asked in the talk about whether the explicit memory traffic and avoiding the return-address-buffer branch prediction is a source of inefficiency in the transformation and I have to say, yes, but I don't know by how much. I guess we'll find out soon.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Okeydokes, last point!

Scheme to Wasm: Numbers

Numbers can be immediate: fixnums

Or on the heap: bignums, fractions, flonums, complex

Supertype is still ref eq

Consider imports to implement bignums

  • On web: BigInt
  • On edge: Wasm support module (mini-gmp?)

Dynamic dispatch for polymorphic ops, as usual

First, I would note that sometimes the compiler can unbox numeric operations. For example if it infers that a result will be an inexact real, it can use unboxed f64 instead of library routines working on heap flonums ((struct i32 f64); the initial i32 is for the hash and tag). But we still need a story for the general case that involves dynamic type checks.

The basic idea is that we get to have fixnums and heap numbers. Fixnums will handle most of the integer arithmetic that we need, and will avoid allocation. We'll inline most fixnum operations as a fast path and call out to library routines otherwise. Of course fixnum inputs may produce a bignum output as well, so the fast path sometimes includes another slow-path callout.

We want to minimize binary module size. In an ideal compile-to-WebAssembly situation, a small program will have a small module size, down to a minimum of a kilobyte or so; larger programs can be megabytes, if the user experience allows for the download delay. Binary module size will be dominated by code, so that means we need to plan for aggressive dead-code elimination, minimize the size of fast paths, and also minimize the size of the standard library.

For numbers, we try to keep module size down by leaning on the platform. In the case of bignums, we can punt some of this work to the host; on a JavaScript host, we would use BigInt, and on a WASI host we'd compile an external bignum library. So that's the general story: inlined fixnum fast paths with dynamic checks, and otherwise library routine callouts, combined with aggressive whole-program dead-code elimination.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Hey I think we did it! Always before when I thought about compiling Scheme or Guile to the web, I got stuck on some point or another, was tempted down the corner-cutting alleys, and eventually gave up before starting. But finally it would seem that the stars are aligned: we get to have our Scheme and run it too.

Miscellenea

Debugging: The wild west of DWARF; prompts

Strings: stringref host strings spark joy

JS interop: Export accessors; Wasm objects opaque to JS. externref.

JIT: A whole ’nother talk!

AOT: wasm2c

Of course, like I said, WebAssembly is still a weird machine: as a compilation target but also at run-time. Debugging is a right proper mess; perhaps some other article on that some time.

How to represent strings is a surprisingly gnarly question; there is tension within the WebAssembly standards community between those that think that it's possible for JavaScript and WebAssembly to share an underlying string representation, and those that think that it's a fool's errand and that copying is the only way to go. I don't know which side will prevail; perhaps more on that as well later on.

Similarly the whole interoperation with JavaScript question is very much in its early stages, with the current situation choosing to err on the side of nothing rather than the wrong thing. You can pass a WebAssembly (ref eq) to JavaScript, but JavaScript can't do anything with it: it has no prototype. The state of the art is to also ship a JS run-time that wraps each wasm object, proxying exported functions from the wasm module as object methods.

Finally, some language implementations really need JIT support, like PyPy. There, that's a whole 'nother talk!

WebAssembly for the rest of us

With GC, WebAssembly is now ready for us

Getting our languages on WebAssembly now a S.M.O.P.

Let’s score some goals in the second half!

(visit-links
 "gitlab.com/spritely/guile-hoot-updates"
 "wingolog.org"
 "wingo@igalia.com"
 "igalia.com"
 "mastodon.social/@wingo")

WebAssembly has proven to have some great wins for C, C++, Rust, and so on -- but now it's our turn to get in the game. GC is coming and we as a community need to be getting our compilers and language run-times ready. Let's put on the coffee and bang some bytes together; it's still early days and there's a world to win out there for the language community with the best WebAssembly experience. The game is afoot: happy consing!

by Andy Wingo at March 20, 2023 09:06 AM

March 14, 2023

Víctor JáquezReview of Igalia Multimedia activities (2022)

We, Igalia’s multimedia team, would like to share with you our list of achievements along the past 2022.

WebKit Multimedia

WebRTC

Phil already wrote a first blog post, of a series, on this regard: WebRTC in WebKitGTK and WPE, status updates, part I. Please, be sure to give it a glance, it has nice videos.

Long story short, last year we started to support Media Capture and Streams in WebKitGTK and WPE using GStreamer, either for input devices (camera and microphone), desktop sharing, webaudio, and web canvas. But this is just the first step. We are currently working on RTCPeerConnection, also using GStreamer, to share all these captured streams with other web peers. Meanwhile, we’ll wait for the second episode of Phil’s series 🙂

MediaRecorder

We worked in an initial implementation of MediaRecorder with GStreamer (1.20 or superior). The specification goes about allowing a web browser to record a selected stream. For example, a voice-memo or video application which could encode and upload a capture of your microphone / camera.

Gamepad

While WebKitGTK already has Gamepad support, WPE lacked it. We did the implementation last year, and there’s a blog post about it: Gamepad in WPEWebkit, with video showing a demo of it.

Capture encoded video streams from webcams

Some webcams only provide high resolution frames encoded in H.264 or so. In order to support these resolutions with those webcams we added the support for negotiate of those formats and decode them internally to handle the streams. Though we are just at the beginning of more efficient support.

Flatpak SDK maintenance

A lot of effort went to maintain the Flatpak SDK for WebKit. It is a set of runtimes that allows to have a reproducible build of WebKit, independently of the used Linux distribution. Nowadays the Flatpak SDK is used in Webkit’s EWS, and by many developers.

Among all the features added during the year we can highlight added Rust support, a full integrity check before upgrading, and offer a way to override dependencies as local projects.

MSE/EME enhancements

As every year, massive work was done in WebKit ports using GStreamer for Media Source Extensions and Encrypted Media Extensions, improving user experience with different streaming services in the Web, such as Odysee, Amazon, DAZN, etc.

In the case of encrypted media, GStreamer-based WebKit ports provide the stubs to communicate with an external Content Decryption Module (CDM). If you’re willing to support this in your platform, you can reach us.

Also we worked in a video demo showing how MSE/EME works in a Raspberry Pi 3 using WPE:

WebAudio demo

We also spent time recording video demos, such as this one, showing WebAudio using WPE on a desktop computer.

GStreamer

We managed to merge a lot of bug fixes in GStreamer, which in many cases can be harder to solve rather than implementing new features, though former are more interesting to tell, such as those related with making Rust the main developing language for GStreamer besides C.

Rust bindings and GStreamer elements for Vonage Video API / OpenTok

OpenTok is the legacy name of Vonage Video API, and is a PaaS (Platform As a Service) to ease the development and deployment of WebRTC services and applications.

We published our work in Github of Rust bindings both for the Client SDK for Linux and the Server SDK using REST API, along with a GStreamer plugin to publish and subscribe to video and audio streams.

GstWebRTCSrc

In the beginning there was webrtcbin, an element that implements the majority of W3C RTCPeerConnection API. It’s so flexible and powerful that it’s rather hard to use for the most common cases. Then appeared webrtcsink, a wrapper of webrtcbin, written in Rust, which receives GStreamer streams which will be offered and streamed to web peers. Later on, we developed webrtcsrc, the webrtcsink counterpart: an element which source pads push streams from web peers, such as another browser, and forward those Web streams as GStreamer ones in a pipeline. Both webrtcsink and webrtcsrc are written in Rust.

Behavior-Driven Development test framework for GStreamer

Behavior-Driven Development is gaining relevance with tools like Cucumber for Java and its domain specific language, Gherkin to define software behaviors. Rustaceans have picked up these ideas and developed cucumber-rs. The logical consequence was obvious: Why not GStreamer?

Last year we tinkered with GStreamer-Cucumber, a BDD to define behavior tests for GStreamer pipelines.

GstValidate Rust bindings

There have been some discussion if BDD is the best way to test GStreamer pipelines, and there’s GstValidate, and also, last year, we added its Rust bindings.

GStreamer Editing Services

Though not everything was Rust. We work hard on GStreamer’s nuts and bolts.

Last year, we gathered the team to hack GStreamer Editing Services, particularly to explore adding OpenGL and DMABuf support, such as downloading or uploading a texture before processing, and selecting a proper filter to avoid those transfers.

GstVA and GStreamer-VAAPI

We helped in the maintenance of GStreamer-VAAPI and the development of its near replacement: GstVA, adding new elements such as the H.264 encoder, the compositor and the JPEG decoder. Along with participation on the debate and code reviewing of negotiating DMABuf streams in the pipeline.

Vulkan decoder and parser library for CTS

You might have heard about Vulkan has now integrated in its API video decoding, while encoding is currently work-in-progress. We devoted time on helping Khronos with the Vulkan Video Conformance Tests (CTS), particularly with a parser based on GStreamer and developing a H.264 decoder in GStreamer using Vulkan Video API.

You can check the presentation we did last Vulkanised.

WPE Android Experiment

In a joint adventure with Igalia’s Webkit team we did some experiments to port WPE to Android. This is just an internal proof of concept so far, but we are looking forward to see how this will evolve in the future, and what new possibilities this might open up.

If you have any questions about WebKit, GStreamer, Linux video stack, compilers, etc., please contact us.

by vjaquez at March 14, 2023 10:45 AM

March 10, 2023

Andy Wingopre-initialization of garbage-collected webassembly heaps

(Andy Wingo)

Hey comrades, I just had an idea that I won't be able to work on in the next couple months and wanted to release it into the wild. They say if you love your ideas, you should let them go and see if they come back to you, right? In that spirit I abandon this idea to the woods.

Basically the idea is Wizer-like pre-initialization of WebAssembly modules, but for modules that store their data on the GC-managed heap instead of just in linear memory.

Say you have a WebAssembly module with GC types. It might look like this:

(module
  (type $t0 (struct (ref eq)))
  (type $t1 (struct (ref $t0) i32))
  (type $t2 (array (mut (ref $t1))))
  ...
  (global $g0 (ref null eq)
    (ref.null eq))
  (global $g1 (ref $t1)
    (array.new_canon $t0 (i31.new (i32.const 42))))
  ...
  (function $f0 ...)
  ...)

You define some struct and array types, there are some global variables, and some functions to actually do the work. (There are probably also tables and other things but I am simplifying.)

If you consider the object graph of an instantiated module, you will have some set of roots R that point to GC-managed objects. The live objects in the heap are the roots and any object referenced by a live object.

Let us assume a standalone WebAssembly module. In that case the set of types T of all objects in the heap is closed: it can only be one of the types $t0, $t1, and so on that are defined in the module. These types have a partial order and can thus be sorted from most to least specific. Let's assume that this sort order is just the reverse of the definition order, for now. Therefore we can write a general type introspection function for any object in the graph:

(func $introspect (param $obj anyref)
  (block $L2 (ref $t2)
    (block $L1 (ref $t1)
      (block $L0 (ref $t0)
        (br_on_cast $L2 $t2 (local.get $obj))
        (br_on_cast $L1 $t1 (local.get $obj))
        (br_on_cast $L0 $t0 (local.get $obj))
        (unreachable))
      ;; Do $t0 things...
      (return))
    ;; Do $t1 things...
    (return))
  ;; Do $t2 things...
  (return))

In particular, given a WebAssembly module, we can generate a function to trace edges in an object graph of its types. Using this, we can identify all live objects, and what's more, we can take a snapshot of those objects:

(func $snapshot (result (ref (array (mut anyref))))
  ;; Start from roots, use introspect to find concrete types
  ;; and trace edges, use a worklist, return an array of
  ;; all live objects in topological sort order
  )

Having a heap snapshot is interesting for introspection purposes, but my interest is in having fast start-up. Many programs have a kind of "initialization" phase where they get the system up and running, and only then proceed to actually work on the problem at hand. For example, when you run python3 foo.py, Python will first spend some time parsing and byte-compiling foo.py, importing the modules it uses and so on, and then will actually run foo.py's code. Wizer lets you snapshot the state of a module after initialization but before the real work begins, which can save on startup time.

For a GC heap, we actually have similar possibilities, but the mechanism is different. Instead of generating an array of all live objects, we could generate a serialized state of the heap as bytecode, and another function to read the bytecode and reload the heap:

(func $pickle (result (ref (array (mut i8))))
  ;; Return an array of bytecode which, when interpreted,
  ;; can reconstruct the object graph and set the roots
  )
(func $unpickle (param (ref (array (mut i8))))
  ;; Interpret the bytecode, building object graph in
  ;; topological order
  )

The unpickler is module-dependent: it will need one case to construct each concrete type $tN in the module. Therefore the bytecode grammar would be module-dependent too.

What you would get with a bytecode-based $pickle/$unpickle pair would be the ability to serialize and reload heap state many times. But for the pre-initialization case, probably that's not precisely what you want: you want to residualize a new WebAssembly module that, when loaded, will rehydrate the heap. In that case you want a function like:

(func $make-init (result (ref (array (mut i8))))
  ;; Return an array of WebAssembly code which, when
  ;; added to the module as a function and invoked, 
  ;; can reconstruct the object graph and set the roots.
  )

Then you would use binary tools to add that newly generated function to the module.

In short, there is a space open for a tool which takes a WebAssembly+GC module M and produces M', a module which contains a $make-init function. Then you use a WebAssembly+GC host to load the module and call the $make-init function, resulting in a WebAssembly function $init which you then patch in to the original M to make M'', which is M pre-initialized for a given task.

Optimizations

Some of the object graph is constant; for example, an instance of a struct type that has no mutable fields. These objects don't have to be created in the init function; they can be declared as new constant global variables, which an engine may be able to initialize more efficiently.

The pre-initialized module will still have an initialization phase in which it builds the heap. This is a constant function and it would be nice to avoid it. Some WebAssembly hosts will be able to run pre-initialization and then snapshot the GC heap using lower-level facilities (copy-on-write mappings, pointer compression and relocatable cages, pre-initialization on an internal level...). This would potentially decrease latency and may allow for cross-instance memory sharing.

Limitations

There are five preconditions to be able to pickle and unpickle the GC heap:

  1. The set of concrete types in a module must be closed.

  2. The roots of the GC graph must be enumerable.

  3. The object-graph edges from each live object must be enumerable.

  4. To prevent cycles, we have to know when an object has been visited: objects must have identity.

  5. We must be able to create each type in a module.

I think there are three limitations to this pre-initialization idea in practice.

One is externref; these values come from the host and are by definition not introspectable by WebAssembly. Let's keep the closed-world assumption and consider the case where the set of external reference types is closed also. In that case if a module allows for external references, we can perhaps make its pickling routines call out to the host to (2) provide any external roots (3) identify edges on externref values (4) compare externref values for identity and (5) indicate some imported functions which can be called to re-create exernal objects.

Another limitation is funcref. In practice in the current state of WebAssembly and GC, you will only have a funcref which is created by ref.func, and which (3) therefore has no edges and (5) can be re-created by ref.func. However neither WebAssembly nor the JS API has no way of knowing which function index corresponds to a given funcref. Including function references in the graph would therefore require some sort of host-specific API. Relatedly, function references are not comparable for equality (func is not a subtype of eq), which is a little annoying but not so bad considering that function references can't participate in a cycle. Perhaps a solution though would be to assume (!) that the host representation of a funcref is constant: the JavaScript (e.g.) representations of (ref.func 0) and (ref.func 0) are the same value (in terms of ===). Then you could compare a given function reference against a set of known values to determine its index. Note, when function references are expanded to include closures, we will have more problems in this area.

Finally, there is the question of roots. Given a module, we can generate a function to read the values of all reference-typed globals and of all entries in all tables. What we can't get at are any references from the stack, so our object graph may be incomplete. Perhaps this is not a problem though, because when we unpickle the graph we won't be able to re-create the stack anyway.

OK, that's my idea. Have at it, hackers!

by Andy Wingo at March 10, 2023 09:20 AM

March 07, 2023

Robert McQueenFlathub in 2023

(Robert McQueen)

It’s been quite a few months since the most recent updates about Flathub last year. We’ve been busy behind the scenes, so I’d like to share what we’ve been up to at Flathub and why—and what’s coming up from us this year. I want to focus on:

  • Where Flathub is today as a strong ecosystem with 2,000 apps
  • Our progress on evolving Flathub from a build service to an app store
  • The economic barrier to growing the ecosystem, and its consequences
  • What’s next to overcome our challenges with focused initiatives

Today

Flathub is going strong: we offer 2,000 apps from over 1,500 collaborators on GitHub. We’re averaging 700,000 app downloads a day, with 898 million HTTP requests totalling 88.3 TB served by our CDN each day (thank you Fastly!). Flatpak has, in my opinion, solved the largest technical issue which has held back the mainstream growth and acceptance of Linux on the desktop (or other personal computing devices) for the past 25 years: namely, the difficulty for app developers to publish their work in a way that makes it easy for people to discover, download (or sideload, for people in challenging connectivity environments), install and use. Flathub builds on that to help users discover the work of app developers and helps that work reach users in a timely manner.

Initial results of this disintermediation are promising: even with its modest size so far, Flathub has hundreds of apps that I have never, ever heard of before—and that’s even considering I’ve been working in the Linux desktop space for nearly 20 years and spent many of those staring at the contents of dselect (showing my age a little) or GNOME Software, attending conferences, and reading blog posts, news articles, and forums. I am also heartened to see that many of our OS distributor partners have recognised that this model is hugely complementary and additive to the indispensable work they are doing to bring the Linux desktop to end users, and that “having more apps available to your users” is a value-add allowing you to focus on your core offering and not a zero-sum game that should motivate infighting.

Ongoing Progress

Getting Flathub into its current state has been a long ongoing process. Here’s what we’ve been up to behind the scenes:

Development

Last year, we concluded our first engagement with Codethink to build features into the Flathub web app to move from a build service to an app store. That includes accounts for users and developers, payment processing via Stripe, and the ability for developers to manage upload tokens for the apps they control. In parallel, James Westman has been working on app verification and the corresponding features in flat-manager to ensure app metadata accurately reflects verification and pricing, and to provide authentication for paying users for app downloads when the developer enables it. Only verified developers will be able to make direct uploads or access payment settings for their apps.

Legal

So far, the GNOME Foundation has acted as an incubator and legal host for Flathub even though it’s not purely a GNOME product or initiative. Distributing software to end users along with processing and forwarding payments and donations also has a different legal profile in terms of risk exposure and nonprofit compliance than the current activities of the GNOME Foundation. Consequently, we plan to establish an independent legal entity to own and operate Flathub which reduces risk for the GNOME Foundation, better reflects the independent and cross-desktop interests of Flathub, and provides flexibility in the future should we need to change the structure.

We’re currently in the process of reviewing legal advice to ensure we have the right structure in place before moving forward.

Governance

As Flathub is something we want to set outside of the existing Linux desktop and distribution space—and ensure we represent and serve the widest community of Linux users and developers—we’ve been working on a governance model that ensures that there is transparency and trust in who is making decisions, and why. We have set up a working group with myself and Martín Abente Lahaye from GNOME, Aleix Pol Gonzalez, Neofytos Kolokotronis, and Timothée Ravier from KDE, and Jorge Castro flying the flag for the Flathub community. Thanks also to Neil McGovern and Nick Richards who were also more involved in the process earlier on.

We don’t want to get held up here creating something complex with memberships and elections, so at first we’re going to come up with a simple/balanced way to appoint people into a board that makes key decisions about Flathub and iterate from there.

Funding

We have received one grant for 2023 of $100K from Endless Network which will go towards the infrastructure, legal, and operations costs of running Flathub and setting up the structure described above. (Full disclosure: Endless Network is the umbrella organisation which also funds my employer, Endless OS Foundation.) I am hoping to grow the available funding to $250K for this year in order to cover the next round of development on the software, prepare for higher operations costs (e.g., accounting gets more complex), and bring in a second full-time staff member in addition to Bartłomiej Piotrowski to handle enquiries, reviews, documentation, and partner outreach.

We’re currently in discussions with NLnet about funding further software development, but have been unfortunately turned down for a grant from the Plaintext Group for this year; this Schmidt Futures project around OSS sustainability is not currently issuing grants in 2023. However, we continue to work on other funding opportunities.

Remaining Barriers

My personal hypothesis is that our largest remaining barrier to Linux desktop scale and impact is economic. On competing platforms—mobile or desktop—a developer can offer their work for sale via an app store or direct download with payment or subscription within hours of making a release. While we have taken the “time to first download” time down from months to days with Flathub, as a community we continue to have a challenging relationship with money. Some creators are lucky enough to have a full-time job within the FLOSS space, while a few “superstar” developers are able to nurture some level of financial support by investing time in building a following through streaming, Patreon, Kickstarter, or similar. However, a large proportion of us have to make do with the main payback from our labours being a stream of bug reports on GitHub interspersed with occasional conciliatory beers at FOSDEM (other beverages and events are available).

The first and most obvious consequence is that if there is no financial payback for participating in developing apps for the free and open source desktop, we will lose many people in the process—despite the amazing achievements of those who have brought us to where we are today. As a result, we’ll have far fewer developers and apps. If we can’t offer access to a growing base of users or the opportunity to offer something of monetary value to them, the reward in terms of adoption and possible payment will be very small. Developers would be forgiven for taking their time and attention elsewhere. With fewer apps, our platform has less to entice and retain prospective users.

The second consequence is that this also represents a significant hurdle for diverse and inclusive participation. We essentially require that somebody is in a position of privilege and comfort that they have internet, power, time, and income—not to mention childcare, etc.—to spare so that they can take part. If that’s not the case for somebody, we are leaving them shut out from our community before they even have a chance to start. My belief is that free and open source software represents a better way for people to access computing, and there are billions of people in the world we should hope to reach with our work. But if the mechanism for participation ensures their voices and needs are never represented in our community of creators, we are significantly less likely to understand and meet those needs.

While these are my thoughts, you’ll notice a strong theme to this year will be leading a consultation process to ensure that we are including, understanding and reflecting the needs of our different communities—app creators, OS distributors and Linux users—as I don’t believe that our initiative will be successful without ensuring mutual benefit and shared success. Ultimately, no matter how beautiful, performant, or featureful the latest versions of the Plasma or GNOME desktops are, or how slick the newly rewritten installer is from your favourite distribution, all of the projects making up the Linux desktop ecosystem are subdividing between ourselves an absolutely tiny market share of the global market of personal computers. To make a bigger mark on the world, as a community, we need to get out more.

What’s Next?

After identifying our major barriers to overcome, we’ve planned a number of focused initiatives and restructuring this year:

Phased Deployment

We’re working on deploying the work we have been doing over the past year, starting first with launching the new Flathub web experience as well as the rebrand that Jakub has been talking about on his blog. This also will finally launch the verification features so we can distinguish those apps which are uploaded by their developers.

In parallel, we’ll also be able to turn on the Flatpak repo subsets that enable users to select only verified and/or FLOSS apps in the Flatpak CLI or their desktop’s app center UI.

Consultation

We would like to make sure that the voices of app creators, OS distributors, and Linux users are reflected in our plans for 2023 and beyond. We will be launching this in the form of Flathub Focus Groups at the Linux App Summit in Brno in May 2023, followed up with surveys and other opportunities for online participation. We see our role as interconnecting communities and want to be sure that we remain transparent and accountable to those we are seeking to empower with our work.

Whilst we are being bold and ambitious with what we are trying to create for the Linux desktop community, we also want to make sure we provide the right forums to listen to the FLOSS community and prioritise our work accordingly.

Advisory Board

As we build the Flathub organisation up in 2023, we’re also planning to expand its governance by creating an Advisory Board. We will establish an ongoing forum with different stakeholders around Flathub: OS vendors, hardware integrators, app developers and user representatives to help us create the Flathub that supports and promotes our mutually shared interests in a strong and healthy Linux desktop community.

Direct Uploads

Direct app uploads are close to ready, and they enable exciting stuff like allowing Electron apps to be built outside of flatpak-builder, or driving automatic Flathub uploads from GitHub actions or GitLab CI flows; however, we need to think a little about how we encourage these to be used. Even with its frustrations, our current Buildbot ensures that the build logs and source versions of each app on Flathub are captured, and that the apps are built on all supported architectures. (Is 2023 when we add RISC-V? Reach out if you’d like to help!). If we hand upload tokens out to any developer, even if the majority of apps are open source, we will go from this relatively structured situation to something a lot more unstructured—and we fear many apps will be available on only 64-bit Intel/AMD machines.

My sketch here is that we need to establish some best practices around how to integrate Flathub uploads into popular CI systems, encouraging best practices so that we promote the properties of transparency and reproducibility that we don’t want to lose. If anyone is a CI wizard and would like to work with us as a thought partner about how we can achieve this—make it more flexible where and how build tasks can be hosted, but not lose these cross-platform and inspectability properties—we’d love to hear from you.

Donations and Payments

Once the work around legal and governance reaches a decent point, we will be in the position to move ahead with our Stripe setup and switch on the third big new feature in the Flathub web app. At present, we have already implemented support for one-off payments either as donations or a required purchase. We would like to go further than that, in line with what we were describing earlier about helping developers sustainably work on apps for our ecosystem: we would also like to enable developers to offer subscriptions. This will allow us to create a relationship between users and creators that funds ongoing work rather than what we already have.

Security

For Flathub to succeed, we need to make sure that as we grow, we continue to be a platform that can give users confidence in the quality and security of the apps we offer. To that end, we are planning to set up infrastructure to help ensure developers are shipping the best products they possibly can to users. For example, we’d like to set up automated linting and security scanning on the Flathub back-end to help developers avoid bad practices, unnecessary sandbox permissions, outdated dependencies, etc. and to keep users informed and as secure as possible.

Sponsorship

Fundraising is a forever task—as is running such a big and growing service. We hope that one day, we can cover our costs through some modest fees built into our payments—but until we reach that point, we’re going to be seeking a combination of grant funding and sponsorship to keep our roadmap moving. Our hope is very much that we can encourage different organisations that buy into our vision and will benefit from Flathub to help us support it and ensure we can deliver on our goals. If you have any suggestions of who might like to support Flathub, we would be very appreciative if you could reach out and get us in touch.

Finally, Thank You!

Thanks to you all for reading this far and supporting the work of Flathub, and also to our major sponsors and donors without whom Flathub could not exist: GNOME Foundation, KDE e.V., Mythic Beasts, Endless Network, Fastly, and Equinix Metal via the CNCF Community Cluster. Thanks also to the tireless work of the Freedesktop SDK community to give us the runtime platform most Flatpaks depend on, particularly Seppo Yli-Olli, Codethink and others.

I wanted to also give my personal thanks to a handful of dedicated people who keep Flathub working as a service and as a community: Bartłomiej Piotrowski is keeping the infrastructure working essentially single-handedly (in his spare time from keeping everything running at GNOME); Kolja Lampe and Bart built the new web app and backend API for Flathub which all of the new functionality has been built on, and Filippe LeMarchand maintains the checker bot which helps keeps all of the Flatpaks up to date.

And finally, all of the submissions to Flathub are reviewed to ensure quality, consistency and security by a small dedicated team of reviewers, with a huge amount of work from Hubert Figuière and Bart to keep the submissions flowing. Thanks to everyone­—named or unnamed—for building this vision of the future of the Linux desktop together with us.

(originally posted to Flathub Discourse, head there if you have any questions or comments)

by ramcq at March 07, 2023 11:00 AM

March 04, 2023

GStreamerGStreamer 1.22.1 stable bug fix release

(GStreamer)

The GStreamer team is pleased to announce the first bug fix release in the stable 1.22 release series of your favourite cross-platform multimedia framework!

This release only contains bugfixes and it should be safe to update from 1.22.0.

Highlighted bugfixes:

  • audio channel-mix: allow up to 64 channels (instead of up to 63 channels)
  • avfvideosrc: Don't wait on main thread for permissions request
  • avvidenc: avoid generating inaccurate output timestamps, especially with variable framerate streams
  • AV1 video codec caps signalling improvements in various elements
  • codectimestamper: Fix timestamping on sequence update
  • d3d11overlaycompositor: fix texture width and height
  • d3d11videosink: Fix rendering on external handle
  • dashdemux2: fix seek operation taking a log time to finish for some streams
  • nvencoder: Fix B-frame encoding on Linux and min buffers in auto GPU mode
  • playbin3: fixing buffering for live pipelines
  • playbin: fix potential deadlock when stopping stream with subtitles visible
  • redenc: fix setting of extension ID for twcc
  • rtspsrc: improved compatibility with more broken RTSP servers
  • v4l2h264dec: Fix Raspberry Pi4 will not play video in application
  • vtdec: fix jittery playback of H.264 Level 4.1 movies in macOS
  • vtdec: Fix non-deterministic frame output after flushing seeks
  • vtenc: fix handling of interlaced ProRes on Apple M1 hardware
  • vtenc: don't advertise ARGB/RGBA64 input caps on M1 Pro/Max with macOS <13
  • wasapi2src: Fix loopback capture on Windows 10 Anniversary Update
  • tools: better handling of non-ASCII command line arguments on Windows
  • gst-libav: fix build against newer ffmpeg versions
  • gst-python: Use arch-specific install dir for gi overrides
  • cerbero: Fix setuptools site.py breakage in Python 3.11
  • macOS packages: Fix broken binaries on macos < 11.0
  • various bug fixes, memory leak fixes, and other stability and reliability improvements

See the GStreamer 1.22.1 release notes for more details.

Binaries for Android, iOS, Mac OS X and Windows will be available shortly.

Release tarballs can be downloaded directly here:

March 04, 2023 02:00 PM

February 27, 2023

Thomas Vander SticheleMeet Me in the Bathroom

(Thomas Vander Stichele)

"Welcome to pre-9/11 New York City, when the world was unaware of the profound political and cultural shifts about to occur, and an entire generation was thirsty for more than the post–alternative pop rock plaguing MTV. In the cafés, clubs, and bars of the Lower East Side there convened a group of outsiders and misfits full of ambition and rock star dreams."

Music was the main reason I wanted to move to New York - I wanted to walk the same streets that the Yeah Yeah Yeahs, the National, Interpol, the Walkmen, the Antlers and Sonic Youth were walking. In my mind they'd meet up and have drinks with each other at the same bars, live close to each other, and I'd just run into them all the time myself. I'm not sure that romantic version of New York ever existed. Paul Banks used to live on a corner inbetween where I live and where my kids go to school now, but that is two decades ago (though for a while, we shared a hairdresser). On one of my first visits to New York before moving here, I had a great chat with Thurston Moore at a café right before taking the taxi back to the airport. And that's as close as I got to living my dream.

But now the documentary "Meet me in the Bathroom" (based on the book of the same name) shows that version of New York that only existed for a brief moment in time.

"Meet Me In The Bathroom — ??inspired by Lizzy Goodman’s book of the same name — chronicles the last great romantic age of rock ’n’ roll through the lens of era-defining bands."

Read the book, watch the documentary (available on Google Play among other platforms), or listen to the Spotify playlist Meet Me in the Bathroom: Every Song From The Book In Chronological Order. For bonus points, listen to Losing My Edge (every band mentioned in the LCD Soundsystem song in the order they're mentioned)

Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

flattr this!

by Thomas at February 27, 2023 10:31 AM

February 23, 2023

GStreamerGStreamer 1.20.6 old-stable bug fix release

(GStreamer)

The GStreamer team is pleased to announce another bug fix release in the now old-stable 1.20 release series of your favourite cross-platform multimedia framework!

This release only contains bug fixes. It should be safe to update from 1.20.x.

Highlighted bugfixes:

  • audio: channel-mix: allow up to 64 channels instead of up to 63 channels
  • AOM AV1 encoder timestamp handling improvements
  • AV1 video codec caps handling improvements in aom plugin, isomp4 and matroska muxers/demuxers
  • avvidenc: fix bitrate control and timestamps off FFmpeg-based video encoders
  • h264parse: fix missing timestamps on outputs when splitting a frame
  • rtspsrc: more workarounds for servers with broken control uri handling
  • playbin3: fix issue with UDP streams, making sure there's enough buffering
  • qmlglsrc: Fix deadlock when stopping and some other fixes
  • qtmux: fix default timescale unit for N/1001 framerates
  • v4l2h264dec: Fix Raspberry Pi4 will not play video in application
  • vtdec: Fix non-deterministic frame output after seeks
  • wasapi2src: Fix loopback capture on Windows 10 Anniversary Update
  • macOS, iOS: Fix Xcode 14 ABI breakage with older Xcode
  • cerbero: Fix some regressions for CentOS in the 1.20 branch
  • cerbero: Fix setuptools site.py breakage in Python 3.11
  • Fix gst-libav build against FFmpeg from git
  • gobject-introspection annotation fixes for bindings
  • Miscellaneous bug fixes, memory leak fixes, and other stability and reliability improvements
  • Performance improvements

See the GStreamer 1.20.6 release notes for more details.

Binaries for Android, iOS, macOS and Windows will be available shortly.

Release tarballs can be downloaded directly here:

The GStreamer 1.20 series has now been superseded by the new stable GStreamer 1.22 series and we recommend you upgrade at your earliest convenience.

February 23, 2023 11:50 PM

February 20, 2023

Thomas Vander SticheleAll Teams

(Thomas Vander Stichele)

"I have hands but I am doing all I can to have daily independence so I can’t be ‘all hands’ at work. I can share ideas, show up, and ask for help when I physically can’t use my hands. Belonging means folks with one hand, no hand and limited hands are valued in the workplace." - Dr. Akilah Cadet

If you've been wondering why over the past few months you're seeing a lot more "All Teams" meetings on your calendar, it's because language is ever evolving with the time, and people are starting to become more aware and replace ableist language.

Read more:

If your team still has "all hands" meetings, propose a more creative name, or default to "all teams" instead. Take some time to familiarize yourself with other ableist language and alternatives.

Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

flattr this!

by Thomas at February 20, 2023 10:28 AM

February 16, 2023

Phil NormandWebRTC in WebKitGTK and WPE, status updates, part I

(Phil Normand)

Some time ago we at Igalia embarked on the journey to ship a GStreamer-powered WebRTC backend. This is a long journey, it is not over, but we made some progress …

by Philippe Normand at February 16, 2023 08:30 PM

February 13, 2023

Thomas Vander SticheleHow to Support Through Layoffs

(Thomas Vander Stichele)

The number one question I've gotten in the past week has been "how can I support people affected by the layoff?"

First, some general advice:

  • For everyone, remember to "comfort in, dump out" - vent your frustrations to people less affected, support those who are more affected than you.
  • To support Xooglers, don't ask "how can I help?" as that places more burden on them. Offer specific ways you can help them - "Can I write you a Linkedin recommendation? Can I connect you with this person I know at this company who's hiring?". People are affected disproportionally, and if you want to prioritize your help, consider starting with the people on a visa who are now on a tight deadline to find a new sponsor or face leaving the country.
  • To support your colleagues still here, remember we're not all having the same experience. In particular, people outside of the US will be in limbo for weeks or months to come. People can be anywhere on a spectrum of "long time Googler, first mass layoff" to "I've had to go through worse". Don't assume, lead with curiosity, and listen.

Some resources people shared you might find helpful:

Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

flattr this!

by Thomas at February 13, 2023 10:29 AM

GStreamerGStreamer Rust bindings 0.20 / Rust Plugins 0.10 release

(GStreamer)

Version 0.20 of the GStreamer Rust bindings was released. Together with the bindings, also version 0.10 of the GStreamer Rust plugins was released.

As usual this release follows the latest gtk-rs 0.17 release and the corresponding API changes.

This release features relatively few changes and mostly contains the addition of some convenience APIs, the addition of bindings for some minor APIs and various optimizations.

As part of the optimizations there will be a lot fewer temporary allocations with the new version (especially for NUL-terminated strings), and improved code generation which does not only improve performance but also reduces the size of the resulting binaries considerably. Various GStreamer plugins saw size reductions of 15-20%.

The new release also brings a lot of bugfixes, most of which were already part of the 0.19.x bugfix releases.

Details can be found in the release notes for gstreamer-rs.

The code and documentation for the bindings is available on the freedesktop.org GitLab

as well as on crates.io.

The new 0.10 version of the GStreamer Rust plugins features many improvements to the existing plugins as well as a few new plugins. A majority of the changes were already backported to the 0.9 release series and its bugfix releases, which is part of GStreamer 1.22.

A full list of available plugins can be seen in the repository's README.md.

Details for this release can be found in the release notes for gst-plugins-rs.

If you find any bugs, notice any missing features or other issues please report them in GitLab for the bindings or the plugins.

February 13, 2023 10:00 AM

February 07, 2023

Andy Wingowhippet: towards a new local maximum

(Andy Wingo)

Friends, you might have noted, but over the last year or so I really caught the GC bug. Today's post sums up that year, in the form of a talk I gave yesterday at FOSDEM. It's long! If you prefer video, you can have a look instead to the at the FOSDEM event page.

Whippet: A New GC for Guile

4 Feb 2023 – FOSDEM

Andy Wingo

Guile is...

Mostly written in Scheme

Also a 30 year old C library

// API
SCM scm_cons (SCM car, SCM cdr);

// Many third-party users
SCM x = scm_cons (a, b);

So the context for the whole effort is that Guile has this part of its implementation which is in C. It also exposes a lot of that implementation to users as an API.

Putting the C into GC

SCM x = scm_cons (a, b);

Live objects: the roots, plus anything a live object refers to

How to include x into roots?

  • Refcounting
  • Register (& later unregister) &x with gc
  • Conservative roots

So what contraints does this kind of API impose on the garbage collector?

Let's start by considering the simple cons call above. In a garbage-collected environment, the GC is responsible for reclaiming unused memory. How does the GC know that the result of a scm_cons call is in use?

Generally speaking there are two main strategies for automatic memory management. One is reference counting: you associate a count with an object, incremented once for each referrer; in this case, the stack would hold a reference to x. When removing the reference, you decrement the count, and if it goes to 0 the object is unused and can be freed.

We GC people used to laugh at reference-counting as a memory management solution because it over-approximates the live object set in the presence of cycles, but it would seem that refcounting is coming back. Anyway, this isn't what Guile does, not right now anyway.

The other strategy we can use is tracing: the garbage collector periodically finds all of the live objects on the system and then recycles the memory for everything else. But how to actually find the first live objects to trace?

One way is to inform the garbage collector of the locations of all roots: references to objects originating from outside the heap. This can be done explicitly, as in V8's Handle<> API, or implicitly, in the form of a side table generated by the compiler associating code locations with root locations. This is called precise rooting: the GC is aware of all root locations at all code positions where GC might happen. Generally speaking you want the side table approach, in which the compiler writes out root locations to stack maps, because it doesn't impose any overhead at run-time to register and unregister locations. However for run-time routines implemented in C or C++, you won't be able to get the C compiler to do this for you, so you need the explicit approach if you want precise roots.

Conservative roots

Treat every word in stack as potential root; over-approximate live object set

1993: Bespoke GC inherited from SCM

2006 (1.8): Added pthreads, bugs

2009 (2.0): Switch to BDW-GC

BDW-GC: Roots also from extern SCM foo;, etc

The other way to find roots is very much not The Right Thing. Call it cheeky, call it sloppy, call it yolo, call it what you like, but in the trade it's known as conservative root-finding. This strategy looks like this:

uintptr_t *limit = stack_base_for_platform();
uintptr_t *sp = __builtin_frame_address();
for (; sp < limit; sp++) {
  void *obj = object_at_address(*sp);
  if (obj) add_to_live_objects(obj);
}

You just look at every word on the stack and pretend it's a pointer. If it happens to point to an object in the heap, we add that object to the live set. Of course this algorithm can find a spicy integer whose value just happens to correspond to an object's address, even if that object wouldn't have been counted as live otherwise. This approach doesn't compute the minimal live set, but rather a conservative over-approximation. Oh well. In practice this doesn't seem to be a big deal?

Guile has used conservative root-finding since its beginnings, 30 years ago and more. We had our own bespoke mark-sweep GC in the beginning, but it's now going on 15 years or so that we switched to the third-party Boehm-Demers-Weiser (BDW) collector. It's been good to us! It's better than what we had, it's mostly just worked, and it works correctly with threads.

Conservative roots

+: Ergonomic, eliminates class of bugs (handle registration), no compiler constraints

-: Potential leakage, no compaction / object motion; no bump-pointer allocation, calcifies GC choice

Conservative root-finding does have advantages. It's quite pleasant to program with, in environments in which the compiler is unable to produce stack maps for you, as it eliminates a set of potential bugs related to explicit handle registration and unregistration. Like stack maps, it also doesn't impose run-time overhead on the user program. And although the compiler isn't constrained to emit code to clear roots, it generally does, and sometimes does so more promptly than would be the case with explicit handle deregistration.

But, there are disadvantages too. The potential for leaks is one, though I have to say that in 20 years of using conservative-roots systems, I have not found this to be a problem. It's a source of anxiety whenever a program has memory consumption issues but I've never identified it as being the culprit.

The more serious disadvantage, though, is that conservative edges prevent objects from being moved by the GC. If you know that a location holds a pointer, you can update that location to point to a new location for an object. But if a location only might be a pointer, you can't do that.

In the end, the ergonomics of conservative collection lead to a kind of calcification in Guile, that we thought that BDW was as good as we could get given the constraints, and that changing to anything else would require precise roots, and thus an API and ABI change, losing users, and so on.

What if I told you

You can find roots conservatively and

  • move objects and compact the heap
  • do fast bump-pointer allocation
  • incrementally migrate to precise roots

BDW is not the local maximum

But it turns out, that's not true! There is a way to have conservative roots and also use more optimal GC algorithms, and one which preserves the ability to incrementally refactor the system to have more precision if that's what you want.

Immix

Fundamental GC algorithms

  • mark-compact
  • mark-sweep
  • evacuation
  • mark-region

Immix is a mark-region collector

Let's back up to a high level. Garbage collector implementations are assembled from instances of algorithms, and there are only so many kinds of algorithms out there.

There's mark-compact, in which the collector traverses the object graph once to find live objects, then once again to slide them down to one end of the space they are in.

There's mark-sweep, where the collector traverses the graph once to find live objects, then traverses the whole heap, sweeping dead objects into free lists to be used for future allocations.

There's evacuation, where the collector does a single pass over the object graph, copying the objects outside their space and leaving a forwarding pointer behind.

The BDW collector used by Guile is a mark-sweep collector, and its use of free lists means that allocation isn't as fast as it could be. We want bump-pointer allocation and all the other algorithms give it to us.

Then in 2008, Stephen Blackburn and Kathryn McKinley put out their Immix paper that identified a new kind of collection algorithm, mark-region. A mark-region collector will mark the object graph and then sweep the whole heap for unmarked regions, which can then be reused for allocating new objects.

Allocate: Bump-pointer into holes in thread-local block, objects can span lines but not blocks

Trace: Mark objects and lines

Sweep: Coarse eager scan over line mark bytes

Blackburn and McKinley's paper also describes a new mark-region GC algorithm, Immix, which is interesting because it gives us bump-pointer allocation without requiring that objects be moveable. The diagram above, from the paper, shows the organization of an Immix heap. Allocating threads (mutators) obtain 64-kilobyte blocks from the heap. Blocks contains 128-byte lines. When Immix traces the object graph, it marks both objects and the line the object is on. (Usually blocks are part of 2MB aligned slabs, with line mark bits/bytes are stored in a packed array at the start of the slab. When marking an object, it's easy to find the associated line mark just with address arithmetic.)

Immix reclaims memory in units of lines. A set of contiguous lines that were not marked in the previous collection form a hole (a region). Allocation proceeds into holes, in the usual bump-pointer fashion, giving us good locality for contemporaneously-allocated objects, unlike freelist allocation. The slow path, if the object doesn't fit in the hole, is to look for the next hole in the block, or if needed to acquire another block, or to stop for collection if there are no more blocks.

Immix: Opportunistic evacuation

Before trace, determine if compaction needed. If not, mark as usual

If so, select candidate blocks and evacuation target blocks. When tracing in that block, try to evacuate, fall back to mark

The neat thing that Immix adds is a way to compact the heap via opportunistic evacuation. As Immix allocates, it can end up skipping over holes and leaving them unpopulated, and as subsequent cycles of GC occur, it could be that a block ends up with many small holes. If that happens to many blocks it could be time to compact.

To fight fragmentation, Immix decides at the beginning of a GC cycle whether to try to compact or not. If things aren't fragmented, Immix marks in place; it's cheaper that way. But if compaction is needed, Immix selects a set of blocks needing evacuation and another set of empty blocks to evacuate into. (Immix has to keep around a couple percent of memory in empty blocks in reserve for this purpose.)

As Immix traverses the object graph, if it finds that an object is in a block that needs evacuation, it will try to evacuate instead of marking. It may or may not succeed, depending on how much space is available to evacuate into. Maybe it will succeed for all objects in that block, and you will be left with an empty block, which might even be given back to the OS.

Immix: Guile

Opportunistic evacuation compatible with conservative roots!

Bump-pointer allocation

Compaction!

1 year ago: start work on WIP GC implementation

Tying this back to Guile, this gives us all of our desiderata: we can evacuate, but we don't have to, allowing us to cause referents of conservative roots to be marked in place instead of moved; we can bump-pointer allocate; and we are back on the train of modern GC implementations. I could no longer restrain myself: I started hacking on a work-in-progress garbage collector workbench about a year ago, and ended up with something that seems to take us in the right direction.

Whippet vs Immix: Tiny lines

Immix: 128B lines + mark bit in object

Whippet: 16B “lines”; mark byte in side table

More size overhead: 1/16 vs 1/128

Less fragmentation (1 live obj = 2 lines retained)

More alloc overhead? More small holes

What I ended up building wasn't quite Immix. Guile's object representation is very thin and doesn't currently have space for a mark bit, for example, so I would have to have a side table of mark bits. (I could have changed Guile's object representation but I didn't want to require it.) I actually chose mark bytes instead of bits because both the Immix line marks and BDW's own side table of marks were bytes, to allow for parallel markers to race when setting marks.

Then, given that you have a contiguous table of mark bytes, why not remove the idea of lines altogether? Or what amounts to the same thing, why not make line size to be 16 bytes and do away with per-object mark bits? You can then bump-pointer into holes in the mark byte array. The only thing you need to do to that is to be able to cheaply find the end of an object, so you can skip to the next hole while sweeping; you don't want to have to chase pointers to do that. But consider, you've already paid the cost of having a mark byte associated with every possible start of an object, so if your basic object alignment is 16 bytes, that's a memory overhead of 1/16, or 6.25%; OK. Let's put that mark byte to work and include an "end" bit, indicating the end of the object. Allocating an object has to store into the mark byte array to initialize this "end" marker, but you need to write the mark byte anyway to allow for conservative roots ("does this address hold an object?"); writing the end at the same time isn't so bad, perhaps.

The expected outcome would be that relative to 128-byte lines, Whippet ends up with more, smaller holes. Such a block would be a prime target for evacuation, of course, but during allocation this is overhead. Or, it could be a source of memory efficiency; who knows. There is some science yet to do to properly compare this tactic to original Immix, but I don't think I will get around to it.

While I am here and I remember these things, I need to mention two more details. If you read the Immix paper, it describes "conservative line marking", which is related to how you find the end of an object; basically Immix always marks the line an object is on and the next one, in case the object spans the line boundary. Only objects larger than a line have to precisely mark the line mark array when they are traced. Whippet doesn't do this because we have the end bit.

The other detail is the overflow allocator; in the original Immix paper, if you allocate an object that's smallish but still larger than a line or two, but there's no hole big enough in the block, Immix keeps around a completely empty block per mutator in which to bump-pointer-allocate these medium-sized objects. Whippet doesn't do that either, instead relying on such failure to allocate in a block to cause fragmentation and thus hurry along the process of compaction.

Whippet vs Immix: Lazy sweeping

Immix: “cheap” eager coarse sweep

Whippet: just-in-time lazy fine-grained sweep

Corrolary: Data computed by sweep available when sweep complete

Live data at previous GC only known before next GC

Empty blocks discovered by sweeping

Having a fine-grained line mark array means that it's no longer a win to do an eager sweep of all blocks after collecting. Instead Whippet applies the classic "lazy sweeping" optimization to make mutators sweep their blocks just before allocating into them. This introduces a delay in the collection algorithm: Whippet doesn't find out about e.g. fragmentation until the whole heap is swept, but by the time we fully sweep the heap, we've exhausted it via allocation. It introduces a different flavor to the GC, not entirely unlike original Immix, but foreign.

Whippet vs BDW

Compaction/defrag/pinning, heap shrinking, sticky-mark generational GC, threads/contention/allocation, ephemerons, precision, tools

Right! With that out of the way, let's talk about what Whippet gives to Guile, relative to BDW-GC.

Whippet vs BDW: Motion

Heap-conservative tracing: no object moveable

Stack-conservative tracing: stack referents pinned, others not

Whippet: If whole-heap fragmentation exceeds threshold, evacuate most-fragmented blocks

Stack roots scanned first; marked instead of evacuated, implicitly pinned

Explicit pinning: bit in mark byte

If all edges in the heap are conservative, then you can't move anything, because you don't know if an edge is a pointer that can be updated or just a spicy integer. But most systems aren't actually like this: you have conservative edges from the stack, but you can precisely enumerate intra-object edges on the heap. In that case, you have a known set of conservative edges, and you can simply visit those edges first, marking their referents in place instead of evacuating. (Marking an object instead of evacuating implicitly pins it for the duration of the current GC cycle.) Then you visit heap edges precisely, possibly evacuating objects.

I should note that Whippet has a bit in the mark byte for use in explicitly pinning an object. I'm not sure how to manage who is responsible for setting that bit, or what the policy will be; the current idea is to set it for any object whose identity-hash value is taken. We'll see.

Whippet vs BDW: Shrinking

Lazy sweeping finds empty blocks: potentially give back to OS

Need empty blocks? Do evacuating collection

Possibility to do http://marisa.moe/balancer.html

With the BDW collector, your heap can only grow; it will never shrink (unless you enable a non-default option and you happen to have verrry low fragmentation). But with Whippet and evacuation, we can rearrange objects so as to produce empty blocks, which can then be returned to the OS if so desired.

In one of my microbenchmarks I have the system allocating long-lived data, interspersed with garbage (objects that are dead after allocation) whose size is in a power-law distribution. This should produce quite some fragmentation, eventually, and it does. But then Whippet decides to defragment, and it works great! Since Whippet doesn't keep a whole 2x reserve like a semi-space collector, it usually takes more than one GC cycle to fully compact the heap; usually about 3 cycles, from what I can see. I should do some more measurements here.

Of course, this is just mechanism; choosing the right heap sizing policy is a different question.

wingolog.org/archives/2022/10/22/the-sticky-mark-bit-algorithm

Card marking barrier (256B); compare to BDW mprotect / SIGSEGV

The Boehm collector also has a non-default mode in which it uses mprotect and a SIGSEGV handler to enable sticky-mark-bit generational collection. I haven't done a serious investigation, but I see it actually increasing run-time by 20% on one of my microbenchmarks that is actually generation-friendly. I know that Azul's C4 collector used to use page protection tricks but I can only assume that BDW's algorithm just doesn't work very well. (BDW's page barriers have another purpose, to enable incremental collection, in which marking is interleaved with allocation, but this mode is off if parallel markers are supported, and I don't know how well it works.)

Anyway, it seems we can do better. The ideal would be a semi-space nursery, which is the usual solution, but because of conservative roots we are limited to the sticky mark-bit algorithm. Some benchmarks aren't very generation-friendly; the first pair of bars in the chart above shows the mt-gcbench microbenchmark running with and without generational collection, and there's no difference. But in the second, for the quads benchmark, we see a 2x speedup or so.

Of course, to get generational collection to work, we require mutators to use write barriers, which are little bits of code that run when an object is mutated that tell the GC where it might find links from old objects to new objects. Right now in Guile we don't do this, but this benchmark shows what can happen if we do.

Whippet vs BDW: Scale

BDW: TLS segregated-size freelists, lock to refill freelists, SIGPWR for stop

Whippet: thread-local block, sweep without contention, wait-free acquisition of next block, safepoints to stop with ragged marking

Both: parallel markers

Another thing Whippet can do better than BDW is performance when there are multiple allocating threads. The Immix heap organization facilitates minimal coordination between mutators, and maximum locality for each mutator. Sweeping is naturally parallelized according to how many threads are allocating. For BDW, on the other hand, every time an mutator needs to refill its thread-local free lists, it grabs a global lock; sweeping is lazy but serial.

Here's a chart showing whippet versus BDW on one microbenchmark. On the X axis I add more mutator threads; each mutator does the same amount of allocation, so I'm increasing the heap size also by the same factor as the number of mutators. For simplicity I'm running both whippet and BDW with a single marker thread, so I expect to see a linear increase in elapsed time as the heap gets larger (as with 4 mutators there are roughly 4 times the number of live objects to trace). This test is run on a Xeon Silver 4114, taskset to free cores on a single socket.

What we see is that as I add workers, elapsed time increases linearly for both collectors, but more steeply for BDW. I think (but am not sure) that this is because whippet effectively parallelizes sweeping and allocation, whereas BDW has to contend over a global lock to sweep and refill free lists. Both have the linear factor of tracing the object graph, but BDW has the additional linear factor of sweeping, whereas whippet scales with mutator count.

Incidentally you might notice that at 4 mutator threads, BDW randomly crashed, when constrained to a fixed heap size. I have noticed that if you fix the heap size, BDW sometimes (and somewhat randomly) fails. I suspect the crash due to fragmentation and inability to compact, but who knows; multiple threads allocating is a source of indeterminism. Usually when you run BDW you let it choose its own heap size, but for these experiments I needed to have a fixed heap size instead.

Another measure of scalability is, how does the collector do as you add marker threads? This chart shows that for both collectors, runtime decreases as you add threads. It also shows that whippet is significantly slower than BDW on this benchmark, which is Very Weird, and I didn't have access to the machine on which these benchmarks were run when preparing the slides in the train... so, let's call this chart a good reminder that Whippet is a WIP :)

While in the train to Brussels I re-ran this test on the 4-core laptop I had on hand, and got the results that I expected: that whippet performed similarly to BDW, and that adding markers improved things, albeit marginally. Perhaps I should look on a different microbenchmark.

Incidentally, when you configure Whippet for parallel marking at build-time, it uses a different implementation of the mark stack when compared to the parallel marker, even when only 1 marker is enabled. Certainly the parallel marker could use some tuning.

Whippet vs BDW: Ephemerons

BDW: No ephemerons

Whippet: Yes

Another deep irritation I have with BDW is that it doesn't support ephemerons. In Guile we have a number of facilities (finalizers, guardians, the symbol table, weak maps, et al) built on what BDW does have (finalizers, weak references), but the implementations of these facilities in Guile are hacky, slow, sometimes buggy, and don't compose (try putting an object in a guardian and giving it a finalizer to see what I mean). It would be much better if the collector API supported ephemerons natively, specifying their relationship to finalizers and other facilities, allowing us to build what we need in terms of those primitives. With our own GC, we can do that, and do it in such a way that it doesn't depend on the details of the specific collection algorithm. The exception of course is that as BDW doesn't support ephemerons per se, what we get is actually a weak-key association instead, whose value can keep the key alive. Oh well, it's no worse than the current situation.

Whippet vs BDW: Precision

BDW: ~Always stack-conservative, often heap-conservative

Whippet: Fully configurable (at compile-time)

Guile in mid/near-term: C stack conservative, Scheme stack precise, heap precise

Possibly fully precise: unlock semi-space nursery

Conservative tracing is a fundamental design feature of the BDW collector, both of roots and of inter-heap edges. You can tell BDW how to trace specific kinds of heap values, but the default is to do a conservative scan, and the stack is always scanned conservatively. In contrast, these tradeoffs are all configurable in Whippet. You can scan the stack and heap precisely, or stack conservatively and heap precisely, or vice versa (though that doesn't make much sense), or both conservatively.

The long-term future in Guile is probably to continue to scan the C stack conservatively, to continue to scan the Scheme stack precisely (even with BDW-GC, the Scheme compiler emits stack maps and installs a custom mark routine), but to scan the heap as precisely as possible. It could be that a user uses some of our hoary ancient APIs to allocate an object that Whippet can't trace precisely; in that case we'd have to disable evacuation / object motion, but we could still trace other objects precisely.

If Guile ever moved to a fully precise world, that would be a boon for performance, in two ways: first that we would get the ability to use a semi-space nursery instead of the sticky-mark-bit algorithm, and relatedly that we wouldn't need to initialize mark bytes when allocating objects. Second, we'd gain the option to use must-move algorithms for the old space as well (mark-compact, semi-space) if we wanted to. But it's just an option, one that that Whippet opens up for us.

Whippet vs BDW: Tools?

Can build heap tracers and profilers moer easily

More hackable

(BDW-GC has as many preprocessor directives as whippet has source lines)

Finally, relative to BDW-GC, whippet has a more intangible advantage: I can actually hack on it. Just as an indication, 15% of BDW source lines are pre-processor directives, and there is one file that has like 150 #ifdef's, not counting #elseif's, many of them nested. I haven't done all that much to BDW itself, but I personally find it excruciating to work on.

Hackability opens up the possibility to build more tools to help us diagnose memory use problems. They aren't in Whippet yet, but there can be!

Engineering Whippet

Embed-only, abstractions, migration, modern; timeline

OK, that rounds out the comparison between BDW and Whippet, at least on a design level. Now I have a few words about how to actually get this new collector into Guile without breaking the bug budget. I try to arrange my work areas on Guile in such a way that I spend a minimum of time on bugs. Part of my strategy is negligence, I will admit, but part also is anticipating problems and avoiding them ahead of time, even if it takes more work up front.

Engineering Whippet: Embed-only

github.com/wingo/whippet-gc/

Semi: 6 kB; Whippet: 22 kB; BDW: 184 kB

Compile-time specialization:

  • for embedder (e.g. how to forward objects)
  • for selected GC algorithm (e.g. semi-space vs whippet)

Built apart, but with LTO to remove library overhead

So the BDW collector is typically shipped as a shared library that you dynamically link to. I should say that we've had an overall good experience with upgrading BDW-GC in the past; its maintainer (Ivan Maidanski) does a great and responsible job on a hard project. It's been many, many years since we had a bug in BDW-GC. But still, BDW is dependency, and all things being equal we prefer to remove moving parts.

The approach that Whippet is taking is to be an embed-only library: it's designed to be compiled into your project. It's not an include-only library; it still has to be compiled, but with link-time-optimization and a judicious selection of fast-path interfaces, Whippet is mostly able to avoid abstractions being a performance barrier.

The result is that Whippet is small, both in source and in binary, which minimizes its maintenance overhead. Taking additional stripped optimized binary size as the metric, by my calculations a semi-space collector (with a large object space and ephemeron support) takes about 6 kB of object file size, whereas Whippet takes 22 and BDW takes 184. Part of how Whippet gets so small is that it is is configured in major ways at compile-time (choice of main GC algorithm), and specialized against the program it's embedding against (e.g. how to patch in a forwarding pointer). Having all API being internal and visible to LTO instead of going through ELF symbol resolution helps in a minor way as well.

Engineering Whippet: Abstract performance

User API abstracts over GC algorithm, e.g. semi-space or whippet

Expose enough info to allow JIT to open-code fast paths

Inspired by mmtk.io

Abstractions permit change: of algorithm, over time

From a composition standpoint, Whippet is actually a few things. Firstly there is an abstract API to make a heap, create per-thread mutators for a heap, and allocate objects for a mutator. There is the aforementioned embedder API, for having the embedding program indicate how to trace objects and install forwarding pointers. Then there is some common code (for example ephemeron support). There are implementations of the different spaces: semi-space, large object, whippet/immix; and finally collector implementations that tie together the spaces into a full implementation of the abstract API. (In practice the more iconic spaces are intertwingled with the collector implementations they define.)

I don't think I would have gone down this route without seeing some prior work, for example libpas, but it was really MMTk that convinced me that it was worth spending a little time thinking about the GC not as a structureless blob but as a system made of parts and exposing a minimal interface. In particular, I was inspired by seeing that MMTk is able to get good performance while also being abstract, exposing representation details such as how to tell a JIT compiler about allocation fast-paths, but in a principled way. So, thanks MMTk people, for this and so many things!

I'm particularly happy that the API is abstract enough that it frees up not only the garbage collector to change implementations, but also Guile and other embedders, in that they don't have to bake in a dependency on specific collectors. The semi-space collector has been particularly useful here in ensuring that the abstractions don't accidentally rely on support for object pinning.

Engineering Whippet: Migration

API implementable by BDW-GC (except ephemerons)

First step for Guile: BDW behind Whippet API

Then switch to whippet/immix (by default)

The collector API can actually be implemented by the BDW collector. Whippet includes a collector that is a thin wrapper around the BDW API, with support for fast-path allocation via thread-local freelists. In this way we can always check the performance of any given collector against an external fixed point (BDW) as well as a theoretically known point (the semi-space collector).

Indeed I think the first step for Guile is precisely this: refactor Guile to allocate through the Whippet API, but using the BDW collector as the implementation. This will ensure that the Whippet API is sufficient, and then allow an incremental switch to other collectors.

Incidentally, when it comes to integrating Whippet, there are some choices to be made. I mentioned that it's quite configurable, and this chart can give you some idea. On the left side is one microbenchmark (mt-gcbench) and on the right is another (quads). The first generates a lot of fragmentation and has a wide range of object sizes, including some very large objects. The second is very uniform and many allocations die young.

(I know these images are small; right-click to open in new tab or pinch to zoom to see more detail.)

Within each set of bars we have 10 different scenarios, corresponding to different Whippet configurations. (All of these tests are run on my old 4-core laptop with 4 markers if parallel marking is supported, and a 2x heap.)

The first bar in each side is serial whippet: one marker. Then we see parallel whippet: four markers. Great. Then there's generational whippet: one marker, but just scanning objects allocated in the current cycle, hoping that produces enough holes. Then generational parallel whippet: the same as before, but with 4 markers.

The next 4 bars are the same: serial, parallel, generational, parallel-generational, but with one difference: the stack is scanned conservatively instead of precisely. You might be surprised but all of these configurations actually perform better than their precise counterparts. I think the reason is that the microbenchmark uses explicit handle registration and deregistration (it's a stack) instead of compiler-generated stack maps in a side table, but I'm not precisely (ahem) sure.

Finally the next 2 bars are serial and parallel collectors, but marking everything conservatively. I have generational measurements for this configuration but it really doesn't make much sense to assume that you can emit write barriers in this context. These runs are slower than the previous configuration, mostly because there are some non-pointer locations that get scanned conservatively that wouldn't get scanned precisely. I think conservative heap scanning is less efficient than precise but I'm honestly not sure, there are some instruction locality arguments in the other direction. For mt-gcbench though there's a big array of floating-point values that a precise scan will omit, which causes significant overhead there. Probably for this configuration to be viable Whippet would need the equivalent of BDW's API to allocate known-pointerless objects.

Engineering Whippet: Modern

stdatomic

constexpr-ish

pthreads (for parallel markers)

No void*; instead struct types: gc_ref, gc_edge, gc_conservative_ref, etc

Embed-only lib avoids any returns-struct-by-value ABI issue

Rust? MMTk; supply chain concerns

Platform abstraction for conservative root finding

I know it's a sin, but Whippet is implemented in C. I know. The thing is, in the Guile context I need to not introduce wild compile-time dependencies, because of bootstrapping. And I know that Rust is a fine language to use for GC implementation, so if that's what you want, please do go take a look at MMTk! It's a fantastic project, written in Rust, and it can just slot into your project, regardless of the language your project is written in.

But if what you're looking for is something in C, well then you have to pick and choose your C. In the case of Whippet I try to use the limited abilities of C to help prevent bugs; for example, I generally avoid void* and instead wrap pointers or addresses into single-field structs that can't be automatically cast, for example to prevent a struct gc_ref that denotes an object reference (or NULL; it's an option type) from being confused with a struct gc_conservative_ref, which might not point to an object at all.

(Of course, by "C" I mean "C as compiled by gcc and clang with -fno-strict-aliasing". I don't know if it's possible to implement even a simple semi-space collector in C without aliasing violations. Can you access a Foo* object within a mmap'd heap through its new address after it has been moved via memcpy? Maybe not, right? Thoughts are welcome.)

As a project written in the 2020s instead of the 1990s, Whippet gets to assume a competent C compiler, for example relying on the compiler to inline and fold branches where appropriate. As in libpas, Whippet liberally passes functions as values to inline functions, and relies on the compiler to boil away function calls. Whippet only uses the C preprocessor when it absolutely has to.

Finally, there is a clean abstraction for anything that's platform-specific, for example finding the current stack bounds. I haven't compiled this code on Windows or MacOS yet, but I am not anticipating too many troubles.

Engineering Whippet: Timeline

As time permits

Whippet TODO: heap growth/shrinking, finalizers, safepoint API

Guile TODO: safepoints; heap-conservative first

Precise heap TODO: gc_trace_object, SMOBs, user structs with raw ptr fields, user gc_malloc usage; 3.2

6 months for 3.1.1; 12 for 3.2.0 ?

So where does this get us? Where are we now?

For Whippet itself, I think it's mostly done -- enough to start shifting focus to some different phase. It's missing some needed features, notably the ability to grow the heap at all, as I've been in fixed-heap-size-only mode during development. It's also missing finalizers. And, something needs to be done to unify Guile's handling of safepoints and processing of asynchronous signals with Whippet's need to stop all mutators. Some details remain.

But, I think we are close to ready to start integrating in Guile. At first this is just porting Guile to use the Whippet API to access BDW instead of using BDW directly. This whole thing is a side project for me that I work on when I can, so it doesn't exactly proceed at full pace. Perhaps this takes 6 months. Then we can cut a new unstable release, and hopefully release 3.2 withe support for the Immix-flavored collector in another 6 or 9 months.

I thought that we would be forced to make ABI changes, if only because of some legacy APIs assume conservative tracing of object contents. But after a discussion at FOSDEM with Carlo Piovesan I realized this isn't true: because the decision to evacuate or not is made on a collection-by-collection basis, I could simply disable evacuation if the user ever uses a facility that might prohibit object motion, for example if they ever define a SMOB type. If the user wants evacuation, they need to be more precise with their data types, but either way Guile is ready.

Whippet: A Better GC?

An Immix-derived GC

github.com/wingo/whippet-gc/

https://wingolog.org/tags/gc/

Guile 3.2 ?

Thanks to MMTk authors for inspiration!

And that's it! Thanks for reading all the way here. Comments are quite welcome.

As I mentioned in the very beginning, this talk was really about Whippet in the context of Guile. There is a different talk to be made about Guile+Whippet versus other language implementations, for example those with concurrent marking or semi-space nurseries or the like. Yet another talk is Whippet in the context of other GC algorithms. But this is a start. It's something I've been working on for a while now already and I'm pleased that it's gotten to a point where it seems to be at least OK, at least an improvement with respect to BDW-GC in some ways.

But before leaving you, another chart, to give a more global idea of the state of things. Here we compare a single mutator thread performing a specific microbenchmark that makes trees and also lots of fragmentation, across three different GC implementations and a range of heap sizes. The heap size multipliers in this and in all the other tests in this post are calculated analytically based on what the test thinks its maximum heap size should be, not by measuring minimum heap sizes that work. This size is surely lower than the actual maximum required heap size due to internal fragmentation, but the tests don't know about this.

The three collectors are BDW, a semi-space collector, and whippet. Semi-space manages to squeeze in less than 2x of a heap multiplier because it has (and whippet has) a separate large object space that isn't ever evacuated.

What we expect is that tighter heaps impose more GC time, and indeed we see that times are higher on the left side than the right.

Whippet is the only implementation that manages to run at a 1.3x heap, but it takes some time. It's slower than BDW at a 1.5x heap but better there on out, until what appears to be a bug or pathology makes it take longer at 5x. Adding memory should always decrease run time.

The semi-space collector starts working at 1.75x and then surpasses all collectors from 2.5x onwards. We expect the semi-space collector to win for big heaps, because its overhead is proportional to live data only, whereas mark-sweep and mark-region collectors have to sweep, which is proportional to heap size, and indeed that's what we see.

I think this chart shows we have some tuning yet to do. The range between 2x and 3x is quite acceptable, but we need to see what's causing Whippet to be slower than BDW at 1.5x. I haven't done as much performance tuning as I would like to but am happy to finally be able to know where we stand.

And that's it! Happy hacking, friends, and may your heap sizes be ever righteous.

by Andy Wingo at February 07, 2023 01:14 PM

February 06, 2023

Thomas Vander SticheleSRE Philosophy With Jennifer Mace

(Thomas Vander Stichele)

"Even the most junior SRE on call starts having director authority. [..] There is a power in that relationship that SRE does have when they think something is in danger. And it's a power we have to be careful not to misuse. But it's important, because that's our job."

Macey is the guest on Episode 1 of SRE Prodcast, Google's podcast about Site Reliability Engineering. She goes in-depth on some of the core tenets of SRE, including risk, on-call, toil, design involvement, and more. (As a side note, I'm reasonably certain that I'm not the entertaining Belgian that was causing her team failure loops, but I'm too afraid to ask.)

The whole series is worth a listen, but just like the podcast itself - start with macey's advice.

"My definition of toil: toil is boring or repetitive work that does not gain you a permanent improvement."

Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

flattr this!

by Thomas at February 06, 2023 09:25 AM

February 04, 2023

Thomas Vander SticheleHow Not to Say the Wrong Thing

(Thomas Vander Stichele)

"if you’re going to open your mouth, ask yourself if what you are about to say is likely to provide comfort and support. If it isn’t, don’t say it. Don’t, for example, give advice."

Susan Silk's Ring Theory is a helpful model to navigate what not to say during times of grief and traumatic events.

Picture a center ring, and inside it the people most affected by what's going on. Picture a larger circle around it, with inside it the people closest to those in the center. Repeat outwards.

The person in the center ring can say anything they want to anyone, anywhere.
Everyone else can say those things too, but only to people in the larger outside rings. Otherwise, you support and comfort.

Now, consider where in this diagram you are, and where the people you are talking to are.

"Comfort IN, dump OUT."

This model applies in other situations - for example, managers are better off complaining to their own managers or peers, while supporting their own reports and absorbing their complaints with empathy and compassion.

Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

flattr this!

by Thomas at February 04, 2023 02:24 AM

January 27, 2023

Andy Wingothree approaches to heap sizing

(Andy Wingo)

How much memory should a program get? Tonight, a quick note on sizing for garbage-collected heaps. There are a few possible answers, depending on what your goals are for the system.

you: doctor science

Sometimes you build a system and you want to study it: to identify its principal components and see how they work together, or to isolate the effect of altering a single component. In that case, what you want is a fixed heap size. You run your program a few times and determine a heap size that is sufficient for your problem, and then in future run the program with that new fixed heap size. This allows you to concentrate on the other components of the system.

A good approach to choosing the fixed heap size for a program is to determine the minimum heap size a program can have by bisection, then multiplying that size by a constant factor. Garbage collection is a space/time tradeoff: the factor you choose represents a point on the space/time tradeoff curve. I would choose 1.5 in general, but this is arbitrary; I'd go more with 3 or even 5 if memory isn't scarce and I'm really optimizing for throughput.

Note that a fixed-size heap is not generally what you want. It's not good user experience for running ./foo at the command line, for example. The reason for this is that program memory use is usually a function of the program's input, and only in some cases do you know what the input might look like, and until you run the program you don't know what the exact effect of input on memory is. Still, if you have a team of operations people that knows what input patterns look like and has experience with a GC-using server-side process, fixed heap sizes could be a good solution there too.

you: average josé/fina

On the other end of the spectrum is the average user. You just want to run your program. The program should have the memory it needs! Not too much of course; that would be wasteful. Not too little either; I can tell you, my house is less than 100m², and I spend way too much time shuffling things from one surface to another. If I had more space I could avoid this wasted effort, and in a similar way, you don't want to be too stingy with a program's heap. Do the right thing!

Of course, you probably have multiple programs running on a system that are making similar heap sizing choices at the same time, and the relative needs and importances of these programs could change over time, for example as you switch tabs in a web browser, so the right thing really refers to overall system performance, whereas what you are controlling is just one process' heap size; what is the Right Thing, anyway?

My corner of the GC discourse agrees that something like the right solution was outlined by Kirisame, Shenoy, and Panchekha in a 2022 OOPSLA paper, in which the optimum heap size depends on the allocation rate and the gc cost for a process, which you measure on an ongoing basis. Interestingly, their formulation of heap size calculation can be made by each process without coordination, but results in a whole-system optimum.

There are some details but you can imagine some instinctive results: for example, when a program stops allocating because it's waiting for some external event like user input, it doesn't need so much memory, so it can start shrinking its heap. After all, it might be quite a while before the program has new input. If the program starts allocating again, perhaps because there is new input, it can grow its heap rapidly, and might then shrink again later. The mechanism by which this happens is pleasantly simple, and I salute (again!) the authors for identifying the practical benefits that an abstract model brings to the problem domain.

you: a damaged, suspicious individual

Hoo, friends-- I don't know. I've seen some things. Not to exaggerate, I like to think I'm a well-balanced sort of fellow, but there's some suspicion too, right? So when I imagine a background thread determining that my web server hasn't gotten so much action in the last 100ms and that really what it needs to be doing is shrinking its heap, kicking off additional work to mark-compact it or whatever, when the whole point of the virtual machine is to run that web server and not much else, only to have to probably give it more heap 50ms later, I-- well, again, I exaggerate. The MemBalancer paper has a heartbeat period of 1 Hz and a smoothing function for the heap size, but it just smells like danger. Do I need danger? I mean, maybe? Probably in most cases? But maybe it would be better to avoid danger if I can. Heap growth is usually both necessary and cheap when it happens, but shrinkage is never necessary and is sometimes expensive because you have to shuffle around data.

So, I think there is probably a case for a third mode: not fixed, not adaptive like the MemBalancer approach, but just growable: grow the heap when and if its size is less than a configurable multiplier (e.g. 1.5) of live data. Never shrink the heap. If you ever notice that a process is taking too much memory, manually kill it and start over, or whatever. Default to adaptive, of course, but when you start to troubleshoot a high GC overhead in a long-lived proess, perhaps switch to growable to see its effect.

unavoidable badness

There is some heuristic badness that one cannot avoid: even with the adaptive MemBalancer approach, you have to choose a point on the space/time tradeoff curve. Regardless of what you do, your system will grow a hairy nest of knobs and dials, and if your system is successful there will be a lively aftermarket industry of tuning articles: "Are you experiencing poor object transit? One knob you must know"; "Four knobs to heaven"; "It's raining knobs"; "GC engineers DO NOT want you to grab this knob!!"; etc. (I hope that my British readers are enjoying this.)

These ad-hoc heuristics are just part of the domain. What I want to say though is that having a general framework for how you approach heap sizing can limit knob profusion, and can help you organize what you have into a structure of sorts.

At least, this is what I tell myself; inshallah. Now I have told you too. Until next time, happy hacking!

by Andy Wingo at January 27, 2023 09:45 PM

January 24, 2023

Andy Wingoparallel ephemeron tracing

(Andy Wingo)

Hello all, and happy new year. Today's note continues the series on implementing ephemerons in a garbage collector.

In our last dispatch we looked at a serial algorithm to trace ephemerons. However, production garbage collectors are parallel: during collection, they trace the object graph using multiple worker threads. Our problem is to extend the ephemeron-tracing algorithm with support for multiple tracing threads, without introducing stalls or serial bottlenecks.

Recall that we ended up having to define a table of pending ephemerons:

struct gc_pending_ephemeron_table {
  struct gc_ephemeron *resolved;
  size_t nbuckets;
  struct gc_ephemeron *buckets[0];
};

This table holds pending ephemerons that have been visited by the graph tracer but whose keys haven't been found yet, as well as a singly-linked list of resolved ephemerons that are waiting to have their values traced. As a global data structure, the pending ephemeron table is a point of contention between tracing threads that we need to design around.

a confession

Allow me to confess my sins: things would be a bit simpler if I didn't allow tracing workers to race.

As background, if your GC supports marking in place instead of always evacuating, then there is a mark bit associated with each object. To reduce the overhead of contention, a common strategy is to actually use a whole byte for the mark bit, and to write to it using relaxed atomics (or even raw stores). This avoids the cost of a compare-and-swap, but at the cost that multiple marking threads might see that an object's mark was unset, go to mark the object, and think that they were the thread that marked the object. As far as the mark byte goes, that's OK because everybody is writing the same value. The object gets pushed on the to-be-traced grey object queues multiple times, but that's OK too because tracing should be idempotent.

This is a common optimization for parallel marking, and it doesn't have any significant impact on other parts of the GC--except ephemeron marking. For ephemerons, because the state transition isn't simply from unmarked to marked, we need more coordination.

high level

The parallel ephemeron marking algorithm modifies the serial algorithm in just a few ways:

  1. We have an atomically-updated state field in the ephemeron, used to know if e.g. an ephemeron is pending or resolved;

  2. We use separate fields for the pending and resolved links, to allow for concurrent readers across a state change;

  3. We introduce "traced" and "claimed" states to resolve races between parallel tracers on the same ephemeron, and track the "epoch" at which an ephemeron was last traced;

  4. We remove resolved ephemerons from the pending ephemeron hash table lazily, and use atomic swaps to pop from the resolved ephemerons list;

  5. We have to re-check key liveness after publishing an ephemeron to the pending ephemeron table.

Regarding the first point, there are four possible values for the ephemeron's state field:

enum {
  TRACED, CLAIMED, PENDING, RESOLVED
};

The state transition diagram looks like this:

  ,----->TRACED<-----.
 ,         | ^        .
,          v |         .
|        CLAIMED        |
|  ,-----/     \---.    |
|  v               v    |
PENDING--------->RESOLVED

With this information, we can start to flesh out the ephemeron object itself:

struct gc_ephemeron {
  uint8_t state;
  uint8_t is_dead;
  unsigned epoch;
  struct gc_ephemeron *pending;
  struct gc_ephemeron *resolved;
  void *key;
  void *value;
};

The state field holds one of the four state values; is_dead indicates if a live ephemeron was ever proven to have a dead key, or if the user explicitly killed the ephemeron; and epoch is the GC count at which the ephemeron was last traced. Ephemerons are born TRACED in the current GC epoch, and the collector is responsible for incrementing the current epoch before each collection.

algorithm: tracing ephemerons

When the collector first finds an ephemeron, it does a compare-and-swap (CAS) on the state from TRACED to CLAIMED. If that succeeds, we check the epoch; if it's current, we revert to the TRACED state: there's nothing to do.

(Without marking races, you wouldn't need either TRACED or CLAIMED states, or the epoch; it would be implicit in the fact that the ephemeron was being traced at all that you had a TRACED ephemeron with an old epoch.)

So now we have a CLAIMED ephemeron with an out-of-date epoch. We update the epoch and clear the pending and resolved fields, setting them to NULL. If, then, the ephemeron is_dead, we are done, and we go back to TRACED.

Otherwise we check if the key has already been traced. If so we forward it (if evacuating) and then trace the value edge as well, and transition to TRACED.

Otherwise we have a live E but we don't know about K; this ephemeron is pending. We transition E's state to PENDING and add it to the front of K's hash bucket in the pending ephemerons table, using CAS to avoid locks.

We then have to re-check if K is live, after publishing E, to account for other threads racing to mark to K while we mark E; if indeed K is live, then we transition to RESOLVED and push E on the global resolved ephemeron list, using CAS, via the resolved link.

So far, so good: either the ephemeron is fully traced, or it's pending and published, or (rarely) published-then-resolved and waiting to be traced.

algorithm: tracing objects

The annoying thing about tracing ephemerons is that it potentially impacts tracing of all objects: any object could be the key that resolves a pending ephemeron.

When we trace an object, we look it up in the pending ephemeron hash table. But, as we traverse the chains in a bucket, we also load each node's state. If we find a node that's not in the PENDING state, we atomically forward its predecessor to point to its successor. This is correct for concurrent readers because the end of the chain is always reachable: we only skip nodes that are not PENDING, nodes never become PENDING after they transition away from being PENDING, and we only add PENDING nodes to the front of the chain. We even leave the pending field in place, so that any concurrent reader of the chain can still find the tail, even when the ephemeron has gone on to be RESOLVED or even TRACED.

(I had thought I would need Tim Harris' atomic list implementation, but it turns out that since I only ever insert items at the head, having annotated links is not necessary.)

If we find a PENDING ephemeron that has K as its key, then we CAS its state from PENDING to RESOLVED. If this works, we CAS it onto the front of the resolved list. (Note that we also have to forward the key at this point, for a moving GC; this was a bug in my original implementation.)

algorithm: resolved ephemerons

Periodically a thread tracing the graph will run out of objects to trace (its mark stack is empty). That's a good time to check if there are resolved ephemerons to trace. We atomically exchange the global resolved list with NULL, and then if there were resolved ephemerons, then we trace their values and transition them to TRACED.

At the very end of the GC cycle, we sweep the pending ephemeron table, marking any ephemeron that's still there as is_dead, transitioning them back to TRACED, clearing the buckets of the pending ephemeron table as we go.

nits

So that's it. There are some drawbacks, for example that this solution takes at least three words per ephemeron. Oh well.

There is also an annoying point of serialization, which is related to the lazy ephemeron resolution optimization. Consider that checking the pending ephemeron table on every object visit is overhead; it would be nice to avoid this. So instead, we start in "lazy" mode, in which pending ephemerons are never resolved by marking; and then once the mark stack / grey object worklist fully empties, we sweep through the pending ephemeron table, checking each ephemeron's key to see if it was visited in the end, and resolving those ephemerons; we then switch to "eager" mode in which each object visit could potentially resolve ephemerons. In this way the cost of ephemeron tracing is avoided for that part of the graph that is strongly reachable. However, with parallel markers, would you switch to eager mode when any thread runs out of objects to mark, or when all threads run out of objects? You would get greatest parallelism with the former, but you run the risk of some workers prematurely running out of data, but when there is still a significant part of the strongly-reachable graph to traverse. If you wait for all threads to be done, you introduce a serialization point. There is a related question of when to pump the resolved ephemerons list. But these are engineering details.

Speaking of details, there are some gnarly pitfalls, particularly that you have to be very careful about pre-visit versus post-visit object addresses; for a semi-space collector, visiting an object will move it, so for example in the pending ephemeron table which by definition is keyed by pre-visit (fromspace) object addresses, you need to be sure to trace the ephemeron key for any transition to RESOLVED, and there are a few places this happens (the re-check after publish, sweeping the table after transitioning from lazy to eager, and when resolving eagerly).

implementation

If you've read this far, you may be interested in the implementation; it's only a few hundred lines long. It took me quite a while to whittle it down!

Ephemerons are challenging from a software engineering perspective, because they are logically a separate module, but they interact both with users of the GC and with the collector implementations. It's tricky to find the abstractions that work for all GC algorithms, whether they mark in place or move their objects, and whether they mark the heap precisely or if there are some conservative edges. But if this is the sort of thing that interests you, voilà the API for users and the API to and from collector implementations.

And, that's it! I am looking forward to climbing out of this GC hole, one blog at a time. There are just a few more features before I can seriously attack integrating this into Guile. Until the next time, happy hacking :)

by Andy Wingo at January 24, 2023 10:48 AM

January 23, 2023

GStreamerGStreamer 1.22.0 new major stable release

(GStreamer)

The GStreamer team is excited to announce a new major feature release of your favourite cross-platform multimedia framework!

As always, this release is again packed with new features, bug fixes and many other improvements.

The 1.22 release series adds new features on top of the previous 1.20 series and is part of the API and ABI-stable 1.x release series of the GStreamer multimedia framework.

Highlights:

  • AV1 video codec support improvements
  • New HLS, DASH and Microsoft Smooth Streaming adaptive streaming clients
  • Qt6 support for rendering video inside a QML scene
  • Minimal builds optimised for binary size, including only the individual elements needed
  • Playbin3, Decodebin3, UriDecodebin3, Parsebin enhancements and stabilisation
  • WebRTC simulcast support and support for Google Congestion Control
  • WebRTC-based media server ingestion/egress (WHIP/WHEP) support
  • New easy to use batteries-included WebRTC sender plugin
  • Easy RTP sender timestamp reconstruction for RTP and RTSP
  • ONVIF timed metadata support
  • New fragmented MP4 muxer and non-fragmented MP4 muxer
  • New plugins for Amazon AWS storage and audio transcription services
  • New gtk4paintablesink and gtkwaylandsink renderers
  • New videocolorscale element that can convert and scale in one go for better performance
  • High bit-depth video improvements
  • Touchscreen event support in navigation API
  • Rust plugins now shipped in macOS and Windows/MSVC binary packages
  • H.264/H.265 timestamp correction elements for PTS/DTS reconstruction before muxers
  • Improved design for DMA buffer sharing and modifier handling for hardware-accelerated video decoders/encoders/filters and capturing/rendering on Linux
  • Video4Linux2 hardware accelerated decoder improvements
  • CUDA integration and Direct3D11 integration and plugin improvements
  • New H.264 / AVC, H.265 / HEVC and AV1 hardware-accelerated video encoders for AMD GPUs using the Advanced Media Framework (AMF) SDK
  • applemedia: H.265 / HEVC video encoding + decoding support
  • androidmedia: H.265 / HEVC video encoding support
  • New "force-live" property for audiomixer, compositor, glvideomixer, d3d11compositor etc.
  • Lots of new plugins, features, performance improvements and bug fixes

For more details check out the GStreamer 1.22 release notes.

Binaries for Android, iOS, macOS and Windows will be provided in due course.

You can download release tarballs directly here: gstreamer, gst-plugins-base, gst-plugins-good, gst-plugins-ugly, gst-plugins-bad, gst-libav, gst-rtsp-server, gst-python, gst-editing-services, gst-devtools, gstreamer-vaapi, gstreamer-sharp, gst-omx, or gstreamer-docs.

January 23, 2023 10:00 PM

January 14, 2023

GStreamerGStreamer 1.21.90 pre-release (1.22 rc1)

(GStreamer)

The GStreamer team is excited to announce the first release candidate for the upcoming stable 1.22 release series.

This 1.21.90 pre-release is for testing and development purposes in the lead-up to the stable 1.22 series which is now feature frozen and scheduled for release very soon. Any newly-added API can still change until that point, although it is extremely unlikely for that to happen at this point.

Depending on how things go there might be more release candidates in the next couple of days, but in any case we're aiming to get 1.22.0 out as soon as possible.

Preliminary release notes highlighting all the new features, bugfixes, performance optimizations and other important changes will be made available in the next few days.

Binaries for Android, iOS, Mac OS X and Windows will be available at the usual location in due course.

Release tarballs can be downloaded directly here:

As always, please let us know of any issues you run into by filing an issue in Gitlab.

January 14, 2023 01:00 AM

January 03, 2023

Víctor JáquezGStreamer compilation with third party libraries

Suppose that you have to hack a GStreamer element which requires a library that is not (yet) packaged by your distribution, nor wrapped as a Meson’s subproject. How do you do?

In our case, we needed the latest version of

Which are interrelated CMake projects.

For these cases, GStreamer’s uninstalled development scripts can use a special directory: gstreamer/prefix. As the README.md says:

NOTE: In the development environment, a fully usable prefix is also configured in gstreamer/prefix where you can install any extra dependency/project.

This means that gstenv.py script (the responsible of setting up the uninstalled development environment) will add

  • gstreamer/prefix/bin in PATH for executable files.
  • gstreamer/prefix/lib and gstreamer/prefix/share/gstreamer-1.0 in GST_PLUGIN_PATH, for out-of-tree elements.
  • gstreamer/prefix/lib in GI_TYPELIB_PATH for GObject Introspection metadata.
  • gstreamer/prefix/lib/pkgconfig in PKG_CONFIG_PATH for third party dependencies (our case!)
  • gstreamer/prefix/etc/xdg for XDG_CONFIG_DIRS for XDG compliant configuration files.
  • gstreamer/prefix/lib and gstreamer/prefix/lib64 in LD_LIBRARY_PATH for third party libraries.

Therefore, the general idea, is to compile those third party libraries with their installation prefix as gstreamer/prefix.

In our case, Vulkan repositories are interrelated so they need to be compiled in certain order. Also, we decided, for self-containment, to clone them in gstreamer/subprojects.

Vulkan-Headers

$ cd ~/gst/gstreamer/subprojects
$ git clone git@github.com:KhronosGroup/Vulkan-Headers.git
$ cd Vulkan-Headers
$ mkdir build
$ cd build
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
$ cmake --build . --install

Vulkan-Loader

$ cd ~/gst/gstreamer/subprojects
$ git clone git@github.com:KhronosGroup/Vulkan-Loader.git
$ cd Vulkan-Loader
$ mkdir build
$ cd build
$ cmake  -DCMAKE_BUILD_TYPE=Debug -DVULKAN_HEADERS_INSTALL_DIR=/home/vjaquez/gst/gstreamer/prefix DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
$ cmake --build . --install

Vulkan-Tools

$ cd ~/gst/gstreamer/subprojects
$ git clone git@github.com:KhronosGroup/Vulkan-Tools.git
$ cd Vulkan-Tools
$ mkdir build
$ cd build
$ cmake  -DCMAKE_BUILD_TYPE=Debug -DVULKAN_HEADERS_INSTALL_DIR=/home/vjaquez/gst/gstreamer/prefix DCMAKE_INSTALL_PREFIX=/home/vjaquez/gst/gstreamer/prefix ..
$ cmake --build . --install

Right now we have the Vulkan headers and the Vulkan loader pkg-config file in place. And we should be able to compile GStreamer. Right?

Not exactly, because gstenv.py only sets the environment variables for the development environment, not for GStreamer compilation. But the solution is simple, because we have all set in the proper order: just to set PKG_CONFIG_PATH when executing meson setup:

$ PKG_CONFIG_PATH=/home/vjaquez/gst/gstreamer/prefix/lib/pkgconfig meson setup --buildtype=debug build

by vjaquez at January 03, 2023 06:35 PM

December 27, 2022

Jean-François Fortin TamThe post-lockdown work rave, abuse, and shortened fuses

This article is a mix of personal stories, social & interpersonal psychology, and technology (with some drive-by remarks about Free & Open Source software). I originally drafted this article in the fall of 2021 and winter of 2022, but I needed more time to rewrite it a fourth time (because why not). So, while this post ends up landing straight into the twilight zone “between xmas and New Year”, it has more insights and illustrations and arguably makes for a better-rounded story, even if it’s a 7 to 9 minutes read rather than 5-6.


For all its faults and unpleasant aspects, 2020 had at least one thing going for it, in my case: the relative lack of opportunities in my line of work at the time gave me some more space to think and justification to breathe, at a time where the world was coming to terms with “It’s okay to not be 100% productive, it’s okay to try to enjoy small things in life, and just be“. I was privileged to be able to afford that suspension of affairs (due to the society I live in and the way my lifestyle is structured), and I was fortunate enough to not have had casualties in my direct entourage in light of that year’s context.

Then suddenly, in 2021, after a year of nothingness, work picked up. Hold unto your buckets!

As I often say in business, when the train passes through the station, you get onto the damned train. Or perhaps I should say, while the horse is still around while you’re being chased by a colossus, you jump onto that f’ing horse. Well, it turned out to be a surprisingly long and sporty ride.

Shadow of the Colossus pixelart” by Fernando Henrique, used with permission for editorial purposes.
Check out his amazing gallery!

The Work Rave

I initially thought 2021’s work surge would last only a short while, that I would soon have time to rest and take a step back “when the client work pipeline dries up because of some new pandemic/economic SNAFU.” The intense “hustle while you can” marathon—in reaction to the dry business year of 2020—was necessary, but it was also quite exhausting.

Continuous work with very little physical activities (lockdowns and curfews didn’t exactly improve my already sedentary lifestyle) also had a tangible impact. I don’t think I’ve ever taken as many boiling hot baths as during that year, where I had to regularly unbreak my hands—that were frequently on the verge of RSI, from too much typing and mousing—and rest my eyes (which were almost always dried out near the end of the day). My body was forcing me to slow down (later in the year for example, I had to stay away from the keyboard for two weeks or so because the hands just couldn’t type), and this was supremely annoying to me; after all, it’s not like I was sick, nor worried about food & shelter, so how could I not make the utmost productive use of my time given my arguably envious position in the context of the pandemic?

Summer went by without work drying up that year (nice problem to have, I know 🫠), so after maybe two weeks of semi-vacation (read: working on my own tasks and a GTG release, instead of client work) I went right back to the work treadmill. By that time I had experienced two micro-burnouts in the months prior, but, again, strike the iron while it’s hot, seize the work while it’s available, etc.—we didn’t know what would happen in the fall season, after all.

Witnessing abuse

Around the same time, I was also trying to support two good friends who were undergoing nervous breakdown due to criminal harassment and who, in late summer, were in deep distress. Now, if you’ve paid attention to my last few years’ retrospectives so far, you know I’ve had my share of dead friends in the past, so no, not this time, not on my watch.

I told my clients about my reduced availability during that time, and focused on helping those friends—from mere emotional and logistical support (“I have extra capacity. Use it.”, I said) to “last minute evac to the airport”, as they had to exile themselves for their own safety (because this is the harassment they’ve been going through). They have now left the country forever. It pains me that I won’t be able to see them around much anymore, but if that’s what it takes for them to stay alive, so be it.

It started out as a dark, life-threatening situation, and they are now safe and sound, albeit an ocean apart. I guess that’s a reasonable outcome.

Facing abuse

Shortly afterwards however, an acquaintance of mine (unrelated to the situation described above) started gaslighting and bullying me. Figures! That person had previously been a good friend of mine, and they used to live by the ethos of openness and generosity towards others… until then, that is.

Indeed, within the span of a few months, money and power had irreversibly shifted that person’s worldview and abruptly brought out the worst in them, to the point where they could no longer be reasoned with—any such attempt would only bring out more furious and outlandish demands from them, and they continued to try to bully me into submission and oblivion.

I was in shock and awe at the selfish and petty power trip unfolding before my eyes.

What had happened to that person’s previous noble ideals and “standing up for your colleagues” principles? Alas, that person had now survived long enough to become the exact embodiment of the kind of patronal hubris and unfairness they had vocally despised before (when they were on the other side of the fence). My discontentment was profound:

“You were the chosen one!
It was said that you would destroy the haute bourgeoisie, not join them!
Bring fairness to the proletariat, not leave it in darkness!”

By that climatic scene, their continued bullying had brought me on the verge of 2021’s third micro-burnout, and I had now lost literally an entire week of sleep thinking about the situation and how disappointed I was in their self-centeredness… and that was enough. It had to stop.

At that point in the year—and after the “above-and-beyond” energy spent trying to salvage other difficult relationships in previous years, some with real life-threatening circumstances—I had no patience left for this kind of meaningless abuse, and sought to gracefully yet swiftly end that relationship. It is unfortunate for them as I was pretty much the only friend remaining in their life (they had previously alienated all other friends and family members… see a pattern here?), but I’m not here to save everyone against themselves, particularly not when they show aggressiveness towards me.

Sure, demonstrable facts and reality were on my side, but it wouldn’t matter: that person could no longer be reasoned with; their paranoïa made them imagine adversaries everywhere they went, and counter-PsyOps would have been way above my paygrade. I have heard enough stories of similar situations, and we ain’t got no time for that, so… Neutralize and move on. Have a nice life, buddy.

Pictured: “You want to boss me around?
Oh no, you don’t. I’m too old for this sh!t.”

Facing short fuses: a challenge of our times

You know, I understand how most of us have been in a generally irritable mood since 2020—heck, even I have much lower levels of tolerance compared to the inordinate amounts of B.S. I can normally endure, and generally speaking, I think something broke in all of us—so I try to be patient, helpful, understanding and loving to everyone I meet out there… but I have definitely seen people around me having a much shorter fuse and getting offended about everything and anything.

Even something as mundane as an email where I say, “Here’s my phone number as you requested, call me when you want, if I don’t answer I’m probably outside doing errands”. Or for cracking too many jokes in social encounters among friends. Or for pointing out that a PowerPoint or Photoshop mockup is not a MVP and that you shouldn’t erroneously use that term in front of investors if you don’t want to be instantly discredited. Or for taking more than 48 hours to reply to a friend’s email about some drying machine’s warranty enforcement options, while I receive an average of 200 to 250 emails per week. Seriously, some people around me got offended about those things! And I’m like:

At some point, my empathy has limits, and I’m not a punching bag for the perpetually-offended and self-indulgent. Yes, we’ve been living through difficult times, and yes, I do have a degree in psychology and inordinate amounts of patience, but you’re not paying me to be the übertragung medium for your personality laid bare. I’ve already had enough people attempt to play mind games on me in the last few years; no more.

In business and relationships these days, when someone is being unappreciative of my contributions, dismissive of my value, or accusing me of this or that perceived emotional slight, I am now cutting them out. Life is too short to be spent comforting self-centered or inconsiderate people.

👉 As an aside: similarly, I can empathize with various FLOSS project maintainers having very limited tolerance left for passive-aggressive commenters lately, and thus their tendency to quickly lock heated tickets in their issue trackers, as we’ve seen in some areas of GNOME in recent years. I, for one, am incredibly grateful for the many improvements we have seen throughout the GNOME applications & technology stack from 2020 to 2022 (even if I sometimes disagree with some design decisions), but details of my appreciation will probably warrant some standalone blog post or two! In the meantime: thank you, fellow contributors. Your work is not going unnoticed.


With this mindset, and after what transpired through the first three quarters of 2021, I had to take a break from any non-essential work at the start of the fall season in 2021. I had reached the limit on the amount of madness for the year, and staring at the inevitability of the dreaded year-end administrative & financial planning tasks, I decided to phase out existing client work and not take on any new clients during 2021’s last quarter.

Thus in 2022, to avoid burning myself again, I’ve then been very selective about what client work I take on (luckily, my client assignations have been very pleasant that year); I have then been trying to focus my remaining time on draining the swamp, keeping enough free time for myself and my involvement in FLOSS projects and local non-profit orgs, and preparing to scale the business in 2023-2024.

…but scaling, my friends, is a story for another day. I’ve yet to write a retrospective summary of all the cool things that happened throughout 2022, too. Assuredly, my next few blog posts will be quite a bit more upbeat! 😊

by Jeff at December 27, 2022 05:02 PM

December 19, 2022

GStreamerGStreamer 1.20.5 stable bug fix release

(GStreamer)

The GStreamer team is pleased to announce another bug fix release in the stable 1.20 release series of your favourite cross-platform multimedia framework!

This release only contains bug fixes. It should be safe to update from 1.20.x.

Highlighted bugfixes:

  • systemclock waiting fixes for certain 32-bit platforms/libcs
  • alphacombine: robustness improvements for corner case scenarios
  • avfvideosrc: Report latency when doing screen capture
  • d3d11videosink: various thread-safety and stability fixes
  • decklink: fix performance issue when HDMI signal has been lost for a long time
  • flacparse: Fix handling of headers advertising 32 bits per sample
  • mpegts: Handle when iconv doesn't support ISO 6937 (e.g. musl libc)
  • opengl: fix automatic dispmanx detection for rpi4 and fix usage of eglCreate/DestroyImage
  • opusdec: Various channel-related fixes
  • textrender: event handling fixes, esp. for GAP event
  • subparse: Fix non-closed tag handling
  • videoscale: fix handling of unknown buffer metas
  • videosink: reverse playback handling fixes
  • qtmux: Prefill mode fixes, especially for raw audio
  • multiudpsink: allow binding to IPv6 address
  • rtspsrc: Fix usage of IPv6 connections in SETUP
  • rtspsrc: Only EOS on timeout if all streams are timed out/EOS
  • splitmuxsrc: fix playback stall if there are unlinked pads
  • v4l2: Fix SIGSEGV on state change during format changes
  • wavparse robustness fixes
  • Fix static linking on macOS (opengl, vulkan)
  • gstreamer-vaapi: fix headless build against mesa >= 22.3.0
  • GStreamer Editing Services library: Fix build with tools disabled
  • webrtc example/demo fixes
  • unit test fixes for aesdec and rtpjitterbuffer
  • Cerbero: Fix ios cross-compile with cmake on M1; some recipe updates and other build fixes
  • Binary packages: pkg-config file fixes for various recipes (ffmpeg, taglib, gstreamer)
  • Binary packages: Enable high bitdepth support for libvpx (VP8/VP9 encoding/decoding)
  • Binary packages: ship aes plugin
  • Performance improvements
  • Miscellaneous bug fixes, memory leak fixes, and other stability and reliability improvements

See the GStreamer 1.20.5 release notes for more details.

Binaries for Android, iOS, macOS and Windows will be available shortly.

Release tarballs can be downloaded directly here:

December 19, 2022 11:00 PM

December 18, 2022

Víctor JáquezVideo decoding in GStreamer with Vulkan Video extension (part 2)

Its has been a while since I reported my tinkering with the Vulkan Video provisional extension. Now the specification will have its final release soonish, and also there has been more engagement within the open source communities, such as the work-in-progress FFMpeg implementation by Lynne (please, please, read that post), and the also work-in-progress Mesa 3D drivers both for AMD and Intel by Dave Airlie! Along with the well known NVIDIA beta drivers for Vulkan.

From our side, we have been trying to provide an open source alternative to the video parser used by the Conformance Test Suite and the NVIDIA
vk_video_samples, using GStreamer: GstVkVideoParser, which intends to be a drop-in replacement of the current proprietary parser library.

Along the way, we have sketched the Vulkan Video support in
gfxreconstruct
, for getting traces of the API usage. Sadly, its kind of bit-rotten right now, even more because the specification has changed since then.

Regarding the H.264 decoder for GStreamer, we just restarted its hacking. The merge request was moved to monorepo, but for the sake of the well needed complete re-write, we changed the branch to this one (vkh264dec). We needed to re-write it because, besides the specification updates, we have learned many things along the journey, such as the out-of-band parameters update, Vulkan’s recommendation for memory pre-allocation as much as possible, the DPB/references handling, the debate about buffer vs. slice uploading, and other friction points that Lynne has spotted for future early adopters.

The way to compile it is grab the branch and compile as usually GStreamer is compiled with meson:

meson setup builddir -Dgst-plugins-bad:vulkan-video=enabled --buildtype=debug
ninja C builddir

And run simple pipelines such as

gst-launch-1.0 filesrc location=INPUT ! parsebin ! vulkanh264dec ! fakesink -v

Our objective is to have a functional demo for the next Vulkanised in
February
. We are very ambitious, we want it to work in Linux, Windows and in many GPU as possible. Wish us luck. And happy December festivities!

by vjaquez at December 18, 2022 01:14 PM

December 15, 2022

Andy Wingoephemeral success

(Andy Wingo)

Good evening, patient hackers :) Today finishes off my series on implementing ephemerons in a garbage collector.

Last time, we had a working solution for ephemerons, but it involved recursively visiting any pending ephemerons from within the copy routine—the bit of a semi-space collector that is called when traversing the object graph and we see an object that we hadn't seen yet. This recursive visit could itself recurse, and so we could overflow the control stack.

The solution, of course, is "don't do that": instead of visiting recursively, enqueue the ephemeron for visiting later. Iterate, don't recurse. But here we run into a funny problem: how do we add an ephemeron to a queue or worklist? It's such a pedestrian question ("just... enqueue it?") but I think it illustrates some of the particular concerns of garbage collection hacking.

speak, memory

The issue is that we are in the land of "can't use my tools because I broke my tools with my tools". You can't make a standard List<T> because you can't allocate list nodes inside the tracing routine: if you had memory in which you could allocate, you wouldn't be calling the garbage collector.

If the collector needs a data structure whose size doesn't depend on the connectivity of the object graph, you can pre-allocate it in a reserved part of the heap. This adds memory overhead, of course; for a 1000 MB heap, say, you used to be able to make graphs 500 MB in size (for a semi-space collector), but now you can only do 475 MB because you have to reserve 50 MB (say) for your data structures. Another way to look at it is, if you have a 400 MB live set and then you allocate 2GB of garbage, if your heap limit is 500 MB you will collect 20 times, but if it's 475 MB you'll collect 26 times, which is more expensive. This is part of why GC algorithms are so primitive; implementors have to be stingy that we don't get to have nice things / data structures.

However in the case of ephemerons, we will potentially need one worklist entry per ephemeron in the object graph. There is no optimal fixed size for a worklist of ephemerons. Most object graphs will have no or few ephemerons. Some, though, will have practically the whole heap.

For data structure needs like this, the standard solution is to reserve the needed space for a GC-managed data structure in the object itself. For example, for concurrent copying collectors, the GC might reserve a word in the object for a forwarding pointer, instead of just clobbering the first word. If you needed a GC-managed binary tree for a specific kind of object, you'd reserve two words. Again there are strong pressures to minimize this overhead, but in the case of ephemerons it seems sensible to make them pay their way on a per-ephemeron basis.

so let's retake the thing

So sometimes we might need to put an ephemeron in a worklist. Let's add a member to the ephemeron structure:

struct gc_ephemeron {
  struct gc_obj header;
  int dead;
  struct gc_obj *key;
  struct gc_obj *value;
  struct gc_ephemeron *gc_link; // *
};

Incidentally this also solves the problem of how to represent the struct gc_pending_ephemeron_table; just reserve 0.5% of the heap or so as a bucket array for a buckets-and-chains hash table, and use the gc_link as the intrachain links.

struct gc_pending_ephemeron_table {
  struct gc_ephemeron *resolved;
  size_t nbuckets;
  struct gc_ephemeron buckets[0];
};

An ephemeron can end up in three states, then:

  1. Outside a collection: gc_link can be whatever.

  2. In a collection, the ephemeron is in the pending ephemeron table: gc_link is part of a hash table.

  3. In a collection, the ephemeron's key has been visited, and the ephemeron is on the to-visit worklist; gc_link is part of the resolved singly-linked list.

Instead of phrasing the interface to ephemerons in terms of visiting edges in the graph, the verb is to resolve ephemerons. Resolving an ephemeron adds it to a worklist instead of immediately visiting any edge.

struct gc_ephemeron **
pending_ephemeron_bucket(struct gc_pending_ephemeron_table *table,
                         struct gc_obj *key) {
  return &table->buckets[hash_pointer(obj) % table->nbuckets];
}

void add_pending_ephemeron(struct gc_pending_ephemeron_table *table,
                           struct gc_obj *key,
                           struct gc_ephemeron *ephemeron) {
  struct gc_ephemeron **bucket = pending_ephemeron_bucket(table, key);
  ephemeron->gc_link = *bucket;
  *bucket = ephemeron;
}

void resolve_pending_ephemerons(struct gc_pending_ephemeron_table *table,
                                struct gc_obj *obj) {
  struct gc_ephemeron **link = pending_ephemeron_bucket(table, obj);
  struct gc_ephemeron *ephemeron;
  while ((ephemeron = *link)) {
    if (ephemeron->key == obj) {
      *link = ephemeron->gc_link;
      add_resolved_ephemeron(table, ephemeron);
    } else {
      link = &ephemeron->gc_link;
    }
  }
}

Copying an object may add it to the set of pending ephemerons, if it is itself an ephemeron, and also may resolve other pending ephemerons.

void resolve_ephemerons(struct gc_heap *heap, struct gc_obj *obj) {
  resolve_pending_ephemerons(heap->pending_ephemerons, obj);

  struct gc_ephemeron *ephemeron;
  if ((ephemeron = as_ephemeron(forwarded(obj)))
      && !ephemeron->dead) {
    if (is_forwarded(ephemeron->key))
      add_resolved_ephemeron(heap->pending_ephemerons,
                             ephemeron);
    else
      add_pending_ephemeron(heap->pending_ephemerons,
                            ephemeron->key, ephemeron);
  }
}

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  ...
  resolve_ephemerons(heap, obj); // *
  return new_obj;
}

Finally, we need to add something to the core collector to scan resolved ephemerons:

int trace_some_ephemerons(struct gc_heap *heap) {
  struct gc_ephemeron *resolved = heap->pending_ephemerons->resolved;
  if (!resolved) return 0;
  heap->pending_ephemerons->resolved = NULL;
  while (resolved) {
    resolved->key = forwarded(resolved->key);
    visit_field(&resolved->value, heap);
    resolved = resolved->gc_link;
  }
  return 1;
}

void kill_pending_ephemerons(struct gc_heap *heap) {
  struct gc_ephemeron *ephemeron;
  struct gc_pending_ephemeron_table *table = heap->pending_ephemerons;
  for (size_t i = 0; i < table->nbuckets; i++) {
    for (struct gc_ephemeron *chain = table->buckets[i];
         chain;
         chain = chain->gc_link)
      chain->dead = 1;    
    table->buckets[i] = NULL;
  }
}

void collect(struct gc_heap *heap) {
  flip(heap);
  uintptr_t scan = heap->hp;
  trace_roots(heap, visit_field);
  do { // *
    while(scan < heap->hp) {
      struct gc_obj *obj = scan;
      scan += align_size(trace_heap_object(obj, heap, visit_field));
    }
  } while (trace_ephemerons(heap)); // *
  kill_pending_ephemerons(heap); // *
}

The result is... not so bad? It makes sense to make ephemerons pay their own way in terms of memory, having an internal field managed by the GC. In fact I must confess that in the implementation I have been woodshedding, I actually have three of these damn things; perhaps more on that in some other post. But the perturbation to the core algorithm is perhaps less than the original code. There are still some optimizations to make, notably postponing hash-table lookups until the whole strongly-reachable graph is discovered; but again, another day.

And with that, thanks for coming along with me for my journeys into ephemeron-space.

I would like to specifically thank Erik Corry and Steve Blackburn for their advice over the years, and patience with my ignorance; I can only imagine that it's quite amusing when you have experience in a domain to see someone new and eager come in and make many of the classic mistakes. They have both had a kind of generous parsimony in the sense of allowing me to make the necessary gaffes but also providing insight where it can be helpful.

I'm thinking of many occasions but I especially appreciate the advice to start with a semi-space collector when trying new things, be it benchmarks or test cases or API design or new functionality, as it's a simple algorithm, hard to get wrong on the implementation side, and perfect for bringing out any bugs in other parts of the system. In this case the difference between fromspace and tospace pointers has a material difference to how you structure the ephemeron implementation; it's not something you can do just in a trace_heap_object function, as you don't have the old pointers there, and the pending ephemeron table is indexed by old object addresses.

Well, until some other time, gentle hackfolk, do accept my sincerest waste disposal greetings. As always, yours in garbage, etc.,

by Andy Wingo at December 15, 2022 09:21 PM

December 14, 2022

Christian SchallerUpdate from the world of Fedora Workstation

(Christian Schaller)

Let me start by saying that I now also have a Mastodon account, so please follow me on @Cfkschaller@fosstodon.org for news and updates around Fedora Workstation, PipeWire, Wayland, LVFS and Linux in general.

Fedora vs Fedora Workstation
Before I start with the general development update I want to mention something I mentioned quite a few times before and that is the confusion I often find people have about what is Fedora and what is Fedora Workstation.

Fedora Workstation

Fedora is our overall open source project and community working on packaging components and software for the various outputs that the Fedora community delivers. Think of the Fedora community a bit like a big group of people providing well tested and maintained building blocks to be used to build operating systems and applications. As part of that bigger community you have a lot of working groups and special interest groups working to build something with those building blocks and Fedora Workstation is the most popular thing being built from those building blocks. That means that Fedora Workstation isn’t ‘Fedora’ it is something created by the Fedora community alongside a lot of other projects like Fedora Server, Silverblue and Fedora spins like Fedora KDE and Fedora Kinoite. But all them should be considered separate efforts built using a shared set of building blocks.
Putting together an operating system like Fedora Workstation is more than just assembling a list of software components to include though, it is also about setting policies, default configurations, testing and QE and marketing. This means that while Fedora Workstation contains many of the same components of other things like the Fedora KDE Plasma Desktop spin, the XFCE Desktop spin, the Cinnamon spin and so on, they are not the same thing. And that is not just because the feature set of GNOME is different from the feature set of XFCE, it is because each variant is able to set their own policies, their own configurations and do their own testing and QE. Different variants adopted different technologies at different times, for instance Fedora Workstation was an early adopter of new technologies like Wayland and PipeWire. So the reason I keep stressing this point is that I to this day often see comments or feedback about ‘Fedora’, feedback which as someone spending a lot of effort on Fedora Workstation, sometimes makes no sense to me, only to reach out and discover that they where not using Fedora Workstation, but one of the spins. So I do ask people, especially those who are members of the technology press to be more precise in their reviews, about if they are talking about Fedora Workstation or another project that is housed under the Fedora umbrella and not just shorten it all to ‘Fedora’.

Fedora Workstation

Fedora Workstation


Anyway, onwards to updating you on some of the work we are doing in and around Fedora Workstation currently.

High Dynamic Range – HDR

HDR

HDR

A major goal for us at Red Hat is to get proper HDR support into Fedora and RHEL. Sebastian Wick is leading that effort for us and working with partners and contributors across the community and the industry. For those who read my past updates you know I have been talking about this for a while, but it has been slow moving both because it is a very complex issue that needs changes from the kernel graphics drivers up through the desktop shell and into the application GUI toolkits. As Sebastian put it when I spoke with him, HDR forces us (the Linux ecosystem) to take colors seriously for the first time and thus it reveals a lot of shortcomings up through our stack, many which are not even directly HDR related.
Support for High Dynamic Range (HDR) requires compositors to produce framebuffers in specific HDR encodings with framebuffers from different clients. Most clients nowadays are unaware of HDR, Wide Color Gamuts (WCG) and instead produce pixels with RGB encodings which look okay on most displays which approximate sRGB characteristics. Converting different encodings to the common HDR encoding requires clients to communicate their encoding, the compositor to adjust and convert between the different color spaces, and the driver to enable an HDR mode in the display. Converting between the encodings should ideally be done by the scanout engine of the GPU to keep the latency and power benefits of direct scanout. Similarly, applications and toolkits have to understand different encodings and how to convert between them to make use of HDR and WCG features.

Essentially, HDR forces us to handle color correctly and makes color management an integral part of the system.
So in no particular order here are some of the issues we have found and are looking at:

  • Display brightness controls are today limited to a single, internal monitor and setting the brightness level to zero might or might not turn off the screen. The luminance setting might scale linearly or approximate the human brightness perception depending on hardware. Hans de Goede on our team is working on fixing these issues to get as coherent behaviour from different hardware as possible.
  • GStreamers glcolorconvert element doesn’t actually convert colors, which breaks playback of videos on some HDR and WCG encoded videos.
  • There are also some challenges with limitations in the current KMS API. User space cannot select the bit depth of the colors sent to the display, nor whether a YUV format with subsampling is used. Both are driver internal magic which can reduce the quality of the image but might be required to light up displays with the limited bandwidth available. The handling of limited and full range colors is driver defined and an override for badly behaving sinks is only available on Intel and VC4. And offloading the color transformations to the scanout hardware is impossible with the KMS API. Available features, their order and configurability varies a lot between hardware. To avoid any discrepancy between the offloaded and the software case the entire color pipeline has to be predictable by user space. We’re currently working together with hardware vendors to look at ways forward here.
  • To handle the communication of the color metadata between clients and compositors Sebastian is working with the wider community to define two work-in-progress wayland protocols. The color representation protocol describes how to go from a tristimulus value to the encoding (this includes YUV to RGB conversions, full vs limited color range, chroma siting) and the color-management protocol for describes the tristimulus values (primaries, whitepoint, OETF, ICC profile, dynamic range).
  • There is also now a repository on FreeDesktop.org’s GitLab instance to document and discuss all things related to colors on Linux. This required a lot of research, thinking and discussing about how everything is supposed to fit together.
  • Compositors have to look up information from EDID and DisplayID for HDR. The data is notoriously unreliable currently so Simon Ser, Pekka Paalanen and Sebastian started a new project called libdisplay-info for parsing and quirking EDID+DisplayID.
  • Mutter is now in charge of ICC profile handling and we’re planning to deprecate the old colord daemon. Work on supporting high bit depth framebuffers and enabling HDR mode is ongoing and should give us an experimental setup for figuring out how to combine different SDR and HDR content.

Headless display and automated test of the GNOME Shell

Jonas Ådahl has been working on getting headless display working correctly under Wayland for a bit and as part of that also been working on using the headless display support to enable automated testing of gnome-shell. The core work to get GNOME Shell to be able to run headless on top of Wayland finished some time ago, but there are still some related tasks that need further work.

  • The most critical is remote login. With some hand holding it works and supports TPM based authentication for single user sessions, but currently lacks a nice way to start it. This item is on the very top of Jonas todo list so hopefully we can cross it off soon.
  • Remote multi user login: This feature is currently in active development and will make it possible to get a gdm login screen via gnome-remote-desktop where one can type ones username and password getting a headless session. The remote multi user login is a collaboration between Jonas Ådahl, Pascal Nowack, the author of the RDP backend in gnome-remote-desktop and Joan Torres from SUSE.
  • Testing: Jonas has also been putting in a lot of work recently to improve automated testing of GNOME Shell on the back of the headless display support. You can read more about that in his recent blog post on the subject.

HID-BPF

Benjamin Tissoires has been working on a set of kernel patches for some time now that implements something called HID-BPF. HID referring to Human Interface Devices (like mice, keyboard, joysticks and so on) and BPF is a kernel and user-space observability scheme for the Linux kernel. If you want to learn more about BPF in general I recommend this Linux Journal article on the subject. This blog post will talk specifically about HID-BPF and not BPF in general. We are expecting this patcheset to either land in 6.2 or it might get delayed to 6.3.
The kernel documentation explains in length and through examples how to use HID-BPF and when. The summary would be that in the HID world, we often have to tweak a single byte on a device to make it work, simply because HW makers only test their device under Windows, and they can provide a “driver” that has the fix in software. So most of the time it comes to changing a few bytes in the report descriptor of the HID device, or on the fly when we receive an event. For years, we have been doing this sort of fixes in the kernel through the normal kernel driver model. But it is a relatively slow way of doing it and we should be able to be more efficient.

The kernel driver model process involves multiple steps and requires the reporter of the issue to test the fix at every step. The reporter has an issue and contacts a kernel developer to fix that device (sometimes, the reporter and the kernel developers are the same person). To be able to submit the fix, it has to be tested, and so compiled in the kernel upstream tree, which means that we require regular users to compile their kernel. This is not an easy feat, but that’s how we do it. Then, the patch is sent to the list and reviewed. This often leads to v2, v3 of the patch requiring the user to recompile a new kernel. Once it’s accepted, we still need to wait for the patch to land in Linus’ tree, and then for the kernel to be taken by distributions. And this is only now that the reporter can drop the custom kernel build. Of course, during that time, the reporter has a choice to make: should I update my distro kernel to get security updates and drop my fix, or should I recompile the kernel to keep everybody up to date, or should I just stick with my kernel version because my fix is way more important?
So how can HID-BPF make this easier?

So instead of the extensive process above. A fix can be written as a BPF program which can be run on any kernel. So instead of asking the reporter to recompile the upstream kernel, a kernel developer can compile the BPF program, and provide both the sources and the binary to the tester. The tester now just has to drop that binary in a given directory to have a userspace component automatically binding this program to the device. When the review process happens, we can easily ask testers to update their BPF program, and they don’t even need to reboot (just unplug/replug the target device). For the developers, we continue doing the review, and we merge the program directly in the source tree. Not even a single #define needs to differ. Then an automated process (yet to be fully determined, but Benjamin has an RFC out there) builds that program into the kernel tree, and we can safely ship that tested program. When the program hits the distribution kernel, the user doesn’t even notice it: given that the kernel already ships that BPF program, the userspace will just not attach itself to the device, meaning that the test user will even benefit from the latest fixes if there were any without any manual intervention. Of course, if the user wants to update the kernel because of a security problem or another bug, the user will get both: the security fix and the HID fix still there.

That single part is probably the most appealing thing about HID-BPF. And of course, this translate very well in the enterprise Linux world: when a customer has an issue on a HID device that can be fixed by HID-BPF, we can provide to that customer the pre-compiled binary to introduce in the filesystem without having to rely on a kernel update.
But there is more: we can now start implementing a HID firewall (so that only fwupd can change the firmware of a HID device), we can also tell people not to invent ad-hoc kernel APIs, but rely on eBPF to expose that functionality (so that if the userspace program is not there, we don’t even expose that capability). And we can be even more creative, like deciding to change how the kernel presents a device to the userspace.

Benjamin speaking a Kernel Recipies


For more details I recommend checking out the talk Benjamin did at the kernel recipes conference.

MIPI Camera
Kate Hsuan and Hans de Goede have been working together trying to get the Linux support for MIPI cameras into shape. MIPI cameras are the next generation of PC cameras and you are going to see more and more laptops shipping with these so it is critical for us to get them working and working well and this work is also done in the context of our close collaboration with Lenovo around the Fedora laptops they offer.
Without working support, Fedora Workstation users who buy a laptop equipped with a MIPI camera will only get a black image. The long term solution for MIPI cameras is a library called libcamera which is a project lead by Laurent Pinchart and Kieran Bingham from IdeasOnBoard and sponsored by Google. For desktop Linux users you probably are not going to have applications interact directly with libcamera though, instead our expectation is that your applications use libcamera through PipeWire and the Flatpak portal. Thanks to the work of the community we now have support for the Intel IPU3 MIPI stack in the kernel and through libcamera, but the Intel IPU6 MIPI stack is the one we expect to see head into laptops in a major way and thus our current effort is focused on bringing support for that in a useful way to Fedora users. Intel has so far provided a combination of binary and limited open source code to support these cameras under Linux, by using GStreamer to provide an emulated V4l2 device. This is not a great long term solution, but it does provide us with at least an intermediate solution until we can get IPU6 working with libcamera. Based on his successful work on IPU3, Hans de Goede has been working on getting the necessary part of the Intel open source driver for IPU6 upstreamed since that is the basic interface to control the IPU6 and image sensor. Kate has been working on packaging all the remaining Intel non-free binary and software releases into RPM packages. The packages will provide the v4l2loopback solution which should work with any software supporting V4l2. These packages were planned to go live soon in RPM fusion nonfree repository.

LVFS – Linux Vendor Firmware Service
Richard Hughes is still making great strides forward with the now ubiques LVFS service. A few months ago we pushed the UEFI dbx update to the LVFS, which has now been downloaded over 4 million times. This pushed the total downloads to a new high of 5 million updates just in the 4 weeks of September, although it’s returned to a more normal growth pattern now.

The LVFS also now supports 1,163 different devices, and is being used by 137 different vendors. Since we started all those years ago we’ve provided at least 74 million updates to end-users although it’s probably even more given that lots of Red Hat customers mirror the entire LVFS for internal use. Not to mention that thanks to Richards collaboration with Google LVFS is now an integral part of the ‘Works with ChromeOS’ program.

On the LVFS we also now show the end-user provided HSI reports for a lot of popular hardware. This is really useful to check how secure the device will likely be before buying new hardware, regardless if the vendor is uploading to the LVFS. We’ve also asked ODMs and OEMs who do actually use the LVFS to use signed reports so we can continue to scale up LVFS. Once that is in place we can continue to scale up, aiming for 10,000 supported devices, and to 100 million downloads.

LVFS analytics

Latest LVFS statistics

PipeWire & OBS Studio

In addition to doing a bunch of bug fixes and smaller improvements around audio in PipeWire, Wim Taymans has spent time recently working on getting the video side of PipeWire in shape. We decided to use OBS Studio as our ‘proof of concept’ application because there was already a decent set of patches from Georges Stavracas and Columbarius that Wim could build upon. Getting the Linux developer community to adopt PipeWire for video will be harder than it was for audio since we can not do a drop in replacement, instead it will require active porting by application developers. Unlike for audio where PipeWire provides a binary compatible implementation of the ALSA, PulesAudio and JACK apis we can not provide a re-implementation of the V4L API that can run transparently and reliably in place of the actual V4L2. That said, Wim has created a tool called pw-v4l2 which tries to redirect the v4l2 calls into PipeWire, so you can use that for some testing, for example by running ‘pw-v4l2 cheese’ on the command line and you will see Cheese appear in the PipeWire patchbay applications Helvum and qpwgraph. As stated though it is not reliable enough to be something we can for instance have GNOME Shell do as a default for all camera handling applications. Instead we will rely on application developers out there to look at the work we did in OBS Studio as a best practice example and then implement camera input through PipeWire using that. This brings with it a lot of advantages though, like transparent support for libcamera alongside v4l2 and easy sharing of the video streams between multiple applications.
One thing to note about this is that at the moment we have two ‘equal’ video backends in PipeWire, V4L2 and libcamera, which means you get offered the same device twice, once from v4l2 and once from libcamera. Obviously this is a little confusing and annoying. Since libcamera is still in heavy development the v4l2 backend is more reliable for now, so Fedora Workstation we will be shipping with the v4l2 backend ‘out of the box’, but allow you to easily install the libcamera backend. As libcamera matures we will switch over to the libcamera backend (and allow you to install the v4l backend if you still need/want it for some reason.)

Helvum running OBS Studio with native PipeWire support and Cheese using the pw-v4l2 re-directer.

Of course the most critical thing to get ported to use PipeWire for camera handling is the web browsers. Luckily thanks to the work of Pengutronix there is a patchset ready to be merged into WebRTC, which is the implementation used by both Chromium/Chrome and Firefox. While I can make no promises we are currently looking to see if it is viable for us to start shipping that patch in the Fedora Firefox package soon.

And finally thanks to the work by Jan Grulich in collaboration with Google engineers PipeWire is now included in the Chromium test build which has allowed Google to enable PipeWire for screen sharing enabled by default. Having PipeWire in the test builds is also going to be critical for getting the camera handling patch merged and enabled.

Flathub, Flatpak and Fedora

As I have spoken about before, we have a clear goal of moving as close as we can to a Flatpak only model for Fedora Workstation. The are a lot of reasons for this like making applications more robust (i.e. the OS doesn’t keep moving fast underneath the application), making updates more reliable (because an application updating its dependencies doesn’t risk conflicting with the needs of other application), making applications more portable in the sense that they will not need to be rebuilt for each variety of operating system, to provide better security since applications can be sandboxed in a container with clear access limitations and to allow us to move to an OS model like we have in Silverblue with an immutable core.
So over the last years we spent a lot of effort alongside other members of the Linux community preparing the ground to allow us to go in this direction, with improvements to GTK+ and Qt for instance to allow them to work better in sandboxes like the one provided by Flatpak.
There has also been strong support and growth around Flathub which now provides a wide range of applications in Flatpak format and being used for most major Linux distributions out there. As part of that we have been working to figure out policies and user interface to allow us to enable Flathub fully in Fedora (currently there is a small allowlisted selection available when you enable 3rd party software). This change didn’t make it into Fedora Workstation 37, but we do hope to have it ready for Fedora Workstation 38. As part of that effort we also hope to take another look at the process for building Flatpaks inside Fedora to reduce the barrier for Fedora developers to do so.

So how do we see things evolving in terms of distribution packaging? Well that is a good question. First of all a huge part of what goes into a distribution is not Flatpak material considering that Flatpaks are squarely aimed at shipping GUI desktop applications. There are a huge list of libraries and tools that are either used to build the distribution itself, like Wayland, GNOME Shell, libinput, PipeWire etc. or tools used by developers on top of the operating system like Python, Perl, Rust tools etc. So these will be needed to be RPM packaged for the distribution regardless. And there will be cases where you still want certain applications to be RPM packaged going forward. For instance many of you hopefully are aware of Container Toolbx, our effort to make pet containers a great tool for developers. What you install into your toolbox, including some GUI applications like many IDEs will still need to be packages as RPMS and installed into each toolbox as they have no support for interacting with a toolbox from the host. Over time we hope that more IDEs will follow in GNOME Builders footsteps and become container aware and thus can be run from the host system as a Flatpak, but apart from Owen Taylors VS Code integration plugin most of them are not yet and thus needs to be installed inside your Toolbx.

As for building Flatpaks in Fedora as opposed to on Flathub, we are working on improving the developer experience around that. There are many good reasons why one might want to maintain a Fedora Flatpak, things like liking the Fedora content and security policies or just being more familiar with using the tested and vetted Fedora packages. Of course there are good reasons why developers might prefer maintaining applications on Flathub too, we are fine with either, but we want to make sure that whatever path you choose we have a great developer and package maintainer experience for you.

Multi-Stream Transport

Multi-monitor setups have become more and more common and popular so one effort we spent time on for the last few years is Lyude Paul working to clean up the MST support in the kernel. MST is a specification for DisplayPort that allows multiple monitors to be driven from a single DisplayPort port by multiplexing several video streams into a single stream and sending it to a branch device, which demultiplexes the signal into the original streams. DisplayPort MST will usually take the form of a single USB-C or DisplayPort connection. More recently, we’ve also seen higher resolution displays – and complicated technologies like DSC (Display Stream Compression) which need proper driver support in order to function.

Making setups like docks work is no easy task. In the Linux kernel, we have a set of shared code that any video driver can use in order to much more easily implement support for features like DisplayPort MST. We’ve put quite a lot of work into making this code both viable for any driver to use, but also to build new functionality on top of for features such as DSC. Our hope is that with this we can both encourage the growth of support for functionality like MST, and support of further features like DSC from vendors through this work. Since this code is shared, this can also come with the benefit that any new functionality implemented through this path is far easier to add to other drivers.
Lyude has mostly finished this work now and recently have been focusing on fixing some regressions that accidentally came upstream in amdgpu. The main stuff she was working on beforehand was a lot of code cleanup, particularly removing a bunch of the legacy MST code. For context: with kernel modesetting we have legacy modesetting, and atomic modesetting. Atomic modesetting is what’s used for modern drivers, and it’s a great deal simpler to work with than legacy modesetting. Most of the MST helpers were written before atomic was a thing, and as a result there was a pretty big mess of code that both didn’t really need to be there – and actively made it a lot more difficult to implement new functionality and figure out whether bug fixes that were being submitted to Lyude were even correct or not. Now that we’ve cleaned this up though, the MST helpers make heavy use of atomic and this has definitely simplified the code quite a bit.

by uraeus at December 14, 2022 02:37 PM

December 12, 2022

Andy Wingoi'm throwing ephemeron party & you're invited

(Andy Wingo)

Good day, hackfolk. Today's note tries to extend our semi-space collector with support for ephemerons. Spoiler alert: we fail in a subtle and interesting way. See if you can spot it before the end :)

Recall that, as we concluded in an earlier article, a memory manager needs to incorporate ephemerons as a core part of the tracing algorithm. Ephemerons are not macro-expressible in terms of object trace functions.

Instead, to support ephemerons we need to augment our core trace routine. When we see an ephemeron E, we need to check if the key K is already visited (and therefore live); if so, we trace the value V directly, and we're done. Otherwise, we add E to a global table of pending ephemerons T, indexed under K. Finally whenever we trace a new object O, ephemerons included, we look up O in T, to trace any pending ephemerons for O.

So, taking our semi-space collector as a workbench, let's start by defining what an ephemeron is.

struct gc_ephemeron {
  struct gc_obj header;
  int dead;
  struct gc_obj *key;
  struct gc_obj *value;
};

enum gc_obj_kind { ..., EPHEMERON, ... };

static struct gc_ephemeron* as_ephemeron(struct gc_obj *obj) {
  uintptr_t ephemeron_tag = NOT_FORWARDED_BIT | (EPHEMERON << 1);
  if (obj->tag == ephemeron_tag)
    return (struct gc_ephemeron*)obj;
  return NULL;
}

First we need to allow the GC to know when an object is an ephemeron or not. This is somewhat annoying, as you would like to make this concern entirely the responsibility of the user, and let the GC be indifferent to the kinds of objects it's dealing with, but it seems to be unavoidable.

The heap will need some kind of data structure to track pending ephemerons:

struct gc_pending_ephemeron_table;

struct gc_heap {
  ...
  struct gc_pending_ephemeron_table *pending_ephemerons;
}

struct gc_ephemeron *
pop_pending_ephemeron(struct gc_pending_ephemeron_table*,
                      struct gc_obj*);
void
add_pending_ephemeron(struct gc_pending_ephemeron_table*,
                      struct gc_obj*, struct gc_ephemeron*);
struct gc_ephemeron *
pop_any_pending_ephemeron(struct gc_pending_ephemeron_table*);

Now let's define a function to handle ephemeron shenanigans:

void visit_ephemerons(struct gc_heap *heap, struct gc_obj *obj) {
  // We are visiting OBJ for the first time.
  // OBJ is the old address, but it is already forwarded.
  ASSERT(is_forwarded(obj));

  // First, visit any pending ephemeron for OBJ.
  struct gc_ephemeron *ephemeron;
  while ((ephemeron =
            pop_pending_ephemeron(heap->pending_ephemerons, obj))) {
    ASSERT(obj == ephemeron->key);
    ephemeron->key = forwarded(obj);
    visit_field(&ephemeron->value, heap);
  }

  // Then if OBJ is itself an ephemeron, trace it.
  if ((ephemeron = as_ephemeron(forwarded(obj)))
      && !ephemeron->dead) {
    if (is_forwarded(ephemeron->key)) {
      ephemeron->key = forwarded(ephemeron->key);
      visit_field(&ephemeron->value, heap);
    } else {
      add_pending_ephemeron(heap->pending_ephemerons,
                            ephemeron->key, ephemeron);
    }
  }
}

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  ...
  visit_ephemerons(heap, obj); // *
  return new_obj;
}

We wire it into the copy routine, as that's the bit of the collector that is called only once per object and which has access to the old address. We actually can't process ephemerons during the Cheney field scan, as there we don't have old object addresses.

Then at the end of collection, we kill any ephemeron whose key hasn't been traced:

void kill_pending_ephemerons(struct gc_heap *heap) {
  struct gc_ephemeron *ephemeron;
  while ((ephemeron =
            pop_any_pending_ephemeron(heap->pending_ephemerons)))
    ephemeron->dead = 1;    
}

void collect(struct gc_heap *heap) {
  // ...
  kill_pending_ephemerons(heap);
}

First observation: Gosh, this is quite a mess. It's more code than the core collector, and it's gnarly. There's a hash table, for goodness' sake. Goodbye, elegant algorithm!

Second observation: Well, at least it works.

Third observation: Oh. It works in the same way as tracing in the copy routine works: well enough for shallow graphs, but catastrophically for arbitrary graphs. Calling visit_field from within copy introduces unbounded recursion, as tracing one value can cause more ephemerons to resolve, ad infinitum.

Well. We seem to have reached a dead-end, for now. Will our hero wrest victory from the jaws of defeat? Tune in next time for find out: same garbage time (unpredictable), same garbage channel (my wordhoard). Happy hacking!

by Andy Wingo at December 12, 2022 02:02 PM

December 11, 2022

Andy Wingowe iterate so that you can recurse

(Andy Wingo)

Sometimes when you see an elegant algorithm, you think "looks great, I just need it to also do X". Perhaps you are able to build X directly out of what the algorithm gives you; fantastic. Or, perhaps you can alter the algorithm a bit, and it works just as well while also doing X. Sometimes, though, you alter the algorithm and things go pear-shaped.

Tonight's little note builds on yesterday's semi-space collector article and discusses an worse alternative to the Cheney scanning algorithm.

To recall, we had this visit_field function that takes a edge in the object graph, as the address of a field in memory containing a struct gc_obj*. If the edge points to an object that was already copied, visit_field updates it to the forwarded address. Otherwise it copies the object, thus computing the new address, and then updates the field.

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  size_t size = heap_object_size(obj);
  struct gc_obj *new_obj = (struct gc_obj*)heap->hp;
  memcpy(new_obj, obj, size);
  forward(obj, new_obj);
  heap->hp += align_size(size);
  return new_obj;
}

void visit_field(struct gc_obj **field, struct gc_heap *heap) {
  struct gc_obj *from = *field;
  struct gc_obj *to =
    is_forwarded(from) ? forwarded(from) : copy(heap, from);
  *field = to;
}

Although a newly copied object is in tospace, all of its fields still point to fromspace. The Cheney scan algorithm later visits the fields in the newly copied object with visit_field, which both discovers new objects and updates the fields to point to tospace.

One disadvantage of this approach is that the order in which the objects are copied is a bit random. Given a hierarchical memory system, it's better if objects that are accessed together in time are close together in space. This is an impossible task without instrumenting the actual data access in a program and then assuming future accesses will be like the past. Instead, the generally-accepted solution is to ensure that objects that are allocated close together in time be adjacent in space. The bump-pointer allocator in a semi-space collector provides this property, but the evacuation algorithm above does not: it would need to preserve allocation order, but instead its order is driven by graph connectivity.

I say that the copying algorithm above is random but really it favors a breadth-first traversal; if you have a binary tree, first you will copy the left and the right nodes of the root, then the left and right children of the left, then the left and right children of the right, then grandchildren, and so on. Maybe it would be better to keep parent and child nodes together? After all they are probably allocated that way.

So, what if we change the algorithm:

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  size_t size = heap_object_size(obj);
  struct gc_obj *new_obj = (struct gc_obj*)heap->hp;
  memcpy(new_obj, obj, size);
  forward(obj, new_obj);
  heap->hp += align_size(size);
  trace_heap_object(new_obj, heap, visit_field); // *
  return new_obj;
}

void visit_field(struct gc_obj **field, struct gc_heap *heap) {
  struct gc_obj *from = *field;
  struct gc_obj *to =
    is_forwarded(from) ? forwarded(from) : copy(heap, from);
  *field = to;
}

void collect(struct gc_heap *heap) {
  flip(heap);
  trace_roots(heap, visit_field);
}

Here we favor a depth-first traversal: we eagerly call trace_heap_object within copy. No need for the Cheney scan algorithm; tracing does it all.

The thing is, this works! It might even have better performance for some workloads, depending on access patterns. And yet, nobody does this. Why?

Well, consider a linked list with a million nodes; you'll end up with a million recursive calls to copy, as visiting each link eagerly traverses the next. While I am all about unbounded recursion, an infinitely extensible stack is something that a language runtime has to provide to a user, and here we're deep into implementing-the-language-runtime territory. At some point a user's deep heap graph is going to cause a gnarly system failure via stack overflow.

Ultimately stack space needed by a GC algorithm counts towards collector memory overhead. In the case of a semi-space collector you already need twice the amount memory as your live object graph, and if you recursed instead of iterated this might balloon to 3x or more, depending on the heap graph shape.

Hey that's my note! All this has been context for some future article, so this will be on the final exam. Until then!

by Andy Wingo at December 11, 2022 09:19 PM

December 10, 2022

Andy Wingoa simple semi-space collector

(Andy Wingo)

Good day, hackfolk. Today's article is about semi-space collectors. Many of you know what these are, but perhaps not so many have seen an annotated implementation, so let's do that.

Just to recap, the big picture here is that a semi-space collector divides a chunk of memory into two equal halves or spaces, called the fromspace and the tospace. Allocation proceeds linearly across tospace, from one end to the other. When the tospace is full, we flip the spaces: the tospace becomes the fromspace, and the fromspace becomes the tospace. The collector copies out all live data from the fromspace to the tospace (hence the names), starting from some set of root objects. Once the copy is done, allocation then proceeds in the new tospace.

In practice when you build a GC, it's parameterized in a few ways, one of them being how the user of the GC will represent objects. Let's take as an example a simple tag-in-the-first-word scheme:

struct gc_obj {
  union {
    uintptr_t tag;
    struct gc_obj *forwarded; // for GC
  };
  uintptr_t payload[0];
};

We'll divide all the code in the system into GC code and user code. Users of the GC define how objects are represented. When user code wants to know what the type of an object is, it looks at the first word to check the tag. But, you see that GC has a say in what the representation of user objects needs to be: there's a forwarded member too.

static const uintptr_t NOT_FORWARDED_BIT = 1;
int is_forwarded(struct gc_obj *obj) {
  return (obj->tag & NOT_FORWARDED_BIT) == 1;
}
void* forwarded(struct gc_obj *obj) { 
  return obj->forwarded;
}
void forward(struct gc_obj *from, struct gc_obj *to) {
  from->forwarded = to;
}

forwarded is a forwarding pointer. When GC copies an object from fromspace to tospace, it clobbers the first word of the old copy in fromspace, writing the new address there. It's like when you move to a new flat and have your mail forwarded from your old to your new address.

There is a contract between the GC and the user in which the user agrees to always set the NOT_FORWARDED_BIT in the first word of its objects. That bit is a way for the GC to check if an object is forwarded or not: a forwarded pointer will never have its low bit set, because allocations are aligned on some power-of-two boundary, for example 8 bytes.

struct gc_heap;

// To implement by the user:
size_t heap_object_size(struct gc_obj *obj);
size_t trace_heap_object(struct gc_obj *obj, struct gc_heap *heap,
                         void (*visit)(struct gc_obj **field,
                                       struct gc_heap *heap));
size_t trace_roots(struct gc_heap *heap,
                   void (*visit)(struct gc_obj **field,
                                 struct gc_heap *heap));

The contract between GC and user is in practice one of the most important details of a memory management system. As a GC author, you want to expose the absolute minimum interface, to preserve your freedom to change implementations. The GC-user interface does need to have some minimum surface area, though, for example to enable inlining of the hot path for object allocation. Also, as we see here, there are some operations needed by the GC which are usually implemented by the user: computing the size of an object, tracing its references, and tracing the root references. If this aspect of GC design interests you, I would strongly recommend having a look at MMTk, which has been fruitfully exploring this space over the last two decades.

struct gc_heap {
  uintptr_t hp;
  uintptr_t limit;
  uintptr_t from_space;
  uintptr_t to_space;
  size_t size;
};

Now we get to the implementation of the GC. With the exception of how to inline the allocation hot-path, none of this needs to be exposed to the user. We start with a basic definition of what a semi-space heap is, above, and below we will implement collection and allocation.

static uintptr_t align(uintptr_t val, uintptr_t alignment) {
  return (val + alignment - 1) & ~(alignment - 1);
}
static uintptr_t align_size(uintptr_t size) {
  return align(size, sizeof(uintptr_t));
}

All allocators have some minimum alignment, which is usually a power of two at least as large as the target language's ABI alignment. Usually it's a word or two; here we just use one word (4 or 8 bytes).

struct gc_heap* make_heap(size_t size) {
  size = align(size, getpagesize());
  struct gc_heap *heap = malloc(sizeof(struct gc_heap));
  void *mem = mmap(NULL, size, PROT_READ|PROT_WRITE,
                   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  heap->to_space = heap->hp = (uintptr_t) mem;
  heap->from_space = heap->limit = space->hp + size / 2;
  heap->size = size;
  return heap;
}

Making a heap is just requesting a bunch of memory and dividing it in two. How you get that space differs depending on your platform; here we use mmap and also the platform malloc for the struct gc_heap metadata. Of course you will want to check that both the mmap and the malloc succeed :)

struct gc_obj* copy(struct gc_heap *heap, struct gc_obj *obj) {
  size_t size = heap_object_size(obj);
  struct gc_obj *new_obj = (struct gc_obj*)heap->hp;
  memcpy(new_obj, obj, size);
  forward(obj, new_obj);
  heap->hp += align_size(size);
  return new_obj;
}

void flip(struct gc_heap *heap) {
  heap->hp = heap->from_space;
  heap->from_space = heap->to_space;
  heap->to_space = heap->hp;
  heap->limit = heap->hp + heap->size / 2;
}  

void visit_field(struct gc_obj **field, struct gc_heap *heap) {
  struct gc_obj *from = *field;
  struct gc_obj *to =
    is_forwarded(from) ? forwarded(from) : copy(heap, from);
  *field = to;
}

void collect(struct gc_heap *heap) {
  flip(heap);
  uintptr_t scan = heap->hp;
  trace_roots(heap, visit_field);
  while(scan < heap->hp) {
    struct gc_obj *obj = scan;
    scan += align_size(trace_heap_object(obj, heap, visit_field));
  }
}

Here we have the actual semi-space collection algorithm! It's a tiny bit of code about which people have written reams of prose, and to be fair there are many things to say—too many for here.

Personally I think the most interesting aspect of a semi-space collector is the so-called "Cheney scanning algorithm": when we see an object that's not yet traced, in visit_field, we copy() it to tospace, but don't actually look at its fields. Instead collect keeps track of the partition of tospace that contains copied objects which have not yet been traced, which are those in [scan, heap->hp). The Cheney scan sweeps through this space, advancing scan, possibly copying more objects and extending heap->hp, until such a time as the needs-tracing partition is empty. It's quite a neat solution that requires no additional memory.

inline struct gc_obj* allocate(struct gc_heap *heap, size_t size) {
retry:
  uintptr_t addr = heap->hp;
  uintptr_t new_hp = align_size(addr + size);
  if (heap->limit < new_hp) {
    collect(heap);
    if (heap->limit - heap->hp < size) {
      fprintf(stderr, "out of memory\n");
      abort();
    }
    goto retry;
  }
  heap->hp = new_hp;
  return (struct gc_obj*)addr;
}

Finally, we have the allocator: the reason we have the GC in the first place. The fast path just returns heap->hp, and arranges for the next allocation to return heap->hp + size. The slow path calls collect() and then retries.

Welp, that's a semi-space collector. Until next time for some notes on ephemerons again. Until then, have a garbage holiday season!

by Andy Wingo at December 10, 2022 08:50 PM

December 05, 2022

GStreamerGStreamer 1.21.3 unstable development release

(GStreamer)

The GStreamer team is pleased to announce another development release in the unstable 1.21 release series on the road to the 1.22 stable release.

The unstable 1.21 release series adds new features on top of the current stable 1.20 series and is part of the API and ABI-stable 1.x release series of the GStreamer multimedia framework.

The unstable 1.21 release series is for testing and development purposes in the lead-up to the stable 1.22 series which is scheduled for release in December 2022. Any newly-added API can still change until that point.

A feature freeze is now into effect for the 1.21 series, but newly-added API might still change until the final 1.22.0 stable release, and minor features may also still be added until then.

This development release is primarily for developers and early adaptors.

Packagers: please note that plugins (e.g. xingmux) may have moved between modules, so please take extra care and make sure inter-module version dependencies are such that users can only upgrade all modules in one go, instead of seeing a mix of different 1.21 and/or 1.20 versions on their system.

Binaries for Android, iOS, Mac OS X and Windows are also available at the usual location.

Release tarballs can be downloaded directly here:

As always, please let us know of any issues you run into by filing an issue in Gitlab.

December 05, 2022 02:00 AM

November 28, 2022

Andy Wingoare ephemerons primitive?

(Andy Wingo)

Good evening :) A quick note, tonight: I've long thought that ephemerons are primitive and can't be implemented with mark functions and/or finalizers, but today I think I have a counterexample.

For context, one of the goals of the GC implementation I have been working on on is to replace Guile's current use of the Boehm-Demers-Weiser (BDW) conservative collector. Of course, changing a garbage collector for a production language runtime is risky, and for Guile one of the mitigation strategies for this work is that the new collector is behind an abstract API whose implementation can be chosen at compile-time, without requiring changes to user code. That way we can first switch to BDW-implementing-the-new-GC-API, then switch the implementation behind that API to something else.

Abstracting GC is a tricky problem to get right, and I thank the MMTk project for showing that this is possible -- you have user-facing APIs that need to be implemented by concrete collectors, but also extension points so that the user can provide some compile-time configuration too, for example to provide field-tracing visitors that take into account how a user wants to lay out objects.

Anyway. As we discussed last time, ephemerons are usually have explicit support from the GC, so we need an ephemeron abstraction as part of the abstract GC API. The question is, can BDW-GC provide an implementation of this API?

I think the answer is "yes, but it's very gnarly and will kill performance so bad that you won't want to do it."

the contenders

Consider that the primitives that you get with BDW-GC are custom mark functions, run on objects when they are found to be live by the mark workers; disappearing links, a kind of weak reference; and finalizers, which receive the object being finalized, can allocate, and indeed can resurrect the object.

BDW-GC's finalizers are a powerful primitive, but not one that is useful for implementing the "conjunction" aspect of ephemerons, as they cannot constrain the marker's idea of graph connectivity: a finalizer can only prolong the life of an object subgraph, not cut it short. So let's put finalizers aside.

Weak references have a tantalizingly close kind of conjunction property: if the weak reference itself is alive, and the referent is also otherwise reachable, then the weak reference can be dereferenced. However this primitive only involves the two objects E and K; there's no way to then condition traceability of a third object V to E and K.

We are left with mark functions. These are an extraordinarily powerful interface in BDW-GC, but somewhat expensive also: not inlined, and going against the grain of what BDW-GC is really about (heaps in which the majority of all references are conservative). But, OK. They way they work is, your program allocates a number of GC "kinds", and associates mark functions with those kinds. Then when you allocate objects, you use those kinds. BDW-GC will call your mark functions when tracing an object of those kinds.

Let's assume firstly that you have a kind for ephemerons; then when you go to mark an ephemeron E, you mark the value V only if the key K has been marked. Problem solved, right? Only halfway: you also have to handle the case in which E is marked first, then K. So you publish E to a global hash table, and... well. You would mark V when you mark a K for which there is a published E. But, for that you need a hook into marking V, and V can be any object...

So now we assume additionally that all objects are allocated with user-provided custom mark functions, and that all mark functions check if the marked object is in the published table of pending ephemerons, and if so marks values. This is essentially what a proper ephemeron implementation would do, though there are some optimizations one can do to avoid checking the table for each object before the mark stack runs empty for the first time. In this case, yes you can do it! Additionally if you register disappearing links for the K field in each E, you can know if an ephemeron E was marked dead in a previous collection. Add a pre-mark hook (something BDW-GC provides) to clear the pending ephemeron table, and you are in business.

yes, but no

So, it is possible to implement ephemerons with just custom mark functions. I wouldn't want to do it, though: missing the mostly-avoid-pending-ephemeron-check optimization would be devastating, and really what you want is support in the GC implementation. I think that for the BDW-GC implementation in whippet I'll just implement weak-key associations, in which the value is always marked strongly unless the key was dead on a previous collection, using disappearing links on the key field. That way a (possibly indirect) reference from a value V to a key K can indeed keep K alive, but oh well: it's a conservative approximation of what should happen, and not worse than what Guile has currently.

Good night and happy hacking!

by Andy Wingo at November 28, 2022 09:11 PM

November 07, 2022

GStreamerGStreamer 1.21.2 unstable development release

(GStreamer)

The GStreamer team is pleased to announce a second development release in the unstable 1.21 release series.

The unstable 1.21 release series adds new features on top of the current stable 1.20 series and is part of the API and ABI-stable 1.x release series of the GStreamer multimedia framework.

The unstable 1.21 release series is for testing and development purposes in the lead-up to the stable 1.22 series which is scheduled for release in December 2022. Any newly-added API can still change until that point.

This development release is primarily for developers and early adopters.

Binaries for Android, iOS, Mac OS X and Windows are also available at the usual location.

Release tarballs can be downloaded directly here:

As always, please let us know of any issues you run into by filing an issue in Gitlab.

November 07, 2022 11:30 PM