October 03, 2022

Andy Wingoon "correct and efficient work-stealing for weak memory models"

(Andy Wingo)

Hello all, a quick post today. Inspired by Rust as a Language for High Performance GC Implementation by Yi Lin et al, a few months ago I had a look to see how the basic Rust concurrency facilities that they used were implemented.

One of the key components that Lin et al used was a Chase-Lev work-stealing double-ended queue (deque). The 2005 article Dynamic Circular Work-Stealing Deque by David Chase and Yossi Lev is a nice read defining this data structure. It's used when you have a single producer of values, but multiple threads competing to claim those values. This is useful when implementing per-CPU schedulers or work queues; each CPU pushes on any items that it has to its own deque, and pops them also, but when it runs out of work, it goes to see if it can steal work from other CPUs.

The 2013 paper Correct and Efficient Work-Stealing for Weak Memory Models by Nhat Min Lê et al updates the Chase-Lev paper by relaxing the concurrency primitives from the original big-hammer sequential-consistency operations used in the Chase-Lev paper to an appropriate mix of C11 relaxed, acquire/release, and sequentially-consistent operations. The paper therefore has a C11 translation of the original algorithm, and a proof of correctness. It's quite pleasant. Here's the a version in Rust's crossbeam crate, and here's the same thing in C.

I had been using this updated C11 Chase-Lev deque implementation for a while with no complaints in a parallel garbage collector. Each worker thread would keep a local unsynchronized work queue, which when it grew too large would donate half of its work to a per-worker Chase-Lev deque. Then if it ran out of work, it would go through all the workers, seeing if it could steal some work.

My use of the deque was thus limited to only the push and steal primitives, but not take (using the language of the Lê et al paper). take is like steal, except that it takes values from the producer end of the deque, and it can't run concurrently with push. In practice take only used by the the thread that also calls push. Cool.

Well I thought, you know, before a worker thread goes to steal from some other thread, it might as well see if it can do a cheap take on its own deque to see if it could take back some work that it had previously offloaded there. But here I ran into a bug. A brief internet search didn't turn up anything, so here we are to mention it.

Specifically, there is a bug in the Lê et al paper that is not in the Chase-Lev paper. The original paper is in Java, and the C11 version is in, well, C11. The issue is.... integer overflow! In brief, push will increment bottom, and steal increments top. take, on the other hand, can decrement bottom. It uses size_t to represent bottom. I think you see where this is going; if you take on an empty deque in the initial state, you create a situation that looks just like a deque with (size_t)-1 elements, causing garbage reads and all kinds of delightful behavior.

The funny thing is that I looked at the proof and I looked at the industrial applications of the deque and I thought well, I just have to transcribe the algorithm exactly and I'll be golden. But it just goes to show that proving one property of an algorithm doesn't necessarily imply that the algorithm is correct.

by Andy Wingo at October 03, 2022 08:24 AM

September 01, 2022

Andy Wingonew month, new brainworm

(Andy Wingo)

Today, a brainworm! I had a thought a few days ago and can't get it out of my head, so I need to pass it on to another host.

So, imagine a world in which there is a a drive to build a kind of Kubernetes on top of WebAssembly. Kubernetes nodes are generally containers, associated with additional metadata indicating their place in overall system topology (network connections and so on). (I am not a Kubernetes specialist, as you can see; corrections welcome.) Now in a WebAssembly cloud, the nodes would be components, probably also with additional topological metadata. VC-backed companies will duke it out for dominance of the WebAssembly cloud space, and in a couple years we will probably emerge with an open source project that has become a de-facto standard (though it might be dominated by one or two players).

In this world, Kubernetes and Spiffy-Wasm-Cloud will coexist. One of the success factors for Kubernetes was that you can just put your old database binary inside a container: it's the same ABI as when you run your database in a virtual machine, or on (so-called!) bare metal. The means of composition are TCP and UDP network connections between containers, possibly facilitated by some kind of network fabric. In contrast, in Spiffy-Wasm-Cloud we aren't starting from the kernel ABI, with processes and such: instead there's WASI, which is more of a kind of specialized and limited libc. You can't just drop in your database binary, you have to write code to get it to conform to the new interfaces.

One consequence of this situation is that I expect WASI and the component model to develop a rich network API, to allow WebAssembly components to interoperate not just with end-users but also other (micro-)services running in the same cloud. Likewise there is room here for a company to develop some complicated network fabrics for linking these things together.

However, WebAssembly-to-WebAssembly links are better expressed via typed functional interfaces; it's more expressive and can be faster. Not only can you end up having fine-grained composition that looks more like lightweight Erlang processes, you can also string together components in a pipeline with communications overhead approaching that of a simple function call. Relative to Kubernetes, there are potential 10x-100x improvements to be had, in throughput and in memory footprint, at least in some cases. It's the promise of this kind of improvement that can drive investment in this area, and eventually adoption.

But, you still have some legacy things running in containers. What to do? Well... Maybe recompile them to WebAssembly? That's my brain-worm.

A container is a file system image containing executable files and data. Starting with the executable files, they are in machine code, generally x64, and interoperate with system libraries and the run-time via an ABI. You could compile them to WebAssembly instead. You could interpret them as data, or JIT-compile them as webvm does, or directly compile them to WebAssembly. This is the sort of thing you hire Fabrice Bellard to do ;) Then you have the filesystem. Let's assume it is stateless: any change to the filesystem at runtime doesn't need to be preserved. (I understand this is a goal, though I could be wrong.) So you could put the filesystem in memory, as some kind of addressable data structure, and you make the libc interface access that data structure. It's something like the microkernel approach. And then you translate whatever topological connectivity metadata you had for Kubernetes to your Spiffy-Wasm-Cloud's format.

Anyway in the end you have a WebAssembly module and some metadata, and you can run it in your WebAssembly cloud. Or on the more basic level, you have a container and you can now run it on any machine with a WebAssembly implementation, even on other architectures (coucou RISC-V!).

Anyway, that's the tweet. Have fun, whoever gets to work on this :)

by Andy Wingo at September 01, 2022 10:12 AM

August 24, 2022

Arun RaghavanGStreamer for your backend services

For the last year and a half, we at Asymptotic have been working with the excellent team at Daily. I’d like to share a little bit about what we’ve learned.

Daily is a real time calling platform as a service. One standard feature that users have come to expect in their calls is the ability to record them, or to stream their conversations to a larger audience. This involves mixing together all the audio/video from each participant and then storing it, or streaming it live via YouTube, Twitch, or any other third-party service.

As you might expect, GStreamer is a good fit for building this kind of functionality, where we consume a bunch of RTP streams, composite/mix them, and then send them out to one or more external services (Amazon’s S3 for recordings and HLS, or a third-party RTMP server).

I’ve written about how we implemented this feature elsewhere, but I’ll summarise briefly.

This is a slightly longer post than usual, so grab a cup of your favourite beverage, or jump straight to the summary section for the tl;dr.

Architecture

Participants in a call send their audio and video to a “Selective Forwarding Unit” (SFU) which, as the name suggests, is responsible for forwarding media to all the other participants. The media is usually compressed with lossy codecs such as VP8/H.264 (video) or Opus (audio), and packaged in RTP packets.

In addition to this, the SFU can also forward the media to a streaming and recording backend, which is responsible for combining the streams based on client-provided configuration. Configuration might include such details as which participants to capture, or how the video should be laid out (grid and active participant being two common layouts).

Daily’s streaming architecture

" data-medium-file="https://arunraghavan.net/wp-content/uploads/daily-streaming-arch-300x86.png" data-large-file="https://arunraghavan.net/wp-content/uploads/daily-streaming-arch-1024x295.png" class="size-large wp-image-2571" src="https://arunraghavan.net/wp-content/uploads/daily-streaming-arch-1024x295.png" alt="Diagram of Daily's streaming architecture" width="672" height="194" srcset="https://arunraghavan.net/wp-content/uploads/daily-streaming-arch-1024x295.png 1024w, https://arunraghavan.net/wp-content/uploads/daily-streaming-arch-300x86.png 300w, https://arunraghavan.net/wp-content/uploads/daily-streaming-arch-768x221.png 768w, https://arunraghavan.net/wp-content/uploads/daily-streaming-arch-1536x443.png 1536w, https://arunraghavan.net/wp-content/uploads/daily-streaming-arch-2048x590.png 2048w, https://arunraghavan.net/wp-content/uploads/daily-streaming-arch-800x231.png 800w" sizes="(max-width: 672px) 100vw, 672px" />

Daily’s streaming architecture

While the illustration itself is static, reality is often messier. Participants may join and leave the call at any time, audio and video might be muted, or screens might be shared — we expect media to come and go dynamically, and the backend must be able adapt to this smoothly.

Let’s zoom into the box labelled Streaming and Recording Backend. This is an internal service, at the core of which is a GStreamer-based application which runs a pipeline that looks something like this:

High-level view of the GStreamer pipeline

" data-medium-file="https://arunraghavan.net/wp-content/uploads/daily-gst-pipeline-300x98.png" data-large-file="https://arunraghavan.net/wp-content/uploads/daily-gst-pipeline-1024x336.png" loading="lazy" class="size-large wp-image-2572" src="https://arunraghavan.net/wp-content/uploads/daily-gst-pipeline-1024x336.png" alt="High level diagram of a GStreamer pipeline for streaming and recording" width="672" height="221" srcset="https://arunraghavan.net/wp-content/uploads/daily-gst-pipeline-1024x336.png 1024w, https://arunraghavan.net/wp-content/uploads/daily-gst-pipeline-300x98.png 300w, https://arunraghavan.net/wp-content/uploads/daily-gst-pipeline-768x252.png 768w, https://arunraghavan.net/wp-content/uploads/daily-gst-pipeline-1536x504.png 1536w, https://arunraghavan.net/wp-content/uploads/daily-gst-pipeline-2048x672.png 2048w, https://arunraghavan.net/wp-content/uploads/daily-gst-pipeline-800x263.png 800w" sizes="(max-width: 672px) 100vw, 672px" />

High-level view of the GStreamer pipeline

This is, of course, a simplified diagram. There are a number of finer details, such as per-track processing, decisions regarding how we build the composition, and how we manage the set of outputs.

With this architecture in mind, let’s dive into how we built this out, and what approaches we found worked well.

Modular output

It should be no surprise to folks familiar with GStreamer that this architecture allows for us to easily expand the set of outputs we support.

We first started with just packaging the stream in FLV and streaming to RTMP with rtmp2sink. Downstream services like YouTube and Twitch could then repackage and stream this to a larger number of users, both in real time and for watching later.

We then added recording support, which was “just” a matter of adding another branch to the pipeline to package the media in MP4 and stream straight to Amazon S3 using the rusotos3sink (now awss3sink after we rewrote the plugin using the official Amazon Rust SDK).

At a high level, this part of the GStreamer pipeline looks something like:

Modular outputs of the GStreamer pipeline

" data-medium-file="https://arunraghavan.net/wp-content/uploads/daily-modular-output-300x157.png" data-large-file="https://arunraghavan.net/wp-content/uploads/daily-modular-output-1024x536.png" loading="lazy" class="size-large wp-image-2570" src="https://arunraghavan.net/wp-content/uploads/daily-modular-output-1024x536.png" alt="Modular outputs of the GStreamer pipeline" width="672" height="352" srcset="https://arunraghavan.net/wp-content/uploads/daily-modular-output-1024x536.png 1024w, https://arunraghavan.net/wp-content/uploads/daily-modular-output-300x157.png 300w, https://arunraghavan.net/wp-content/uploads/daily-modular-output-768x402.png 768w, https://arunraghavan.net/wp-content/uploads/daily-modular-output-1536x804.png 1536w, https://arunraghavan.net/wp-content/uploads/daily-modular-output-800x419.png 800w, https://arunraghavan.net/wp-content/uploads/daily-modular-output.png 1793w" sizes="(max-width: 672px) 100vw, 672px" />

Modular outputs of the GStreamer pipeline

During an internal hackathon, we created an s3hlssink (now upstream) which allows us to write out a single-variant HLS stream from the pipeline directly to an S3 bucket. We also have an open merge request to write out multi-variant HLS directly to S3.

This will provide an HLS output alternative for Daily customers who wish to manage their own media content distribution.

Of course, we can add more outputs as needed. For example, streaming using newer protocols such as SRT would be relatively straightforward.

Complexity management

The GStreamer application started out as a monolithic Python program (for which we have a pretty mature set of bindings). Due to the number of features we wanted to support, and how dynamic they were (the set of inputs and outputs can change at runtime), things started getting complex quickly.

A good way to manage this complexity is to refactor functionality into independent modules. We found that GStreamer bins provide a good way to encapsulate state and functionality, while providing a well-defined control interface to the application. We decided to implement these bins in Rust, for reasons I hope to get into in a future blog post.

For example, we added the ability to stream to multiple RTMP services concurrently. We wrapped this functionality in an “RTMP streaming bin”, with a simple interface to specify the set of RTMP URLs being streamed to. This bin then internally manages adding and removing sinks within the running pipeline without the application having to keep track of the details. This is illustrated below.

RTMP streaming bin

" data-medium-file="https://arunraghavan.net/wp-content/uploads/daily-streaming-bin-300x156.png" data-large-file="https://arunraghavan.net/wp-content/uploads/daily-streaming-bin-1024x532.png" loading="lazy" class="size-large wp-image-2569" src="https://arunraghavan.net/wp-content/uploads/daily-streaming-bin-1024x532.png" alt="Rough illustration of an RTMP streaming bin's components" width="672" height="349" srcset="https://arunraghavan.net/wp-content/uploads/daily-streaming-bin-1024x532.png 1024w, https://arunraghavan.net/wp-content/uploads/daily-streaming-bin-300x156.png 300w, https://arunraghavan.net/wp-content/uploads/daily-streaming-bin-768x399.png 768w, https://arunraghavan.net/wp-content/uploads/daily-streaming-bin-1536x799.png 1536w, https://arunraghavan.net/wp-content/uploads/daily-streaming-bin-2048x1065.png 2048w, https://arunraghavan.net/wp-content/uploads/daily-streaming-bin-800x416.png 800w" sizes="(max-width: 672px) 100vw, 672px" />

RTMP streaming bin

Doing this for each bit of dynamic functionality has made both development and maintenance of the code a lot easier.

Performance and threading

As we started using our pipeline with larger calls, we realised that we did have some performance bottlenecks. When we had more than 9-10 video inputs, the output frame rate would drop from the expected 20fps to 5-10fps. We run the pipeline on a fairly beefy EC2 instance, so we had to dig in to understand what was going on.

We went with a low-tech approach to figure this out, and just used top to give us a sense of the overall CPU consumption. It turned out we were well below 50% of overall CPU utilisation, but still dropping frames.

So we switched top to “threads-mode” (H while it’s running, or top -H), to see individual threads’ CPU consumption. GStreamer elements provide names for the threads they spawn, which helps us identify where the bottlenecks are.

We identified these CPU-intensive tasks in our pipeline:

  1. Decoding and resizing input video streams
  2. Composing all the inputs together
  3. Encoding video for the output

The composition step looked something like:

Older single-threaded compositor setup

" data-medium-file="https://arunraghavan.net/wp-content/uploads/daily-compositor-old-300x150.png" data-large-file="https://arunraghavan.net/wp-content/uploads/daily-compositor-old-1024x510.png" loading="lazy" class="size-large wp-image-2567" src="https://arunraghavan.net/wp-content/uploads/daily-compositor-old-1024x510.png" alt="Older single-threaded compositor" width="672" height="335" srcset="https://arunraghavan.net/wp-content/uploads/daily-compositor-old-1024x510.png 1024w, https://arunraghavan.net/wp-content/uploads/daily-compositor-old-300x150.png 300w, https://arunraghavan.net/wp-content/uploads/daily-compositor-old-768x383.png 768w, https://arunraghavan.net/wp-content/uploads/daily-compositor-old-1536x765.png 1536w, https://arunraghavan.net/wp-content/uploads/daily-compositor-old-800x399.png 800w, https://arunraghavan.net/wp-content/uploads/daily-compositor-old.png 1806w" sizes="(max-width: 672px) 100vw, 672px" />

Older single-threaded compositor setup

The bottleneck turned out to be in compositor‘s processing loop. We factored out the expensive processing into individual threads for each video track. With this in place, we’re able to breeze through with even 25 video inputs. The new pipeline looked something like:

Compositor with processing in upstream threads

" data-medium-file="https://arunraghavan.net/wp-content/uploads/daily-compositor-new-300x121.png" data-large-file="https://arunraghavan.net/wp-content/uploads/daily-compositor-new-1024x414.png" loading="lazy" class="size-large wp-image-2568" src="https://arunraghavan.net/wp-content/uploads/daily-compositor-new-1024x414.png" alt="Compositor with processing moved upstream" width="672" height="272" srcset="https://arunraghavan.net/wp-content/uploads/daily-compositor-new-1024x414.png 1024w, https://arunraghavan.net/wp-content/uploads/daily-compositor-new-300x121.png 300w, https://arunraghavan.net/wp-content/uploads/daily-compositor-new-768x310.png 768w, https://arunraghavan.net/wp-content/uploads/daily-compositor-new-1536x620.png 1536w, https://arunraghavan.net/wp-content/uploads/daily-compositor-new-2048x827.png 2048w, https://arunraghavan.net/wp-content/uploads/daily-compositor-new-800x323.png 800w" sizes="(max-width: 672px) 100vw, 672px" />

Compositor with processing in upstream threads

Note: Subsequent to this work, compositor gained the ability to fan out per-input processing into multiple threads. We still kept our manual approach as we have some processing that is not (yet) implemented in compositor. An example of this is a plugin we developed for drawing rounded corners (also available upstream).

The key takeaway here is that understanding GStreamer’s threading model is a vital tool for managing the performance of your pipeline.

Buffers and Observability

One thing we want to keep in mind from the start is observability — we wanted to be able to easily characterise the performance of a running pipeline, as well as be able to aggregate these statistics to form a view of the health of the system as we added features and made releases.

During the performance analysis I described in the previous section, we found that queue elements provide a good proxy for pipeline health. The queue creates a thread boundary in the pipeline, and looking at fill level and overruns gives us a good sense of how each thread is performing. If a queue runs full and overruns, it means the elements downstream of that queue aren’t able to keep up and that’s something to look at.

This is illustrated in the figure below. The compositor is not able to keep up in real time, causing the queues upstream of it to start filling up, while those downstream of it are empty.

Slow composition causes upstream queues to fill up

" data-medium-file="https://arunraghavan.net/wp-content/uploads/daily-queues-300x49.png" data-large-file="https://arunraghavan.net/wp-content/uploads/daily-queues-1024x166.png" loading="lazy" class="size-large wp-image-2566" src="https://arunraghavan.net/wp-content/uploads/daily-queues-1024x166.png" alt="Illustration of queue fill levels" width="672" height="109" srcset="https://arunraghavan.net/wp-content/uploads/daily-queues-1024x166.png 1024w, https://arunraghavan.net/wp-content/uploads/daily-queues-300x49.png 300w, https://arunraghavan.net/wp-content/uploads/daily-queues-768x125.png 768w, https://arunraghavan.net/wp-content/uploads/daily-queues-1536x249.png 1536w, https://arunraghavan.net/wp-content/uploads/daily-queues-2048x333.png 2048w, https://arunraghavan.net/wp-content/uploads/daily-queues-800x130.png 800w" sizes="(max-width: 672px) 100vw, 672px" />

Slow composition causes upstream queues to fill up

We periodically probe our pipeline for statistics from queues and other elements, and feed them into applications logs for analysis and aggregation. This gives us a very useful first point of investigation when problems are reported, as well as a basis for proactive monitoring.

There has been recent work to add a queue tracer to generalise tracking these statistics, which we may be able to adopt and expand.

Summary

Hopefully this post has given you a sense of the platform we have built. In the process, we also set the rails for a very exciting custom composition engine.

To summarise what we learned:

  • While I am not an impartial observer, I think using GStreamer helped us ship a large number of features in a relatively brief time
  • The ability to dynamically add and remove both inputs and outputs allows us to realise complex use-cases with relative ease
  • We manage complexity by encapsulating functionality in bins
  • Analysing and tuning performance is very tractable with the right tools and mental model
  • Building for observability early on can pay rich dividends

I hope you enjoyed this post. If you think we can help you with your GStreamer needs, do get in touch.

Thank you for your time!

by Arun at August 24, 2022 03:51 PM

August 23, 2022

Andy Wingoaccessing webassembly reference-typed arrays from c++

(Andy Wingo)

The WebAssembly garbage collection proposal is coming soonish (really!) and will extend WebAssembly with the the capability to create and access arrays whose memory is automatically managed by the host. As long as some system component has a reference to an array, it will be kept alive, and as soon as nobody references it any more, it becomes "garbage" and is thus eligible for collection.

(In a way it's funny to define the proposal this way, in terms of what happens to garbage objects that by definition aren't part of the program's future any more; really the interesting thing is the new things you can do with live data, defining new data types and representing them outside of linear memory and passing them between components without copying. But "extensible-arrays-structs-and-other-data-types" just isn't as catchy as "GC". Anyway, I digress!)

One potential use case for garbage-collected arrays is for passing large buffers between parts of a WebAssembly system. For example, a webcam driver could produce a stream of frames as reference-typed arrays of bytes, and then pass them by reference to a sandboxed WebAssembly instance to, I don't know, identify cats in the images or something. You get the idea. Reference-typed arrays let you avoid copying large video frames.

A lot of image-processing code is written in C++ or Rust. With WebAssembly 1.0, you just have linear memory and no reference-typed values, which works well for these languages that like to think of memory as having a single address space. But once you get reference-typed arrays in the mix, you effectively have multiple address spaces: you can't address the contents of the array using a normal pointer, as you might be able to do if you mmap'd the buffer into a program's address space. So what do you do?

reference-typed values are special

The broader question of C++ and GC-managed arrays is, well, too broad for today. The set of array types is infinite, because it's not just arrays of i32, it's also arrays of arrays of i32, and arrays of those, and arrays of records, and so on.

So let's limit the question to just arrays of i8, to see if we can make some progress. So imagine a C function that takes an array of i8:

void process(array_of_i8 array) {
  // ?
}

If you know WebAssembly, there's a clear translation of the sort of code that we want:

(func (param $array (ref (array i8)))
  ; operate on local 0
  )

The WebAssembly function will have an array as a parameter. But, here we start to run into more problems with the LLVM toolchain that we use to compile C and other languages to WebAssembly. When the C front-end of LLVM (clang) compiles a function to the LLVM middle-end's intermediate representation (IR), it models all local variables (including function parameters) as mutable memory locations created with alloca. Later optimizations might turn these memory locations back to SSA variables and thence to registers or stack slots. But, a reference-typed value has no bit representation, and it can't be stored to linear memory: there is no alloca that can hold it.

Incidentally this problem is not isolated to future extensions to WebAssembly; the externref and funcref data types that landed in WebAssembly 2.0 and in all browsers are also reference types that can't be written to main memory. Similarly, the table data type which is also part of shipping WebAssembly is not dissimilar to GC-managed arrays, except that they are statically allocated at compile-time.

At Igalia, my colleagues Paulo Matos and Alex Bradbury have been hard at work to solve this gnarly problem and finally expose reference-typed values to C. The full details and final vision are probably a bit too much for this article, but some bits on the mechanism will help.

Firstly, note that LLVM has a fairly traditional breakdown between front-end (clang), middle-end ("the IR layer"), and back-end ("the MC layer"). The back-end can be quite target-specific, and though it can be annoying, we've managed to get fairly good support for reference types there.

In the IR layer, we are currently representing GC-managed values as opaque pointers into non-default, non-integral address spaces. LLVM attaches an address space (an integer less than 224 or so) to each pointer, mostly for OpenCL and GPU sorts of use-cases, and we abuse this to prevent LLVM from doing much reasoning about these values.

This is a bit of a theme, incidentally: get the IR layer to avoid assuming anything about reference-typed values. We're fighting the system, in a way. As another example, because LLVM is really oriented towards lowering high-level constructs to low-level machine operations, it doesn't necessarily preserve types attached to pointers on the IR layer. Whereas for WebAssembly, we need exactly that: we reify types when we write out WebAssembly object files, and we need LLVM to pass some types through from front-end to back-end unmolested. We've had to change tack a number of times to get a good way to preserve data from front-end to back-end, and probably will have to do so again before we end up with a final design.

Finally on the front-end we need to generate an alloca in different address spaces depending on the type being allocated. And because reference-typed arrays can't be stored to main memory, there are semantic restrictions as to how they can be used, which need to be enforced by clang. Fortunately, this set of restrictions is similar enough to what is imposed by the ARM C Language Extensions (ACLE) for scalable vector (SVE) values, which also don't have a known bit representation at compile-time, so we can piggy-back on those. This patch hasn't landed yet, but who knows, it might land soon; in the mean-time we are going to run ahead of upstream a bit to show how you might define and use an array type definition. Further tacks here are also expected, as we try to thread the needle between exposing these features to users and not imposing too much of a burden on clang maintenance.

accessing array contents

All this is a bit basic, though; it just gives you enough to have a local variable or a function parameter of a reference-valued type. Let's continue our example:

void process(array_of_i8 array) {
  uint32_t sum;
  for (size_t idx = 0; i < __builtin_wasm_array_length(array); i++)
    sum += (uint8_t)__builtin_wasm_array_ref_i8(array, idx);
  // ...
}

The most basic way to extend C to access these otherwise opaque values is to expose some builtins, say __builtin_wasm_array_length and so on. Probably you need different intrinsics for each scalar array element type (i8, i16, and so on), and one for arrays which return reference-typed values. We'll talk about arrays of references another day, but focusing on the i8 case, the C builtin then lowers to a dedicated LLVM intrinsic, which passes through the middle layer unscathed.

In C++ I think we can provide some nicer syntax which preserves the syntactic illusion of array access.

I think this is going to be sufficient as an MVP, but there's one caveat: SIMD. You can indeed have an array of i128 values, but you can only access that array's elements as i128; worse, you can't load multiple data from an i8 array as i128 or even i32.

Compare this to to the memory control proposal, which instead proposes to map buffers to non-default memories. In WebAssembly, you can in theory (and perhaps soon in practice) have multiple memories. The easiest way I can see on the toolchain side is to use the address space feature in clang:

void process(uint8_t *array __attribute__((address_space(42))),
             size_t len) {
  uint32_t sum;
  for (size_t idx = 0; i < len; i++)
    sum += array[idx];
  // ...
}

How exactly to plumb the mapping between address spaces which can only be specified by number from the front-end to the back-end is a little gnarly; really you'd like to declare the set of address spaces that a compilation unit uses symbolically, and then have the linker produce a final allocation of memory indices. But I digress, it's clear that with this solution we can use SIMD instructions to load multiple bytes from memory at a time, so it's a winner with respect to accessing GC arrays.

Or is it? Perhaps there could be SIMD extensions for packed GC arrays. I think it makes sense, but it's a fair amount of (admittedly somewhat mechanical) specification and implementation work.

& future

In some future bloggies we'll talk about how we will declare new reference types: first some basics, then some more integrated visions for reference types and C++. Lots going on, and this is just a brain-dump of the current state of things; thoughts are very much welcome.

by Andy Wingo at August 23, 2022 10:35 AM

August 18, 2022

Andy Wingojust-in-time code generation within webassembly

(Andy Wingo)

Just-in-time (JIT) code generation is an important tactic when implementing a programming language. Generating code at run-time allows a program to specialize itself against the specific data it is run against. For a program that implements a programming language, that specialization is with respect to the program being run, and possibly with respect to the data that program uses.

The way this typically works is that the program generates bytes for the instruction set of the machine it's running on, and then transfers control to those instructions.

Usually the program has to put its generated code in memory that is specially marked as executable. However, this capability is missing in WebAssembly. How, then, to do just-in-time compilation in WebAssembly?

webassembly as a harvard architecture

In a von Neumman machine, like the ones that you are probably reading this on, code and data share an address space. There's only one kind of pointer, and it can point to anything: the bytes that implement the sin function, the number 42, the characters in "biscuits", or anything at all. WebAssembly is different in that its code is not addressable at run-time. Functions in a WebAssembly module are numbered sequentially from 0, and the WebAssembly call instruction takes the callee as an immediate parameter.

So, to add code to a WebAssembly program, somehow you'd have to augment the program with more functions. Let's assume we will make that possible somehow -- that your WebAssembly module that had N functions will now have N+1 functions, and with function N being the new one your program generated. How would we call it? Given that the call instructions hard-code the callee, the existing functions 0 to N-1 won't call it.

Here the answer is call_indirect. A bit of a reminder, this instruction take the callee as an operand, not an immediate parameter, allowing it to choose the callee function at run-time. The callee operand is an index into a table of functions. Conventionally, table 0 is called the indirect function table as it contains an entry for each function which might ever be the target of an indirect call.

With this in mind, our problem has two parts, then: (1) how to augment a WebAssembly module with a new function, and (2) how to get the original module to call the new code.

late linking of auxiliary webassembly modules

The key idea here is that to add code, the main program should generate a new WebAssembly module containing that code. Then we run a linking phase to actually bring that new code to life and make it available.

System linkers like ld typically require a complete set of symbols and relocations to resolve inter-archive references. However when performing a late link of JIT-generated code, we can take a short-cut: the main program can embed memory addresses directly into the code it generates. Therefore the generated module would import memory from the main module. All references from the generated code to the main module can be directly embedded in this way.

The generated module would also import the indirect function table from the main module. (We would ensure that the main module exports its memory and indirect function table via the toolchain.) When the main module makes the generated module, it also embeds a special patch function in the generated module. This function would add the new functions to the main module's indirect function table, and perform any relocations onto the main module's memory. All references from the main module to generated functions are installed via the patch function.

We plan on two implementations of late linking, but both share the fundamental mechanism of a generated WebAssembly module with a patch function.

dynamic linking via the run-time

One implementation of a linker is for the main module to cause the run-time to dynamically instantiate a new WebAssembly module. The run-time would provide the memory and indirect function table from the main module as imports when instantiating the generated module.

The advantage of dynamic linking is that it can update a live WebAssembly module without any need for re-instantiation or special run-time checkpointing support.

In the context of the web, JIT compilation can be triggered by the WebAssembly module in question, by calling out to functionality from JavaScript, or we can use a "pull-based" model to allow the JavaScript host to poll the WebAssembly instance for any pending JIT code.

For WASI deployments, you need a capability from the host. Either you import a module that provides run-time JIT capability, or you rely on the host to poll you for data.

static linking via wizer

Another idea is to build on Wizer's ability to take a snapshot of a WebAssembly module. You could extend Wizer to also be able to augment a module with new code. In this role, Wizer is effectively a late linker, linking in a new archive to an existing object.

Wizer already needs the ability to instantiate a WebAssembly module and to run its code. Causing Wizer to ask the module if it has any generated auxiliary module that should be instantiated, patched, and incorporated into the main module should not be a huge deal. Wizer can already run the patch function, to perform relocations to patch in access to the new functions. After having done that, Wizer (or some other tool) would need to snapshot the module, as usual, but also adding in the extra code.

As a technical detail, in the simplest case in which code is generated in units of functions which don't directly call each other, this is as simple as just appending the functions to the code section and then and appending the generated element segments to the main module's element segment, updating the appended function references to their new values by adding the total number of functions in the module before the new module was concatenated to each function reference.

late linking appears to be async codegen

From the perspective of a main program, WebAssembly JIT code generation via late linking appears the same as aynchronous code generation.

For example, take the C program:

struct Value;
struct Func {
  struct Expr *body;
  void *jitCode;
};

void recordJitCandidate(struct Func *func);
uint8_t* flushJitCode(); // Call to actually generate JIT code.

struct Value* interpretCall(struct Expr *body,
                            struct Value *arg);

struct Value* call(struct Func *func,
                   struct Value* val) {
  if (func->jitCode) {
    struct Value* (*f)(struct Value*) = jitCode;
    return f(val);
  } else {
    recordJitCandidate(func);
    return interpretCall(func->body, val);
  }
}

Here the C program allows for the possibility of JIT code generation: there is a slot in a Func instance to fill in with a code pointer. If this program generates code for a given Func, it won't be able to fill in the pointer -- it can't add new code to the image. But, it could tell Wizer to do so, and Wizer could snapshot the program, link in the new function, and patch &func->jitCode. From the program's perspective, it's as if the code becomes available asynchronously.

demo!

So many words, right? Let's see some code! As a sketch for other JIT compiler work, I implemented a little Scheme interpreter and JIT compiler, targetting WebAssembly. See interp.cc for the source. You compile it like this:

$ /opt/wasi-sdk/bin/clang++ -O2 -Wall \
   -mexec-model=reactor \
   -Wl,--growable-table \
   -Wl,--export-table \
   -DLIBRARY=1 \
   -fno-exceptions \
   interp.cc -o interplib.wasm

Here we are compiling with WASI SDK. I have version 14.

The -mexec-model=reactor argument means that this WASI module isn't just a run-once thing, after which its state is torn down; rather it's a multiple-entry component.

The two -Wl, options tell the linker to export the indirect function table, and to allow the indirect function table to be augmented by the JIT module.

The -DLIBRARY=1 is used by interp.cc; you can actually run and debug it natively but that's just for development. We're instead compiling to wasm and running with a WASI environment, giving us fprintf and other debugging niceties.

The -fno-exceptions is because WASI doesn't support exceptions currently. Also we don't need them.

WASI is mainly for non-browser use cases, but this module does so little that it doesn't need much from WASI and I can just polyfill it in browser JavaScript. So that's what we have here:

loading wasm-jit...

JavaScript disabled, no wasm-jit demo. See the wasm-jit web page for more information. &&<<<<><>>&&<&>>>><><><><><><>>><>>

Each time you enter a Scheme expression, it will be parsed to an internal tree-like intermediate language. You can then run a recursive interpreter over that tree by pressing the "Evaluate" button. Press it a number of times, you should get the same result.

As the interpreter runs, it records any closures that it created. The Func instances attached to the closures have a slot for a C++ function pointer, which is initially NULL. Function pointers in WebAssembly are indexes into the indirect function table; the first slot is kept empty so that calling a NULL pointer (a pointer with value 0) causes an error. If the interpreter gets to a closure call and the closure's function's JIT code pointer is NULL, it will interpret the closure's body. Otherwise it will call the function pointer.

If you then press the "JIT" button above, the module will assemble a fresh WebAssembly module containing JIT code for the closures that it saw at run-time. Obviously that's just one heuristic: you could be more eager or more lazy; this is just a detail.

Although the particular JIT compiler isn't much of interest---the point being to see JIT code generation at all---it's nice to see that the fibonacci example sees a good speedup; try it yourself, and try it on different browsers if you can. Neat stuff!

not just the web

I was wondering how to get something like this working in a non-webby environment and it turns out that the Python interface to wasmtime is just the thing. I wrote a little interp.py harness that can do the same thing that we can do on the web; just run as `python3 interp.py`, after having `pip3 install wasmtime`:

$ python3 interp.py
...
Calling eval(0x11eb0) 5 times took 1.716s.
Calling jitModule()
jitModule result: <wasmtime._module.Module object at 0x7f2bef0821c0>
Instantiating and patching in JIT module
... 
Calling eval(0x11eb0) 5 times took 1.161s.

Interestingly it would appear that the performance of wasmtime's code (0.232s/invocation) is somewhat better than both SpiderMonkey (0.392s) and V8 (0.729s).

reflections

This work is just a proof of concept, but it's a step in a particular direction. As part of previous work with Fastly, we enabled the SpiderMonkey JavaScript engine to run on top of WebAssembly. When combined with pre-initialization via Wizer, you end up with a system that can start in microseconds: fast enough to instantiate a fresh, shared-nothing module on every HTTP request, for example.

The SpiderMonkey-on-WASI work left out JIT compilation, though, because, you know, WebAssembly doesn't support JIT compilation. JavaScript code actually ran via the C++ bytecode interpreter. But as we just found out, actually you can compile the bytecode: just-in-time, but at a different time-scale. What if you took a SpiderMonkey interpreter, pre-generated WebAssembly code for a user's JavaScript file, and then combined them into a single freeze-dried WebAssembly module via Wizer? You get the benefits of fast startup while also getting decent baseline performance. There are many engineering considerations here, but as part of work sponsored by Shopify, we have made good progress in this regard; details in another missive.

I think a kind of "offline JIT" has a lot of value for deployment environments like Shopify's and Fastly's, and you don't have to limit yourself to "total" optimizations: you can still collect and incorporate type feedback, and you get the benefit of taking advantage of adaptive optimization without having to actually run the JIT compiler at run-time.

But if we think of more traditional "online JIT" use cases, it's clear that relying on host JIT capabilities, while a good MVP, is not optimal. For one, you would like to be able to freely emit direct calls from generated code to existing code, instead of having to call indirectly or via imports. I think it still might make sense to have a language run-time express its generated code in the form of a WebAssembly module, though really you might want native support for compiling that code (asynchronously) from within WebAssembly itself, without calling out to a run-time. Most people I have talked to that work on WebAssembly implementations in JS engines believe that a JIT proposal will come some day, but it's good to know that we don't have to wait for it to start generating code and taking advantage of it.

& out

If you want to play around with the demo, do take a look at the wasm-jit Github project; it's fun stuff. Happy hacking, and until next time!

by Andy Wingo at August 18, 2022 03:36 PM

August 17, 2022

Bastien NoceraSpeeding up the kernel testing loop

(Bastien Nocera)

When I create kernel contributions, I usually rely on a specific hardware, which makes using a system on which I need to deploy kernels too complicated or time-consuming to be worth it. Yes, I'm an idiot that hacks the kernel directly on their main machine, though in my defense, I usually just need to compile drivers rather than full kernels.

But sometimes I work on a part of the kernel that can't be easily swapped out, like the USB sub-system. In which case I need to test out full kernels.

I usually prefer compiling full kernels as RPMs, on my Fedora system as it makes it easier to remove old test versions and clearly tag more information in the changelog or version numbers if I need to.

Step one, build as non-root

First, if you haven't already done so, create an ~/.rpmmacros file (I know...), and add a few lines so you don't need to be root, or write stuff in /usr to create RPMs.

$ cat ~/.rpmmacros
%_topdir        /home/hadess/Projects/packages
%_tmppath        %{_topdir}/tmp

Easy enough. Now we can use fedpkg or rpmbuild to create RPMs. Don't forget to run those under “powerprofilesctl launch” to speed things up a bit.

Step two, build less

We're hacking the kernel, so let's try and build from upstream. Instead of the aforementioned fedpkg, we'll use “make binrpm-pkg” in the upstream kernel, which builds the kernel locally, as it normally would, and then packages just the binaries into an RPM. This means that you can't really redistribute the results of this command, but it's fine for our use.

 If you choose to build a source RPM using “make rpm-pkg”, know that this one will build the kernel inside rpmbuild, this will be important later.

 Now that we're building from the kernel sources, that's our time to activate the cheat code. Run “make localmodconfig”. It will generate a .config file containing just the currently loaded modules. Don't forget to modify it to include your new driver, or driver for a device you'll need for testing.

Step three, build faster

If running “make rpm-pkg” is the same as running “make ; make modules” and then packaging up the results, does that mean that the “%{?_smp_mflags}” RPM macro is ignored, I make you ask rhetorically. The answer is yes. “make -j16 rpm-pkg”. Boom. Faster.

Step four, build fasterer

As we're building in the kernel tree locally before creating a binary package, already compiled modules and binaries are kept, and shouldn't need to be recompiled. This last trick can however be used to speed up compilation significantly if you use multiple kernel trees, or need to clean the build tree for whatever reason. In my tests, it made things slightly slower for a single tree compilation.

$ sudo dnf install -y ccache
$ make CC="ccache gcc" -j16 binrpm-pkg

Easy.

And if you want to speed up the rpm-pkg build:

$ cat ~/.rpmmacros
[...]
%__cc            ccache gcc
%__cxx            ccache g++

More information is available in Speeding Up Linux Kernel Builds With Ccache.

Step five, package faster

Now, if you've implemented all this, you'll see that the compilation still stops for a significant amount of time just before writing “Wrote kernel...rpm”. A quick look at top will show a single CPU core pegged to 100% CPU. It's rpmbuild compressing the package that you will just install and forget about.

$ cat ~/.rpmmacros
[...]
%_binary_payload    w2T16.xzdio

More information is available in Accelerating Ceph RPM Packaging: Using Multithreaded Compression.

TL;DR and further work

All those changes sped up the kernel compilation part of my development from around 20 minutes to less than 2 minutes on my desktop machine.

$ cat ~/.rpmmacros
%_topdir        /home/hadess/Projects/packages
%_tmppath        %{_topdir}/tmp
%__cc            ccache gcc
%__cxx            ccache g++
%_binary_payload    w2T16.xzdio


$ powerprofilesctl launch make CC="ccache gcc" -j16 binrpm-pkg

I believe there's still significant speed ups that could be done, in the kernel, by parallelising some of the symbols manipulation, caching the BTF parsing for modules, switching the single-threaded vmlinux bzip2 compression, and not generating a headers RPM (note: tested this last one, saves 4 seconds :)

 

The results of my tests. YMMV, etc.

table, th, td { border: 1px solid black; border-collapse: collapse; padding: 10px; }
Command Time spent Notes
koji build --scratch --arch-override=x86_64 f36 kernel.src.rpm 129 minutes It's usually quicker, but that day must have been particularly busy
fedpkg local 70 minutes No rpmmacros changes except setting the workdir in $HOME
powerprofilesctl launch fedpkg local 25 minutes
localmodconfig / bin-rpmpkg 19 minutes Defaults to "-j2"
localmodconfig -j16 / bin-rpmpkg 1:48 minutes
powerprofilesctl launch localmodconfig ccache -j16 / bin-rpmpkg 7 minutes Cold cache
powerprofilesctl launch localmodconfig ccache -j16 / bin-rpmpkg 1:45 minutes Hot cache
powerprofilesctl launch localmodconfig xzdio -j16 / bin-rpmpkg 1:20 minutes

by Bastien Nocera (noreply@blogger.com) at August 17, 2022 10:14 AM

August 15, 2022

Sebastian Pölsterlscikit-survival 0.18.0 released

I’m pleased to announce the release of scikit-survival 0.18.0, which adds support for scikit-learn 1.1.

In addition, this release adds the return_array argument to all models providing predict_survival_function and predict_cumulative_hazard_function. That means you can now choose, whether you want to have the survival (cumulative hazard function) automatically evaluated at the unique event times. This is particular useful for plotting. Previously, you would have to evaluate each survival function before plotting:

estimator = CoxPHSurvivalAnalysis()
estimator.fit(X_train, y_train)
pred_surv = estimator.predict_survival_function(
X_test
)
times = pred_surv[0].x
for surv_func in pred_surv:
plt.step(times, surv_func(times), where="post")

Now, you can pass return_array=True and directly get probabilities of the survival function:

estimator = CoxPHSurvivalAnalysis()
estimator.fit(X_train, y_train)
pred_surv_probs = estimator.predict_survival_function(
X_test, return_array=True
)
times = estimator.event_times_
for probs in pred_surv_probs:
plt.step(times, probs, where="post")

Finally, support for Python 3.7 has been dropped and the minimal required version of the following dependencies are raised:

  • numpy 1.17.3
  • pandas 1.0.5
  • scikit-learn 1.1.0
  • scipy 1.3.2

For a full list of changes in scikit-survival 0.18.0, please see the release notes.

Install

Pre-built conda packages are available for Linux, macOS (Intel), and Windows, either

via pip:

pip install scikit-survival

or via conda

 conda install -c sebp scikit-survival

August 15, 2022 03:18 PM

August 07, 2022

Jean-François Fortin TamUnsettled by Unison’s Fadeaway from Fedora

This is in part a rallying cry for packagers, but also a story illustrating how fragile user workflows can be, and how some seemingly inconsequential decisions at the distro level can have disastrous consequences on the ability of individuals to continue running your FLOSS platform.

I’m hoping this 12-minutes read will prove entertaining and insightful to you. This article is the third and (hopefully?!) last part of my Pullitzer-winning “Software upgrade treadmill” série. Go read part two if you haven’t already, I’ll wait. I guarantee you’re going to laugh much harder when reading the beginning of part three below.


The year is 2021. After the two-years war against the Machine, and the two-years in the software carbon freezer, I had finally been enjoying two years of blissful efficiency, where I could stay on top of the latest software versions with the Fedora Advanced Pragmatic Geek Workstation Operating System. Certainly I could continue like this unimpeded. As the protagonists said in la Cité de la peur, “Nothing bad can happen to us now 🎶”

But karma had other plans. As I wanted to upgrade from Fedora 33 to Fedora 34 (and newer) in the spring of 2021, I found out that Unison, another mission-critical app I’ve been depending on everyday for the last 15+ years, has been orphaned from Fedora. Déjà-vu, anyone?

Pictured: Phil the groundhog was disapproving of my springtime plans.

Apparently, Unison was not meant to come back into Fedora until someone fixes it upstream, which I assumed to mean this, this and this.

The core issue for downstream packagers was that Unison was very fragile and incompatible with itself across different versions of OCaml, the language it is programmed in. Personally, for my own usecase, I don’t care about compatibility with “other distros’ versions of Unison” or even “across different releases of Fedora”—both are nice to have, but apparently hellish to manage—I am happy enough if I can already be compatible with myself, within the same version of Fedora with all my machines upgraded in lockstep. But when I started drafting this blog post at the end of 2021, I couldn’t even consider that option, because the package no longer existed in Fedora.

In practice, I was once more stranded on an old unsupported version of Fedora (in this case, F33) across all my personal computers, because of a single package that my daily life depends on.

F33 was declared dead and buried at the end of 2021, and as I would no longer be able to receive any updates for it, I spent the better part of the 2022 winter season on a system with wildly outdated packages (I was still on Firefox 95 until recently, for example). Yes, I did hear thousands of infosec voices cry out in terror as I wrote the above.

Here is what sitting in the cockpit of a Fedora battleworkstation feels like when the cord gets pulled again:

I couldn’t be waiting in suspense for such a long time anymore with only the vague hope that maybe someday the situation might come to a resolution. I’ve been there, done that, got the t-shirt and the grey hair to prove it, as you’ve seen in part two of this trilogy. At some point, above-and-beyond loyalty becomes “acharnement thérapeutique“, and you start questioning your life choices.

Back then I wrote this in my journal:

I could survive like that a little longer, but there comes a time where the unsupported, time-frozen Fedora 33 on my machines becomes untenable—too bitrotten or insecure to use—and I don’t know what I will do then. This might, very regretfully, force me to switch away from Fedora—which I had been using every day for over 11 years by now—to another Linux distribution entirely in 2022-2023, and I am really not looking forward to that distro-hopping nightmare, I would really like to avoid that outcome. None of the alternative distros/OSes excite me.

I want to keep using Fedora, but I’m backed against the wall here, and unless someone repackages Unison as a flatpak, or some new folks fixes the issues upstream so that it can be repackaged in Fedora, then I eventually will have to abandon Fedora Workstation as a platform for myself—and doing so for one remaining damned application breaks my heart.

– Me, some months ago

Throughout the winter, I spent countless hours researching and discussing this with potential collaborators (I did do some lobbying to encourage folks to review and merge the existing work too, I think). I believe I have rewritten this blog post at least four times as the situation kept evolving quite a bit during the span of two months this winter:

  • Back when I originally drafted this blog post near the end of January 2022, the situation looked quite bleak: there was an attempt at doing an OCaml-agnostic version of Unison in late 2020, but it seemed to have stalled for over a year, so I thought the situation might take years to “naturally resolve” in the then-current state of things;
  • Unison’s compatibility issues arguably were both an upstream and downstream problem: if it bites Fedora, it’s likely to bite other distros at some point in time, and the impact of that on users is quite significant. Unison is this kind of non-glorious software that has quietly been used by many people around the world for decades, and I can’t possibly be the only one out there using it regularly (even Debian’s incomplete opt-in statistics show thousands of users);
  • Hubert Figuière valiantly made a first attempt at flatpaking OCaml + Unison as a personal challenge, but the effort stalled (I presume because of technical difficulties of some kind). I’m not even sure it is possible for Unison to work in a flatpak, this remains an open theoretical question (see further below).
  • Thankfully, sometime during the late winter of 2022, in the upstream Unison project, the code was reviewed, merged and bugfixed, so those three upstream issues (linked above near the start of this article) have been resolved, leading to the highest activity level in years (if not decades), and a new release bundling all those improvements. The ball is now back into the court of the Linux distros.
  • Later on, I discovered a third-party repository that provided packages for Fedora (see further below), which would allow me to keep my systems working.

So nowadays we are missing two things: someone to package Unison back into the main Fedora repositories, and ideally someone to potentially package Unison as a flatpak (if that is even possible) in Flathub as well.


Help needed to bring back the Unison packages in Fedora

In recent months, I discovered a third-party Unison repository for Fedora by Chris Roadfeldt, where he packages the latest version of Unison. This is the main reason why I’m not running an insecure operating system right now. I asked why this was in a COPR, rather than Fedora’s main repositories, and I was told this:

“I package unison for fedora for my own reasons, which I suspect are like yours, I use it and use it often. I would not be opposed to submitting my edits to the spec file to the fedora main repo, after all it was fork from there years ago.
There is a decent amount of work needed to clean up the package for general consumption and maintainability though, and I am not in a position to do that right now. For the time being I will continue to maintain my copr repo and let others make use of it as they want. Should it be of use in the main fedora repo, the fedora project is welcome to include it.” […]

Looks like we’re already “mostly there”. If you are an experienced Fedora/RPM packager, your help would be very welcome in mainstreaming this (or reviving the previous packages from the main repos). Will you pick up that guitar?

Could there be a Flatpak package for Unison too?

Currently, there is no such thing as a Unison flatpak package. Upstream does not want to maintain or support packages (their policy is to only ship source code), and I don’t even know if it is conceptually possible for a networked system app like unison—which must act as a client and server, commandline and GUI—to work as a flatpak.

The way Unison works is… sorcery. It somehow requires Unison to be installed on both sides (historically with the exact same versions of everything, otherwise things tend to break, but with the 2.52 release this problem may finally be gone), yet it does not run as a system service/daemon, so… how does it behave like a client & server without being a server daemon listening? 🤔 I don’t know how such arcane magic works. But it sure brings up the question: is it actually possible for it this to work while inside Flatpak on both sides? I have no idea. I would love it if someone proves it is possible and publishes it on flathub, thereby solving this interoperability madness once and for all, across all Linux distros (Fedora is not the only one struggling with this, Debian has also frequently been encountering this problem, and therefore, so are all its derivatives)… though arguably, with the new OCaml-version-independent line of Unison releases (2.52.x), this may no longer be an issue (I hope).


“Why are you blogging about this instead of fixing it yourself?”

There are couple of reasons. In no particular order:

  • I think it’s interesting to raise awareness about these type of struggles users face, and how they can influence the choice of a platform. I was this close to leaving Fedora, and yet I’m someone with inordinate amounts of patience and geekiness; that should tell you something. Furthermore, I would argue that general users shouldn’t have to learn code surgery and packaging (or containers or some other exotic OS mechanisms) to be able to continue using their everyday productivity software. To have mass appeal in the market, Linux-based OSes need to be able to solve everyday productivity problems for users who are not necessarily developers or packagers.
  • I don’t have the technical skill required at all; I don’t know OCaml or any of that fancy advanced coding they use in Unison, nor do I have the time/skill/inclination in becoming a packager of any kind. It is really not supposed to be my area of expertise & focus as a management & marketing professional.
  • I really don’t have the time for hands-on deep technical wrangling anymore, other than filing bug reports; even just keeping on top of the thousands of bug reports I have filed in recent years (in the GNOME & freedesktop stack) is challenging enough, as the picture below illustrates:
Pictured: my “bug reports” email notifications/comments folder. This does not include the GTG or Pitivi bug reports, which are filtered into separate folders.

After all, one of the reasons I haven’t blogged much in recent years is that I am trying to build and scale not one, not two, but three businesses. Good bug reports take time and effort, and beyond testing & feedback, I can’t be fixing every piece of software myself at the same time. The reason I have spent much time and effort writing this longform article here is because I find it interesting as a case study on the vulnerability of user workflows to modern operating systems’ fast-paced changes.


“Why don’t you just use The SyncThing?”

Syncthing is not solving the same problem as Unison—and never will—as the Syncthing developers say themselves in various discussion/support threads (for example here, here, here, etc.).

I do love and admire Syncthing, it is the best of breed in distributed continuous file synchronization (according to my own casual testing and the many recommendations I’ve heard from people around me). Back in the 2005-2010 era, I would have killed to have this kind of software available to me. Syncthing is, since 2013, fulfilling the promise of the long-lost and buried peer-to-peer version of iFolder (from the Novell Linux desktop era) that I was never able to run except on some of the very early versions of Ubuntu (4.10 to 5.04, maybe?). I wish someone had told me about Syncthing’s existence earlier than 2021! This month, I have successfully deployed it for a friend of mine who was struggling with horrible proprietary slow-synchronizing NAS bullshit.

Syncthing is a technological marvel, and I’d want to use it, but I can’t, at least not for my most important files and application data. My needs have evolved, especially as I want to be able to selectively sync my Liferea data folder, my GTG data folder, my LibreOffice profiles, GNote, SSH configuration files, fonts, critical business documents, and all that while my home partitions/drives are always 99.9% full:

Through some miracle of stupendous efficiency in digital packratting, I am still using the same 2TB home drive from 2012 and can just barely fit everything in it. This is criminal, I know.

There is no other open-source alternatives to Unison, which is not only peer-to-peer and resources-efficient, but also based on the concept of manual-controlled batches (you can review/approve/override each file/directory being synchronized). I am not the only one out there to have evaluated Syncthing, and gone right back to Unison. It gives you the fine-grained surgical precision you cannot have with other software.

There are many cases where Unison, with its ability to let me override the direction of a sync for any file or folder, has saved my ass, letting me revert mistakes from another computer.

“Sounds like a lot of community management work. Why don’t you just switch distros?”

Pretty much no other distro does exactly what Fedora does (and listing the possible alternative distros here would be beyond the scope of this blog post).

For me, the whole point of using a Linux distribution like Fedora is for it to “Just Work™” and for me to move on with other problems in my life, while still being able to run the latest versions of software so that I am not burdening developers with outdated bugs when reporting issues (so no, unless I decide to stop contributing to GNOME, I cannot really run Debian Stable, Ubuntu LTS, or elementary, sorry!). Under normal circumstances, Fedora provides the perfect “Goldilocks” balance between freshness and stability. If I wanted to constantly break and fix my computer and to mess with packages, I would probably be running Slackware, Gentoo, Arch btw, or whatever übergeek distro du jour, but I don’t want that. I want simple things, and I want the perfect balance that Fedora had struck for me for the past ten years:

  • it provides the latest (and vanilla) versions of all software, while still being stable and tested enough to not constantly pull the rug from under my feet; I want the latest GNOME every six months (so that I don’t get yelled at for using ancient software versions when reporting issues) but I want it pre-tested and not shipped on “release day zero” (rolling distros scare the hell out of me in terms of productivity).
  • it does not force me to stay on top of 1000-1500 updates per week like you often see in rolling distros (my average uptime on my laptop and desktop workstation is 30 to 45 days on average) nor require ostree and BTRFS to have a semblance of safety (because those two bring problems of their own that I am not ready to accept yet);
  • it gives me the ability to wait 1-2 months after each Fedora release before upgrading, so that I can let Fedora early-upgraders (along with Arch and OpenSUSE Tumbleweed users), take the first wave of bugs in their face 😉

Conclusion to the Upgrade Trilogy

When I originally wrote the first few drafts of this third article in the winter, the jury was still out on whether I actually would have to switch away from Fedora, because, from a timing perspective, I assuredly would not be meddling with my computers in the winter (you do not mess with your productivity tools during income tax season). I vowed to leave the problem on ice until spring/summer at the earliest, and thought, “Perhaps in the meantime someone will have come up with a solution by then.”, but not knowing if I would still be able to use my favorite operating system six months down the road was a big concern of mine.

Other than writing this educational editorial piece, I did not know what to do while I was sitting again on a ticking timebomb. As a user, I felt trapped and cornered, and that’s speaking as someone who has been around the freedesktop software platform for 19 years and knows how to fix most common Linux desktop usage problems! Imagine what happens when users are not so unreasonably persistent. What is the path forward for users left between a rock and a hard place?

With this trilogy, you now have a poignantly detailed picture of how rough of a ride our FLOSS platforms can be when a seemingly “tiny piece” of the system breaks down. I hope you have enjoyed my writing and found it insightful in some way.

If you’ve read through the whole three parts of this series, I commend you for your dedication to reading my epics. Feel free to share this trilogy or leave a comment so that I can give you a high-five whenever we meet someday.

by Jeff at August 07, 2022 06:35 PM

Andy Wingocoarse or lazy?

(Andy Wingo)

sweeping, coarse and lazy

One of the things that had perplexed me about the Immix collector was how to effectively defragment the heap via evacuation while keeping just 2-3% of space as free blocks for an evacuation reserve. The original Immix paper states:

To evacuate the object, the collector uses the same allocator as the mutator, continuing allocation right where the mutator left off. Once it exhausts any unused recyclable blocks, it uses any completely free blocks. By default, immix sets aside a small number of free blocks that it never returns to the global allocator and only ever uses for evacuating. This headroom eases defragmentation and is counted against immix's overall heap budget. By default immix reserves 2.5% of the heap as compaction headroom, but [...] is fairly insensitive to values ranging between 1 and 3%.

To Immix, a "recyclable" block is partially full: it contains surviving data from a previous collection, but also some holes in which to allocate. But when would you have recyclable blocks at evacuation-time? Evacuation occurs as part of collection. Collection usually occurs when there's no more memory in which to allocate. At that point any recyclable block would have been allocated into already, and won't become recyclable again until the next trace of the heap identifies the block's surviving data. Of course after the next trace they could become "empty", if no object survives, or "full", if all lines have survivor objects.

In general, after a full allocation cycle, you don't know much about the heap. If you could easily know where the live data and the holes were, a garbage collector's job would be much easier :) Any algorithm that starts from the assumption that you know where the holes are can't be used before a heap trace. So, I was not sure what the Immix paper is meaning here about allocating into recyclable blocks.

Thinking on it again, I realized that Immix might trigger collection early sometimes, before it has exhausted the previous cycle's set of blocks in which to allocate. As we discussed earlier, there is a case in which you might want to trigger an early compaction: when a large object allocator runs out of blocks to decommission from the immix space. And if one evacuating collection didn't yield enough free blocks, you might trigger the next one early, reserving some recyclable and empty blocks as evacuation targets.

when do you know what you know: lazy and eager

Consider a basic question, such as "how many bytes in the heap are used by live objects". In general you don't know! Indeed you often never know precisely. For example, concurrent collectors often have some amount of "floating garbage" which is unreachable data but which survives across a collection. And of course you don't know the difference between floating garbage and precious data: if you did, you would have collected the garbage.

Even the idea of "when" is tricky in systems that allow parallel mutator threads. Unless the program has a total ordering of mutations of the object graph, there's no one timeline with respect to which you can measure the heap. Still, Immix is a stop-the-world collector, and since such collectors synchronously trace the heap while mutators are stopped, these are times when you can exactly compute properties about the heap.

Let's retake the question of measuring live bytes. For an evacuating semi-space, knowing the number of live bytes after a collection is trivial: all survivors are packed into to-space. But for a mark-sweep space, you would have to compute this information. You could compute it at mark-time, while tracing the graph, but doing so takes time, which means delaying the time at which mutators can start again.

Alternately, for a mark-sweep collector, you can compute free bytes at sweep-time. This is the phase in which you go through the whole heap and return any space that wasn't marked in the last collection to the allocator, allowing it to be used for fresh allocations. This is the point in the garbage collection cycle in which you can answer questions such as "what is the set of recyclable blocks": you know what is garbage and you know what is not.

Though you could sweep during the stop-the-world pause, you don't have to; sweeping only touches dead objects, so it is correct to allow mutators to continue and then sweep as the mutators run. There are two general strategies: spawn a thread that sweeps as fast as it can (concurrent sweeping), or make mutators sweep as needed, just before they allocate (lazy sweeping). But this introduces a lag between when you know and what you know—your count of total live heap bytes describes a time in the past, not the present, because mutators have moved on since then.

For most collectors with a sweep phase, deciding between eager (during the stop-the-world phase) and deferred (concurrent or lazy) sweeping is very easy. You don't immediately need the information that sweeping allows you to compute; it's quite sufficient to wait until the next cycle. Moving work out of the stop-the-world phase is a win for mutator responsiveness (latency). Usually people implement lazy sweeping, as it is naturally incremental with the mutator, naturally parallel for parallel mutators, and any sweeping overhead due to cache misses can be mitigated by immediately using swept space for allocation. The case for concurrent sweeping is less clear to me, but if you have cores that would otherwise be idle, sure.

eager coarse sweeping

Immix is interesting in that it chooses to sweep eagerly, during the stop-the-world phase. Instead of sweeping irregularly-sized objects, however, it sweeps over its "line mark" array: one byte for each 128-byte "line" in the mark space. For 32 kB blocks, there will be 256 bytes per block, and line mark bytes in each 4 MB slab of the heap are packed contiguously. Therefore you get relatively good locality, but this just mitigates a cost that other collectors don't have to pay. So what does eager marking over these coarse 128-byte regions buy Immix?

Firstly, eager sweeping buys you eager identification of empty blocks. If your large object space needs to steal blocks from the mark space, but the mark space doesn't have enough empties, it can just trigger collection and then it knows if enough blocks are available. If no blocks are available, you can grow the heap or signal out-of-memory. If the lospace (large object space) runs out of blocks before the mark space has used all recyclable blocks, that's no problem: evacuation can move the survivors of fragmented blocks into these recyclable blocks, which have also already been identified by the eager coarse sweep.

Without eager empty block identification, if the lospace runs out of blocks, firstly you don't know how many empty blocks the mark space has. Sweeping is a kind of wavefront that moves through the whole heap; empty blocks behind the wavefront will be identified, but those ahead of the wavefront will not. Such a lospace allocation would then have to either wait for a concurrent sweeper to advance, or perform some lazy sweeping work. The expected latency of a lospace allocation would thus be higher, without eager identification of empty blocks.

Secondly, eager sweeping might reduce allocation overhead for mutators. If allocation just has to identify holes and not compute information or decide on what to do with a block, maybe it go brr? Not sure.

lines, lines, lines

The original Immix paper also notes a relative insensitivity of the collector to line size: 64 or 256 bytes could have worked just as well. This was a somewhat surprising result to me but I think I didn't appreciate all the roles that lines play in Immix.

Obviously line size affect the worst-case fragmentation, though this is mitigated by evacuation (which evacuates objects, not lines). This I got from the paper. In this case, smaller lines are better.

Line size affects allocation-time overhead for mutators, though which way I don't know: scanning for holes will be easier with fewer lines in a block, but smaller lines would contain more free space and thus result in fewer collections. I can only imagine though that with smaller line sizes, average hole size would decrease and thus medium-sized allocations would be harder to service. Something of a wash, perhaps.

However if we ask ourselves the thought experiment, why not just have 16-byte lines? How crazy would that be? I think the impediment to having such a precise line size would mainly be Immix's eager sweep, as a fine-grained traversal of the heap would process much more data and incur possibly-unacceptable pause time overheads. But, in such a design you would do away with some other downsides of coarse-grained lines: a side table of mark bytes would make the line mark table redundant, and you eliminate much possible "dark matter" hidden by internal fragmentation in lines. You'd need to defer sweeping. But then you lose eager identification of empty blocks, and perhaps also the ability to evacuate into recyclable blocks. What would such a system look like?

Readers that have gotten this far will be pleased to hear that I have made some investigations in this area. But, this post is already long, so let's revisit this in another dispatch. Until then, happy allocations in all regions.

by Andy Wingo at August 07, 2022 09:44 AM

July 20, 2022

Andy Wingounintentional concurrency

(Andy Wingo)

Good evening, gentle hackfolk. Last time we talked about heuristics for when you might want to compact a heap. Compacting garbage collection is nice and tidy and appeals to our orderly instincts, and it enables heap shrinking and reallocation of pages to large object spaces and it can reduce fragmentation: all very good things. But evacuation is more expensive than just marking objects in place, and so a production garbage collector will usually just mark objects in place, and only compact or evacuate when needed.

Today's post is more details!

dedication

Just because it's been, oh, a couple decades, I would like to reintroduce a term I learned from Marnanel years ago on advogato, a nerdy group blog kind of a site. As I recall, there is a word that originates in the Oxbridge social environment, "narg", from "Not A Real Gentleman", and which therefore denotes things that not-real-gentlemen do: nerd out about anything that's not, like, fox-hunting or golf; or generally spending time on something not because it will advance you in conventional hierarchies but because you just can't help it, because you love it, because it is just your thing. Anyway, in the spirit of pursuits that are really not What One Does With One's Time, this post is dedicated to the word "nargery".

side note, bis: immix-style evacuation versus mark-compact

In my last post I described Immix-style evacuation, and noted that it might take a few cycles to fully compact the heap, and that it has a few pathologies: the heap might never reach full compaction, and that Immix might run out of free blocks in which to evacuate.

With these disadvantages, why bother? Why not just do a single mark-compact pass and be done? I implicitly asked this question last time but didn't really answer it.

For some people will be, yep, yebo, mark-compact is the right answer. And yet, there are a few reasons that one might choose to evacuate a fraction of the heap instead of compacting it all at once.

The first reason is object pinning. Mark-compact systems assume that all objects can be moved; you can't usefully relax this assumption. Most algorithms "slide" objects down to lower addresses, squeezing out the holes, and therefore every live object's address needs to be available to use when sliding down other objects with higher addresses. And yet, it would be nice sometimes to prevent an object from being moved. This is the case, for example, when you grant a foreign interface (e.g. a C function) access to a buffer: if garbage collection happens while in that foreign interface, it would be nice to be able to prevent garbage collection from moving the object out from under the C function's feet.

Another reason to want to pin an object is because of conservative root-finding. Guile currently uses the Boehm-Demers-Weiser collector, which conservatively scans the stack and data segments for anything that looks like a pointer to the heap. The garbage collector can't update such a global root in response to compaction, because you can't be sure that a given word is a pointer and not just an integer with an inconvenient value. In short, objects referenced by conservative roots need to be pinned. I would like to support precise roots at some point but part of my interest in Immix is to allow Guile to move to a better GC algorithm, without necessarily requiring precise enumeration of GC roots. Optimistic partial evacuation allows for the possibility that any given evacuation might fail, which makes it appropriate for conservative root-finding.

Finally, as moving objects has a cost, it's reasonable to want to only incur that cost for the part of the heap that needs it. In any given heap, there will likely be some data that stays live across a series of collections, and which, once compacted, can't be profitably moved for many cycles. Focussing evacuation on only the part of the heap with the lowest survival rates avoids wasting time on copies that don't result in additional compaction.

(I should admit one thing: sliding mark-compact compaction preserves allocation order, whereas evacuation does not. The memory layout of sliding compaction is more optimal than evacuation.)

multi-cycle evacuation

Say a mutator runs out of memory, and therefore invokes the collector. The collector decides for whatever reason that we should evacuate at least part of the heap instead of marking in place. How much of the heap can we evacuate? The answer depends primarily on how many free blocks you have reserved for evacuation. These are known-empty blocks that haven't been allocated into by the last cycle. If you don't have any, you can't evacuate! So probably you should keep some around, even when performing in-place collections. The Immix papers suggest 2% and that works for me too.

Then you evacuate some blocks. Hopefully the result is that after this collection cycle, you have more free blocks. But you haven't compacted the heap, at least probably not on the first try: not into 2% of total space. Therefore you tell the mutator to put any empty blocks it finds as a result of lazy sweeping during the next cycle onto the evacuation target list, and then the next cycle you have more blocks to evacuate into, and more and more and so on until after some number of cycles you fall below some overall heap fragmentation low-watermark target, at which point you can switch back to marking in place.

I don't know how this works in practice! In my test setups which triggers compaction at 10% fragmentation and continues until it drops below 5%, it's rare that it takes more than 3 cycles of evacuation until the heap drops to effectively 0% fragmentation. Of course I had to introduce fragmented allocation patterns into the microbenchmarks to even cause evacuation to happen at all. I look forward to some day soon testing with real applications.

concurrency

Just as a terminological note, in the world of garbage collectors, "parallel" refers to multiple threads being used by a garbage collector. Parallelism within a collector is essentially an implementation detail; when the world is stopped for collection, the mutator (the user program) generally doesn't care if the collector uses 1 thread or 15. On the other hand, "concurrent" means the collector and the mutator running at the same time.

Different parts of the collector can be concurrent with the mutator: for example, sweeping, marking, or evacuation. Concurrent sweeping is just a detail, because it just visits dead objects. Concurrent marking is interesting, because it can significantly reduce stop-the-world pauses by performing most of the computation while the mutator is running. It's tricky, as you might imagine; the collector traverses the object graph while the mutator is, you know, mutating it. But there are standard techniques to make this work. Concurrent evacuation is a nightmare. It's not that you can't implement it; you can. But it's very very hard to get an overall performance win from concurrent evacuation/copying.

So if you are looking for a good bargain in the marketplace of garbage collector algorithms, it would seem that you need to avoid concurrent copying/evacuation. It's an expensive product that would seem to not buy you very much.

All that is just a prelude to an observation that there is a funny source of concurrency even in some systems that don't see themselves as concurrent: mutator threads marking their own roots. To recall, when you stop the world for a garbage collection, all mutator threads have to somehow notice the request to stop, reach a safepoint, and then stop. Then the collector traces the roots from all mutators and everything they reference, transitively. Then you let the threads go again. Thing is, once you get more than a thread or four, stopping threads can take time. You'd be tempted to just have threads notice that they need to stop, then traverse their own stacks at their own safepoint to find their roots, then stop. But, this introduces concurrency between root-tracing and other mutators that might not have seen the request to stop. For marking, this concurrency can be fine: you are just setting mark bits, not mutating the roots. You might need to add an additional mark pattern that can be distinguished from marked-last-time and marked-the-time-before-but-dead-now, but that's a detail. Fine.

But if you instead start an evacuating collection, the gates of hell open wide and toothy maws and horns fill your vision. One thread could be stopping and evacuating the objects referenced by its roots, while another hasn't noticed the request to stop and is happily using the same objects: chaos! You are trying to make a minor optimization to move some work out of the stop-the-world phase but instead everything falls apart.

Anyway, this whole article was really to get here and note that you can't do ragged-stops with evacuation without supporting full concurrent evacuation. Otherwise, you need to postpone root traversal until all threads are stopped. Perhaps this is another argument that evacuation is expensive, relative to marking in place. In practice I haven't seen the ragged-stop effect making so much of a difference, but perhaps that is because evacuation is infrequent in my test cases.

Zokay? Zokay. Welp, this evening's nargery was indeed nargy. Happy hacking to all collectors out there, and until next time.

by Andy Wingo at July 20, 2022 09:26 PM

July 18, 2022

Víctor JáquezGstVA H.264 encoder, compositor and JPEG decoder

There are, right now, three new GstVA elements merged in main: vah264enc, vacompositor and vajpegdec.

Just to recap, GstVA is a GStreamer plugin in gst-plugins-bad (yes, we agree it’s not a great name anymore), to differentiate it from gstreamer-vaapi. Both plugins use libva to access stateless video processing operations; the main difference is, precisely, how stream’s state is handled: while GstVA uses GStreamer libraries shared by other hardware accelerated plugins (such as d3d11 and v4l2codecs), gstreamer-vaapi uses an internal, tightly coupled and convoluted library.

Also, note that right now (release 1.20) GstVA elements are ranked NONE, while gstreamer-vaapi ones are mostly PRIMARY+1.

Back to the three new elements in GstVA, the most complex one is vah264enc wrote almost completely by He Junyan, from Intel. For it, He had to write a H.264 bitwriter which is, until certain extend, the opposite for H.264 parser: construct the bitstream buffer from H.264 structures such as PPS, SPS, slice header, etc. This API is part of libgstcodecparsers, ready to be reused by other plugins or applications. Currently vah264enc is fairly complete and functional, dealing with profiles and rate controls, among other parameters. It still have rough spots, but we’re working on them. But He Junyan is restless and he already has in the pipeline an encoder common class along with a HEVC and AV1 encoders.

The second element is vacompositor, wrote by Artie Eoff. It’s the replacement of vaapioverlay in gstreamer-vaapi. The suffix compositor is preferred to follow the name of primary video mixing (software-based) element: compositor, successor of videomixer. See this discussion for further details. The purpose of this element is to compose a single video stream from multiple video streams. It works with Intel’s media-driver supporting alpha channel, and also works with AMD Mesa Gallium, but without alpha channel (in other words, a custom degree of transparency).

The last, but not the least, element is vajpegdec, which I worked on. The main issue was not the decoder itself, but jpegparse, which didn’t signal the image caps required for the hardware accelerated decoders. For instance, VA only decodes images with SOF marker 0 (Baseline DCT). It wasn’t needed before because the main and only consumer of the parser is jpegdec which deals with any type of JPEG image. Long story short, we revamped jpegparse and now it signals sof marker, color space (YUV, RGB, etc.) and chroma subsampling (if it has YUV color space), along with comments and EXIF-like metadata as pipeline’s tags. Thus vajpegdec will expose in caps template the supported color space and chroma subsampling supported by the driver. For example, Intel supports (more or less) RGB color space, while AMD Mesa Gallium don’t.

And that’s all for now. Thanks.

by vjaquez at July 18, 2022 06:27 PM

June 21, 2022

Andy Wingoan optimistic evacuation of my wordhoard

(Andy Wingo)

Good morning, mallocators. Last time we talked about how to split available memory between a block-structured main space and a large object space. Given a fixed heap size, making a new large object allocation will steal available pages from the block-structured space by finding empty blocks and temporarily returning them to the operating system.

Today I'd like to talk more about nothing, or rather, why might you want nothing rather than something. Given an Immix heap, why would you want it organized in such a way that live data is packed into some blocks, leaving other blocks completely free? How bad would it be if instead the live data were spread all over the heap? When might it be a good idea to try to compact the heap? Ideally we'd like to be able to translate the answers to these questions into heuristics that can inform the GC when compaction/evacuation would be a good idea.

lospace and the void

Let's start with one of the more obvious points: large object allocation. With a fixed-size heap, you can't allocate new large objects if you don't have empty blocks in your paged space (the Immix space, for example) that you can return to the OS. To obtain these free blocks, you have four options.

  1. You can continue lazy sweeping of recycled blocks, to see if you find an empty block. This is a bit time-consuming, though.

  2. Otherwise, you can trigger a regular non-moving GC, which might free up blocks in the Immix space but which is also likely to free up large objects, which would result in fresh empty blocks.

  3. You can trigger a compacting or evacuating collection. Immix can't actually compact the heap all in one go, so you would preferentially select evacuation-candidate blocks by choosing the blocks with the least live data (as measured at the last GC), hoping that little data will need to be evacuated.

  4. Finally, for environments in which the heap is growable, you could just grow the heap instead. In this case you would configure the system to target a heap size multiplier rather than a heap size, which would scale the heap to be e.g. twice the size of the live data, as measured at the last collection.

If you have a growable heap, I think you will rarely choose to compact rather than grow the heap: you will either collect or grow. Under constant allocation rate, the rate of empty blocks being reclaimed from freed lospace objects will be equal to the rate at which they are needed, so if collection doesn't produce any, then that means your live data set is increasing and so growing is a good option. Anyway let's put growable heaps aside, as heap-growth heuristics are a separate gnarly problem.

The question becomes, when should large object allocation force a compaction? Absent growable heaps, the answer is clear: when allocating a large object fails because there are no empty pages, but the statistics show that there is actually ample free memory. Good! We have one heuristic, and one with an optimum: you could compact in other situations but from the point of view of lospace, waiting until allocation failure is the most efficient.

shrinkage

Moving on, another use of empty blocks is when shrinking the heap. The collector might decide that it's a good idea to return some memory to the operating system. For example, I enjoyed this recent paper on heuristics for optimum heap size, that advocates that you size the heap in proportion to the square root of the allocation rate, and that as a consequence, when/if the application reaches a dormant state, it should promptly return memory to the OS.

Here, we have a similar heuristic for when to evacuate: when we would like to release memory to the OS but we have no empty blocks, we should compact. We use the same evacuation candidate selection approach as before, also, aiming for maximum empty block yield.

fragmentation

What if you go to allocate a medium object, say 4kB, but there is no hole that's 4kB or larger? In that case, your heap is fragmented. The smaller your heap size, the more likely this is to happen. We should compact the heap to make the maximum hole size larger.

side note: compaction via partial evacuation

The evacuation strategy of Immix is... optimistic. A mark-compact collector will compact the whole heap, but Immix will only be able to evacuate a fraction of it.

It's worth dwelling on this a bit. As described in the paper, Immix reserves around 2-3% of overall space for evacuation overhead. Let's say you decide to evacuate: you start with 2-3% of blocks being empty (the target blocks), and choose a corresponding set of candidate blocks for evacuation (the source blocks). Since Immix is a one-pass collector, it doesn't know how much data is live when it starts collecting. It may not know that the blocks that it is evacuating will fit into the target space. As specified in the original paper, if the target space fills up, Immix will mark in place instead of evacuating; an evacuation candidate block with marked-in-place objects would then be non-empty at the end of collection.

In fact if you choose a set of evacuation candidates hoping to maximize your empty block yield, based on an estimate of live data instead of limiting to only the number of target blocks, I think it's possible to actually fill the targets before the source blocks empty, leaving you with no empty blocks at the end! (This can happen due to inaccurate live data estimations, or via internal fragmentation with the block size.) The only way to avoid this is to never select more evacuation candidate blocks than you have in target blocks. If you are lucky, you won't have to use all of the target blocks, and so at the end you will end up with more free blocks than not, so a subsequent evacuation will be more effective. The defragmentation result in that case would still be pretty good, but the yield in free blocks is not great.

In a production garbage collector I would still be tempted to be optimistic and select more evacuation candidate blocks than available empty target blocks, because it will require fewer rounds to compact the whole heap, if that's what you wanted to do. It would be a relatively rare occurrence to start an evacuation cycle. If you ran out of space while evacuating, in a production GC I would just temporarily commission some overhead blocks for evacuation and release them promptly after evacuation is complete. If you have a small heap multiplier in your Immix space, occasional partial evacuation in a long-running process would probably reach a steady state with blocks being either full or empty. Fragmented blocks would represent newer objects and evacuation would periodically sediment these into longer-lived dense blocks.

mutator throughput

Finally, the shape of the heap has its inverse in the shape of the holes into which the mutator can allocate. It's most efficient for the mutator if the heap has as few holes as possible: ideally just one large hole per block, which is the limit case of an empty block.

The opposite extreme would be having every other "line" (in Immix terms) be used, so that free space is spread across the heap in a vast spray of one-line holes. Even if fragmentation is not a problem, perhaps because the application only allocates objects that pack neatly into lines, having to stutter all the time to look for holes is overhead for the mutator. Also, the result is that contemporaneous allocations are more likely to be placed farther apart in memory, leading to more cache misses when accessing data. Together, allocator overhead and access overhead lead to lower mutator throughput.

When would this situation get so bad as to trigger compaction? Here I have no idea. There is no clear maximum. If compaction were free, we would compact all the time. But it's not; there's a tradeoff between the cost of compaction and mutator throughput.

I think here I would punt. If the heap is being actively resized based on allocation rate, we'll hit the other heuristics first, and so we won't need to trigger evacuation/compaction based on mutator overhead. You could measure this, though, in terms of average or median hole size, or average or maximum number of holes per block. Since evacuation is partial, all you need to do is to identify some "bad" blocks and then perhaps evacuation becomes attractive.

gc pause

Welp, that's some thoughts on when to trigger evacuation in Immix. Next time, we'll talk about some engineering aspects of evacuation. Until then, happy consing!

by Andy Wingo at June 21, 2022 12:21 PM

June 20, 2022

Andy Wingoblocks and pages and large objects

(Andy Wingo)

Good day! In a recent dispatch we talked about the fundamental garbage collection algorithms, also introducing the Immix mark-region collector. Immix mostly leaves objects in place but can move objects if it thinks it would be profitable. But when would it decide that this is a good idea? Are there cases in which it is necessary?

I promised to answer those questions in a followup article, but I didn't say which followup :) Before I get there, I want to talk about paged spaces.

enter the multispace

We mentioned that Immix divides the heap into blocks (32kB or so), and that no object can span multiple blocks. "Large" objects -- defined by Immix to be more than 8kB -- go to a separate "large object space", or "lospace" for short.

Though the implementation of a large object space is relatively simple, I found that it has some points that are quite subtle. Probably the most important of these points relates to heap size. Consider that if you just had one space, implemented using mark-compact maybe, then the procedure to allocate a 16 kB object would go:

  1. Try to bump the allocation pointer by 16kB. Is it still within range? If so we are done.

  2. Otherwise, collect garbage and try again. If after GC there isn't enough space, the allocation fails.

In step (2), collecting garbage could decide to grow or shrink the heap. However when evaluating collector algorithms, you generally want to avoid dynamically-sized heaps.

cheatery

Here is where I need to make an embarrassing admission. In my role as co-maintainer of the Guile programming language implementation, I have long noodled around with benchmarks, comparing Guile to Chez, Chicken, and other implementations. It's good fun. However, I only realized recently that I had a magic knob that I could turn to win more benchmarks: simply make the heap bigger. Make it start bigger, make it grow faster, whatever it takes. For a program that does its work in some fixed amount of total allocation, a bigger heap will require fewer collections, and therefore generally take less time. (Some amount of collection may be good for performance as it improves locality, but this is a marginal factor.)

Of course I didn't really go wild with this knob but it now makes me doubt all benchmarks I have ever seen: are we really using benchmarks to select for fast implementations, or are we in fact selecting for implementations with cheeky heap size heuristics? Consider even any of the common allocation-heavy JavaScript benchmarks, DeltaBlue or Earley or the like; to win these benchmarks, web browsers are incentivised to have large heaps. In the real world, though, a more parsimonious policy might be more appreciated by users.

Java people have known this for quite some time, and are therefore used to fixing the heap size while running benchmarks. For example, people will measure the minimum amount of memory that can allow a benchmark to run, and then configure the heap to be a constant multiplier of this minimum size. The MMTK garbage collector toolkit can't even grow the heap at all currently: it's an important feature for production garbage collectors, but as they are just now migrating out of the research phase, heap growth (and shrinking) hasn't yet been a priority.

lospace

So now consider a garbage collector that has two spaces: an Immix space for allocations of 8kB and below, and a large object space for, well, larger objects. How do you divide the available memory between the two spaces? Could the balance between immix and lospace change at run-time? If you never had large objects, would you be wasting space at all? Conversely is there a strategy that can also work for only large objects?

Perhaps the answer is obvious to you, but it wasn't to me. After much reading of the MMTK source code and pondering, here is what I understand the state of the art to be.

  1. Arrange for your main space -- Immix, mark-sweep, whatever -- to be block-structured, and able to dynamically decomission or recommission blocks, perhaps via MADV_DONTNEED. This works if the blocks are even multiples of the underlying OS page size.

  2. Keep a counter of however many bytes the lospace currently has.

  3. When you go to allocate a large object, increment the lospace byte counter, and then round up to number of blocks to decommission from the main paged space. If this is more than are currently decommissioned, find some empty blocks and decommission them.

  4. If no empty blocks were found, collect, and try again. If the second try doesn't work, then the allocation fails.

  5. Now that the paged space has shrunk, lospace can allocate. You can use the system malloc, but probably better to use mmap, so that if these objects are collected, you can just MADV_DONTNEED them and keep them around for later re-use.

  6. After GC runs, explicitly return the memory for any object in lospace that wasn't visited when the object graph was traversed. Decrement the lospace byte counter and possibly return some empty blocks to the paged space.

There are some interesting aspects about this strategy. One is, the memory that you return to the OS doesn't need to be contiguous. When allocating a 50 MB object, you don't have to find 50 MB of contiguous free space, because any set of blocks that adds up to 50 MB will do.

Another aspect is that this adaptive strategy can work for any ratio of large to non-large objects. The user doesn't have to manually set the sizes of the various spaces.

This strategy does assume that address space is larger than heap size, but only by a factor of 2 (modulo fragmentation for the large object space). Therefore our risk of running afoul of user resource limits and kernel overcommit heuristics is low.

The one underspecified part of this algorithm is... did you see it? "Find some empty blocks". If the main paged space does lazy sweeping -- only scanning a block for holes right before the block will be used for allocation -- then after a collection we don't actually know very much about the heap, and notably, we don't know what blocks are empty. (We could know it, of course, but it would take time; you could traverse the line mark arrays for all blocks while the world is stopped, but this increases pause time. The original Immix collector does this, however.) In the system I've been working on, instead I have it so that if a mutator finds an empty block, it puts it on a separate list, and then takes another block, only allocating into empty blocks once all blocks are swept. If the lospace needs blocks, it sweeps eagerly until it finds enough empty blocks, throwing away any nonempty blocks. This causes the next collection to happen sooner, but that's not a terrible thing; this only occurs when rebalancing lospace versus paged-space size, because if you have a constant allocation rate on the lospace side, you will also have a complementary rate of production of empty blocks by GC, as they are recommissioned when lospace objects are reclaimed.

What if your main paged space has ample space for allocating a large object, but there are no empty blocks, because live objects are equally peppered around all blocks? In that case, often the application would be best served by growing the heap, but maybe not. In any case in a strict-heap-size environment, we need a solution.

But for that... let's pick up another day. Until then, happy hacking!

by Andy Wingo at June 20, 2022 02:59 PM

June 15, 2022

GStreamerGStreamer 1.20.3 stable bug fix release

(GStreamer)

The GStreamer team is pleased to announce the third bug fix release in the stable 1.20 release series of your favourite cross-platform multimedia framework!

This release only contains important security fixes. It should be safe to update from 1.20.x and we recommend you upgrade at your earliest convenience.

Highlighted bugfixes:

  • Security fixes in Matroska, MP4 and AVI demuxers
  • Fix scrambled video playback with hardware-accelerated VA-API decoders on certain Intel hardware
  • playbin3/decodebin3 regression fix for unhandled streams
  • Fragmented MP4 playback fixes
  • Android H.265 encoder mapping
  • Playback of MXF files produced by older FFmpeg versions
  • Fix rtmp2sink crashes on 32-bit platforms
  • WebRTC improvements
  • D3D11 video decoder and screen recorder fixes
  • Performance improvements
  • Support for building against OpenCV 4.6 and other build fixes
  • Miscellaneous bug fixes, memory leak fixes, and other stability and reliability improvements

See the GStreamer 1.20.3 release notes for more details.

Binaries for Android, iOS, macOS and Windows will be available shortly.

Release tarballs can be downloaded directly here:

June 15, 2022 11:00 PM

Andy Wingodefragmentation

(Andy Wingo)

Good morning, hackers! Been a while. It used to be that I had long blocks of uninterrupted time to think and work on projects. Now I have two kids; the longest such time-blocks are on trains (too infrequent, but it happens) and in a less effective but more frequent fashion, after the kids are sleeping. As I start writing this, I'm in an airport waiting for a delayed flight -- my first since the pandemic -- so we can consider this to be the former case.

It is perhaps out of mechanical sympathy that I have been using my reclaimed time to noodle on a garbage collector. Managing space and managing time have similar concerns: how to do much with little, efficiently packing different-sized allocations into a finite resource.

I have been itching to write a GC for years, but the proximate event that pushed me over the edge was reading about the Immix collection algorithm a few months ago.

on fundamentals

Immix is a "mark-region" collection algorithm. I say "algorithm" rather than "collector" because it's more like a strategy or something that you have to put into practice by making a concrete collector, the other fundamental algorithms being copying/evacuation, mark-sweep, and mark-compact.

To build a collector, you might combine a number of spaces that use different strategies. A common choice would be to have a semi-space copying young generation, a mark-sweep old space, and maybe a treadmill large object space (a kind of copying collector, logically; more on that later). Then you have heuristics that determine what object goes where, when.

On the engineering side, there's quite a number of choices to make there too: probably you make some parts of your collector to be parallel, maybe the collector and the mutator (the user program) can run concurrently, and so on. Things get complicated, but the fundamental algorithms are relatively simple, and present interesting fundamental tradeoffs.


figure 1 from the immix paper

For example, mark-compact is most parsimonious regarding space usage -- for a given program, a garbage collector using a mark-compact algorithm will require less memory than one that uses mark-sweep. However, mark-compact algorithms all require at least two passes over the heap: one to identify live objects (mark), and at least one to relocate them (compact). This makes them less efficient in terms of overall program throughput and can also increase latency (GC pause times).

Copying or evacuating spaces can be more CPU-efficient than mark-compact spaces, as reclaiming memory avoids traversing the heap twice; a copying space copies objects as it traverses the live object graph instead of after the traversal (mark phase) is complete. However, a copying space's minimum heap size is quite high, and it only reaches competitive efficiencies at large heap sizes. For example, if your program needs 100 MB of space for its live data, a semi-space copying collector will need at least 200 MB of space in the heap (a 2x multiplier, we say), and will only run efficiently at something more like 4-5x. It's a reasonable tradeoff to make for small spaces such as nurseries, but as a mature space, it's so memory-hungry that users will be unhappy if you make it responsible for a large portion of your memory.

Finally, mark-sweep is quite efficient in terms of program throughput, because like copying it traverses the heap in just one pass, and because it leaves objects in place instead of moving them. But! Unlike the other two fundamental algorithms, mark-sweep leaves the heap in a fragmented state: instead of having all live objects packed into a contiguous block, memory is interspersed with live objects and free space. So the collector can run quickly but the allocator stops and stutters as it accesses disparate regions of memory.

allocators

Collectors are paired with allocators. For mark-compact and copying/evacuation, the allocator consists of a pointer to free space and a limit. Objects are allocated by bumping the allocation pointer, a fast operation that also preserves locality between contemporaneous allocations, improving overall program throughput. But for mark-sweep, we run into a problem: say you go to allocate a 1 kilobyte byte array, do you actually have space for that?

Generally speaking, mark-sweep allocators solve this problem via freelist allocation: the allocator has an array of lists of free objects, one for each "size class" (say 2 words, 3 words, and so on up to 16 words, then more sparsely up to the largest allocatable size maybe), and services allocations from their appropriate size class's freelist. This prevents the 1 kB free space that we need from being "used up" by a 16-byte allocation that could just have well gone elsewhere. However, freelists prevent objects allocated around the same time from being deterministically placed in nearby memory locations. This increases variance and decreases overall throughput for both the allocation operations but also for pointer-chasing in the course of the program's execution.

Also, in a mark-sweep collector, we can still reach a situation where there is enough space on the heap for an allocation, but that free space broken up into too many pieces: the heap is fragmented. For this reason, many systems that perform mark-sweep collection can choose to compact, if heuristics show it might be profitable. Because the usual strategy is mark-sweep, though, they still use freelist allocation.

on immix and mark-region

Mark-region collectors are like mark-sweep collectors, except that they do bump-pointer allocation into the holes between survivor objects.

Sounds simple, right? To my mind, though the fundamental challenge in implementing a mark-region collector is how to handle fragmentation. Let's take a look at how Immix solves this problem.


part of figure 2 from the immix paper

Firstly, Immix partitions the heap into blocks, which might be 32 kB in size or so. No object can span a block. Block size should be chosen to be a nice power-of-two multiple of the system page size, not so small that common object allocations wouldn't fit. Allocating "large" objects -- greater than 8 kB, for Immix -- go to a separate space that is managed in a different way.

Within a block, Immix divides space into lines -- maybe 128 bytes long. Objects can span lines. Any line that does not contain (a part of) an object that survived the previous collection is part of a hole. A hole is a contiguous span of free lines in a block.

On the allocation side, Immix does bump-pointer allocation into holes. If a mutator doesn't have a hole currently, it scans the current block (obtaining one if needed) for the next hole, via a side-table of per-line mark bits: one bit per line. Lines without the mark are in holes. Scanning for holes is fairly cheap, because the line size is not too small. Note, there are also per-object mark bits as well; just because you've marked a line doesn't mean that you've traced all objects on that line.

Allocating into a hole has good expected performance as well, as it's bump-pointer, and the minimum size isn't tiny. In the worst case of a hole consisting of a single line, you have 128 bytes to work with. This size is large enough for the majority of objects, given that most objects are small.

mitigating fragmentation

Immix still has some challenges regarding fragmentation. There is some loss in which a single (piece of an) object can keep a line marked, wasting any free space on that line. Also, when an object can't fit into a hole, any space left in that hole is lost, at least until the next collection. This loss could also occur for the next hole, and the next and the next and so on until Immix finds a hole that's big enough. In a mark-sweep collector with lazy sweeping, these free extents could instead be placed on freelists and used when needed, but in Immix there is no such facility (by design).

One mitigation for fragmentation risks is "overflow allocation": when allocating an object larger than a line (a medium object), and Immix can't find a hole before the end of the block, Immix allocates into a completely free block. So actually mutator threads allocate into two blocks at a time: one for small objects and medium objects if possible, and the other for medium objects when necessary.

Another mitigation is that large objects are allocated into their own space, so an Immix space will never be used for blocks larger than, say, 8kB.

The other mitigation is that Immix can choose to evacuate instead of mark. How does this work? Is it worth it?

stw

This question about the practical tradeoffs involving evacuation is the one I wanted to pose when I started this article; I have gotten to the point of implementing this part of Immix and I have some doubts. But, this article is long enough, and my plane is about to land, so let's revisit this on my return flight. Until then, see you later, allocators!

by Andy Wingo at June 15, 2022 12:47 PM

June 11, 2022

Sebastian PölsterlUsing VS Code and Podman to Develop SYCL Applications With DPC++'s CUDA Backend

I recently wanted to create a development container for VS Code to develop applications using SYCL based on the CUDA backend of the oneAPI DPC++ (Data Parallel C++) compiler. As I’m running Fedora, it seemed natural to use Podman’s rootless containers instead of Docker for this. This turned out to be more challenging than expected, so I’m going to summarize my setup in this post. I’m using Fedora Linux 36 with Podman version 4.1.0.

Prerequisites

Since the DPC++ is going to use CUDA behind the scene, you will need an NVIDIA GPU and the corresponding Kernel driver for it. I’ve been using the NVIDIA GPU driver from RPM Fusion. Note that you do not have to install CUDA, it is part of the development container alongside the DPC++ compiler.

Next, you require Podman, which on Fedora can be installed by executing

sudo dnf install -y podman

Finally, you require VS Code and the Remote - Containers extension. Just follow the instructions behind those links.

Installing and Configuring the NVIDIA Container Toolkit

The default configuration of the NVIDIA Container Toolkit does not work with Podman, so it needs to be adjusted. Most steps are based on this guide by Red Hat, which I will repeat below.

  1. Add the repository:

    curl -sL https://nvidia.github.io/nvidia-docker/rhel9.0/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
    

    If you aren’t using Fedora 36, you might have to replace rhel9.0 with your distribution, see the instructions.

  2. Next, install the nvidia-container-toolkit package.

    sudo dnf install -y nvidia-container-toolkit
    
  3. Red Hat’s guide mentioned to configure two settings in /etc/nvidia-container-runtime/config.toml. But when using Podman with --userns=keep-id to map the UID of the user running the container to the user running inside the container, you have to change a third setting. So open /etc/nvidia-container-runtime/config.toml with

    sudo -e /etc/nvidia-container-runtime/config.toml
    

    and change the following three lines:

    #no-cgroups = false
    no-cgroups = true
    
    #ldconfig = "@/sbin/ldconfig"
    ldconfig = "/sbin/ldconfig"
    
    #debug = "/var/log/nvidia-container-runtime.log"
    debug = "~/.local/nvidia-container-runtime.log"
    
  4. Next, you have to create a new SELinux policy to enable GPU access within the container:

    curl -sLO https://raw.githubusercontent.com/NVIDIA/dgx-selinux/master/bin/RHEL7/nvidia-container.pp
    sudo semodule -i nvidia-container.pp
    sudo nvidia-container-cli -k list | sudo restorecon -v -f -
    sudo restorecon -Rv /dev
    
  5. Finally, tell VS Code to use Podman instead of Docker by going to User SettingsExtensionsRemote - Containers, and change Remote › Containers: Docker Path to podman, as in the image below.

Docker Path setting

Replace docker with podman.

Using the Development Container

I created an example project that is based on a container that provides

You can install additional tools by editing the project’s Dockerfile.

To use the example project, clone it:

git clone https://github.com/sebp/vscode-sycl-dpcpp-cuda.git

Next, open the vscode-sycl-dpcpp-cuda directory with VS Code. At this point VS Code should recognize that the project contains a development container and suggest reopening the project in the container

Reopen project in container

Click Reopen in Container.

Initially, this step will take some time because the container’s image is downloaded and VS Code will install additional tools inside the container. Subsequently, VS Code will reuse this container, and opening the project in the container will be quick.

Once the project has been opened within the container, you can open the example SYCL application in the file src/sycl-example.cpp. The project is configured to use the DPC++ compiler with the CUDA backend by default. Therefore, you just have to press Ctrl+Shift+B to compile the example file. Using the terminal, you can now execute the compiled program, which should print the GPU it is using and the numbers 0 to 31.

Alternatively, you can compile and directly run the program by executing the Test task by opening the Command Palette (F1, Ctrl+Shift+P) and searching for Run Test Task.

Conclusion

While the journey to use a rootless Podman container with access to the GPU with VS Code was rather cumbersome, I hope this guide will make it less painful for others. The example project should provide a good reference for a devcontainer.json to use rootless Podman containers with GPU access. If you aren’t interested in SYCL or DPC++, you can replace the exising Dockerfile. There are two steps that are essential for this to work:

  1. Create a vscode user inside the container.
  2. Make sure you create certain directories that VS Code (inside the container) will require access to.

Otherwise, you will encounter various permission denied errors.

June 11, 2022 12:33 PM

June 10, 2022

Christian SchallerHow to get your application to show up in GNOME Software

(Christian Schaller)

Adding Applications to the GNOME Software Center

Written by Richard Hughes and Christian F.K. Schaller

This blog post is based on a white paper style writeup Richard and I did a few years ago, since I noticed this week there wasn’t any other comprehensive writeup online on the topic of how to add the required metadata to get an application to appear in GNOME Software (or any other major open source appstore) online I decided to turn the writeup into a blog post, hopefully useful to the wider community. I tried to clean it up a bit as I converted it from the old white paper, so hopefully all information in here is valid as of this posting.

Abstract

Traditionally we have had little information about Linux applications before they have been installed. With the creation of a software center we require access to rich set of metadata about an application before it is deployed so it it can be displayed to the user and easily installed. This document is meant to be a guide for developers who wish to get their software appearing in the Software stores in Fedora Workstation and other distributions. Without the metadata described in this document your application is likely to go undiscovered by many or most linux users, but by reading this document you should be able to relatively quickly prepare you application.

Introduction

GNOME Software

Installing applications on Linux has traditionally involved copying binary and data files into a directory and just writing a single desktop file into a per-user or per-system directory so that it shows up in the desktop environment. In this document we refer to applications as graphical programs, rather than other system add-on components like drivers and codecs. This document will explain why the extra metadata is required and what is required for an application to be visible in the software center. We will try to document how to do this regardless of if you choose to package your application as a rpm package or as a flatpak bundle. The current rules is a combination of various standards that have evolved over the years and will will try to summarize and explain them here, going from bottom to top.

System Architecture

Linux File Hierarchy

Traditionally applications on Linux are expected to install binary files to /usr/bin, the install architecture independent data files to /usr/share/ and configuration files to /etc. If you want to package your application as a flatpak the prefix used will be /app so it is critical for applications to respect the prefix setting. Small temporary files can be stored in /tmp and much larger files in /var/tmp. Per-user configuration is either stored in the users home directory (in ~/.config) or stored in a binary settings store such as dconf. As an application developer never hardcode these paths, but set them following the XDG standard so that they relocate correctly inside a Flatpak.

Desktop files

Desktop files have been around for a long while now and are used by almost all Linux desktops to provide the basic description of a desktop application that your desktop environment will display. Like a human readable name and an icon.

So the creation of a desktop file on Linux allows a program to be visible to the graphical environment, e.g. KDE or GNOME Shell. If applications do not have a desktop file they must be manually launched using a terminal emulator. Desktop files must adhere to the Desktop File Specification and provide metadata in an ini-style format such as:

  • Binary type, typically ‘Application’
  • Program name (optionally localized)
  • Icon to use in the desktop shell
  • Program binary name to use for launching
  • Any mime types that can be opened by the applications (optional)
  • The standard categories the application should be included in (optional)
  • Keywords (optional, and optionally localized)
  • Short one-line summary (optional, and optionally localized)

The desktop file should be installed into /usr/share/applications for applications that are installed system wide. An example desktop file provided below:


[Desktop Entry]
Type=Application
Name=OpenSCAD
Icon=openscad
Exec=openscad %f
MimeType=application/x-openscad;
Categories=Graphics;3DGraphics;Engineering;
Keywords=3d;solid;geometry;csg;model;stl;

The desktop files are used when creating the software center metadata, and so you should verify that you ship a .desktop file for each built application, and that these keys exist: Name, Comment, Icon, Categories, Keywords and Exec and that desktop-file-validate correctly validates the file. There should also be only one desktop file for each application.

The application icon should be in the PNG format with a transparent background and installed in
/usr/share/icons,/usr/share/icons/hicolor//apps/, or /usr/share/${app_name}/icons/*. The icon should be at least 128×128 in size (as this is the minimum size required by Flathub).

The file name of the desktop file is also very important, as this is the assigned ‘application ID’. New applications typically use a reverse-DNS style, e.g. org.gnome.Nautilus would be the app-id. And the .desktop entry file should thus be name org.gnome.Nautilus.desktop, but older programs may just use a short name, e.g. gimp.desktop. It is important to note that the file extension is also included as part of the desktop ID.

You can verify your desktop file using the command ‘desktop-file-validate’. You just run it like this:


desktop-file-validate myapp.desktop

This tools is available through the desktop-file-utils package, which you can install on Fedora Workstation using this command


dnf install desktop-file-utils

You also need what is called a metainfo file (previously known as AppData file= file with the suffix .metainfo.xml (some applications still use the older .appdata.xml name) file should be installed into /usr/share/metainfo with a name that matches the name of the .desktop file, e.g. gimp.desktop & gimp.metainfo.xml or org.gnome.Nautilus.desktop & org.gnome.Nautilus.metainfo.xml.

In the metainfo file you should include several 16:9 aspect screenshots along with a compelling translated description made up of multiple paragraphs.

In order to make it easier for you to do screenshots in 16:9 format we created a small GNOME Shell extension called ‘Screenshot Window Sizer’. You can install it from the GNOME Extensions site.

Once it is installed you can resize the window of your application to 16:9 format by focusing it and pressing ‘ctrl+alt+s’ (you can press the key combo multiple times to get the correct size). It should resize your application window to a perfect 16:9 aspect ratio and let you screenshot it.

Make sure you follow the style guide, which can be tested using the appstreamcli command line tool. appstreamcli is part of the ‘appstream’ package in Fedora Workstation.:


appstreamcli validate foo.metainfo.xml

If you don’t already have the appstreamcli installed it can be installed using this command on Fedora Workstation:

dnf install appstream

What is allowed in an metainfo file is defined in the AppStream specification but common items typical applications add is:

  • License of the upstream project in SPDX identifier format [6], or ‘Proprietary’
  • A translated name and short description to show in the software center search results
  • A translated long description, consisting of multiple paragraphs, itemized and ordered lists.
  • A number of screenshots, with localized captions, typically in 16:9 aspect ratio
  • An optional list of releases with the update details and release information.
  • An optional list of kudos which tells the software center about the integration level of the
    application
  • A set of URLs that allow the software center to provide links to help or bug information
  • Content ratings and hardware compatibility
  • An optional gettext or QT translation domain which allows the AppStream generator to collect statistics on shipped application translations.

A typical (albeit somewhat truncated) metainfo file is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<component type="desktop-application">
<id>org.gnome.Terminal</id>
<metadata_license>GPL-3.0+ or GFDL-1.3-only</metadata_license>
<project_license>GPL-3.0+</project_license>
<name>Terminal</name>
<name xml:lang="ar">الطرفية</name>
<name xml:lang="an">Terminal</name>
<summary>Use the command line</summary>
<summary xml:lang="ar">استعمل سطر الأوامر</summary>
<summary xml:lang="an">Emplega la linia de comandos</summary>
<description>
<p>GNOME Terminal is a terminal emulator application for accessing a UNIX shell environment which can be used to run programs available on your system.</p>
<p xml:lang="ar">يدعم تشكيلات مختلفة، و الألسنة و العديد من اختصارات لوحة المفاتيح.</p>
<p xml:lang="an">Suporta quantos perfils, quantas pestanyas y implementa quantos alcorces de teclau.</p>
</description>
<recommends>
<control>console</control>
<control>keyboard</control>
<control>pointing</control>
</recommends>
<screenshots>
<screenshot type="default">https://help.gnome.org/users/gnome-terminal/stable/figures/gnome-terminal.png</screenshot>
</screenshots>
<kudos>
<kudo>HiDpiIcon</kudo>
<kudo>HighContrast</kudo>
<kudo>ModernToolkit</kudo>
<kudo>SearchProvider</kudo>
<kudo>UserDocs</kudo>
</kudos>
<content_rating type="oars-1.1"/>
<url type="homepage">https://wiki.gnome.org/Apps/Terminal</url>
<project_group>GNOME</project_group>
<update_contact>https://wiki.gnome.org/Apps/Terminal/ReportingBugs</update_contact>
</component>

Some Appstrean background

The Appstream specification is an mature and evolving standard that allows upstream applications to provide metadata such as localized descriptions, screenshots, extra keywords and content ratings for parental control. This intoduction just touches on the surface what it provides so I recommend reading the specification through once you understood the basics. The core concept is that the upstream project ships one extra metainfo XML file which is used to build a global application catalog called a metainfo file. Thousands of open source projects now include metainfo files, and the software center shipped in Fedora, Ubuntu and OpenSuse is now an easy to use application filled with useful application metadata. Applications without metainfo files are no longer shown which provides quite some incentive to upstream projects wanting visibility in popular desktop environments. AppStream was first introduced in 2008 and since then many people have contributed to the specification. It is being used primarily for application metadata but also now is used for drivers, firmware, input methods and fonts. There are multiple projects producing AppStream metadata and also a number of projects consuming the final XML metadata.

When applications are being built as packages by a distribution then the AppStream generation is done automatically, and you do not need to do anything other than installing a .desktop file and an metainfo.xml file in the upstream tarball or zip file. If the application is being built on your own machines or cloud instance then the distributor will need to generate the AppStream metadata manually. This would for example be the case when internal-only or closed source software is being either used or produced. This document assumes you are currently building RPM packages and exporting yum-style repository metadata for Fedora or RHEL although the concepts are the same for rpm-on-OpenSuse or deb-on-Ubuntu.

NOTE: If you are building packages, make sure that there are not two applications installed with one single package. If this is currently the case split up the package so that there are multiple subpackages or mark one of the .desktop files as NoDisplay=true. Make sure the application-subpackages depend on any -common subpackage and deal with upgrades (perhaps using a metapackage) if you’ve shipped the application before.

Summary of Package building

So the steps outlined above explains the extra metadata you need to have your application show up in GNOME Software. This tutorial does not cover how to set up your build system to build these, but both for Meson and autotools you should be able to find a long range of examples online. And there are also major resources available to explain how to create a Fedora RPM or how to build a Flatpak. You probably also want to tie both the Desktop file and the metainfo file into your i18n system so the metadata in them can be translated. It is worth nothing here that while this document explains how you can do everything yourself we do generally recommend relying on existing community infrastructure for hosting source code and packages if you can (for instance if your application is open source), as they will save you work and effort over time. For instance putting your source code into the GNOME git will give you free access to the translator community in GNOME and thus increase the chance your application is internationalized significantly. And by building your package in Fedora you can get peer review of your package and free hosting of the resulting package. Or by putting your package up on Flathub you get wide cross distribution availability.

Setting up hosting infrastructure for your package

We will here explain how you set up a Yum repository for RPM packages that provides the needed metadata. If you are making a Flatpak we recommend skipping ahead to the Flatpak section a bit further down.

Yum hosting and Metadata:

When GNOME Software checks for updates it downloads various metadata files from the server describing the packages available in the repository. GNOME Software can also download AppStream metadata at the same time, allowing add-on repositories to include applications that are visible in the the software center. In most cases distributors are already building binary RPMS and then building metadata as an additional step by running something like this to generate the repomd files on a directory of packages. The tool for creating the repository metadata is called createrepo_c and is part of the package createrepo_c in Fedora. You can install it by running the command:


dnf install createrepo_c.

Once the tool is installed you can run these commands to generate your metadata:


$ createrepo_c --no-database --simple-md-filenames SRPMS/
$ createrepo_c --no-database --simple-md-filenames x86_64/

This creates the primary and filelist metadata required for updating on the command line. Next to build the metadata required for the software center we we need to actually generate the AppStream XML. The tool you need for this is called appstream-builder. This works by decompressing .rpm files and merging together the .desktop file, the .metainfo.xml file and preprocessing the icons. Remember, only applications installing AppData files will be included in the metadata.

You can install appstream builder in Fedora Workstation by using this command:

dnf install libappstream-glib-builder

Once it is installed you can run it by using the following syntax:

$ appstream-builder \
   --origin=yourcompanyname \
   --basename=appstream \
   --cache-dir=/tmp/asb-cache \
   --enable-hidpi \
   --max-threads=1 \
   --min-icon-size=32 \
   --output-dir=/tmp/asb-md \
   --packages-dir=x86_64/ \
   --temp-dir=/tmp/asb-icons

This takes a few minutes and generates some files to the output directory. Your output should look something like this:


Scanning packages...
Processing packages...
Merging applications...
Writing /tmp/asb-md/appstream.xml.gz...
Writing /tmp/asb-md/appstream-icons.tar.gz...
Writing /tmp/asb-md/appstream-screenshots.tar...Done!

The actual build output will depend on your compose server configuration. At this point you can also verify the application is visible in the yourcompanyname.xml.gz file.
We then have to take the generated XML and the tarball of icons and add it to the repomd.xml master document so that GNOME Software automatically downloads the content for searching.
This is as simple as doing:

modifyrepo_c \
    --no-compress \
    --simple-md-filenames \
    /tmp/asb-md/appstream.xml.gz \
    x86_64/repodata/
modifyrepo_c \
    --no-compress \
    --simple-md-filenames \
    /tmp/asb-md/appstream-icons.tar.gz \
    x86_64/repodata/

 

Deploying this metadata will allow GNOME Software to add the application metadata the next time the repository is refreshed, typically, once per day. Hosting your Yum repository on Github Github isn’t really set up for hosting Yum repositories, but here is a method that currently works. So once you created a local copy of your repository create a new project on github. Then use the follow commands to import your repository into github.


cd ~/src/myrepository
git init
git add -A
git commit -a -m "first commit"
git remote add origin git@github.com:yourgitaccount/myrepo.git
git push -u origin master

Once everything is important go into the github web interface and drill down in the file tree until you find the file called ‘repomd.xml’ and click on it. You should now see a button the github interface called ‘Raw’. Once you click that you get the raw version of the XML file and in the URL bar of your browser you should see a URL looking something like this:
https://raw.githubusercontent.com/cschalle/hubyum/master/noarch/repodata/repomd.xml
Copy that URL as you will need the information from it to create your .repo file which is what distributions and users want in order to reach you new repository. To create your .repo file copy this example and edit it to match your data:


[remarkable]
name=Remarkable Markdown editor software and updates
baseurl=https://raw.githubusercontent.com/cschalle/hubyum/master/noarch
gpgcheck=0
enabled=1
enabled_metadata=1

So on top is your Repo shortname inside the brackets, then a name field with a more extensive name. For the baseurl paste the URL you copied earlier and remove the last bits until you are left with either the ‘norach’ directory or your platform directory for instance x86_64. Once you have that file completed put it into /etc/yum.repos.d on your computer and load up GNOME Software. Click on the ‘Updates’ button in GNOME Software and then on the refresh button in the top left corner to ensure your database is up to date. If everything works as expected you should then be able to do a search in GNOME software and find your new application showing up.

Example of self hosted RPM

Flapak hosting and Metadata

The flatpak-builder binary generates AppStream metadata automatically when building applications if the appstream-compose tool is installed on the flatpak build machine. Flatpak remotes are exported with a separate ‘appstream’ branch which is automatically downloaded by GNOME Software and no addition work if required when building your application or updating the remote. Adding the remote is enough to add the application to the software center, on the assumption the AppData file is valid.

Conclusions

AppStream files allow us to build a modern software center experience either using distro packages with yum-style metadata or with the new flatpak application deployment framework. By including a desktop file and AppData file for your Linux binary build your application can be easily found and installed by end users greatly expanding its userbase.

by uraeus at June 10, 2022 02:12 PM

May 11, 2022

Christian SchallerWhy the open source driver release from NVIDIA is so important for Linux?

(Christian Schaller)

Background
Today NVIDIA announced that they are releasing an open source kernel driver for their GPUs, so I want to share with you some background information and how I think this will impact Linux graphics and compute going forward.

One thing many people are not aware of is that Red Hat is the only Linux OS company who has a strong presence in the Linux compute and graphics engineering space. There are of course a lot of other people working in the space too, like engineers working for Intel, AMD and NVIDIA or people working for consultancy companies like Collabora or individual community members, but Red Hat as an OS integration company has been very active on trying to ensure we have a maintainable and shared upstream open source stack. This engineering presence is also what has allowed us to move important technologies forward, like getting hiDPI support for Linux some years ago, or working with NVIDIA to get glvnd implemented to remove a pain point for our users since the original OpenGL design only allowed for one OpenGl implementation to be installed at a time. We see ourselves as the open source community’s partner here, fighting to keep the linux graphics stack coherent and maintainable and as a partner for the hardware OEMs to work with when they need help pushing major new initiatives around GPUs for Linux forward. And as the only linux vendor with a significant engineering footprint in GPUs we have been working closely with NVIDIA. People like Kevin Martin, the manager for our GPU technologies team, Ben Skeggs the maintainer of Nouveau and Dave Airlie, the upstream kernel maintainer for the graphics subsystem, Nouveau contributor Karol Herbst and our accelerator lead Tom Rix have all taken part in meetings, code reviews and discussions with NVIDIA. So let me talk a little about what this release means (and also what it doesn’t mean) and what we hope to see come out of this long term.

First of all, what is in this new driver?
What has been released is an out of tree source code kernel driver which has been tested to support CUDA usecases on datacenter GPUs. There is code in there to support display, but it is not complete or fully tested yet. Also this is only the kernel part, a big part of a modern graphics driver are to be found in the firmware and userspace components and those are still closed source. But it does mean we have a NVIDIA kernel driver now that will start being able to consume the GPL-only APIs in the linux kernel, although this initial release doesn’t consume any APIs the old driver wasn’t already using. The driver also only supports NVIDIA Turing chip GPUs and newer, which means it is not targeting GPUs from before 2018. So for the average Linux desktop user, while this is a great first step and hopefully a sign of what is to come, it is not something you are going to start using tomorrow.

What does it mean for the NVidia binary driver?
Not too much immediately. This binary kernel driver will continue to be needed for older pre-Turing NVIDIA GPUs and until the open source kernel module is full tested and extended for display usecases you are likely to continue using it for your system even if you are on Turing or newer. Also as mentioned above regarding firmware and userspace bits and the binary driver is going to continue to be around even once the open source kernel driver is fully capable.

What does it mean for Nouveau?
Let me start with the obvious, this is actually great news for the Nouveau community and the Nouveau driver and NVIDIA has done a great favour to the open source graphics community with this release. And for those unfamiliar with Nouveau, Nouveau is the in-kernel graphics driver for NVIDIA GPUs today which was originally developed as a reverse engineered driver, but which over recent years actually have had active support from NVIDIA. It is fully functional, but is severely hampered by not having had the ability to for instance re-clock the NVIDIA card, meaning that it can’t give you full performance like the binary driver can. This was something we were working with NVIDIA trying to remedy, but this new release provides us with a better path forward. So what does this new driver mean for Nouveau? Less initially, but a lot in the long run. To give a little background first. The linux kernel does not allow multiple drivers for the same hardware, so in order for a new NVIDIA kernel driver to go in the current one will have to go out or at least be limited to a different set of hardware. The current one is Nouveau. And just like the binary driver a big chunk of Nouveau is not in the kernel, but are the userspace pieces found in Mesa and the Nouveau specific firmware that NVIDIA currently kindly makes available. So regardless of the long term effort to create a new open source in-tree kernel driver based on this new open source driver for NVIDIA hardware, Nouveau will very likely be staying around to support pre-turing hardware just like the NVIDIA binary kernel driver will.

The plan we are working towards from our side, but which is likely to take a few years to come to full fruition, is to come up with a way for the NVIDIA binary driver and Mesa to share a kernel driver. The details of how we will do that is something we are still working on and discussing with our friends at NVIDIA to address both the needs of the NVIDIA userspace and the needs of the Mesa userspace. Along with that evolution we hope to work with NVIDIA engineers to refactor the userspace bits of Mesa that are now targeting just Nouveau to be able to interact with this new kernel driver and also work so that the binary driver and Nouveau can share the same firmware. This has clear advantages for both the open source community and the NVIDIA. For the open source community it means that we will now have a kernel driver and firmware that allows things like changing the clocking of the GPU to provide the kind of performance people expect from the NVIDIA graphics card and it means that we will have an open source driver that will have access to the firmware and kernel updates from day one for new generations of NVIDIA hardware. For the ‘binary’ driver, and I put that in ” signs because it will now be less binary :), it means as stated above that it can start taking advantage of the GPL-only APIs in the kernel, distros can ship it and enable secure boot, and it gets an open source consumer of its kernel driver allowing it to go upstream.
If this new shared kernel driver will be known as Nouveau or something completely different is still an open question, and of course it happening at all depends on if we and the rest of the open source community and NVIDIA are able to find a path together to make it happen, but so far everyone seems to be of good will.

What does this release mean for linux distributions like Fedora and RHEL?

Over time it provides a pathway to radically simplify supporting NVIDIA hardware due to the opportunities discussed elsewhere in this document. Long term we will hope be able to get a better user experience with NVIDIA hardware in terms out of box functionality. Which means day 1 support for new chipsets, a high performance open source Mesa driver for NVIDIA and it will allow us to sign the NVIDIA driver alongside the rest of the kernel to enable things like secureboot support. Since this first release is targeting compute one can expect that these options will first be available for compute users and then graphics at a later time.

What are the next steps
Well there is a lot of work to do here. NVIDIA need to continue the effort to make this new driver feature complete for both Compute and Graphics Display usecases, we’d like to work together to come up with a plan for what the future unified kernel driver can look like and a model around it that works for both the community and NVIDIA, we need to add things like a Mesa Vulkan driver. We at Red Hat will be playing an active part in this work as the only Linux vendor with the capacity to do so and we will also work to ensure that the wider open source community has a chance to participate fully like we do for all open source efforts we are part of.

If you want to hear more about this I did talk with Chris Fisher and Linux Action News about this topic. Note: I did state some timelines in that interview which I didn’t make clear was my guesstimates and not in any form official NVIDIA timelines, so apologize for the confusion.

by uraeus at May 11, 2022 07:52 PM

May 09, 2022

Robert McQueenEvolving a strategy for 2022 and beyond

(Robert McQueen)

As a board, we have been working on several initiatives to make the Foundation a better asset for the GNOME Project. We’re working on a number of threads in parallel, so I wanted to explain the “big picture” a bit more to try and connect together things like the new ED search and the bylaw changes.

We’re all here to see free and open source software succeed and thrive, so that people can be be truly empowered with agency over their technology, rather than being passive consumers. We want to bring GNOME to as many people as possible so that they have computing devices that they can inspect, trust, share and learn from.

In previous years we’ve tried to boost the relevance of GNOME (or technologies such as GTK) or solicit donations from businesses and individuals with existing engagement in FOSS ideology and technology. The problem with this approach is that we’re mostly addressing people and organisations who are already supporting or contributing FOSS in some way. To truly scale our impact, we need to look to the outside world, build better awareness of GNOME outside of our current user base, and find opportunities to secure funding to invest back into the GNOME project.

The Foundation supports the GNOME project with infrastructure, arranging conferences, sponsoring hackfests and travel, design work, legal support, managing sponsorships, advisory board, being the fiscal sponsor of GNOME, GTK, Flathub… and we will keep doing all of these things. What we’re talking about here are additional ways for the Foundation to support the GNOME project – we want to go beyond these activities, and invest into GNOME to grow its adoption amongst people who need it. This has a cost, and that means in parallel with these initiatives, we need to find partners to fund this work.

Neil has previously talked about themes such as education, advocacy, privacy, but we’ve not previously translated these into clear specific initiatives that we would establish in addition to the Foundation’s existing work. This is all a work in progress and we welcome any feedback from the community about refining these ideas, but here are the current strategic initiatives the board is working on. We’ve been thinking about growing our community by encouraging and retaining diverse contributors, and addressing evolving computing needs which aren’t currently well served on the desktop.

Initiative 1. Welcoming newcomers. The community is already spending a lot of time welcoming newcomers and teaching them the best practices. Those activities are as time consuming as they are important, but currently a handful of individuals are running initiatives such as GSoC, Outreachy and outreach to Universities. These activities help bring diverse individuals and perspectives into the community, and helps them develop skills and experience of collaborating to create Open Source projects. We want to make those efforts more sustainable by finding sponsors for these activities. With funding, we can hire people to dedicate their time to operating these programs, including paid mentors and creating materials to support newcomers in future, such as developer documentation, examples and tutorials. This is the initiative that needs to be refined the most before we can turn it into something real.

Initiative 2: Diverse and sustainable Linux app ecosystem. I spoke at the Linux App Summit about the work that GNOME and Endless has been supporting in Flathub, but this is an example of something which has a great overlap between commercial, technical and mission-based advantages. The key goal here is to improve the financial sustainability of participating in our community, which in turn has an impact on the diversity of who we can expect to afford to enter and remain in our community. We believe the existence of this is critically important for individual developers and contributors to unlock earning potential from our ecosystem, through donations or app sales. In turn, a healthy app ecosystem also improves the usefulness of the Linux desktop as a whole for potential users. We believe that we can build a case for commercial vendors in the space to join an advisory board alongside with GNOME, KDE, etc to input into the governance and contribute to the costs of growing Flathub.

Initiative 3: Local-first applications for the GNOME desktop. This is what Thib has been starting to discuss on Discourse, in this thread. There are many different threats to free access to computing and information in today’s world. The GNOME desktop and apps need to give users convenient and reliable access to technology which works similarly to the tools they already use everyday, but keeps them and their data safe from surveillance, censorship, filtering or just being completely cut off from the Internet. We believe that we can seek both philanthropic and grant funding for this work. It will make GNOME a more appealing and comprehensive offering for the many people who want to protect their privacy.

The idea is that these initiatives all sit on the boundary between the GNOME community and the outside world. If the Foundation can grow and deliver these kinds of projects, we are reaching to new people, new contributors and new funding. These contributions and investments back into GNOME represent a true “win-win” for the newcomers and our existing community.

(Originally posted to GNOME Discourse, please feel free to join the discussion there.)

by ramcq at May 09, 2022 02:01 PM

May 02, 2022

GStreamerGStreamer 1.20.2 stable bug fix release

(GStreamer)

The GStreamer team is pleased to announce the second bug fix release in the stable 1.20 release series of your favourite cross-platform multimedia framework!

This release only contains bugfixes and it should be safe to update from 1.20.x.

Highlighted bugfixes:

  • avviddec: Remove vc1/wmv3 override and fix crashes on WMV files with FFMPEG 5.0+
  • macOS: fix plugin discovery for GStreamer installed via brew and fix loading of Rust plugins
  • rtpbasepayload: various header extension handling fixes
  • rtpopusdepay: fix regression in stereo input handling if sprop-stereo is not advertised
  • rtspclientsink: fix possible shutdown deadlock
  • mpegts: gracefully handle "empty" program maps and fix AC-4 detection
  • mxfdemux: Handle empty VANC packets and fix EOS handling
  • playbin3: various playbin3, uridecodebin3, and playsink fixes
  • ptpclock: fix initial sync-up with certain devices
  • gltransformation: let graphene alloc its structures memory aligned
  • webrtcbin fixes and webrtc sendrecv example improvements
  • video4linux2: various fixes including some fixes for Raspberry Pi users
  • videorate segment handling fixes and other fixes
  • nvh264dec, nvh265dec: Fix broken key-unit trick modes and reverse playback
  • wpe: Reintroduce persistent WebContext
  • cerbero: Make it easier to consume 1.20.1 macOS GStreamer .pkgs
  • build fixes and gobject annotation fixes
  • bug fixes, security fixes, memory leak fixes, and other stability and reliability improvements

See the GStreamer 1.20.2 release notes for more details.

Binaries for Android, iOS, macOS and Windows will be available shortly.

Release tarballs can be downloaded directly here:

May 02, 2022 11:30 PM

Sebastian DrögeInstantaneous RTP synchronization & retrieval of absolute sender clock times with GStreamer

(Sebastian Dröge)

Over the last few weeks, GStreamer’s RTP stack got a couple of new and quite useful features. As it is difficult to configure, mostly because there being so many different possible configurations, I decided to write about this a bit with some example code.

The features are RFC 6051-style rapid synchronization of RTP streams, which can be used for inter-stream (e.g. audio/video) synchronization as well as inter-device (i.e. network) synchronization, and the ability to easily retrieve absolute sender clock times per packet on the receiver side.

Note that each of this was already possible before with GStreamer via different mechanisms with different trade-offs. Obviously, not being able to have working audio/video synchronization would be simply not acceptable and I previously talked about how to do inter-device synchronization with GStreamer before, for example at the GStreamer Conference 2015 in Düsseldorf.

The example code below will make use of the GStreamer RTSP Server library but can be applied to any kind of RTP workflow, including WebRTC, and are written in Rust but the same can also be achieved in any other language. The full code can be found in this repository.

And for reference, the merge requests to enable all this are [1], [2] and [3]. You probably don’t want to backport those to an older version of GStreamer though as there are dependencies on various other changes elsewhere. All of the following needs at least GStreamer from the git main branch as of today, or the upcoming 1.22 release.

Baseline Sender / Receiver Code

The starting point of the example code can be found here in the baseline branch. All the important steps are commented so it should be relatively self-explanatory.

Sender

The sender is starting an RTSP server on the local machine on port 8554 and provides a media with H264 video and Opus audio on the mount point /test. It can be started with

$ cargo run -p rtp-rapid-sync-example-send

After starting the server it can be accessed via GStreamer with e.g. gst-play-1.0 rtsp://127.0.0.1:8554/test or similarly via VLC or any other software that supports RTSP.

This does not do anything special yet but lays the foundation for the following steps. It creates an RTSP server instance with a custom RTSP media factory, which in turn creates custom RTSP media instances. All this is not needed at this point yet but will allow for the necessary customization later.

One important aspect here is that the base time of the media’s pipeline is set to zero

pipeline.set_base_time(gst::ClockTime::ZERO);
pipeline.set_start_time(gst::ClockTime::NONE);

This allows the timeoverlay element that is placed in the video part of the pipeline to render the clock time over the video frames. We’re going to use this later to confirm on the receiver that the clock time on the sender and the one retrieved on the receiver are the same.

let video_overlay = gst::ElementFactory::make("timeoverlay", None)
    .context("Creating timeoverlay")?;
[...]
video_overlay.set_property_from_str("time-mode", "running-time");

It actually only supports rendering the running time of each buffer, but in a live pipeline with the base time set to zero the running time and pipeline clock time are the same. See the documentation for some more details about the time concepts in GStreamer.

Overall this creates the following RTSP stream producer bin, which will be used also in all the following steps:

Receiver

The receiver is a simple playbin pipeline that plays an RTSP URI given via command-line parameters and runs until the stream is finished or an error has happened.

It can be run with the following once the sender is started

$ cargo run -p rtp-rapid-sync-example-send -- "rtsp://192.168.1.101:8554/test"

Please don’t forget to replace the IP with the IP of the machine that is actually running the server.

All the code should be familiar to anyone who ever wrote a GStreamer application in Rust, except for one part that might need a bit more explanation

pipeline.connect_closure(
    "source-setup",
    false,
    glib::closure!(|_playbin: &gst::Pipeline, source: &gst::Element| {
        source.set_property("latency", 40u32);
    }),
);

playbin is going to create an rtspsrc, and at that point it will emit the source-setup signal so that the application can do any additional configuration of the source element. Here we’re connecting a signal handler to that signal to do exactly that.

By default rtspsrc introduces a latency of 2 seconds of latency, which is a lot more than what is usually needed. For live, non-VOD RTSP streams this value should be around the network jitter and here we’re configuring that to 40 milliseconds.

Retrieval of absolute sender clock times

Now as the first step we’re going to retrieve the absolute sender clock times for each video frame on the receiver. They will be rendered by the receiver at the bottom of each video frame and will also be printed to stdout. The changes between the previous version of the code and this version can be seen here and the final code here in the sender-clock-time-retrieval branch.

When running the sender and receiver as before, the video from the receiver should look similar to the following

The upper time that is rendered on the video frames is rendered by the sender, the bottom time is rendered by the receiver and both should always be the same unless something is broken here. Both times are the pipeline clock time when the sender created/captured the video frame.

In this configuration the absolute clock times of the sender are provided to the receiver via the NTP / RTP timestamp mapping provided by the RTCP Sender Reports. That’s also the reason why it takes about 5s for the receiver to know the sender’s clock time as RTCP packets are not scheduled very often and only after about 5s by default. The RTCP interval can be configured on rtpbin together with many other things.

Sender

On the sender-side the configuration changes are rather small and not even absolutely necessary.

rtpbin.set_property_from_str("ntp-time-source", "clock-time");

By default the RTP NTP time used in the RTCP packets is based on the local machine’s walltime clock converted to the NTP epoch. While this works fine, this is not the clock that is used for synchronizing the media and as such there will be drift between the RTP timestamps of the media and the NTP time from the RTCP packets, which will be reset every time the receiver receives a new RTCP Sender Report from the sender.

Instead, we configure rtpbin here to use the pipeline clock as the source for the NTP timestamps used in the RTCP Sender Reports. This doesn’t give us (by default at least, see later) an actual NTP timestamp but it doesn’t have the drift problem mentioned before. Without further configuration, in this pipeline the used clock is the monotonic system clock.

rtpbin.set_property("rtcp-sync-send-time", false);

rtpbin normally uses the time when a packet is sent out for the NTP / RTP timestamp mapping in the RTCP Sender Reports. This is changed with this property to instead use the time when the video frame / audio sample was captured, i.e. it does not include all the latency introduced by encoding and other processing in the sender pipeline.

This doesn’t make any big difference in this scenario but usually one would be interested in the capture clock times and not the send clock times.

Receiver

On the receiver-side there are a few more changes. First of all we have to opt-in to rtpjitterbuffer putting a reference timestamp metadata on every received packet with the sender’s absolute clock time.

pipeline.connect_closure(
    "source-setup",
    false,
    glib::closure!(|_playbin: &gst::Pipeline, source: &gst::Element| {
        source.set_property("latency", 40u32);
        source.set_property("add-reference-timestamp-meta", true);
    }),
);

rtpjitterbuffer will start putting the metadata on packets once it knows the NTP / RTP timestamp mapping, i.e. after the first RTCP Sender Report is received in this case. Between the Sender Reports it is going to interpolate the clock times. The normal timestamps (PTS) on each packet are not affected by this and are still based on whatever clock is used locally by the receiver for synchronization.

To actually make use of the reference timestamp metadata we add a timeoverlay element as video-filter on the receiver:

let timeoverlay =
    gst::ElementFactory::make("timeoverlay", None).context("Creating timeoverlay")?;

timeoverlay.set_property_from_str("time-mode", "reference-timestamp");
timeoverlay.set_property_from_str("valignment", "bottom");

pipeline.set_property("video-filter", &timeoverlay);

This will then render the sender’s absolute clock times at the bottom of each video frame, as seen in the screenshot above.

And last we also add a pad probe on the sink pad of the timeoverlay element to retrieve the reference timestamp metadata of each video frame and then printing the sender’s clock time to stdout:

let sinkpad = timeoverlay
    .static_pad("video_sink")
    .expect("Failed to get timeoverlay sinkpad");
sinkpad
    .add_probe(gst::PadProbeType::BUFFER, |_pad, info| {
        if let Some(gst::PadProbeData::Buffer(ref buffer)) = info.data {
            if let Some(meta) = buffer.meta::<gst::ReferenceTimestampMeta>() {
                println!("Have sender clock time {}", meta.timestamp());
            } else {
                println!("Have no sender clock time");
            }
        }

        gst::PadProbeReturn::Ok
    })
    .expect("Failed to add pad probe");

Rapid synchronization via RTP header extensions

The main problem with the previous code is that the sender’s clock times are only known once the first RTCP Sender Report is received by the receiver. There are many ways to configure rtpbin to make this happen faster (e.g. by reducing the RTCP interval or by switching to the AVPF RTP profile) but in any case the information would be transmitted outside the actual media data flow and it can’t be guaranteed that it is actually known on the receiver from the very first received packet onwards. This is of course not a problem in every use-case, but for the cases where it is there is a solution for this problem.

RFC 6051 defines an RTP header extension that allows to transmit the NTP timestamp that corresponds an RTP packet directly together with this very packet. And that’s what the next changes to the code are making use of.

The changes between the previous version of the code and this version can be seen here and the final code here in the rapid-synchronization branch.

Sender

To add the header extension on the sender-side it is only necessary to add an instance of the corresponding header extension implementation to the payloaders.

let hdr_ext = gst_rtp::RTPHeaderExtension::create_from_uri(
    "urn:ietf:params:rtp-hdrext:ntp-64",
    )
    .context("Creating NTP 64-bit RTP header extension")?;
hdr_ext.set_id(1);
video_pay.emit_by_name::<()>("add-extension", &[&hdr_ext]);

This first instantiates the header extension based on the uniquely defined URI for it, then sets its ID to 1 (see RFC 5285) and then adds it to the video payloader. The same is then done for the audio payloader.

By default this will add the header extension to every RTP packet that has a different RTP timestamp than the previous one. In other words: on the first packet that corresponds to an audio or video frame. Via properties on the header extension this can be configured but generally the default should be sufficient.

Receiver

On the receiver-side no changes would actually be necessary. The use of the header extension is signaled via the SDP (see RFC 5285) and it will be automatically made use of inside rtpbin as another source of NTP / RTP timestamp mappings in addition to the RTCP Sender Reports.

However, we configure one additional property on rtpbin

source.connect_closure(
    "new-manager",
    false,
    glib::closure!(|_rtspsrc: &gst::Element, rtpbin: &gst::Element| {
        rtpbin.set_property("min-ts-offset", gst::ClockTime::from_mseconds(1));
    }),
);

Inter-stream audio/video synchronization

The reason for configuring the min-ts-offset property on the rtpbin is that the NTP / RTP timestamp mapping is not only used for providing the reference timestamp metadata but it is also used for inter-stream synchronization by default. That is, for getting correct audio / video synchronization.

With RTP alone there is no mechanism to synchronize multiple streams against each other as the packet’s RTP timestamps of different streams have no correlation to each other. This is not too much of a problem as usually the packets for audio and video are received approximately at the same time but there’s still some inaccuracy in there.

One approach to fix this is to use the NTP / RTP timestamp mapping for each stream, either from the RTCP Sender Reports or from the RTP header extension, and that’s what is made use of here. And because the mapping is provided very often via the RTP header extension but the RTP timestamps are only accurate up to clock rate (1/90000s for video and 1/48000s) for audio in this case, we configure a threshold of 1ms for adjusting the inter-stream synchronization. Without this it would be adjusted almost continuously by a very small amount back and forth.

Other approaches for inter-stream synchronization are provided by RTSP itself before streaming starts (via the RTP-Info header), but due to a bug this is currently not made use of by GStreamer.

Yet another approach would be via the clock information provided by RFC 7273, about which I already wrote previously and which is also supported by GStreamer. This also allows inter-device, network synchronization and used for that purpose as part of e.g. AES67, Ravenna, SMPTE 2022 / 2110 and many other protocols.

Inter-device network synchronization

Now for the last part, we’re going to add actual inter-device synchronization to this example. The changes between the previous version of the code and this version can be seen here and the final code here in the network-sync branch. This does not use the clock information provided via RFC 7273 (which would be another option) but uses the same NTP / RTP timestamp mapping that was discussed above.

When starting the receiver multiple times on different (or the same) machines, each of them should play back the media synchronized to each other and exactly 2 seconds after the corresponding audio / video frames are produced on the sender.

For this, both sender and all receivers are using an NTP clock (pool.ntp.org in this case) instead of the local monotonic system clock for media synchronization (i.e. as the pipeline clock). Instead of an NTP clock it would also be possible to any other mechanism for network clock synchronization, e.g. PTP or the GStreamer netclock.

println!("Syncing to NTP clock");
clock
    .wait_for_sync(gst::ClockTime::from_seconds(5))
    .context("Syncing NTP clock")?;
println!("Synced to NTP clock");

This code instantiates a GStreamer NTP clock and then synchronously waits up to 5 seconds for it to synchronize. If that fails then the application simply exits with an error.

Sender

On the sender side all that is needed is to configure the RTSP media factory, and as such the pipeline used inside it, to use the NTP clock

factory.set_clock(Some(&clock));

This causes all media inside the sender’s pipeline to be synchronized according to this NTP clock and to also use it for the NTP timestamps in the RTCP Sender Reports and the RTP header extension.

Receiver

On the receiver side the same has to happen

pipeline.use_clock(Some(&clock));

In addition a couple more settings have to be configured on the receiver though. First of all we configure a static latency of 2 seconds on the receiver’s pipeline.

pipeline.set_latency(gst::ClockTime::from_seconds(2));

This is necessary as GStreamer can’t know the latency of every receiver (e.g. different decoders might be used), and also because the sender latency can’t be automatically known. Each audio / video frame will be timestamped on the receiver with the NTP timestamp when it was captured / created, but since then all the latency of the sender, the network and the receiver pipeline has passed and for this some compensation must happen.

Which value to use here depends a lot on the overall setup, but 2 seconds is a (very) safe guess in this case. The value only has to be larger than the sum of sender, network and receiver latency and in the end has the effect that the receiver is showing the media exactly that much later than the sender has produced it.

And last we also have to tell rtpbin that

  1. sender and receiver clock are synchronized to each other, i.e. in this case both are using exactly the same NTP clock, and that no translation to the pipeline’s clock is necessary, and
  2. that the outgoing timestamps on the receiver should be exactly the sender timestamps and that this conversion should happen based on the NTP / RTP timestamp mapping

source.set_property_from_str("buffer-mode", "synced");
source.set_property("ntp-sync", true);

And that’s it.

A careful reader will also have noticed that all of the above would also work without the RTP header extension, but then the receivers would only be synchronized once the first RTCP Sender Report is received. That’s what the test-netclock.c / test-netclock-client.c example from the GStreamer RTSP server is doing.

As usual with RTP, the above is by far not the only way of doing this and GStreamer also supports various other synchronization mechanisms. Which one is the correct one for a specific use-case depends on a lot of factors.

by slomo at May 02, 2022 01:00 PM

Víctor JáquezFrom gst-build to local-projects

Two years ago I wrote a blog post about using gst-build inside of WebKit SDK flatpak. Well, all that has changed. That’s the true upstream spirit.

There were two main reason for the change:

  1. Since the switch to GStreamer mono repository, gst-build has been deprecated. The mechanism in WebKit were added, basically, to allow GStreamer upstream, so keeping gst-build directory just polluted the conceptual framework.
  2. By using gst-build one could override almost any other package in WebKit SDK. For example, for developing gamepad handling in WPE I added libmanette as a GStreamer subproject, to link a modified version of the library rather than the one in flatpak. But that approach added an unneeded conceptual depth in tree.

In order to simplify these operations, by taking advantage of Meson’s subproject support directly, gst-build handling were removed and new mechanism was set in place: Local Dependencies. With local dependencies, you can add or override almost any dependency, while flatting the tree layout, by placing at the same level GStreamer and any other library. Of course, in order add dependencies, they must be built with meson.

For example, to override libsoup and GStreamer, just clone both repositories below of Tools/flatpak/local-projects/subprojects, and declare them in WEBKIT_LOCAL_DEPS environment variable:


$ export WEBKIT_SDK_LOCAL_DEPS=libsoup,gstreamer-full
$ export WEBKIT_SDK_LOCAL_DEPS_OPTIONS="-Dgstreamer-full:introspection=disabled -Dgst-plugins-good:soup=disabled"
$ build-webkit --wpe

by vjaquez at May 02, 2022 11:11 AM

April 25, 2022

Sebastian Pölsterlscikit-survival 0.17.2 released

I’m pleased to announce the release of scikit-survival 0.17.2. This release fixes several small issues with packaging scikit-survival and the documentation. For a full list of changes in scikit-survival 0.17.2, please see the release notes.

Most notably, binary wheels are now available for Linux, Windows, and macOS (Intel). This has been possible thanks to the cibuildwheel build tool, which makes it incredible easy to use GitHub Actions for building those wheels for multiple versions of Python. Therefore, you can now use pip without building everything from source by simply running

pip install scikit-survival

As before, pre-built conda packages are available too, by running

 conda install -c sebp scikit-survival

Support for Apple Silicon M1

Currently, there are no pre-built packages for Mac with Apple Silicon M1 hardware (also known as macos/arm64). There are two main reasons for that. The biggest problem is the lack of CI servers that run on Apple Silicon M1. This makes it difficult to systematically test scikit-survival on such hardware. Second, some of scikit-survival’s dependencies do not offer wheels for macos/arm64 yet, namely ecos and osqp.

However, conda-forge figured out a way to cross-compile packages, and do offer scikit-survival for macos/arm64. I tried to adapt their conda recipe, but failed to achieve a working cross-compilation process so far. Cross-compiling with cibuildwheel would be an alternative, but currently doesn’t make much sense if run-time dependencies ecos and osqp do not provide wheels for macos/arm64. I hope these issues can be resolved in the near future.

April 25, 2022 06:14 PM

April 20, 2022

Thomas Vander SticheleRunning Anthos inside Google

(Thomas Vander Stichele)

"With everyone and their dog shifting to containers, and away from virtual machines (VMs), we realized that running vendor-provided software on VMs at Google was slowing us down. So we moved."

Bikram co-authored this blog post last year about DASInfra's experience moving workloads from Corp to Anthos. The group I run at work is going down a similar path by migrating VMs to Anthos on bare metal for on-prem.

Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

flattr this!

by Thomas at April 20, 2022 08:40 PM

April 18, 2022

Thomas Vander SticheleWhat’s the next action?

(Thomas Vander Stichele)

"Without a next action, there remains a potentially infinite gap between current reality and what you need to do."

David Allen's Getting Things Done is the non-fiction book I've reread the most in my life. I reread it every couple of years and still pick up on new ideas that I missed before, or parts that resonate better now and I'm excited to implement. Before Google, I used to give this book to new employees as a welcome gift.

The book got an update in 2015, and I haven't read the new version yet, so I'm planning an extended GTD book club at work in Q2, spreading the book out over multiple sessions. (In fact, I did just that for the young adult version of the book with my 16 year old godson back home in Belgium) If you've run a GTD book club, drop me a line!

Find out more at Getting Things Done® - David Allen's GTD® Methodology

"Too many meetings end with a vague feeling among the players that something ought to happen, and the hope that it’s not their personal job to make it so. [...] ask “So what’s the next action on this?” at the end of each discussion point in your next staff meeting"

Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

flattr this!

by Thomas at April 18, 2022 04:39 PM

April 16, 2022

Thomas Vander SticheleRebecca Solnit – Men Explain Things to Me

(Thomas Vander Stichele)

"Most women fight wars on two fronts, one for whatever the putative topic is and one simply for the right to speak, to have ideas, to be acknowledged to be in possession of facts and truths, to have value, to be a human being."

In honor of International Women's Day 2022 (this past March 8th), some quotes from the 2008 article that inspired the term "mansplaining": to comment on or explain something to a woman in a condescending, overconfident, and often inaccurate or oversimplified manner.

I've certainly been (and probably still am) guilty of this behavior, and this is a standing invitation to let me know when I'm doing it to you.

Read the original article with a new introduction at Men Explain Things to Me – Guernica

"None was more astonishing than the one from the Indianapolis man who wrote in to tell me that he had “never personally or professionally shortchanged a woman” and went on to berate me for not hanging out with “more regular guys or at least do a little homework first,” gave me some advice about how to run my life, and then commented on my “feelings of inferiority.” He thought that being patronized was an experience a woman chooses to, or could choose not to have–and so the fault was all mine. Life is short; I didn’t write back."

Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

flattr this!

by Thomas at April 16, 2022 02:37 PM

April 14, 2022

Thomas Vander SticheleDraft emails from Google Docs

(Thomas Vander Stichele)

In the ever more vertical company that Google is becoming, it is even more important to collaborate on some of your communication - more people want to contribute to the message and get it right, and more thought needs to be given to the ever wider audience you're sending mails to.

A while back I copied over AppScript code from an internal Google project to send meeting notes to make a different tool which makes it easy to go from Google Docs draft to a mail in GMail and avoid embarrassing copy/paste errors. I'm happy to be able to retire that little side project in favor of a recently released built-in feature of Google Docs: Draft emails from Google Docs - Docs Editors Help

Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

flattr this!

by Thomas at April 14, 2022 03:33 PM

April 07, 2022

Christian SchallerWhy is Kopper and Zink important? AKA the future of OpenGL

(Christian Schaller)

Since Kopper got merged today upstream I wanted to write a little about it as I think the value it brings can be unclear for the uninitiated.

Adam Jackson in our graphics team has been working for the last Months together with other community members like Mike Blumenkrantz implementing Kopper. For those unaware Zink is an OpenGL implementation running on top of Vulkan and Kopper is the layer that allows you to translate OpenGL and GLX window handling to Vulkan WSI handling. This means that you can get full OpenGL support even if your GPU only has a Vulkan driver available and it also means you can for instance run GNOME on top of this stack thanks to the addition of Kopper to Zink.

During the lifecycle of the soon to be released Fedora Workstation 36 we expect to allow you to turn on the doing OpenGL using Kopper and Zink as an experimental feature once we update Fedora 36 to Mesa 22.1.

So you might ask why would I care about this as an end user? Well initially you probably will not care much, but over time it is likely that GPU makers will eventually stop developing native OpenGL drivers and just focus on their Vulkan drivers. At that point Zink and Kopper provides you with a backwards compatibility solution for your OpenGL applications. And for Linux distributions it will also at some point help reduce the amount of code we need to ship and maintain significantly as we can just rely on Zink and Kopper everywhere which of course reduces the workload for maintainers.

This is not going to be an overnight transition though, Zink and Kopper will need some time to stabilize and further improve performance. At the moment performance is generally a bit slower than the native drivers, but we have seen some examples of games which actually got better performance with specific driver combinations, but over time we expect to see the negative performance delta shrink. The delta is unlikely to ever fully go away due to the cost of translating between the two APIs, but on the other side we are going to be in a situation in a few years where all current/new applications use Vulkan natively (or through Proton) and thus the stuff that relies on OpenGL will be older software, so combined with faster GPUs you should still get more than good enough performance. And at that point Zink will be a lifesaver for your old OpenGL based applications and games.

by uraeus at April 07, 2022 06:14 PM

March 31, 2022

Thomas Vander SticheleThe COVID Cocoon

(Thomas Vander Stichele)

"The global COVID-19 pandemic has had countless impacts on society. One interesting effect is that it has created an environment in which many people have been able to explore their gender identity and, in many cases, undergo a gender transition. As organizations return to in-person work, be it full-time or hybrid, there is a greater chance of “out” transgender, non-binary, or gender non-conforming employees in the workforce." (From the "5 Ally Actions Newsletter - Mar 25, 2022")

March 31 is the Transgender Day of Visibility. The COVID Cocoon is a nickname given for the phenomenon of people discovering their gender diversity during the pandemic environment.

The full report is an interesting read; one recommendation that we can all contribute to is on Culture and Communication: Proactively communicating that gender diversity is accepted, asking staff for their input, and being open and ready to listen helps create a culture where employees can feel safe, welcome, and valued.

Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

flattr this!

by Thomas at March 31, 2022 11:46 PM

March 28, 2022

Thomas Vander SticheleBuilding a Second Brain

(Thomas Vander Stichele)

"Your Second Brain is for preserving raw information over time until it's ready to be used, because information is perishable. Your Second Brain is the brain that doesn't forget." - Tiago Forte

Personal Knowledge Management is going through a wave of innovation with new tools like Roam, Logseq, Obsidian, Notion, RemNote, and others gaining traction over Evernote, OneNote and the like. It's a great time to get curious or reacquaint yourself with the tools and processes that strengthen learning, processing, and expressing your knowledge work.

The expression "Second Brain" has been popularized by Tiago Forte, who's been running an online cohort-based class called Building a Second Brain I took the class last year and found it a powerful distillation of an approach to PKM and note-taking. If you want to learn more, they just wrapped up the Second Brain Summit and posted all videos online: Second Brain Summit 2022 - Full Session Recordings - YouTube

The next class cohort is open for enrollment until March 30th midnight ET, at Building a Second Brain: Live 5-Week Online Course, and runs from April 12th to May 10th, 2022.

"Taking notes is the closest thing we have to time travel." - Kendrick Lamar

Taken from The Playlist - a curated perspective on the intersection of form and content (subscribe, discuss)

flattr this!

by Thomas at March 28, 2022 08:44 PM