Hardware/Software Co-Design for Efficient Microkernel Execution - Martin Děcký [FOSDEM 2019]

Microkernel multiserver systems are better than monolithic systems. This used to be an opinion, but there appears more scientific evidence that this is true. But there is a penalty in performance.

An example of scientific evidence is a paper that has studied a number of known vulnerabilities and check if this would have happened in a microkernel.

The microkernel ideas go back to 1969.

OSes are typically written for CPUs after the CPU has been developed. Requirements of new ISAs usually follow the needs of the mainstream (monolithic) OSes running on the past ISAs. This creates a vicious circle that doesn’t allow microkernels to escape the performance penalty. There has already been a lot of effort on the microkernel side to run better on existing hardware, but there hasn’t been much work on designing a CPU for microkernels.

Martin has a number of rough ideas about what should be included in the CPU.

In a monolithic kernel, control and data flows between subsystems as function calls, with data in register/stack and direct pointers. In the multiserver microkernel, it is via IPC so only a subset of registers can be used, need to switch privilege levels and address space, possibly scheduling, and either data copying or memory sharing (page granularity).

A richer jump instruction would be a good step, so that in one (set of) instructions you could both jump and switch address space, without need to involve the kernel. Of course, the OS still needs to set up those jump points, by identifying a “call gate” that contains the address space and the entry point of the IPC handler, similar for returns. A TLB could be added to cache these. For asynchronous IPC, CPU cache lines could be used for message buffers to get a granularity between register and page.

For bulk data, memory sharing is actually quite efficient. The overhead is mainly caused by creating and tearing down the shared pages, and the requirement of page granularity. To solve this, support for sub-page granularity in the MMU would be useful, e.g. cache line size. To make this work, a separate virtual-to-cache mapping table could be set up.

A microkernel has more active threads than a monolithic system, so fast context switching stays efficient. Current techniques are mostly about hiding the latency. We have two mechanisms: hardware multi-threading (but doesn’t scale beyond a few threads) and OS context switching (but incurs a much longer latency, multiple microseconds). Solution: hardware support for an unlimited number of execution contexts with a TLB for efficient access, with dedicated instructions to save/restore the context. The same mechanism can also improve userspace interrupt processing: interrupts could truly be handled by userspace instead of needing a privileged handler that just wakes up a thread.

Maybe also capabilities support could be improved. For example, an unforgeable object identifier that is embedded in a reference (pointer) which makes it easier to manage capabilities. E.g. a RISC-V-128 would have plenty of bits for storing the capability reference.

For most of the ideas presented, some prior art exists (cfr. slides for references). E.g. for cross-address-space IPC, the VT-x extensions for VMs allow you to do that - but limited to 512 entries.

This is more an opinion piece, to start some discussion.

Martin has worked on HelenOS as an enthusiast since 2004.