Linux system calls

For a long time I felt there was something unique about Linux. Something was different, couldn't quite figure out what it was. Only when I learned about Linux's system call interface did I finally understand.

This article contains everything I've learned.

Why Linux system calls?

Programs usually interface with the kernel through libraries provided by the operating system, most commonly libc. Through these libraries, they have access to system functions. POSIX-compliant operating systems provide read, write and numerous others. Windows has the Win32 API and its many DLLs and functions.

These operating systems consist of strongly connected kernel and user space components, developed and distributed as one unit. The user space libraries are the only supported means of using the system. User space programs are not meant to talk to the kernel directly, they're meant to use the provided system libraries. This forces them to depend on and link against such libraries.

While it is often possible to interface with the kernel directly, the kernel interface is unstable and subject to change. Software that insists on using that interface might simply stop working if or rather when it changes.

Then there's the whole "reinventing libc" insanity, even on macOS (where no such ABI stability was guaranteed, but they did it anyway, and that ended up with a macOS update breaking all Go apps). On Windows they can't get away with that, so they use Cgo instead

marcan_42, Hacker News, Sept 4, 2021

As I understand it, Go currently has its own syscall wrappers for Darwin. This is explicitly against what Apple recommends, precisely because they're not willing to commit to a particular syscall ABI. This leads to issues like #16570, and although we've been lucky in that things have generally been backward-compatible so far, there's no guarantee that it'll continue to happen. It doesn't seem inconceivable to me that we'd at some point end up having to specify "build for macOS 10.13+" vs. "build for 10.12 and below", for example.

copumpkin, golang/go GitHub issue #17490, Oct 17, 2016

Sometimes it's not even possible to use system calls at all. OpenBSD has implemented system call origin verification, a security mechanism that only allows system calls originating from the system's libc. So not only is the kernel ABI unstable, normal programs are not even allowed to interface with the kernel at all.

The eventual goal would be to disallow system calls from anywhere but the region mapped for libc, but the Go language port for OpenBSD currently makes system calls directly.

Switching Go to use the libc wrappers (as is already done on Solaris and macOS) is something that De Raadt would like to see. It may allow dropping the main program text segment from the list of valid regions in the future:

It is an ABI break for the operating system, but that is no big deal for OpenBSD. De Raadt said: we here at OpenBSD are the kings of ABI-instability. He suggested that relying on the OpenBSD ABI is fraught with peril:
Program to the API rather than the ABI. When we see benefits, we change the ABI more often than the API. I have altered the ABI. Pray I do not alter it further.

Jake Edge, OpenBSD system-call-origin verification, LWN, December 11, 2019

Linux is different

One of the things that make the Linux kernel interesting is the fact it has a stable kernel-userspace interface. Unlike virtually every other kernel and operating system, Linux guarantees stability at the binary interface level.

This interface matches much of the POSIX interface and is based on it and other Unix based interfaces. It will only be added to over time, and not have things removed from it.

torvalds/linux, Documentation/ABI/stable/syscalls, 2006-06-21

This is rooted in the fact Linux is a kernel, not a complete operating system as traditionally defined. As an independent component, it must have a stable interface to user space software if anything is to be built upon it.

While many people argue that Linux is not an operating system, there's no question that Linux is a platform and that it is possible to safely build directly upon it. There is no actual need to depend on anything else. Not even libc.

How Linux system calls work

Processor instruction set architectures contain special instructions for calling the kernel. These instructions cause the processor to switch to kernel mode and execute code at a predefined location in the kernel.

At least one parameter must be provided: the system call number, often referred to as the NR. Linux uses this number as an index into a table of function pointers to find the function being called. Any other arguments are passed to this function.

These parameters are passed to the kernel in registers. The kernel also returns a result value in a register. Which registers are used for which parameters and which register contains the return value defines the Linux system call calling convention.

This calling convention is stable, allowing user space programs to use it without fear of breakage. It is defined at the instruction set level and so it is also programming language agnostic. All user space programs written in any language may make use of it. Typically, programs call libc functions which implement this calling convention. However, that is not actually a requirement. It's perfectly possible for a compiler to directly emit code following that calling convention: it could have support for a system_call keyword. A JIT compiler could generate code for this at runtime.

The journalists at the LWN have written detailed articles about the implementation of Linux system calls. They are definitely worth reading.

The calling convention is documented here:

Implementing a system call function

In order to make a system call, the parameters must be placed in the appropriate registers, the system call instruction must be executed and the return value must be collected from the return register. System calls support a maximum of six arguments.

Since the registers and system call instruction vary by architecture, separate functions are needed for each architecture. Despite this, it is simple to write a C function that can make any system call.

For example, a system call function for the x86_64 architecture:

long
linux_system_call_x86_64(long number,
                         long _1, long _2, long _3,
                         long _4, long _5, long _6)
{
	register long rax __asm__("rax") = number;
	register long rdi __asm__("rdi") = _1;
	register long rsi __asm__("rsi") = _2;
	register long rdx __asm__("rdx") = _3;
	register long r10 __asm__("r10") = _4;
	register long r8  __asm__("r8")  = _5;
	register long r9  __asm__("r9")  = _6;

	__asm__ volatile
	("syscall"

		: "+r" (rax),
		  "+r" (r8), "+r" (r9), "+r" (r10)
		: "r" (rdi), "r" (rsi), "r" (rdx)
		: "rcx", "r11", "cc", "memory");

	return rax;
}

All parameters and the return value are of type long. This really just means "register": all values passed to the kernel must fit in registers and typically long is register sized. This means all arguments must either be simple values or pointers to more complex structures.

The function ensures all arguments are placed in the appropriate registers by assigning them to local variables annotated with an inline assembly directive which tells the compiler which register to choose. The register keyword does nothing, it's there just to make it clear what these variables are about.

The x86_64 architecture contains the aptly named syscall instruction which switches to kernel mode and enters the kernel entry point. Other architectures have different instructions. For example, aarch64 uses svc #0 instead.

The compiler is informed via the extended inline assembly construct that this instruction has 7 inputs, 1 output and that it clobbers some registers, the carry bit and memory. The 7 inputs are all the previously assigned system call number and parameter registers. The output is the return value which is placed in rax, overwriting the system call number.

After the system call has been made, all that's left to do is to return the result. It may be a valid value or a negated errno constant. The various libcs normalize those error values and place them in a global or thread local errno variable. That's not necessary when using Linux system calls directly!