elektito programming & stuff

LRFS Part 4: Early Userspace: initrd and initramfs

This is part of a series of articles. You can find the first part here.

Although init is considered the beginning of the Linux userspace, this is not technically the case. There are other facilities that are part of the userspace and run even before init. Using these is not mandatory, and as it happens we are not going to use them in our distro (not now, at least), but I thought it might be educational to examine them here.

Why?

So why do we need something to run before init? Here are a few cases that this might be necessary:

  • Mounting the root file system might need access to kernel modules that are not built into the kernel, but are instead built as kernel modules and reside on the very file system we are going to mount. It is, of course, possible to build these into the kernel, but we might want to keep the kernel from becoming too large. This was especially the case before, when RAM used to be more limited than it is now.

  • The root file system might be encrypted and it might need being decrypted before being mounted.

  • The system might be in hibernation and need special treatment before waking up.

The kernel provides two facilities for running a small userspace before the actual init: these are called initrd and initramfs, the latter being a more recent addition to the kernel, although both have been there for quite some time now.

The names initrd and initramfs, although referring to distinct facilities, are frequently used interchangeably.

  • initrd is a file system image that is mounted as root. It usually contains an executable named /linuxrc that is run after the image is mounted. After performing any necessary preparations and mounting the real root file system in a temporary location, this program then uses the pivot_root system call to switch to the new root and then unmounts initrd.

  • initramfs is a file archive that becomes the root file system. An executable called /init on this archive is then run by the kernel, effectively becoming init (that is, PID 1). This can continue as an init, or later mount a new root and exec the real init.

In the next two sections, we will examine each of these carefully and see some examples.

initrd

We will be using the following C program as the “early init”:

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>

int
main(int argc, char *argv[])
{
  printf("Early Init\n");
  printf("----------\n");
  printf("PID=%d UID=%d\n", getpid(), getuid());
  printf("----------\n");
  for(;;);

  return 0;
}

Save this as earlyinit.c and compile it statically:

gcc earlyinit.c -static -Wl,-s -o linuxrc

Now create a file system image and copy linuxrc to it:

dd if=/dev/zero of=image.img bs=20M count=1
mke2fs image.img
sudo mount image.img /mnt
sudo cp linuxrc /mnt
sudo umount /mnt

You can also compress this file:

gzip -9 image.img

You probably won’t be able to use your distribution’s kernel for this as support for initrd is probably not built into the kernel. Build a new kernel from source and make sure the CONFIG_BLK_DEV_INITRD and CONFIG_BLK_DEV_RAM options are set to y in the .config file.

Like in previous sections, we will be using qemu to run the kernel and initrd. qemu has a -initrd option that we can use:

qemu-system-x86_64 -kernel /path/to/bzImage \
                   -initrd image.img.gz \
                   -enable-kvm \
                   -append "console=ttyS0" \
                   -nographic

Take a look at the output. Notice the PID reported in the output. It is not 1. An initrd is not an init.

A real initrd would mount the root file system as part of its work. When initrd returns, the kernel assumes that root is already mounted and so proceeds to running init (from /sbin/init, for example).

Let’s try this. Update the C program above like this:

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <string.h>

int
main(int argc, char *argv[])
{
  printf("Init\n");
  printf("----------\n");
  printf("PID=%d UID=%d argv[0]=%s\n", getpid(), getuid(), argv[0]);
  printf("----------\n");
  if (strcmp(argv[0], "linuxrc") == 0) {
    /* running as initrd */
    return 0;
  } else {
    for(;;);
  }
}

We are going to use this as both linuxrc and init. Recompile, like before and update the image like this:

gunzip image.img.gz
sudo mount image.img /mnt
sudo cp linuxrc /mnt
sudo mkdir /mnt/sbin
sudo cp linuxrc /mnt/sbin/init
sudo umount /mnt

We won’t be compressing the image this time, since qemu does not accept a compressed image as an argument to -hda. Run qemu like this:

qemu-system-x86_64 -kernel /path/to/bzImage \
                   -initrd image.img \
                   -enable-kvm \
                   -hda image.img \
                   -append "console=ttyS0 root=/dev/sda" \
                   -nographic

Again we are not actually mounting the real root file system here. So in this case, when initrd returns, the kernel runs /sbin/init from the already mounted RAM disk (i.e. initrd iteself). In the output, you will see two invocations of our program, one as linuxrc, the other as /sbin/init.

initramfs

Originally, initramfs was supposed to be a an archive embedded into the Linux kernel itself. This archive is mounted as root and a /init file inside it is executed as init (i.e. with PID 1).

What this incarnation of early init does is slightly different from initrd. First of all, initramfs cannot be unmounted. It usually deletes all of its contents in the end, however, chroots into the real root file system and invokes init using one of the exec system calls. pivot_root cannot be used in initramfs. klibc and busybox each have a utility (called run-init and switch-root respectively) that helps initramfs writers with the usual tasks (deleting files, chroot and exec, among other things).

In order to try this, we’ll revert to the original version of our simple early init:

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>

int
main(int argc, char *argv[])
{
  printf("Early Init\n");
  printf("----------\n");
  printf("PID=%d UID=%d\n", getpid(), getuid());
  printf("----------\n");
  for(;;);

  return 0;
}

Compile this as init:

gcc earlyinit.c -o init -static -Wl,-s

Now we have to create an archive, instead of a file system image. This is a cpio archive. The concept is very similar to the more widely used tar archive. Let’s create an archive with init as its only content:

echo init | cpio -o -H newc | gzip -9 >initramfs

The -H flag specifies the variant of cpio that the kernel uses. Now, as we said before, initramfs is an archive that is embedded inside the kernel and indeed when building a kernel you can configure an initramfs to be built inside the kernel. However, there is a simpler way of doing this as when the kernel receives a cpio archive instead of a file system image for its initrd parameter, it uses the archive as if it is a built-in initramfs archive.

So in effect, you can pass the initramfs archive like an initrd to qemu. Let’s try it:

qemu-system-x86_64 -kernel /path/to/bzImage \
                   -initrd initramfs \
                   -enable-kvm \
                   -append "console=ttyS0" \
                   -nographic

You will see that this time our program is run with PID 1. It can simply do its work and exec the real init in the end.

Tools for initramfs writers

If writing an initramfs in C, in many cases, alternative C standard libraries are used in place of glibc which is feature-rich and very large. musl is one popular implementation of the C standard library that is used when size is important. klibc is another which, although not implementing the full extent of the standard library, has been specifically written for writing an early init. Both provide wrapper scripts for building against them. For musl, you can use the musl-gcc script:

musl-gcc earlyinit.c -static -Wl,-s -o linuxrc

while for klibc you can use klcc:

klcc earlyinit.c -static -Wl,-s -o linuxrc

Both provide very smaller executables than when linking against glibc.

In many cases, the early init program is in fact written in shell script, so a shell and a number of utilities are included. You can use bash and GNU coreutils for this, but again these are quite large and all of their features is probably not necessary for the small initramfs script.

busybox is one alternative, which includes a shell, and a large number of utilities, including the previously-mentioned switch-root.

klibc also comes with a number of utilities, which are more limited, and smaller, than the ones with busybox. It also includes the run-init utility which helps with wrapping up the work in initramfs.

A note on Ubuntu’s initramfs

If you try taking a peek at the initramfs on an Ubuntu system with cpio, a few files are extracted and then you’ll receive an error message. At least, that’s how it is on Ubuntu 16.04 and 18.04 where I tried this. This is because the Ubuntu initramfs is actually two cpio archives put one after the other in a single file.

Ubuntu 18.04 comes with an unmkinitramfs utility (installed with the initramfs-tools-core package) capable of extracting the contents of this initramfs. It’s a shell script so you can take a look at it and see how it actually works.

Wrapping up

supermin contains a small initramfs program that can be very educational to look at. Just get the source code and open the init/init.c file.

I also found the following links very informative:

As I said in the beginning, we are not going to use either initrd or initramfs in our distro-to-be. So we’ll just carry on.

LRFS Part 3: Init

This is part of a series of articles. You can find the first part here.

In this first part of this series we built a kernel and ran it with a very minimal (and useless) init program. We then built bash and used that as init. Let’s go back to our init program and see how we can make a more proper init system.

What is init?

The first process to be started by the kernel is called init. This process always has the Process ID (PID) of 1 and has a number of special properties:

  • It should keep running up until the system shuts down. If init is terminated, the kernel will panic.

  • All orphan processes are re-parented to init*. These are the processes whose parents has been terminated before them. The orphans, when terminated, become zombies. Init is tasked “reaping” these processes, so that their resources is allowed to be freed.

  • Signals without a signal handler do not have any default behavior for init. As an example, a process that does not handle SIGTERM, will shutdown by default if it receives that signal. If init, however, receives SIGTERM and has no signal handler for it, the signal is just ignored.

Init is the process that starts the Linux userland. Everything from the login prompt, your shell or your desktop environment is directly or indirectly started by init.

Note however, that technically speaking, the only thing that init “has to” do is reaping zombie processes. Today’s init systems though do a lot more. Starting and managing services is among the most important of those.

Our init system, called hello, does little though. For now, it is going to:

  • Run a startup script.
  • Start a shell.
  • Keep running and reap zombie processes.

Let’s do it then.

The code

Here is our expanded init system, in all its glory:

#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <stdio.h>

static void
handle_sigchld(int sig) {
  int saved_errno = errno;

  /* reap orphaned children, passed away. rip. */
  while (waitpid((pid_t)(-1), 0, WNOHANG) > 0) {}

  errno = saved_errno;
}

static int
run_program(const char *path)
{
  pid_t child;
  int ret;
  siginfo_t info;

  child = fork();
  if (child) {
    waitid(P_PID, child, &info, WEXITED);
    return info.si_status;
  } else {
    execl(path, path, NULL);
    printf("Could not run: %s\n", path);
    printf("    %s\n", strerror(errno));
    exit(255);
  }
}

static void
launch_login(void)
{
  if (!fork()) {
    execl("/bin/bash", "/bin/bash", NULL);
  }
}

int
main(int argc, char *argv[])
{
  struct sigaction sa;

  /* register sigchld handler */
  sa.sa_handler = &handle_sigchld;
  sigemptyset(&sa.sa_mask);
  sa.sa_flags = SA_RESTART | SA_NOCLDSTOP;
  if (sigaction(SIGCHLD, &sa, 0) == -1) {
    printf("Could not install signal handler. Aborting.\n");
    return 1;
  }

  printf("Hello, World!\n");
  printf("This is hello, your friendly init system!\n");

  printf("Attempting to run your rc.local...\n");
  run_program("/etc/rc.local");

  printf("Launching your shell...\n");
  launch_login();

  for (;;) {
    usleep(600 * 1000000);
  }

  return 0;
}

As you can see, we start by adding a SIGCHLD handler. SIGCHLD can be sent to a process whenever something interesting happens to its children. Here we explicitly ask to only be informed when one of the children has exited (SA_NOCLDSTOP).

We then display a friendly startup message and then run the script located at /etc/rc.local. We then launch bash as the shell and go to sleep. From here are, the only thing init does is to handle SIGCHLD and wait on the child processes so that their resources can be freed by the kernel.

Try it

As before, rebuild the package and add the contents to the image file and then run the result in qemu:

qemu-system-x86_64 -kernel /path/to/bzImage \
              -append "root=/dev/sda init=/bin/bash console=ttyS0" \
              -hda /path/to/image.img \
              -enable-kvm \
              -nographic \
              -serial mon:stdio

You can use the tools in the /tools directory for building the package and creating the rootfs.

The source

You can find the source code for hello and all the other tools and packages talked about in this series here on Github.

* Technically that is not always correct. In more recent versions, there can be “sub-reapers” that an orphan might be re-parented to. In the absence of sub-reapers though, orphans are re-parented to init.

LRFS Part 2: Adding a shell and packages

This is part of a series of articles. You can find the first part here.

In this first part of this series we built a kernel and ran it with a very minimal (and useless) init program. That’s not very useful. Let’s add a shell.

But before that, let’s backtrack a bit. Remember we briefly talked about repeatability in the last post. Let’s get back to that and later see what it has to do with us wanting to add a shell to our distro.

Repeatability

So what’s repeatability and why is it important? Repeatability is the quality of a quality of a process to assures us we arrive at the same results every time we follow it, whether we do it now or ten years later.

How are we supposed to do that? By documenting a record of every little thing that has a meaningful impact on our results. This includes configuration, build options, patches, environment, etc.

I am going to put everything in a git repository. Every piece (which we are going to call a package) will be in its own sub-directory. In that sub-directory, we’ll have a file named pkg.json which describes the package, and a Makefile that contains build and installation instructions. The pkg.json file will look something like this:

{
    "version": "1.0.0",
    "source": {
        "type": "local"
    }
}

This tells the build script what the current package version number and how to obtain the source code (local in this case, since the code is included right there in the package directory).

The Makefile should contain at least the following targets: all which will be used for building the package, and install which will install the package files to a path determined in the INSTDIR environment variable.

But this is not all that is needed for repeatable builds. Another factor that might affect the build in longer periods of time, is the tools in use. It might so happen, for example, that a warning added in a new version of gcc breaks a build in which all warnings are considered errors.

However, it sounds a bit impractical to me, to have tools like compilers and linkers as part of the package. A better approach, could be to use the same version of tools for all the packages at every point in time. In order to do so, we’ll simply describe the build environment for all packages in the root of the pkgs directory. We’ll use a build.json file like this:

{
    "env": {
        "name": "ubuntu",
        "version": "18.04"
    }
}

Given that there are generally no breaking changes in development tools in a single version of Ubuntu, this should work for now.

The directory tree looks like this at the moment:

/
/pkgs/
/pkgs/build.json
/pkgs/kernel/
/pkgs/kernel/pkg.json
/pkgs/kernel/Makefile
/pkgs/kernel/config
/pkgs/hello/
/pkgs/hello/pkg.json
/pkgs/hello/Makefile
/pkgs/hello/hello.c
/tools/
/tools/build
/tools/build-rootfs
/tools/run
/README.md

As you can see, we have two packages right now, kernel and hello, both of which are in the pkgs directory. We’ll also have a top-level tools directory, which contains a script for building the packages, a script for building a root file system from a list of packages, and a script to run everything in qemu.

The build script

The build script (aptly named build), located in the /tools directory, builds one or more packages, according to the instructions in the pkg.json file and the Makefile. Apart from “local” source code, it also supports downloading a source tarball or obtaining it from a git repository.

The build script also supports applying one or more patches to the code before building it. For each package, a .tar.xz file is created which contains all the files needed for installation.

The script needs a number of tools to run:

  • jq: A versatile, command-line JSON parser.
  • awk: For text processing.
  • lsb_release: For getting information about the build environment.
  • fakeroot: Runs another program in an environment with fake root privilege for file manipulation. This is needed because sometimes install scripts need to change a file’s group, something that only root can do, but we do not want our build system to run as root, hence we use fakeroot for creating the package. Extracting the packages, however, will obviously need root permissions.

In order to build the hello package, for example, go the tools directory and run ./build hello. When the script is done running, you should have a hello-1.0.0.tar.xz package in your current directory.

And a shell

Here we are at last. We are going to build bash as our shell. This is the pkg.json file to use:

{
    "version": "4.4.23",
    "source": {
        "type": "dl",
        "location": "https://ftp.gnu.org/gnu/bash/bash-4.4.tar.gz",
        "inner_dir": "bash-4.4"
    },
    "patches": {
        "options": "-p0",
        "apply_dir": ".",
        "files": [
            "bash44-001",
            "bash44-002",
            "bash44-003",
            "bash44-004",
            "bash44-005",
            "bash44-006",
            "bash44-007",
            "bash44-008",
            "bash44-009",
            "bash44-010",
            "bash44-011",
            "bash44-012",
            "bash44-013",
            "bash44-014",
            "bash44-015",
            "bash44-016",
            "bash44-017",
            "bash44-018",
            "bash44-019",
            "bash44-020",
            "bash44-021",
            "bash44-022",
            "bash44-023"
        ]
    }
}

This is a bit more complicated than the one we saw before. Let’s see what it does. First, we are saying this is the 4.4.23 release of bash, which happens to be the latest release at the moment. You can find all bash releases here.

After that, we determine how the source code is to be obtained. The value dl for type means that a tarball is going to be downloaded. The location field determines the download address and the inner_dir field tells the build system where in the tarball the source code resides.

We then have a list of patches, twenty-three of them, that need to be applied to the 4.4 release, so that we arrive at 4.4.23 (all of these are downloaded from the bash release page mentioned before).

Then there’s the Makefile:

all:
    cd ../_src_ && ./configure --prefix=/usr --exec-prefix= --enable-static-link && $(MAKE)

install:
    $(MAKE) -C ../_src_ install DESTDIR=${INSTDIR}

.PHONY: all install

(Yes, my code coloring tool messes up the colors for this make snippet. Ugly, but can’t do nuffing about that right now!)

As you can see the all target configures and builds the code, while the install target actually installs the files. There are a few points to explain:

  • For building every package, a temporary directory created and the package directory is copied here and renamed to pkg. The source code, if obtained from an external source, is fetched and put in an _src_ directory inside the temporary directory.

  • We configure bash so that it is linked statically. We still don’t have glibc or any of the other shared libraries needed.

  • We use the $(MAKE) special variable instead of running make directly. This way, some of the properties of the parent make are communicated to the sub-makes (like the -j argument, so that the right number of parallel jobs are used).

In order to actually run the built shell, you need to add the contents of the package file (the one built by the build script) to the image file we created before. After that, run the VM like this:

qemu-system-x86_64 -kernel /path/to/bzImage \
              -append "root=/dev/sda init=/bin/bash console=ttyS0" \
              -hda /path/to/image.img \
              -enable-kvm \
              -nographic \
              -serial mon:stdio

Note the init parameter passed to the kernel. Here, we are telling the kernel to use bash as the init program.

The source

Everything I’ve talked about in this article can be found in this repository on Github.

Linux Really from Scratch: Part 1

Linux from Scratch has been one of the projects that I’ve always been interested in…and never gotten around to actually going through with it! I think I downloaded the book a few years ago and actually read a couple of chapters, but never went past that.

So for one reason or another, during the last few days I’ve been looking at how Linux is actually booted and how the user space is launched, and I thought maybe I can do it all by myself. Without even going through the LFS book.

So in this, hopefully, series of articles, I’m going to document the process of building a very basic Linux system, all the pieces built from scratch. This will be an iterative process in which we do things step-by-step, in each step adding a bit more complexity.

This is what I’m trying to achieve with this series:

  • We’ll take a look at how Linux actually boots.

  • We’ll see the building blocks of the user space.

  • We’ll try to see what it is the distros do for us. Hopefully we’re going to get a lot more respect for the folks who do all that hard work for us!

  • I’ll try to keep everything deterministic and repeatable.

  • My main focus will be creating an image that is run in a VM and accessed with SSH. So no graphics, at least, not for some time. SSH will also take a while to arrive, but I’ll try to get there as soon as possible, since I really hate working in a console without a proper terminal.

Where to begin?

We will start with two pieces: the Linux kernel and a tiny init program. This will be our init:

#include <stdio.h>

int
main(int argc, char *argv[])
{
    printf("Hello, World!\n");
    printf("This is your friendly init system.\n");
    printf("Just hanging here...\n");
    for (;;);

    return 0;
}

This will simply print out a message and then loop indefinitely, since an init is not supposed to ever exit. We will need to compile this statically. We don’t have glibc or other dynamic libraries right now. Compile the program by running:

gcc hello.c -static -o hello
strip hello

The resulting executable, hello, will be our primitive init.

Then we need to build the kernel. Get the source code for the stable branch of the kernel:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

The latest stable version is v4.19.8 right now. Checkout that version:

git checkout v4.19.8

I decided to simply use the config for current Ubuntu 18.04 kernel as the basis. (Thinking about repeatability? You’re right. More on that later.)

Configure and build the kernel:

cd linux
cp /boot/config-$(uname -r) ./.config
make oldconfig
make

Running what we have

We’re going to use qemu to run what we have. But first, let’s create the root image.

Create an image file and mount it:

qemu-img create image.img 2G
sudo mount image.img /mnt

Now copy the hello executable to the mount directory and rename it to init. Then unmount.

cp hello /mnt/init
sudo umount /mnt

All right. We’re all set. Launch it all by running:

qemu-system-x86_64 -kernel /path/to/bzImage \
                   -append "root=/dev/sda init=/init console=ttyS0" \
                   -hda /path/to/image.img \
                   -enable-kvm \
                   -nographic \
                   -serial mon:stdio

You need to fix the path to the kernel and the disk image. The kernel should be in the arch/x86_64/boot directory after the build is complete.

If everything is okay, you’ll see the boot messages and at the end you’ll get the “Hello, World!” message from our “init system.”

What’s in this command?

  • -kernel /path/to/bzImage: specifies the kernel image to use. Using this option, we won’t have to create a bootable disk, we just directly provide the kernel to boot.

  • -append "root=/dev/sda init=/init console=ttyS0": adds a few options to the kernel command-line. root is the root file-system that is mounted on /, init is the path to the init program to use, and console specifies the output console device (needed in combination with the -serial option).

  • -enable-kvm: use KVM for virtualization.

  • -nographic: do not show the SDL window that is used as VM display by default.

  • -serial mon:stdio: redirect the serial port to stdio. This, in combination with the console parameter passed to the kernel, causes kernel output to be displayed on the current terminal.

It’s just the beginning

I had actually prepared a lot more material, especially about repeatability and build automation, but since the article was getting too long, I’ll leave those for another article. Stay tuned.

Moved!

I’ve recently moved to the small but beautiful town of Enschede in The Netherlands. I’ll just leave you with a view from my desk at work for now. Pretty colorful, isn’t it?!

Office View