Towards Linux 2.6

A look into the workings of the next new kernel

Anand Santhanam (asanthan@in.ibm.com), Software Engineer
IBM Global Services

23 Sep 2003

The impending release of a new stable kernel promises greater adoption for Linux, as it becomes more reliable and scalable over a larger variety of processors. Here we'll highlight some of the changes, both big and small, with some code samples.

Linux kernel development has come a long way from the humble v 0.1 release in 1991 by Linus Torvalds that contained a basic scheduler and IPC and memory management algorithms. It's now a a viable alternative operating system that poses a serious challenge in the marketplace. More and more governments and IT giants are moving to Linux. Linux now powers everything from the smallest of embedded devices to S/390s, and from wristwatches to huge enterprise servers.

Linux 2.6 is the next major release of the Linux kernel development cycle, and it contains powerful features aimed at improved performance on high-end enterprise servers as well as support for a plethora of embedded devices (also see the link to Joseph Pranevich's Wonderful World of Linux in the Resources section, for more detailed analyses of Linux 2.6 support for large, small, and multiple processors).

This article analyzes some of the key features of Linux 2.6 for the avid Linux user, and discusses various changes that might be of interest to driver developers.

Linux 2.6 highlights

Linux 2.6 is a big step for Linux on enterprise servers as well as for embedded systems. For high-end machines, new features target performance improvements, scalability, throughput, and NUMA support for SMP machines. For the embedded world, new architectures and processor types have been added -- including support for MMU-less systems that do not have a hardware-controlled memory management scheme. And, as usual, a whole new set of drivers for audio and multimedia have been added to cater to the desktop crowd.

We analyze some of the most notable features of Linux 2.6 in this article, but there are many worthy changes we could not cover here, including enhanced kernel core dumping, fast mutex support, an improved I/O subsystem, and much more. Some of these are summarized in the sidebar, and we have included links to others in the Resources section where possible.

New scheduler

The 2.6 Linux kernel uses a new scheduler algorithm, developed by Ingo Molnar. Called the O(1) algorithm, it should perform notably well under high loads and it scales well with a large number of processors.

In the 2.4 scheduler, the timeslice recalculation algorithm requires that all processes exhaust their timeslice before their new timeslices can be recomputed. So on a system with a large number of processors, most of the processors become idle as processes complete their timeslice and wait for recalculation (for a new timeslice); this impacts SMP efficiency. Moreover, idle processors start executing waiting processes (if their own processor is busy) whose timeslices haven't yet been exhausted, causing processes to start "bouncing" between processors. When a bouncing process happens to be a high priority or interactive process, overall system performance is affected.

The new scheduler addresses these issues by distributing timeslices on a per-CPU basis and eliminating the global synchronization and recalculation loop. The scheduler uses two priority arrays, namely active and expired arrays, that are accessed through pointers. The active array contains all tasks that are affine to a CPU and have timeslices left. The expired array contains a sorted list of all tasks whose timeslices have expired. If all active tasks are used up, then the access pointers to the arrays are switched, and the expired array (containing the ready to run tasks) becomes the active array and the empty active array becomes the new array for expired tasks. The array indices are stored in a 64-bit bitmap, and finding the highest priority task is very trivial.

The new scheduler does not have a big runqueue_lock anymore. It maintains a per-processor run queue/lock mechanism so that two processes on two different processors can sleep, wake up, and context-switch completely in parallel. The recalculation loop (which recomputes the time slices for the processes) and the goodness loop have been eliminated, and O(1) algorithms are used for wakeup() and schedule().

The main benefits of the new scheduler include:

Kernel preemption

The kernel preemption patch has been merged into the 2.5 series and subsequently will make it into 2.6. This should significantly lower latencies for user interactive applications, multimedia applications, and the like. This feature is especially good for real-time systems and embedded devices.

The 2.5 kernel preemption work was done by Robert Love. In previous versions of the kernel (including the 2.4 kernels), it was not possible to preempt a task executing in kernel mode (including user tasks that had entered into kernel mode via system calls) until or unless the task voluntarily relinquished the CPU.

As of kernel 2.6, the kernel is preemptible. A kernel task can be preempted, so that some important user application can continue to run. The main advantage of this is that there will be a big boost to user interactiveness of the system, and the user will feel that things are happening at a faster pace for a key stroke or mouse click.

Of course, not all sections of the kernel code can be preempted. Certain critical sections of the kernel code are locked against preemption. Locking should ensure both that per-CPU data structures and state are always protected against preemption.

The following code-snippet demonstrates the per-CPU data structure problem (in an SMP system):

Listing 1. Code with kernel preemption problem
          int arr[NR_CPUS];

          arr[smp_processor_id()] = i;
          /* kernel preemption could happen here */
          j = arr[smp_processor_id()]   /* i and j are not equal as
   smp_processor_id() may not be the same */

In this situation, if kernel preemption had happened at the specified point, the task would have been assigned to some other processor upon re-schedule -- in which case smp_processor_id() would have returned a different value.

This situation should be prevented by locking.

FPU mode is another case where the state of the CPU should be protected from preemption. When the kernel is executing floating point instructions, the FPU state is not saved. If preemption happens here, then upon reschedule, the FPU state is completely different from what was there before preemption. So FPU code must always be locked against kernel preemption.

Locking can be done by disabling preemption for the critical section and re-enabling it afterwards. The following #defines have been provided by the 2.6 kernel to disable and enable preemption:

Using these defines, Listing 1 could be rewritten as:

Listing 2. Code with locking against preemption
        int cpu, arr[NR_CPUS];

         arr[get_cpu()] = i;  /* disable preemption */
         j = arr[smp_processor_id()];
         /* do some critical stuff here */
         put_cpu()    /* re-enable preemption */

Note that preempt_disable()/enable() calls are nested. That is, preempt_disable() can be called n number of times and preemption will only be re-enabled when the nth preempt_enable() is encountered.

Preemption is implicitly disabled if any spin locks are held. For instance, a call to spin_lock_irqsave() implicitly prevents preemption by calling preempt_disable(); a call to spin_unlock_irqrestore() re-enables preemption by calling preempt_enable().


Other notable changes in the 2.6 kernel

Improved threading model and support for NPTL

A lot of work in improving threading performance has gone into the 2.5 kernel. The improved threading model in 2.6 was also done by Ingo Molnar. Based on a 1:1 threading model (one kernel thread for one user thread), it includes in-kernel support for the new Native Posix Threading Library (or NPTL), which was developed by Molnar in cooperation with Ulrich Drepper.

Threading operations will enjoy an increase in speed; the 2.6 kernel can now handle an arbitrary number of threads and up to 2 billion PIDs (on IA32).

Another change is the introduction of TLS (Thread Local Storage) system calls, which allow for the allocation of one or more GDT (Global Descriptor Table) entries that can be used as thread registers. The GDT is per-CPU and the entries are per-thread. This enables a 1:1 threading model without limitations on the number of threads being created (since a new kernel thread is created for every user thread). The 2.4 kernel allowed only a maximum of 8,192 threads per processor.

The clone system call has been extended to optimize thread creation. The kernel stores the thread ID in a given memory location if the CLONE_PARENT_SETID flag is set and clears the memory location upon thread termination if CLONE_CLEARID is set. This helps user-level memory management recognize unused memory blocks. Also, support for the signal-safe loading of thread registers has been incorporated. Futex (fast user space mutex) is done by the kernel on the thread ID upon pthread_join (for more on futex, please see Resources).

POSIX signal handling is done in kernel space. A signal is delivered to one of the available threads in the process; fatal signals terminate the entire process. Stop and continue signals also affect the entire process, and this enables job control for multithreaded processes.

A new variant of the exit system call has been introduced. Called exit_group(), this system call terminates the entire process and its threads. Moreover, exit handling has been improved with the introduction of the O(1) algorithm so that it takes but two seconds to terminate a process with a hundred thousand threads (compare with 15 minutes to do the same thing in the 2.4 kernel).

The proc filesystem has been modified to report only the original thread, rather than all threads. This avoids the slowing down of /proc reporting. The kernel ensures that the original thread remains until all other threads are terminated.

VM changes

On the VM side, Rik van Riel's r-map (reverse mapping) has been incorporated, which should show significant improvements in VM behaviour under certain loads.

To understand the reverse mapping technique, let's quickly go through some fundamentals of the Linux virtual memory system.

The Linux kernel operates in a virtual memory mode: for every virtual page there is a corresponding physical page of memory in the system. Address translation between the virtual and physical pages is done by the hardware page tables. For a particular virtual page, a page table entry will give a corresponding physical page or note that the page is not present (indicating a page fault). But this virtual-to-physical page mapping is not always one-to-one: multiple virtual pages (pages shared by different processes) might point to the same physical page. In this case, each shared process page entry will have mappings to the corresponding physical page. In cases like this, the situation becomes complicated when the kernel wants to free a particular physical page, as it has to traverse through all the processes' page table entries looking for references to the physical page; it can only release the physical page when the reference count reaches 0. This can be a long and arduous task as the kernel scans through all the processes' page tables, because it has no way of knowing which processes have actual references to this page. This slows the VM down considerably under high loads.

The reverse mapping patch addresses this situation by introducing a data structure called pte_chain in struct page (the physical page structure). The pte_chain is a simple linked list that points to the PTEs of the page, and can return a list of PTEs where a particular page is referenced. Page freeing is suddenly very easy.

There is, however, a pointer overhead to this scheme. Every struct page in the system has to have an extra struct for pte_chain. On a 256 MB system with 64 KB worth of physical pages, there is 64KB * (sizeof(struct pte_chain)) amount of memory that needs to be allocated for this purpose -- a significant number.

Several techniques are being used to address this issue including removal of the wait_queue_head_t field from the struct page (used for exclusive access to the page). Since this wait queue is used sparsely, a smaller wait queue implementation using hash queues to find the correct wait queue has been implemented in the rmap patch.

Despite this, the performance of the rmap patch -- especially on high-end systems and under heavy loads -- is significantly better than the 2.4 kernel's VM system.

Driver porting to Linux 2.6

The 2.6 kernel has brought along with it a pretty significant list of changes for driver developers. This section highlights some of the important aspects of device driver porting from the 2.4 to 2.6 kernel.

First, the kernel build system has been improved for quicker builds compared with 2.4. Improved graphical tools have been added: make xconfig (requires Qt libraries) and make gconfig (requiring GTK libraries).

The following are some of the highlights of the 2.6 build system:

The kernel module loader has also been completely reimplemented in 2.5, which means that the module-building mechanism is much different compared to 2.4. A new set of module utilities is needed for module loading and unloading (please see Resources for a download link for those). Old 2.4 makefiles will no longer work as of 2.6.

The new kernel module loader is the work of Rusty Russel. It makes use of the kernel build mechanism and produces a .ko (kernel object) module object instead of an .o module object. The kernel build system compiles the module first and then links to vermagic.o. This creates a special section in the object module that contains information like the compiler version used, the kernel version, whether kernel preemption is used, and so on.

Now let's take an example and analyze how the new kernel build system should be used to compile and load a simple module. The module concerned is a "hello world" module and is similar to 2.4 module code except that module_init and module_exit need to be used instead of init_module and cleanup_module (kernel 2.4.10 modules onwards use this mechanism). The module is named hello.c and the Makefile is as follows:

Listing 3. Example driver makefile
           KERNEL_SRC = /usr/src/linux
           SUBDIR = $(KERNEL_SRC)/drivers/char/hello/
           all: modules

           obj-m := module.o
           hello-objs := hello.o

           EXTRA_FLAGS += -DDEBUG=1

           modules:
                $(MAKE) -C $(KERNEL_SRC) SUBDIR=$(SUBDIR) modules

This makefile uses the kernel build mechanism to build the module. The compiled module will be named module.ko and is obtained by compiling hello.c and linking with vermagic. KERNEL_SRC indicates the kernel source directory and SUBDIR indicates the directory where the module is located. EXTRA_FLAGS indicate any compile-time flags that need to be given.

Once the new module (module.ko) is created, it can be loaded and unloaded by using the new module utilities. Older module utilities used in 2.4 cannot be used for loading and unloading the 2.6 kernel modules. The new module loading utility tries to minimize the occurrence of race conditions that may occur when a device is opened while the corresponding module is being unloaded, but after the module has made sure that no one is using it. One of the reasons for these race conditions is that the module usage count is manipulated within the module code itself (via MOD_DEC/INC_USE_COUNT).

In 2.6, the module need not do this step of increasing/decreasing the reference count. Instead, this is now done outside of module code. Any code that tries to reference the module has to call try_module_get(&module), and upon success can access the module; this call will fail if the module is being unloaded. Correspondingly, references to the module can be released by using module_put().

Memory management changes

Memory pools were added during 2.5 development to satisfy memory allocations without sleeping. The concept involves pre-allocating a pool of memory and reserving it until it is actually needed. A mempool is created by using the mempool_create() call (linux/mempool.h should be included):

mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);

where min_nr is the number of pre-allocated objects needed, alloc_fn and free_fn are pointers to the standard object allocation and de-allocation routines that the mempool mechanism provides. They are of type:

typedef void *(mempool_alloc_t)(int gfp_mask, void *pool_data);
typedef void (mempool_free_t)(void *element, void *pool_data);

pool_data is the pointer used by allocation/de-allocation functions and gfp_mask is the allocation flag. The allocation function will not sleep unless the __GFP_WAIT flag is specified.

Allocating and de-allocating objects within the pool is done by:

void *mempool_alloc(mempool_t *pool, int gfp_mask);
void mempool_free(void *element, mempool_t *pool);

mempool_alloc() is for allocating objects; if the mempool allocator fails to provide memory, then the pre-allocation pool will be used.

The mempool is returned to the system using mempool_destroy().

In addition to introducing the mempool feature for memory allocation, the 2.5 kernel also introduced three new GFP flags for normal memory allocations. They are:

Apart from the memory allocation changes, the remap_page_range() call -- which is used to map pages to user space -- has been modified slightly. It now takes an extra parameter compared to 2.4. The virtual memory area (VMA) pointer needs to be added in as the first parameter followed by the usual four parameters (start, end, size, and protection flags).

Workqueue interface

Workqueue interface is introduced in 2.5 development to replace the task queue interface (used to schedule kernel tasks). Each workqueue has dedicated worker threads associated with it and all the tasks from the run queue run in the context of the process (so they are allowed to sleep). Drivers can create and use their own workqueues or use the one that comes with the kernel. Workqueues are created using:

struct workqueue_struct *create_workqueue(const char *name);

where name is the name of the workqueue.

Workqueue tasks can be initialized at compile time or at run time. The task needs to be packaged into a structure called the work_struct structure. A workqueue task is initialized during compile time using:

DECLARE_WORK(name, void (*function)(void *), void *data);

where name is the work_struct name, function is the function to be invoked when the task is scheduled, and data is the pointer to that function.

A workqueue task is initialized at run time by using:

INIT_WORK(struct work_struct *work, void (*function)(void *), void *data);

Queuing a job (a workqueue job/task of type work_struct structure) into the workqueue is done with the following function calls:

int queue_work(struct workqueue_struct *queue, struct work_struct *work);
int queue_delayed_work(struct workqueue_struct *queue, struct work_struct
*work, unsigned long delay);

The delay in queue_delayed_work() is given to ensure that at least a minimum delay in jiffies is given before the workqueue entry actually starts executing.

Entries in the workqueue are executed by the associated worker thread at an unspecified time (depending on load, interrupts, etc.) or after a delay time. Any workqueue entry that takes an inordinately long time to run can be cancelled with:

int cancel_delayed_work(struct work_struct *work);

If the entry is actually executing when the cancellation call returns, the entry continues to execute but will not be added to the queue again. The workqueue is flushed of entries with:

void flush_workqueue(struct workqueue_struct *queue);

and is destroyed with:

void destroy_workqueue(struct workqueue_struct *queue);

It is not necessary for all drivers to have their own custom workqueue. Drivers can use the default workqueue provided by the kernel. Since this workqueue is shared by many drivers, the entries may take a long time to execute. To alleviate this, delays in worker functions should be kept to a minimum or avoided altogether.

An important note is that the default queue is available to all drivers, but only GPL licensed drivers can use their own custom-defined workqueues:

There is also a flush_scheduled_work() function that waits for everything in the queue to be executed, and that should be called by modules upon unloading.

Interrupt routine changes

The 2.5 interrupt handler internals have undergone a lot of changes, but most of them do not affect the average driver developer. However, there are some important changes that affect device drivers.

The interrupt handler function now has a return code of type irqreturn_t. This change, introduced by Linus, means that the interrupt handlers tell the generic IRQ layer whether the interrupt is really meant for it or not. This is done to catch spurious interrupts (especially on shared PCI lines) where interrupts keep coming in (because the driver has accidentally enabled the interrupt bits or the hardware has simply gone bad) and none of the drivers can do anything about it. In 2.6, the driver needs to return IRQ_HANDLED if the interrupt is from that device or IRQ_NONE if it has nothing to do with it. This helps the kernel IRQ layer clearly identify which driver is handling that particular interrupt. If an interrupt keeps coming in and there is no registered handler for that device (for instance, all drivers return IRQ_NONE), then the kernel can block interrupts from that device. By default, the driver IRQ routine should return IRQ_HANDLED as it is a bug to return IRQ_NONE when the driver has actually handled that interrupt. The new interrupt handler might look like:

Listing 4. 2.6 Interrupt handler pseudocode

             irqreturn_t irq_handler(...) {
                         ..
                         if (!(my_interrupt)
                                     return IRQ_NONE;  // not our interrupt
                         ...
                         return IRQ_HANDLED;  // return by default
             }

Note that cli(), sti(), save_flags(), and restore_flags() are all deprecated. Instead, local_save_flags() and local_irq_disable() should be used to disable all interrupts locally (within that processor). It is not possible to disable interrupts across all processors.

Unified device model

One of the most notable changes in the 2.5 development is the creation of a unified device model. The device model represents the overall device architecture and the overlay of the system by maintaining a number of data structures. Benefits include improved power management control over devices and easier management of device-related tasks, including tracking of:

Other developments in the 2.5 kernel related to device drivers include:

Conclusion

Since there are so many changes to Linux 2.6 as compared with 2.4, there is talk in the kernel world that the new release may actually be named 3.0. Linus will have the final say in this, and the release may be official as soon as November 2003. Whatever its version number, the new kernel release promises to give us faster, more scalable, and more reliable performance across more platforms and architectures than 2.4.

Linus has asked testers around the world to hunt bugs and report problems, and has asked distribution maintainers to provide 2.6 versions for download. If you would like to take part, you can find download and install links in the Resources below.

Resources

About the author

Anand K. Santhanam has a Bachelor of Engineering degree in Computer Science from Madras University, India. He has been in IBM Global Services (Software Labs), India since July 1999. He is a member of the Linux Group at IBM, where he has worked with ARM-Linux, character/X-based device drivers, power management in embedded systems, PCI device drivers, and multithreaded programming in Linux. Other areas of interest are OS internals and networking. You can contact him at asanthan@in.ibm.com.

Copyright 2003