Interview: Ingo Molnar

Submitted by Jeremy

KernelTrap

December 03, 2002

Ingo Molnar has been contributing to Linux kernel development since 1995 with an impressive list of accomplishments. Most recently his O(1) scheduler was merged into the 2.5 development kernel, as well as much work to enhance the handling of threads. Other highly visible contributions include software-RAID support and the in-kernel Tux web and FTP servers.

In this interview, Ingo explores how he started working on the Linux kernel, noting, "It might sound a bit strange but i installed my first Linux box for the sole purpose of looking at the kernel source." He goes on to explain the concepts behind his new O(1) scheduler, and to describe many of his other kernel efforts. This interview was conducted over several months, and covers a wide range interesting topics...

Jeremy Andrews: When did you get started with Linux?

Ingo Molnar: i think i first heard about Linux around 1993, but i truly got hooked on kernel development in 1995 when i bought the german edition of the 'Linux Kernel Internals' book. It might sound a bit strange but i installed my first Linux box for the sole purpose of looking at the kernel source - which i found (and still find) fascinating. So i guess i'm one of the few people who started out as a kernel developer, later on learned their way to be a Linux admin and then finally learned their way around as a Linux user ;-)

JA: What was your first contribution to the kernel?

Ingo Molnar: my very first contribution was a trivial #ifdef bugfix to the networking code, which was reviewed and merged by Alan Cox. At that point i've been lurking on the kernel mailing list for a couple of months already. My first bigger patch was to arch/i386/kernel/time.c, i implemented timestamp-counter based gettimeofday() on Pentiums (which sped up the gettimeofday() syscall by a factor of ~4) - that code is still alive in current kernels. This patch too was first reviewed by Alan Cox.

I strongly believe that a positive 'first contact' between kernel newbies and kernel oldbies is perhaps the single most important factor in attracting new developers to Linux. Besides having the ability to code, kernel developers also need the ability to talk and listen to other developers.

JA: Do you participate in other mailing lists beyond the lkml? Or is this the only place were newbies and oldbies alike will find you?

Ingo Molnar: i'm subscribed to many mailing lists, but for kernel development it's the vger list(s) where most of the stuff happens.

JA: Your most recent contribution to the Linux kernel was the O(1) scheduler, merged into the 2.5 development tree in early January. When did you first start working on this project and what was the inspiration?

Ingo Molnar: one of the core ideas used within the O(1) scheduler - the using of two sets of priority arrays to achieve fairness - in fact originates from around 1998, i even wrote a preliminary patch back in those times. I couldnt solve some O(N) problems back then so i stopped working on it. I started working on the current O(1) scheduler late last year (2001), sometime in December. The inspiration was what the name suggests - to create a good scheduler for Linux that is O(1) in its entirety: wakeup, context-switching and timer interrupt overhead.

JA: Did you base the design on any existing scheduler implementations or research papers?

Ingo Molnar: this might sound a bit arrogant, but i have only read (most of the) research papers after writing the scheduler. This i found to be a good approach in the area of Linux - knowing about too many well-researched details can often confuse the real direction we have to take. I like writing new code, and i prefer to approach things from the physics side: take a few elementary rules and build up the 'one correct' solution, no compromises. This might not be as effective as first reading all the available material and then cherry-picking a few ideas and thinking up the remaining things, but it sure gives me lots of fun :-)

[ One thing i always try to ensure: i take a look at all existing kernel patches that were announced on the linux-kernel mailing list in the same area, to make sure there's no duplication of effort or NIH syndrome. Since such kernel mailing-list postings are progress reports of active research, it can be said that i read alot of bleeding-edge research. ]

JA: Can you explain how your O(1) scheduler improves upon the previous Linux scheduler?

Ingo Molnar: there are three main areas of improvements.

firstly, as the name suggests, it behaves pretty well independently of how many tasks there are in the system. A number of server workloads (eg. JVMs) actually triggered this inefficiency in the old scheduler.

secondly, scheduling on SMP got improved significantly: both performance and scalability is much better. Also, the scheduler decisions are much more robust these days, because the core design is SMP-aware.

the third improvement is in the way interactive tasks are handled. This is actually the change that should be the most noticeable for ordinary users. Interactive tasks are now detected via a separate, usage-statistics-driven mechanism, which is decoupled from other scheduler mechanisms such as timeslice management. The end result is: interactive tasks are still snappy under heavy load, and CPU-intensive tasks are isolated from interactive tasks much better so that they cannot monopolize CPU resources. This part is still being tweaked upon - the important thing is that it's decoupled - the old scheduler had lots of behavioral details integrated into it implicitly, which made tweaking harder. There's even a patch that makes all scheduler-internal constants (timeslice length, various deadlines and interactivity rules) runtime configurable.

the scheduler also enabled the implementation of a new scheduling policy: batch-scheduling of CPU-intensive tasks. This is a correct implementation of the SCHED_IDLE patches that are floating around - the end result is that batch-scheduled tasks do not disturb other tasks in the system in any way, if other tasks are running then batch-scheduled tasks take up zero CPU time. This can be used for things like SETI calculations, or long numeric calculations in university setups.

JA: Is this batch-scheduling in the queue of patches waiting inclusion into the 2.5 kernel?

Ingo Molnar: well, batch scheduling is a feature that is welcome by a number of users, but is largely irrelevant to others. Right now it's a separate patch to the stock scheduler. There are also some conflicting requests about SCHED_BATCH semantics: some people would like priority levels to cause RT-like separation of execution times, while the current SCHED_BATCH patch uses priority levels [ie. nice values] to determine the percentage of CPU time shared between SCHED_BATCH tasks. Until such issues are not decided (by actual use), it's not good to codify them by moving it into the stock kernel.

ie. in the first 'RT-alike priorities' model, if a 'nice level 15' SCHED_BATCH task is running at the same time a 'nice level 10' task is running, the nice-10 task will get all the CPU time - always, until it exits. In the second priority model the nice-10 task will get more CPU time than the nice-15 task, but both of them will get CPU time.

another property of SCHED_BATCH scheduling is the use of much longer timeslices. Eg. right now it's 3 seconds for a default priority SCHED_BATCH task - while normal tasks have 150 msec timeslices. For things like numeric calculations it's good to have as long timeslices as possible, to minimize the effect of cache trashing. Eg. on a sufficiently powerful CPU with a 2 MB L2 cache, the 'population time' of the cache can be as high as 10 milliseconds. So if there are two numeric calculation tasks that both fully utilize the L2 cache (in nonobvious patterns), and which context-switch every 150 milliseconds, then they will waste 10 milliseconds on cache-rebuilding in the first 6% of their timeslice. This shows up as a direct 6% slowdown of the numeric calculation jobs. Now, if SCHED_BATCH is used, and each task has a 3000 milliseconds timeslice, then the cache-rebuild overhead can be at most 0.3% - a far more acceptable number.

this is also one of the reasons why the default timeslice length got almost doubled over the 2.4 scheduler's timeslice length (there it was 80 msecs). We cannot do the same in the 2.4 scheduler because it has a weaker interactivity detection code, which will make things like an X desktop appear 'sluggish' while eg. a compilation job is running.

I think in the longer term we want to have a more abstract timeslice management solution (something like the fairsched patch), which is more than possible with the O(1) scheduler.

JA: How do JVMs trigger an inefficiency in the old scheduler?

Ingo Molnar: the Java programming model prefers the use of many 'threads' - which is a valid and popular application programming model. So JVMs under Linux tend to be amongst the applications that use the most processes/threads, which are interacting in complex ways. Schedulers usually have the most work to do when there are more tasks in the systems, so JVMs tend to trigger scheduler inefficiencies sooner than perhaps any other Linux application.

JA: Are you aware of any areas where your O(1) scheduler doesn't perform as well as the 2.4 scheduler?

Ingo Molnar: not really, i tried to make sure we preserve all the good things from the 2.4 scheduler. If anyone manages to identify such an area then please mail me about it! :-)

well, there's one area where difference can be felt - nice levels (priorities) are taken far more seriously in the new scheduler. So if one wants maximum X performance even during heavy X load which makes X a "CPU hog", the X server should be reniced to nice levels -10 or -15:

renice -15 -u root
the above command renices all currently existing root-owned processes to -15, this includes admin shells and X as well, on most distributions. Some distributions that added the O(1) scheduler to their kernel also set X's priority to -10 or -15 in X's startup scripts - an obviously more robust solution.

similarly, audio playback code that uses up lots of CPU time (Ogg decoders for example) should use nice level -20 to get the best audio latencies and no 'skipping' of soundtracks. Most of the audio playback applications already support the use of RT priorities for playback - nice level -20 under the O(1) scheduler is a far more secure solution. (if a task with RT priority locks up then that can cause a lockup of the system.) Obviously all of these operations are privileged, and can only be done as root.

JA: What are the scalability limitations of the O(1) scheduler?

Ingo Molnar: i'm afraid there are none currently - the runqueues are perfectly isolated. The load-balancer is the only piece of code that has to look at the 'global' scheduling picture, but even that code first tries to figure out whether it has work in a 'lock-less' way, then does it go take the runqueue locks. And the load-balancer runs at a much lower frequency than other pieces of the scheduler - the 'big' load balancer runs every 250 msecs - which, in the kernel's timescale, is an eternity. The 'idle rebalancer' runs every 1 millisecond on every idle CPU - but since it uses up idle time its cost is essentially zero.

this property of the load-balancer also enables us to add more complex things like support for NUMA cache hierarchies easily - it's not a performance or scalability-critical piece of code. It can support everything from '16 isolated groups of 4 CPUs' or '32 CPUs on a single chip' or any other future cache-hierarchy. This is the beauty of the new multiprocessor scheduler, the handling of the cache hierarchy [ie. SMP or NUMA or CCNUMA, etc.] is decoupled from the actual per-CPU scheduling, so (hopefully) there's no radical redesign needed in the future.

JA: Are the improvements of the O(1) scheduler mainly felt on large servers with multiple CPU's? For example, my home server is an aging PIII 650, and there's definitely a finite limit to the number of processes it can handle at one time.

Ingo Molnar: the improvements are noticeable for basically every workload where the CPU is 'overloaded': interactive tasks are actively detected and preferred, and the scheduler itself does not add to the load no matter how high the load is. While it's typically servers that are overloaded (mainly because server admins can size their own needs better than desktop users, and mainly because desktop use is largely dependent on the reaction speed of humans, which is quite lacking), it's still quite common for desktop systems to get into various sorts of CPU overload situations - so overload handling is important to both categories of uses.

JA: What sort of tuning is left to be done?

Ingo Molnar: there are things left like support for non-SMP and non-UP cache hierarchies, like NUMA or SMT (HyperThreading), but the basic design makes the scheduler well-suited for such purposes as well. In most cases an alternative load-balancing algorithm solves the problems. Plus the tweaking of parameters will perhaps never end.

JA: Do you intend to personally work on supporting the NUMA architecture?

Ingo Molnar: i think the patches from Erich Focht & NUMA crew are looking good, and i'm quite sure we will merge them once things have settled down.

JA: Do you aim to have the O(1) scheduler eventually merged into the mainline 2.4 kernel?

Ingo Molnar: this largely depends on Marcelo. I'm (trying to) do periodic backports of the scheduler to 2.4, and feedback has been positive so far. Most distributions include the O(1) scheduler in their kernel tree, so the code gets a fair amount of testing. (in fact only Debian does not include it, this is due to a generic policy of shipping the default kernel as shipped by Linus.) If 2.6 is released soon enough then it might not be worth putting the O(1) scheduler into 2.4 - with so much stuff being backported to 2.4 i think 2.6 should have some new features by the time it's released! :-)

JA: Whether or not it actually happens, how much more testing needs to be done before you personally would be comfortable with the O(1) scheduler being merged into the stable kernel?

Ingo Molnar: it depends on what rule Marcelo uses to include stuff in 2.4. If the rule is to 'include stuff that lots of people use and which works just fine' then the O(1) scheduler is ready. If the rule is to 'include nonintrusive or must-have fixes only' then the O(1) scheduler should not be included, since the 2.4 scheduler works just fine for most workloads. The 2.4 scheduler is still actively maintained and has no major problems, so it's not like we are in a hurry.

JA: How up-to-date is the O(1) scheduler that's part of Alan Cox's 2.4-ac tree?

Ingo Molnar: it's in essence the same scheduler that we have in 2.5. It has one minor tweak missing. In the 2.5 scheduler we have bits of code from other kernel features as well which are not present in 2.4 (and will probably never be), Rusty Russell's hotplug CPU framework, Robert Love's preemptable kernel code, Dipankar Sarma's RCU code and Andrew Morton's autoremove wakeups. So the 2.5 sched.c cannot be directly compared to 2.4's sched.c.

JA: You've also recently been experimenting with making the O(1) scheduler aware of Hyper-Threading (aka symmetric-multithreading) capable CPU's. You explained in an email to the Linux kernel mailing list how you implemented this by introducing the concept of a shared runqueue. With future tuning, how much of a performance gain do you think you can get by adding this support?

Ingo Molnar: this patch makes a measurable impact when the HT-capable system is not fully utilized. Eg. if a 2-CPU HT system (4 logical CPUs) has 2 tasks running. In this case the correct scheduling decision is to move the two tasks to two different physical CPUs. This alone can result in an up to 30% performance increase of the two task's performance - but for HT systems that are out on the market now it could have a bigger impact as well. It all depends on the tasks, how much cache they use and how well the SMT hardware switches between logical CPUs.

when the HT system is fully utilized then the 'stock' scheduler gets pretty close to the 'HT-aware' scheduler's performance, due to an existing feature of the scheduler, the so-called "cache-decay based affinity".

JA: Have you had much feedback regarding this patch?

Ingo Molnar: Intel is obviously interested, and so were a number of kernel developers, and users as well. But i do not expect the kind of feedback the O(1) scheduler itself produced - HT systems are fresh on the market, and the stock O(1) scheduler handles it reasonably well already.

JA: What processors currently support Hyper-Threading?

Ingo Molnar: only Intel AFAIK - Hyper-Threading is an Intel trademark iirc. I think there are some non-x86 CPUs that have SMT concepts included, perhaps PowerPCs or Alpha?

JA: You're also the author of the original kernel preemption patch. How did your patch differ from the more recent work Robert Love has done in this area?

Ingo Molnar: it was a small concept-patch from early 2000 that just showed that a preemptible kernel can indeed be done by using SMP spinlocks. The patch, while it booted and appeared to work to a certain degree, had bugs and did not handle the many cases that need special care, which Robert's patches and the current 2.5 kernel handles correctly.

otherwise the base approach is IMO very similar, it has things like:

+               preempt_on();
                clear_highpage(page);
+               preempt_off();
and:
+               atomic_inc_local(¤t->may_preempt); 

which is quite similar to what we have 2.5 today, with the difference that
Robert and the kernel developer community actually did the other 95% of the work :-)

JA: Are you also actively working on 2.5 preemptible kernel development?

Ingo Molnar: The maintainer is Robert - i do tend to send smaller preempt related patches (and even a larger one, the 'IRQ lock removal' patch centered around the use of the preemption count). I'm obviously interested in the topic, and i'm happy that all the seemingly conflicting concepts as lowlatency and preemption are now properly merged into 2.5 and that we have really good kernel latencies. Other pressing topics like the scheduler and the threading code still keep me busy most of the time.

JA: Your IRQ rewrite and Robert's preemptible kernel work have resulted in a unified per-task atomic count (the preempt_count) and a lot of code being cleaned up. Do you have plans to do more work in this area?

Ingo Molnar: not at the moment - right now i think that the IRQ code could hardly be any cleaner than it is today :-)

JA: What other kernel projects are you currently working on?

Ingo Molnar: mainly the scheduler, plus these days i'm working on enhancing the handling of 'threads' under Linux, utilized by the NPTL project done by glibc maintainer Ulrich Drepper. This has a high number of components that are in the 2.5 kernel already.

JA: Can you further describe the components that have already been merged into the 2.5 kernel?

Ingo Molnar: TLS stands for 'Thread Local Storage'. You can find the first announcement of the patch at:

http://lwn.net/Articles/5851/
a number of followup patches were posted, and it all got eventually merged
into 2.5.31.

Plus there were a few other things related to threading:
http://lwn.net/Articles/8131/

http://lwn.net/Articles/8034/

http://lwn.net/Articles/7618/

http://lwn.net/Articles/7617/

http://lwn.net/Articles/7603/

http://lwn.net/Articles/7411/

http://lwn.net/Articles/7408/
(note that most of the above patches got reworked significantly before they
got into the 2.5 kernel, but the concepts were all preserved.)

JA: You conducted a test to start hundreds of thousands of threads at one time... Can you describe how you did this, and what were the results?

Ingo Molnar: the first test had slightly less than 100,000 threads. My goal was to create an easy tool to trigger inefficiencies in kernel algorithms that somehow still depend on the number of threads. I wrote some simple C code that started up 100,000 parallel threads with a small userspace stack. This simple test alone triggered 4-5 inefficiencies in the kernel which took a number of days to fix. One of the inefficiencies was the PID allocator, which got discussed on linux-kernel quite extensively and which triggered some emotional responses as well, but finally the patch was merged and now we have a constant-time PID allocator. Another thing the code triggered was the fact that procfs crashed upon creating the 65536'th thread - so my box was definitely the first box on the planet that ran more than 64K kernel threads :-) Another deficiency was the O(N) property of the exit()/wait4() syscalls. And while we were at it, a number of new syscalls were introduced to reduce the overhead of thread operations as much as possible.

The IRQ-stacks patch, written by Ben LaHaise and Dave Hansen, roughly doubles the maximum numbers of threads possible on a 1 GB RAM x86 box, ie. slightly less than 200K threads.

JA: What other Linux kernel related projects have you worked on in the past?

Ingo Molnar: here's a probably incomplete list of the bigger pieces that made it into the kernel: software-RAID support, 3-level paging on x86 (and highmem), the recent IRQ handling rewrite in 2.5 (which also removed the 'big IRQ lock'), the timer scalability patch, kernel workqueues, the CPU affinity syscalls, the initial SMP pagecache scalability code in 2.3, and i also wrote the original 'writeback pagecache' patch for 2.3, wrote various fixes and enhancements to the 'old' scheduler, wrote the 'wake one' support patch for 2.4, wrote the original zoned allocator, bootmem and mempool subsystems. Ie. all across the spectrum.

One project that is not in the 2.5 kernel is the Tux webserver (and now FTP server as well). If you want to see a Tux/FTP server that can serve 10,000 users then do:
ftp ftp://ftp.rpmfind.net/
some smaller but interesting patches: the NMI watchdog, the ability of the 2.4 kernel to create more than ~4000 processes on x86 (ie. the removal of per-thread TSS), netconsole/netdump, 'big reader locks', and one older patch from 2.2 times i'm particularly proud of: i wrote the original 'current task pointer' implementation, which uses the stack pointer to get to the 'current task pointer' on SMP systems. I also wrote the 'memleak' and 'ktrace' debugging helper tools, which have been picked up by other projects.

JA: Your list of contributions is staggering!

Ingo Molnar: well, it's just that i've been around long enough, and that i'm interested in many different areas. So a colorful mix of contributions piled up.

JA: Are you still working on the Tux webserver?

Ingo Molnar: occasionally yes, but other things take precedence currently. But life has not stopped, eg. Anton Blanchard has ported Tux to 2.5, and Arjan van de Ven keeps the 2.4 patch uptodate.

JA: How complete an implementation is the Tux FTP server?

Ingo Molnar: it started out as a proof-of-concept, that a fully in-kernel FTP server was possible. Eg. it implements/accelerates some of the functionality of 'ls' within the kernel as well. It does not handle all aspects of the FTP protocol yet (and perhaps will never support the full range), its main feature currently is absolute security [eg. attackers cannot upload trojans through it, because it, well, has no upload support at all :-) ] and download-only FTP serving - that's clearly the main area where FTP server performance is the biggest problem.

JA: In your opinion, are the Tux webserver and FTP server ready to be merged into the 2.5 kernel?

Ingo Molnar: Anton Blanchard has done merging work in that area, but i think we missed the 2.5 feature freeze deadline. The patch needed for the generic kernel (ie. not the Tux code itself) is fairly small - most areas are part of the kernel already.

JA: What still needs to be modified in the generic kernel?

Ingo Molnar: it's mainly two VFS changes, an exit()-time cleanup function and one new TCP event callback. All the 'big' features that were induced by TUX are in the 2.5 kernel already, zerocopy and the scalability work, so TUX for 2.5 is a really unintrusive patch.

JA: Do these remaining patches add any overhead to the kernel for users that do not need TUX?

Ingo Molnar: nope, not really.

JA: What future plans do you have for Tux?

Ingo Molnar: well, to get it into the stock kernel :-)

JA: Has Linus offered any opinion regarding TUX, and the possibility of merging it into his kernel tree?

Ingo Molnar: yes, there was an attempt to merge TUX early during 2.5. There were no big technical problems, just suggestions to do certain things differently - this is what happens when any bigger piece of code is merged. It got delayed by the scheduler and then by the threading work and now by the feature freeze :-)

JA: Of all these many impressive accomplishments, which are you the most proud?

Ingo Molnar: well, perhaps the scheduler, it manages to solve a few really hard conceptual problems in a pretty critical piece of code that already got called a couple of thousand times while eg. reading this article on a Linux box! :-)

JA: What is your background in programming prior to getting involved with Linux?

Ingo Molnar: well, like many others, i grew up on programming all possible (and even some impossible) aspects of Commodore micro-computers, since age 11. Completely knowing a greatly simplified but fully functional computer architecture helped alot in kernel development.

I think kids today have a harder time, since hardware vendors are much more tightlipped about computer internals, and the complexity of computer systems skyrocketed as well. Linux perhaps helps here too, as a central 'documentation' and reference implementation for "all computer internals that matter".

JA: Much of your work seems to be focused on improving the performance and scalability of the 2.5 kernel. Is this the result of RedHat's product requirements, or your own interests?

Ingo Molnar: well, i'm in the fortunate position that the two are a perfect match.

JA: Can you describe your development environment, including the hardware and software tools you typically use?

Ingo Molnar: i use all the normal text based kernel development tools: vim, gcc/make/etc., i use a serial line to a test-system to debug kernels, and that's all. I like it simple when reading kernel code: i use text consoles (on an LCD screen) to do most of my development work. Occasionally i drop into X for tools that make sense only there, such as ethereal or some of the BK tools.

JA: Is it safe to assume you are working on a Pentium 4 Xeon, based on your recent Hyper-Threading patch?

Ingo Molnar: well, the HT box is one of my test-systems. These days i'm working on a 'boring' system, on which i almost never boot experimental kernel stuff. While not quite mannish, i prefer this solution, having a safe system creates a certain peace of mind.

JA: What's your impression of how the kernel has changed over the past seven years that you've been involved?

Ingo Molnar: there's roughly one order of magnitude more code in the kernel - while the 1.2-ish kernels were just 300 thousand lines of code, the 2.5.48 kernel is more than 5.5 million lines of code. Even if some of this size increase is due to new architectures and new drivers (which do not directly complicate kernel coding), even the "core code" has increased roughly 5 times in size, which is considerable.

The 1.2.0 kernel supported 4 architectures, the 2.5.48 kernel supports 20 different CPU architectures. The 1.2.0 kernel supported 11 filesystems, the 2.5.48 kernel has native support for _50_ filesystems. So the kernel got considerably more complex - but it also got more logical in many respects. More 'refined' would be the right word i think. I really hope we are successful in keeping it simple (and well commented) enough for new developers to understand.

compared to the situation years ago, it's roughly the same amount of work to get a patch into the official kernel, which i think is very good - it's a tribute to Linus. (his patch-integration and patch-steering work has increased an order of magnitude as well. So during the years Linus not only had to care about the scalability of Linux as a technology, but he also had to scale and form his own workload.)

at this point i think it's fair to mention BitKeeper, which, not being open-source code, ruffled some feathers (mine included). While as an open-source purist one can see the disadvantages of BK, i also have to note the kind of improvements it brought. Patches are now getting into the Linux kernel in a more predictable way, and also in a faster way (than say a year ago) - and this is clearly due to BK giving Linus more flexibility. BK also gives a number of very useful tools when searching for bugs or integrating code - eg. i can see which line was last modified by which person, and i can navigate the changes in a quick and logical way. I have worked with a number of source-control packages before (even with some of the 'big' closed-code ones), but BK definitely tops them all. While from the 'big' source control packages i had the impression that they are "the manager's best friend", BK is definitely the "developer's best friend". Which, for a project like Linux, is the single most important factor.

Eg. look at the following graphs:

http://kernelnewbies.org/status/Linux_Kernel_2_5_Progress.png
http://kernelnewbies.org/status/Linux_Kernel_2_5_Compounded_Progress.png

( the links are from:

   http://kt.zork.net/kernel-traffic/kt20021111_191.html#11

)

these show that Linux is a healthy software project, it has a quick and
steady merging rate and only a low number of features are kept in limbo.


JA: You mention originally having issues with BitKeeper. Have these concerns been addressed by Larry and the people at BitMover?

Ingo Molnar: they have been addressed mostly, yes - via existing features of BK. Eg. there's now a commit mailing list for Linus' tree, which is important to keep all the BK related metadata open.

JA: Are you now using BitKeeper yourself?

Ingo Molnar: actually i've been one of the first kernel developers to do a BK merge with Linus (eg. when the scheduler was still in flux i had a scheduler BK-tree from which Linus merged), but it's really the kind of activity that Linus does that fits BK most. So i'm mostly using BK to look at changesets and to generate various kernel trees automatically. Another, technical problem is that my development box(es) are detached from the internet, so BK openlogging does not work.

JA: Do you expect that eventually BitKeeper will be replaced by an open source tool?

Ingo Molnar: it's definitely lots of work, and BK is really complex and well-refined. Currently nothing comparable exists.

JA: Have you worked with any other open source kernels?

Ingo Molnar: not really. I occasionally take a look at FreeBSD - some things they do right, some things they dont, in the areas i'm most interested in the Linux kernel is currently ahead both design-wise and implementation-wise. Finally we caught up in the VM subsystem as well, with Andrea's big and important 2.4 rewrite, Rik's great rmap code and Andrew's fantastic integration work. But what other answer would one expect from a Linux kernel developer? :-)

JA: FreeBSD 5.0 is due to be released around December of this year, with some significant changes to the kernel. Have you followed this development?

Ingo Molnar: not really. The things i sometimes do is to look at their code. Also, when i search for past discussions regarding some specific topic, sometimes there's a FreeBSD hit and then i read it. That's all what i can tell. But i do wish their kernel gets better just as much as the Linux kernel gets better, there needs to be competition to drive both projects forwards. (the Windows kernel is closed up enough so that it does not create any development stimulus for Linux (and vice versa). Rarely do any Windows features get discussed.)

JA: What areas of the Linux kernel do you think still lags behind FreeBSD?

Ingo Molnar: there were two areas where i think we used to lag, the VM and the block IO subsystem - both have been significantly reworked in 2.5. Whether the VM got better than FreeBSD's remains to be seen (via actual use), but the Linux VM already has features that FreeBSD does not have, eg. support for more than 4 GB RAM on x86 (here i guess i'm biased, i wrote much of that code). But FreeBSD's core VM logic itself, ie. the state machine that decides what to throw out under memory pressure, how to swap and how to do IO, is top-notch. I think with Andrew Morton's and Jens Axobe's latest VM and IO work we are top-notch as well (with a few extras perhaps).

There's also an interesting VM project in the making, Arjan van de Ven's O(1) VM code. [without doubt i do appear to have a sweet spot for O(1) code :-) ] Rik van Riel has merged Arjan's code a couple of days ago. The code converts every important VM algorithm (laundering, aging) to a O(1) algorithm while still keeping the fundamentals - this is quite nontrivial for things like page aging. It's in essence the VM overhead reduction work that Andrea Arcangeli has started in 2.4.10, brought to the extreme. I have run Arjan's O(1) VM under high memory pressure, and it's really impressive - kswapd (the central VM housekeeping kernel thread), which used to eat up lots of CPU time under VM load, has almost vanished from the CPU usage chart.

I do have the impression that the Linux VM is close to a conceptual breakthrough - with all the dots connected we now have something that is the next level of quality. The 2.5 VM has merged all the seemingly conflicting VM branches that fought it out in 2.4, and the many complex subsystems involved suddenly started playing in concert and produce something really nice.

JA: A much earlier version of the rmap code was originally in the 2.4 kernel, but got ripped out. Do you feel it has improved enough that this won't happen again?

Ingo Molnar: this most definitely wont happen. We already rely on rmap for some other features, so it's not just a matter of undoing one patch. Rmap is essential to the new VM, without rmap the VM would be like a ferrari with an old diesel motor - looks good but is pretty unusable.

the problem of rmap in 2.4 was simply its complexity, relative youth as a project and the relative low number of people that tested it. So in 2.4 it would have been quite a stretch to keep it in. But it was a fair game for 2.5, and with Andrew's simplification/robustization/speedup of Rik's rmap code it was very manageable.

JA: Would it be safe to say that 2.5 will outperform even a heavily performance tuned 2.4?

Ingo Molnar: i'd expect it to - if it does not at least give comparable performance for any given workload (with 2.5 tuned to that workload as well) then we have not done a good job.

JA: What other major improvements have gone into 2.5, beyond the scheduler and VM rewrites?

Ingo Molnar: the block IO rewrite, lots of VFS changes, a rework of the module code and (plug) the new threading implementation. The block IO rewrite was long overdue and that's the one i'm most happy about.

JA: Do you feel the changes are significant enough to call the next major kernel 3.0 instead of 2.6?

Ingo Molnar: well, i do think they are significant enough to be called 3.0 - on the other hand it might not matter much whether it's called 2.6 or 3.0, after all what ordinarily people know about is this new shiny Linux 9.0 release, right? ;)

JA: Looking into the future, what do you see in store for the next development kernel, version 2.7?

Ingo Molnar: no idea, really, i dont think trying to look into the future brings many fruits, the kernel needs to handle what is available here and today. Sometimes we are lucky and create stuff that happens to work for years :-) Perhaps something like OpenMosix would be nice to have in the kernel. Plus even better (native) support for User Mode Linux. Things like this.

JA: Where do you intend to focus your attention after you're content with how the O(1) scheduler is tuned?

Ingo Molnar: i have no idea. Threading and scalability i suspect is going to remain an area of interest.

JA: Do you have any advice to offer those aspiring to become productive kernel developers?

Ingo Molnar: only the old mantra: to read the source and the mailing lists. And take it easy - do what you like doing most.

JA: Thank you for taking time away from your coding to talk with me. I am awed by all your accomplishments, and look forward to seeing where your kernel development interests lead you in the future.

(c)2002 KernelTrap