Kernel comparison: Improved memory management in the 2.6 kernel

Kernel comparison: Improved memory management in the 2.6 kernel

From large pages to reverse mapping: more reliability and speed

Paul Larson
Software Engineer, Linux Technology Center, IBM

03 Mar 2004

The 2.6 Linux kernel employs a number of techniques to improve the use of large amounts of memory, making Linux more enterprise-ready than ever before. This article outlines a few of the more important changes, including reverse mapping, the use of larger memory pages, storage of page-table entries in high memory, and greater stability of the memory manager.

As the Linux kernel has grown and matured, more users are looking to Linux for running very large systems that handle scientific analysis applications or even enormous databases. These enterprise-class applications often demand large amounts of memory in order to perform well. The 2.4 Linux kernel had facilities to understand fairly large amounts of memory, but many changes were made to the 2.5 kernel to make it able to handle larger amounts of memory in a more efficient manner.

Reverse mappings

In the Linux memory manager, page tables keep track of the physical pages of memory that are used by a process, and they map the virtual pages to the physical pages. Some of these pages might not be used for long periods of time, making them good candidates for swapping out. However, before they can be swapped out, every single process mapping that page must be found so that the page-table entry for the page in that process can be updated. In the Linux 2.4 kernel, this can be a daunting task as the page tables for every process must be traversed in order to determine whether or not the page is mapped by that process. As the number of processes running on the system grows, so does the work involved in swapping out one of these pages.

Reverse mapping, or RMAP, was implemented in the 2.5 kernel to solve this problem. Reverse mapping provides a mechanism for discovering which processes are using a given physical page of memory. Instead of traversing the page tables for every process, the memory manager now has, for each physical page, a linked list containing pointers to the page-table entries (PTEs) of every process currently mapping that page. This linked list is called a PTE chain. The PTE chain greatly increases the speed of finding those processes that are mapping a page, as shown in Figure 1.

Figure 1. Reverse-mapping in 2.6

Nothing is free, of course: the performance gains obtained by using reverse mappings come at a price. The most notable and obvious cost of reverse mapping is that it incurs some memory overhead. Some memory has to be used to keep track of all those reverse mappings. Each entry in the PTE chain uses 4 bytes to store a pointer to the page-table entry and an additional 4 bytes to store the pointer to the next entry on the chain. This memory must also come from low memory, which on 32-bit hardware is somewhat limited. Sometimes this can be optimized down to a single entry instead of using a linked list. This method is called the page-direct approach. If there is only a single mapping to the page, then a single pointer called "direct" can be used instead of a linked list. It is only possible to use this optimization if that page is mapped by only one process. If the page is later mapped by another process, the page will have to be converted to a PTE chain. A flag is set to tell the memory manager when this optimization is in effect for a given page.

There are also a few other complexities brought about by reverse mappings. Whenever pages are mapped by a process, reverse mappings must be established for all of those pages. Likewise, when a process unmaps pages, the corresponding reverse mappings must also be removed. This is especially common at exit time. All of these operations must be performed under locks. For applications that perform a lot of forks and exits, this can be very expensive and add a lot of overhead.

Despite a few tradeoffs, reverse mappings have proven to be a valuable modification to the Linux memory manager. A serious bottleneck with locating processes that map a page is minimized to a simple operation using this approach. Reverse mappings help the system continue to perform and scale well when large applications are placing huge memory demands on the kernel and multiple processes are sharing memory. There are also more enhancements for reverse mapping currently being researched for possible inclusion in future versions of the Linux kernel.

Large pages

Typically, the memory manager deals with memory in 4 KB pages on x86 systems. The actual page size is architecture dependent. For most uses, pages of this size are the most efficient way for the memory manager to deal with memory. Some applications, however, make use of extremely large amounts of memory. Large databases are a common example of this. For every page mapped by each process, page-table entries must also be created to map the virtual address to the physical address. If you have a process that maps 1 GB of memory with 4 KB pages, it would take 262,144 page-table entries to keep track of those pages. If each page-table entry consumes 8 bytes, then that would be 2 MB of overhead for every 1 GB of memory mapped. This is quite a bit of overhead by itself, but the problem becomes even worse if you have multiple processes sharing that memory. In such a situation, every process mapping that same 1 GB of memory would consume its own 2 MB worth of page-table entries. With enough processes, the memory wasted on overhead might exceed the amount of memory the application requested for use.

One way to help alleviate this situation is to use a larger page size. Most modern processors support at least a small and a large page size, and some support even more than that. On x86, the size of a large page is 4 MB, or 2MB on systems with physical address extension (PAE) turned on. Assuming a large page size of 4 MB is used in the same example from above, that same 1 GB of memory could be mapped with only 256 page-table entries instead of 262,144. This translates to only 2,048 bytes of overhead instead of 2 MB.

The use of large pages can also improve performance by reducing the number of translation lookaside buffer (TLB) misses. The TLB is a sort of cache for the page tables that allows virtual to physical address translation to be performed more quickly for pages that are listed in the table. Of course, the TLB can only hold a limited number of translations. Large pages can accommodate more memory in fewer actual pages, so as more large pages are used, more memory can be referenced through the TLB than with smaller page sizes.

Storing page-table entries in high memory

Page-tables can normally be stored only in low memory on 32-bit machines. This low memory is limited to the first 896 MB of physical memory and required for use by most of the rest of the kernel as well. In a situation where applications use a large number of processes and map a lot of memory, low memory can quickly become scarce.

A configuration option, called Highmem PTE in the 2.6 kernel now allows the page-table entries to be placed in high memory, freeing more of the low memory area for other kernel data structures that do have to be placed there. In exchange, the process of using these page-table entries is somewhat slower. However, for systems in which a large number of processes are running, storing page tables in high memory can be enabled to squeeze more memory out of the low memory area.

Figure 2. Memory regions

Stability

Better stability is another important improvement of the 2.6 memory manager. When the 2.4 kernel was released, users started having memory management-related stability problems almost immediately. Given the systemwide impact of memory management, stability is of utmost importance. The problems were mostly resolved, but the solution entailed essentially gutting the memory manager and replacing it with a much simpler rewrite. This left a lot of room for Linux distributors to improve on the memory manager for their own particular distribution of Linux. The other side of those improvements, however, is that memory management features in 2.4 can be quite different depending on which distribution is used. In order to prevent such a situation from happening again, memory management was one of the most scrutinized areas of kernel development in 2.6. The new memory management code has been tested and optimized on everything from very low end desktop systems to large, enterprise-class, multi-processor systems.

Conclusion

The memory management improvements in the Linux 2.6 kernel go far beyond the features mentioned in this article. Many of the changes are subtle but equally important. These changes all work together to produce a memory manager in the 2.6 kernel designed for better performance, efficiency, and stability. Some changes, like Highmem PTE and Large pages, work to reduce the overhead caused by memory management. Other changes, like reverse mappings, speed up performance in certain critical areas. These specific examples were chosen because they exemplify how the Linux 2.6 kernel has been tuned and enhanced to better handle enterprise-class hardware and applications.

Resources

The paper "Linux Memory Management on Larger Machines" [ http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Bligh-OLS2003.pdf ] by Martin Bligh and David Hansen was presented at the 2003 Linux Symposium.
Red Hat's Rik van Riel discusses some of the shortcomings of the virtual memory subsystem that prevent it from working well on machines with many gigabytes of RAM, in his presentation at the 2003 Ottawa Linux Symposium, Towards an O(1) VM [ http://www.surriel.com/lectures/ols2003/ ].
Mel Gorman offers a deeper understanding [ http://www.csn.ul.ie/~mel/projects/vm/guide/html/understand/ ] of the Linux Virtual Memory Manager. This is also being published in book form [ http://www.amazon.com/exec/obidos/ASIN/0131453483 ] by Prentice Hall.
The Kernel/Analysis HOWTO includes a page on Linux Memory Management [ http://www.tldp.org/HOWTO/KernelAnalysis-HOWTO-7.html ].
LWN.net has an article about Large page support in the Linux kernel [ http://lwn.net/Articles/6969/ ] and another on the Object-based reverse-mapping VM [ http://lwn.net/Articles/23732/ ], which was merged into the 2.5 kernel.
You can read many articles about how software testing is done at [ IBM http://www.research.ibm.com/journal/sj41-1.html ] in the IBM Systems Journal.
IBM offers Performance Management, Testing, and Scalability Services [ http://www-1.ibm.com/services/ism/pmts/ ].
IBM's Linux Technology Center [ http://oss.software.ibm.com/developerworks/opensource/linux/ ] works directly with the Linux development community.
The Linux at IBM [ http://www-1.ibm.com/linux/index.shtml ] site features Linux news and information from throughout IBM.
Prior to the 2.6 release, IBM developerWorks looked Towards Linux 2.6 [ http://www.ibm.com/developerworks/linux/library/l-inside.html ] look at some of the key features of the new kernel, including the new scheduler and the Native Posix Threading Library (NPTL) (developerWorks, September 2003)
Read about LTP testing of the 2.4 Linux kernel [ http://www.ibm.com/developerworks/linux/library/l-rel/ ] (developerWorks, December 2003).
Paul Larson also looks at improvements to the process of kernel development [ http://www.ibm.com/developerworks/linux/library/l-dev26/index.html ] for the 2.6 kernel (developerWorks, February 2004).
To get an idea of the speed improvements the 2.6 kernel offers, read about Web serving on 2.4 and 2.6 [ http://www.ibm.com/developerworks/linux/library/l-web26/index.html ] (developerWorks, February 2004).
Find more resources for Linux developers in the developerWorks Linux zone [ http://www.ibm.com/developerworks/linux/ ].
Browse for books [ http://www.ibm.com/developerworks/apps/SendTo?bookstore=safari ] on these and other technical topics.

About the author

Paul Larson works on the Linux Test team in the Linux Technology Center at IBM. Some of the projects he has been working on over the past year include the Linux Test Project, 2.5/2.6 kernel stabilization, and kernel code coverage analysis. He can be reached at pl@us.ibm.com