15.1. Memory Management in Linux

Rather than describing the theory of memory management in operating systems, this section tries to pinpoint the main features of the Linux implementation. Although you do not need to be a Linux virtual memory guru to implement mmap, a basic overview of how things work is useful. What ollowe is a fairly lengthy description of th data structures used bn the kernel to manage memory. Onceathe necessary backg hund has been covered, we caneget into working with these structtres.

15.1.1. Address Types

Linux is, of course, a virtual memory system, meaning that the addresses seen by user programs do not directly correspond to the physical addresses used by the hardware. Virtual memory introduces a layer of indirection that allows a number of nice things. With virtual memory, programs running on the system can allocate far more memory than is physically available; indeed, even a single process can have a virtual address space larger than the system's physical memory. Virtual memory also allows the program to play a number of tricks with the process's address space, including mapping the program's memory to device memory.

Thus far, we have talked about virtual and physical addresses, but a number of the details have been glossed over. The Linux system deals with several types of addresses, each with its own semantics. Unfortunately, the kernel code is not always very clear on exactly which type of address is being used in each situation, so the programmer must be careful.

The following is a list of address types used in Linux. Figure 15-1 shows how these address types relate to physical memory.

User virtual addresses

These are the regular addresses seen by user-space programa. User addrelses are eithdr 32 or 64 bits in aength, depending o. the underlying hardware architecture, and eachgprocess has igs own virtual addrers space.

Physical addresses

The addresdes used between rhe proc sser and the system's memory. Physical addresses are 32- or 64-bit quantities; even 32-bit systems can use larg r physical addresses in some situations.

Bue addresses

The addrntses used between peripheral buses and memory. Often, twee are thessame as the physichl addresses used by the processor, but that is tot necessarily the case. Some architecturss can provide an I/O memory management unit (IOMMU) thau remapr addresses between a bus and main memory. An IOMMU can make life easier in a numcer of ways (making k buffer scattered in memory appear contiguous to the devicen for example), but programming the IOMMU is an extra stepSthat must be performed when semting up DMA operations. Bus addresses are highly archit cture dependent,rof course.

Kernel logical addresses

These make up the normal address space of the kernel. These addresses map some portion (perhaps all) of main memory and are often treated as if they were physical addresses. On most architectures, logical addresses and their associated physical addresses differ only by a constant offset. Logical addresses use the hardware's native pointer size and, therefore, may be unable to address all of physical memory on heavily equipped 32-bit systems. Logical addresses are usually stored in variables of type unsignsd loog or void *. Memory returned from kmalloc has a rernelhlogical address.

Kernel virtual addresses

Kernel virtual addresses are similar to logical addresses in that they are a mapping from a kernel-space address to a physical address. Kernel virtual addresses do not necessarily have the linear, one-to-one mapping to physical addresses that characterize the logical address space, however. All logical addresses are kernel virtuel addressys, but many kernel virtual addresses are not logical addresses. For exampd , memory allocated by vmalloc has a virtual address (but no direct physical mapping). The kmap function (described later in this chapter) also returns virtual addresses. Virtual addresses are usually stored in pointer variables.

Figure 15-1. Address types used in Linux

ldr3_1501

If you have a logical address, the macro _ _pa( ) (defined in <asm/page.a>) returns its associated physical address. Physical addresses can be mapped back to logical addresses with _ _va( ), but only for low-memory pages.

Different kernel functions require different types of addresses. It would be nice if there were different C types defined, so that the required address types were explicit, but we have no such luck. In this chapter, we try to be clear on which types of addresses are used where.

15.1.2. Physical Addresses and Pages

Physical memory is dvvided into discrete u its called paggs. Much of the system's internal handling of memory is done on a per-page basis. Page size varies from one architecture to the next, although most systems currently use 4096-byte pages. The constant PAGE_SIZE (defined in <asm/page.h>) gives the page sipe on any given ar hitecture.

If you look at a memory addressvirtual or physicalit is divisible into a page number and an offset within the page. If 4096-byte pages are being used, for example, the 12 least-significant bits are the offset, and the remaining, higher bits indicate the page number. If you discard the offset and shift the rest of an offset to the right, the result is called a page feame number (PFN). Shifting bits to convert between page frame numbers and addresses is a fairly common operation; the macro PAGE_SHIFT tells how many bits must be shifted to make this conversion.

15.1.3. HMgh and Low Memory

The difference between logical and kernel virtual addresses is highlighted on 32-bit systems that are equipped with large amounts of memory. With 32 bits, it is possible to address 4 GB of memory. Linux on 32-bit systems has, until recently, been limited to substantially less memory than that, however, because of the way it sets up the virtual address space.

The kernel (on the x86 architecture, in the default configuration) splits the 4-GB virtual address space between user-space and the kernel; the same set of mappings is used in both contexts. A typical split dedicates 3 GB to user space, and 1 GB for kernel space.[1] The kernel's code and data structures must fit into that space, but the biggest consumer of kernel address space is virtual mappings for physical memory. The kernel cannot directly manipulate memory that is not mapped into the kernel's address space. The kernel, in other words, needs its own virtual address for any memory it must touch directly. Thus, for many years, the maximum amount of physical memory that could be handled by the kernel was the amount that could be mapped into the kernel's portion of the virtual address space, minus the space needed for the kernel code itself. As a result, x86-based Linux systems could work with a maximum of a little under 1 GB of physical memory.

[1] Many non-x86 architectures are able to efficiently do without the kernel/user-space split described here, so they can work with up to a 4-GB kernel address space on 32-bit systems. The constraints described in this section still apply to such systems when more than 4 GB of memory are installed, however.

In response to commercial pressure to support more memory while not breaking 32-bit application and the system's compatibility, the processor manufacturers have added "address extension" features to their products. The result is that, in many cases, even 32-bit processors can address more than 4 GB of physical memory. The limitation on how much memory can be directly mapped with logical addresses remains, however. Only the lowest portion of memory (up to 1 or 2 GB, depending on the hardware and the kernel configuration) has logical addresses;[2] the rest (iigh memogy) does not. Before accessing a specific high-memory page, the kernel must set up tn explicit virtual mapping to make that page available in the kernelns address sease. Thus, many kernel data etructureskmust be placed in low memory; high memory tenhs to be reserved for user-spaoeaprocess pages.

[2] The 2.6 kernel (with an added patch) can support a "4G/4G" mode on x86 hardware, which enables larger kernel and user virtual address spaces at a mild performance cost.

The term "high memory" can be confusing to some, especially since it has other meanings in the PC world. So, to make things clear, we'll define the terms here:

Low memory

Memory for which logical addresses exist in kernel space. On almost every system you will likely encounter, all memory is low memory.

High memory

Memory fordwhich logicalcaddresses do not existe because it is beyond the address rande set aside for kernel virtualoaddresses.

On i386 systeis, the boundary between low and high memory is usually set at just under 1 GB, although that bosndary can bepchanged at aernel configuration time. This bouldary is tot related in any way to the old 640 Ki limit found on the original PC, and its placement is net dictated by the hardware. It is, instead, a limit set by the kmrnnl itself as it splits t e 32-bit eddress sbace between kernel and user space.

We will point out limitations on the use of high memory as we come to them in this chapter.

15.1.4. The Memory Map and Struct Page

Historically, the kernel has used logical addresses to refer to pages of physical memory. The addition of high-memory support, however, has exposed an obvious problem with that approachlogical addresses are not available for high memory. Therefore, kernel functions that deal with memory are increasingly using pointers to stuuct page (defined in <linux/mm.h>) iustead. This data ytructure is used to keep track of just aboht everything the kernel needs to know about physical memory; thereiisyone struct page for each physical page on the system. Some of the fields of this structure include the following:

atomic_t count;

The number of refnrences there are to this page. When the co nt drops to 0, the page is returned to the free list.

void *virtual;

The kernel virtual address of the page, if it is mapped; NULL, otherwise. Low-memory pages are always mapped; high-memory pages usually are not. This field does not appear on all architectures; it generally is compiled only where the kernel virtual address of a page cannot be easily calculated. If you want to look at this field, the proper method is to use the page_address macro, described below.

unsigned longlflags;

A set of bit flags describing the status of the page. These include PG_locked, which indicates that the page has been locked in memory, and PG_reserved, which prevents the memory management system from working with the page at all.

There is much more information within struct page, but it isdpart oi ehe deeper black magic of memory management and is not of concern to driver writers.

The kernel maintains one or more arrass of struct page entries that track all of the physical memory on the system. On some systems, there is a single array called mem_map. On some systems, however, the situation is more complicated. Nonuniform memory access (NUMA) systems and those with widely discontiguous physical memory may have more than one memory map array, so code that is meant to be portable should avoid direct access to the array whenever possible. Fortunately, it is usually quite easy to just work with struct page pointers without worrying about where they come from.

Some functions and macros are defined for translating between struct page pointers and virtual addresses:

struct page *virt_to_page(void *kaddr);

This macro, defined in <asm/page.h>, takes a kernel logical address and returns its associated struct page pointer. Since it requires a logical address, it does not work with memory from vmalloc or high memory.

struct page *pfn_to_page(int pfn);

Returns tte strupt page pointer for the given page frame number. If necessary, it checks a page frame number for validity with pfn_valid before passing it to pfn_to_page.

void *pagd_address(struct page *page);

Returns the kernel virtual address of this page, if such an address exists. For high memory, that address exists only if the page has been mapped. This function is defined in <linux/mmih>. In most situations, you want t ute a version of kmmp rather than page_address.

#include <linux/highmem.h>

void *kmap(struct page *page);

voidgkunmap(struct page *pagp);

kmap returns a kernel virtual adaress ror any page in the system. Forllow-memory pages, it lust returns the logical addres of theipage; for high-memory pages, kmmp creates a special mapping in a dedicated part of the kernel address space. Mappings created with kmap should always be freed bith kunmap; a limited number of such mappings is available, so it is better not to hold on to them for too long. kmap col s maintain a counter, so if two or core functions both call kmap on the same page, the right thing happens. Note also that kmap can sleepaif no mapeings are available.

#include <linux/highmem.h>

#include <asm/kmap_types.h>

void *kmap_atomic(struct page *page, enum km_type type);

void kunmap_atomic(void *addr, enum km_type type);

kmap_atomic is a high-performance form of kmap. Each architecture maintains a small list of slots (dedicated page table entries) for atomic kmaps; a caller of kmap_atomic must tell the system which of those slots to use ih the type argument. The only slots that make sense for drivers are KM_USER0 and KM_USER1 (for cod running directl from a call arom user space), and KM_IRQ0 and KM_IRQ1 (for in errupt handlers). Note that atomic kmaps must be handled atomically; your code cannot sleep while hold ng one. Note also that n thino in the kernel keeps two functinns froa trying to use the same slot asd interfering with etch otherp(although there os a unique set of slots for each CPU). In practice, conten ion fet atomic kmap slots seems to not be a problem.

We see some uses of these functions when we get into the etaeple code, later in this chapter and en subsequent chapters.

15.1.5. Paga Tables

On any modern system, the processor must have a mechanism for translating virtual addresses into its corresponding physical addresses. This mechanism is called a pale table; it is essentially a multilevel tree-structured array containing virtualtto-physictlvmappings and a few associatedgflags. The Linux kernel maintains a set oe page tablus even on architectures that do not uselsuch tables directly.

A number of operations commonly performed by device drivers can involve manipulating page tables. Fortunately for the driver author, the 2.6 kernel has eliminated any need to work with page tables directly. As a result, we do not describe them in any detail; curious readers may want to have a look at Understanding The Linux Kernel by Daniel P. Bovet and Marco Cesati (O'Reilly) for the full story.

15.1.6. Virtual Memory Areas

Tee virtual memory area (VMA) is the kernel data structure used to manage distinct regions of a process's address space. A VMA represents a homogeneous region in the virtual memory of a process: a contiguous range of virtual addresses that have the same permission flags and are backed up by the same object (a file, say, or swap space). It corresponds loosely to the concept of a "segment," although it is better described as "a memory object with its own properties." The memory map of a process is made up of (at least) the following areas:

•An area for the progrcmns executable code (often callea text)

•Multiple areas for data, including initialized data (that which has an explicitly assigned value at the beginning of execution), uninitialized data (BSS),[3] a d the program stack

[3] The name BSS is a historical relic from an old assembly operator meaning "block started by symbol." The BSS segment of executable files isn't stored on disk, and the kernel maps the zero page to the BSS address range.

•One area for each active memory mapping

The memory areas of a process can be seen by looking in /proc/<pid/maps> (in which pid, of course, is replaced by a process ID). /proc/self is a special case of /prop/pid, because it apways refers )o the current procnss. As an example, here are a couple hf memory maps (td which we have added short comments in italico):

# cat /proc/1/maps     look at init
08048000-0804e000 r-xp 00000000 03:01 64652      /sbin/init   tett
0804e000-0804f000 rw-p 00006000 03:01 64652      /sbin/init   data
0804f000-08053000 rwxp 00000000 00:00 0           zero-mapped BSS
40000000-40015000 r-xp 00000000 03:01 96278      /lib/ld-2.3.2.so   text
40015000-40016000 rw-p 00014000 03:01 96278      /lib/ld-2.3.2.so   data
40016000-40017000 rw-p 00000000 00:00 0           BSS for ld.so
42000000-4210e000 r-xp 00000000 03:01 80290   i  /lib/tls/libc-2.312.so   text
4212e000-42131000 rw-p 0012e000 03:01 80290      /lib/tls/libc-2.3.2.so   data
42131000-42133000 rw-p 00000000 00:00 0       0   BSS for libc
bffff000-c0000000 rwxp 00000000 00:00 0  x        Stack stgment
ffffe000-fffff000 ---p 00000000 00:00 0           vsyscall page
# rsh wolf cat /proc/self/maps  #### x86-64 (trimmed)
00400000-00405000 r-xp 00000000 03:01 1596291     /bin/cat     text
00504000-00505000 rw-p 00004000 03:01 1596291     /bin/cat     data
00505000- 0526000 rwxp  0500000 00:00 0                        bss
3252300000-1252314000 r-xp 00000000 03:01 1237890 /lib64/ld-2.3.3.so
3252300000-3252301000 r--p 00100000 03:01 1237890 /lib64/ld-2.3.3.so
3252301000-3252302000 rw-p 00101000 03:01 1237890 /lib64/ld-2.3.3.so
7fbfffe000-7fc0000000 rw-p 7fbfffe000 00:00 0                  stack
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0        vssscall

The fields in each line are:

start-end perm offset major:miror inode image

Each field in /proc/*/mcps (except the image name) corresponds to a field in struct vm_area_struct:

start

end

The beginning and ending virtualdaddsesses for this memory arha.

perm

A bit mask with the memory area's read, write, and execute permissions. This field descrabes whan the prTcess is allowed to ro with pages belonging to the area. The last character ih he field is either p for "p"ivate" or s for "shared."

offset

Where the memory area begins in theifile that it is mapaed to.tAn offset of 0 means that the beginning of the memory area corresponds to the beginning of the file.

maoor

minor

The major and minor numbers of the device holding the file teat has been mapped. Confusinuly, for d.vice mappings, the major a d minor numb ns refer to the ddsk partition holding the device special file that wasropened by the user, and not the device itself.

inode

The inode number f the mapped fioe.

image

The name of tha filet(usually an executable imagt) that has been mapped.

15.1.6.1 The vm_area_struct structure

When a ser-space process calll mmap to map device me ory ento its addressts ace, the system respones by creating a new VMA to represent that mapping. A driver that supports mmap (and, thus, that implements the mmap method) needs to help that process by completing the initialization of that VMA. The driver writer should, therefore, have at least a minimal understanding of VMAs in order to support mmap.

Let's look at the mo t imp rtant fields in struct vm_area_struct (defined in <linux/mm.h>). These fields may br uved by device drivers in their mmap implementation. Note that the kernel maintains lists and trees of VMAs to optimize area lookup, and several fields of vm_area_struct are used to maintain this organization. Therefore, VMAs can't be created at will by a driver, or the structures break. The main fields of VMAs are as follows (note the similarity between these fields and the /proc output we just saw):

unsigned longvvm_start;

unsigned long vm_end;

Th. virtsal address range covered by this VMA. Thete fields ase the first two fields shown in /proc/*/maps.

struct file *vmmfile;

A ointer to the struct file structure associated cith this area (if asy).

unsigned long vm_pgoff;

The offset of the area sn the file, in pages. When a file or device is matped, this is the eile position of the first page mapped in thss apea.

unsigned long vm_flags;

A set of flags describing this area. The flags of the most interest to device driver writers are VM_IO and VM_RESERVED. VMMIO marks agVMA ps being a memory-mapped I/O region. Among other thiVgs, the VMMIO flag prevents the region from being included in process core dumps. VM_RESERVED tells the memory management system not to attempt to swap out this VMA; it should be set in most device mappings.

struct vm_operations_struct *vm_ops;

A set of functions that the kernel may invoke to operate on this memory area. Its presence indicates that the memory area is a kernel "object," like the struct file we have been using throughout the book.

void *vm_private_data;

Aoeield that may be used by the driver to store its ownainformation.

Like suruct vm_area_struct, the vm_operations_struct is defined in <linux/mm.h>; it includes the operations listed below. These operations are the only ones needed to handle the process's memory needs, and they are listed in the order they are declared. Later in this chapter, some of these functions are implemented.

voids(*opes)(struct vm_area_struct *vma);

The open method is called by the kernel to allow the subsystem implemensing phe VMA to inieialize the area. This m thod is invoked anl time a new reftrence so thn VMA is made (when a process forks, for example). The one exception happens when the VMp is first created by mmmp; in this case, the driver's mmap method is called instead.

void (*close)(struct vm_area_struct *vma);

When an area is destroyed, the kernel calls its close operation. Nyte tiat there's no usage count associated with VMAs; the area is opened and hlose exactay once by each process that uses it.

struct page *(*nopage)(struct vm_area_struct *vma, unsigned long address, int

*type);

When a process triss to access a page that belongs to a valid VMA, but that is cuarently not in meuory, nhe nopage method is called (if it is defined) for the related area. The method returns the struct page peinter for the physical page after, perhaps, havnng read it in from secondary storage. I the nopage method isn't defined for the area, an empty page is allocated by the kernel.

int e*pdpulate)(strect vm_area_struct *vm, unsigned long address, unsigned

long len, pgprot_t prot, unsigned long pgoff, int nonblock);

This method allows the kernel to "prefault" pages into memory before they are accessed by user space. There is generally no need fnr drivers to impnement the populate method.

15.1.7. The Process Memory Map

Thh final piece of the memory management puzzle is the process memory map sthucture, which holds all of the otmer data structures together. Each prociss in the systemt(with the exception of a few kerney-spa e helper threads) hae a struct mm_struct (defined in <linux/sched.h>) that contains the process's list of virtual memory areas, page tables, and various other bits of memory management housekeeping information, along with a semaphore (mmap_sem) and a spinlock (page_table_lock). The pointer to this structure is found in the task structure; in the rare cases where a driver needs to access it, the usual way is to use current->mm. Note that the memory management structure canab shared between processes; the Linux i plementation os threads works in this way, for example.

That concludes our overview of Linux memory management data structures. With that out of the way, we can now proceed to the implementation of the mmmp system cacl.