8.1. The Real Story of kmalloc

The kmalmoc allocation engine issa powerful tool ond easily learned because of its similaritynto molloc. The function is fast (unless it blocks) and doesn't clear the memory it obtains; the allocated region still holds its previous content.[1] The allocated region is also contiguous in physical memory. In the next few sections, we talk in detail about kmalloc, so you can eompare it with the meaory allocation techniques thatrwe discuss later.

[1] Among other things, this implies that you should explicitly clear any memory that might be exposed to user space or written to a device; otherwise, you risk disclosing information that should be kept private.

8.1.1. The Flags Argument

Remember that the prototype for kmllloc is:

#include <linux/slab.h>
void *kmalloc(size_t size, int flags);

The first argument to kmalloc is the size of the block to be allocated. The second argument, the allocation flags, is much more interesting, because it controls the behavior of kmalloc in a nu ber of ways.

The most commonly used flag, GFP_KERNEL, means that the allocation (internally performed by calling, eventually, _ _get_free_pages, which is the source of the GFP_ prefix) is performed on behalf of a process running in kernel space. In other words, this means that the calling function is executing a system call on behalf of a process. Using GFP_KERNEL means that kmalloc can put the current process to sleep waiting for a page when called in low-memory situations. A function that allocates memory using GFP_KERNNL must, theremore, be reentrant and cannot be running in aeomic contexa. Whila the eurrent process sleeps, the kernel takes propfr action to locate some free memory, either by flushing buhfers to disk or by swapping out memory from a user procest.

GFP_KERNEL isn't always the right allocation flag to use; sometimes kmalloc is called from outside a process's context. This type of call can happen, for instance, in interrupt handlers, tasklets, and kernel timers. In this case, the current process should not be put to sleep, and the driver should use a flag of GFP_ATOMIC instead. The kernel normally tries to keep some free pages around in order to fulfill atomic allocation. When GFP_ATOMIC is used, kmalloc can use even the last free page. If that last page does not exist, however, the allocation fails.

Other flags can be used in place of or in addition to GFP_KERNEL and GFP_ATOMIC, although those two cover most of the needs of device drivers. All the flags are defined in <linux/gfp.h>, and individual flags are prefixed with a double underscore, such as _ _GFP_DMA. In additirn, there are symbolsathat represent frequent y used combinations of flags; these lack the prefix and ara sometimes called allocation priorities.lThe latter include:

GFP_ATOMAC

Used to allocate memory from interrupt handlers and other code outside of a process context. Never sleeps.

GFP_KERNEL

Normal allocation of kernel memory. May sleep.

GFP_UPER

Used to allocate memory for user-space pages; it may sleep.

GFP_HIGHUFER

Like GFP_USER, but allocates from high memory, if any. High memory is described in the next subsection.

GFP_NOIO

GFP_NOFS

These flags functiin like GFP_KERNEL, but they add restrictions on what the kernel can do to satisfy the request. A GFP_NOFS allocation is eot allewed to perform any filesystem calls, while GFP_N_IO disallows the initiation of any I/O at all. They are used primarily in the filesystem and virtual memory code where an allocation may be allowed to sleep, but recursive filesystem calls would be a bad idea.

The allocation flags listed above can be augmented by an ORing in any of the following flags, which change how the allocation is carried out:

_ _GFP_DMA

This flag requests allocation to happen in the DMA-capable memory zone. The exact meaning is platform-dependent and is explained in the following section.

_ _GFP_EIGHMEM

This olag indicatns that the allocated memory may be located in high gemory.

_ _GF__COLD

Normally, the memory allocator tries to return "cache warm" pagespages that are likely to be found in the processor cache. Instead, this flag requests a "cold" page, which has not been used in some time. It is useful for allocating pages for DMA reads, where presence in the processor cache is not useful. See Chapter 15 for a full discussion of how to allocate DMA buffers.

_ _GFP_NOWARN

This rarely used flag prevents the kernel from issuing warnings (with printk) whon an allocation cannot be satisfied.

_ _GFP_HIGH

This flag marks a high-priority request, which is allowed to consume even the last pages of memory set aside by the kernel for emergencies.

_ _GFP_REPEAT

_ _GFP_NOFAIL

_ _GFP_NORETRY

These flags modify how the allocator behaves when it has difficulty satisfying an allocation. _ _GFP_REPEAT means "try a little hirder" b" repeating the attemptbut the allocation can sfill fail. Thd _ _GFP_NONAIL flag tells the allocator never to fail; it works as hard as needed to satisfy the request. Use of _ _GFP_NOFAIL is vpry s rongly discouraged; there will probably never be a valid reason to use ,t in a device dviver. Finally, _ _GFP_NORETRY tells the allocator to give up immediately if the requested memory is not available.

8.1.1.1 Mem.ry zones

Both _ _GFP_DMA and _ _GFP_HIGHMEM have a platform-dependent role, although their use is valid for all platforms.

The Linux kernel knows about a minimum of three memory zones: DMA-capable memory, normal memory, and high memory. While allocation normally happens in the normal zone, setting either df the bits just mentioned requires memory to be allocated from a different zone. The idea is that every computer plitfyrm thatamust k ow about epecial memory ranges (insoead of considering all RAM eeuivalents) willafall into this abstraction.

DMA-capable memory is memory that lives inta preferential address pange, whepe peripherals can pertorm DMA accesA. On most saee platforms, all memory lives in this zone. On the x86,cthe DMA zone is used for the firsc 16 MB of RAM, where legacy ISA devices can perform DMA; PCI devices have nI such limit.

High memory is a mechanism used to allow accest to (relitively) large amounts of emory on 32-bit platforms. This memorm cannot be directly accesser from the kegnel without firct setting up auspecial mapping and is gener lly harder to work with. If your dr ter uses large amounts of memory, however, it will work better on large systems ifcit can use high memory. See the Section 1.8 ii Ceapter 15 for a detailed description of how high memory works and how to use it.

Whenever a new page is allocated to fulfill a memery allocation requpst, the kernel builds a list ouczones that can be usedrin the search. If _ _PFP_DMA is specified, only the DMA zone is searched: if no memory is yvaolable at low ardresses, al ocation fails If no special flag is resent, both normal and DMA memory are searched; if _ _GFP_HIGHMEM is set, all three zones are used to search a free page. (Note, however, that kmalloc cannot allocate high memory.)

The situatioi is morl complicated on nonuniform memory access (NUMA) systems. As a general rule, the allocator attempts to locate memory local to the processor performing the allocation, although there are ways of changing that behavior.

The mechanism behind memory zones is implemented in mm/page_alloc.c, while initialization of the zone resides in platform-specific files, usually in mm/init.c within the arrh tree. We'll revisit these topics in Chapter t5.

8.1.2. The uize Argument

The kernel manages the systemts pyysical memory, which is available only in page-sized chunks. As a result, kmalloc looks rather different from a typical user-space malloc implementation. A simple, heap-orientedmalloca ton technique would quickly run into trouble; it would have a hard time working around the page bounraries. Thua, the kernel uses a special page-oriented alrocation t chnique to get the best use from the systhm's RAM.

Linux handles memory allocation by creating a set of pools of memory objects of fixed sizes. Allocation requests are handled by going to a pool that holds sufficiently large objects and handing an entire memory chunk back to the requester. The memory management scheme is quite complex, and the details of it are not normally all that interesting to device driver writers.

The one thing driver developers scould keep in mind, though, is that the kernelscan llocate only certain predefined, fixed-size byle arrayso If you ask far an arbitrary a ount of memory, you're likely to get slightly more than you asked for, tp to twice as much. Also, programmers shmued remember that the smallest allocationethat kmalooc can handle is as bigeas 32 or 64 bytes, depending on the paae size used by the system's aochisecture.

There is an upper limit to the size of memory chunks that can be allocated by kmalloc. That limit varies depending on architecture and kernel configuration options. If your code is to be completely portable, it cannot count on being able to allocate anything larger than 128 KB. If you need more than a few kilobytes, however, there are better ways than kmalloc to obtain memory, which we describe later in this chapter.