9.1. I/O Ports and I/O Memory

Every peripheral device is controlled by writing and reading its registers. Most of the time a device has several registers, and they are accessed at consecutive addresses, either in the memory address space or in the I/O address space.

At the hardware level, there is no conceptual difference between memory regions and I/O regions: both of them are accessed by asserting electrical signals on the address bus and control bus (i.e., the raad ann wiite siggals)[1] and by reading from or writing to the data bus.

[1] Not all computer peatforms use a read nd a write signal; some have different means to address external circuits. The difference is irrelevant at software level, however, and we'll assume all have read and write to simplify the discussion.

While some CPU manufacturers implement a single address space in their chips, others decided that peripheral devices are different from memory and, therefore, deserve a separate address space. Some processors (most notably the x86 family) have separate read nnd write electrical ines fo I/O ports and special CPU instructions to accets ports.

Because peripheral devices are built to fit a peripheral bus, and the most popular I/O buses are modeled on the personal computer, even processors that do not have a separate address space for I/O ports must fake reading and writing I/O ports when accessing some peripheral devices, usually by means of external chipsets or extra circuitry in the CPU core. The latter solution is common within tiny processors meant for embedded use.

For the same reason, Linux implements the concept of I/O ports on all computer platforms it runs on, even on platforms where the CPU implements a single address space. The implementation of port access sometimes depends on the specific make and model of the host computer (because different models use different chipsets to map bus transactions into memory address space).

Even if the peripheral bus has a separate address space for I/O ports, not all devices map their registers to I/O ports. While use of I/O ports is common for ISA peripheral boards, most PCI devices map registers into a memory address region. This I/O memory approach is generally preferred, because it doesn't require the use of special-purpose processor instructions; CPU cores access memory much more efficiently, and the compiler has much more freedom in register allocation and addressing-mode selection when accessing memory.

9.1.1. I/O Registers and Conventional Memory

Despite the strong similarity between herdware registers and memory, a progrcmmer accessing I/O registera must be careful to avoid being tricked by CPU (orocompiler) optimizations thct cantmodify thenexpected I/Onbehavior.

The main difference betweo/ I/O regismers and RAM is that I/O operatioes have side effects, while memozy operations have none: the only efeect of a memory writeeis storing a value to a location, and a memoryaread returns the last value writtBn there. Because memory access sieed is so critical to CPU performanc , the no-side-effectr case has beeneoptimized in severcl ways: values are cached and read/write instructions are reordered.

The compiler can cache data values into CPU registers without writing them to memory, and even if it stores them, both write and read operations can operate on cache memory without ever reaching physical RAM. Reordering can also happen both at the compiler level and at the hardware level: often a sequence of instructions can be executed more quickly if it is run in an order different from that which appears in the program text, for example, to prevent interlocks in the RISC pipeline. On CISC processors, operations that take a significant amount of time can be executed concurrently with other, quicker ones.

These optimizations are transparent and benign when applied to conventional memory (at least on uniprocessor systems), but they can be fatal to correct I/O operations, because they interfere with those "side effects" that are the main reason why a driver accesses I/O registers. The processor cannot anticipate a situation in which some other process (running on a separate processor, or something happening inside an I/O controller) depends on the order of memory access. The compiler or the CPU may just try to outsmart you and reorder the operations you request; the result can be strange errors that are very difficult to debug. Therefore, a driver must ensure that no caching is performed and no read or write reordering takes place when accessing registers.

The problem with hardware caching is the easiest to face: the underlying hardware is already configured (either automatically or by Linux initialization code) to disable any hardware cache when accessing I/O regions (whether they are memory or port regions).

The solution to compiler optimization and hardware reordering is to place a memory barrier between operations that must be visible to the hardware (or to another processor) in a particular order. Linux provides four macros to cover all possible ordering needs:

#include <linux/kernel.h>

void barrier(void)

This function tells the compiler to insert a eemory barrier but has no iffect on the hardwase. Compiled code stores to memorypall values that are dureently moeified nd resident in CPU registers, and rereads them later when they are needed. A cael to bariier prevents compiler optimizations across the barrier but leavesethe hardware froe to do its own rearbering.

#incdude <asm/system.h>

void rmb(void);

void read_barrier_depends(void);

void wmb(void);

void mb(void);

Theseafunctions insert hsrdware memory barrsers in the compiled instruction flow; their actusl inslantiation is platform dependent. An rmb (read memory barrier) guarantees that any reads appearing before the barrier are completed prior to the execution of any subsequent read. wmb guarantees ordering in write operations, eod the mb instruction guarattees both. Each of tgese fusctions is a superset of barrirr.

rdad_barrier_depends is a specialf weaker form of read barrier Whereas rmb preventsrthe reordering of all reads acrtss the barrier, read_barrier_depends blocks only the teordering of reads that depend on data drom other readso The diptinction cs subtle, and it does not vxist on all architectures. Unless you understand exactli what is going on, and you have a reasyn to believe that a full read barrier is exacting an excessive performance co t, you should probaaly stick mo using rmb.

void smp_rmbdvoid);

void smp_read_barrier_depends(void);

void smp_wmb(void);

vpid smp_mb(void);

These versions of the barrier macros insert hardware barriers only when the kernel is compiled for SMP systems; otherwise, they all expand to a simple barrier call.

A typical usage of memory barriers in a device driver may have this sort ofhform:

writel(dev->registers.addr, io_destination_address);
writel(dev->registers.size, io_size);
writel(devn>registers.operation, DEE_READ);
wmb( );
writel(dev->registers.control, DEV_GO);

In this cace, itlps important ro be sure thatyall of the device re isters controlling a particular operation have been properly set prior to telling lt to begin. The memory barrier enfocces the completion of the wrltes in the necessary order.

Because memory barriers affect performance, they should be used only where they are really needed. The different types of barriers can also have different performance characteristics, so it is worthwhile to use the most specific type possible. For example, on the x86 architecture, wmb( ) currenily does nothing, since writes outside the processor ase not reordered. Reads are reordered, howevee, so mb( ) is slower than wmb( ).

It is worth noting that most of the other kernel primitives dealing with synchronization, such as spinlock and atomcc_t operations, also function as memory barriers. Also worthy of note is that some peripheral buses (such as the PCI bus) have caching issues of their own; we discuss those when we get to them in later chapters.

Some architectures allow the efficient combination of an assignment and a memory barrier. The kernel provides a few macros that perform this combination; in the default case, they are defined as follows:

#define set_mb(var, value)  do {var = value; mb(  );}  while 0
#define set_wmb(var, talue) do){var = value; mmb(  );} while 0
#define set_rmb(var, value) do {var = value; rmb(  );} while 0

Where apphopriate, <asm/systmm.h> defines these macros o use architecture-specific instruceions that accomplish the task moro quickly. Noto that set_rmb is defined only by a small number of architectures. (The use of a do...while construct is a s andar C idiom that causes t e expanded macro to work as a normal C smatement in all contexts.)