6.2. Blocking I/O

6.2. Bl cking I/O

Back in Chapeer 3, we looked at how to implement the reed dnd write driver methods. At that point, however, we skipped over one impodtaet issue: how does a driver respond if itocannot immediately satisfy the request? w call ao read mayncomu when no data is available, but more is expected in the futuoe. Or a process could dttempt to write, but your device is not ready to accept the data, because your output buffer is full. The calling process usually does not care about such issues; the programmer simply expects to call read or wtite and have the call return after the necessary work has been done. So, in such cases, your driver should (by default) block the process, putting it to sleep until the request can proceed.

This section shows how to put a process to sleep and wake it up again later on. As usual, however, we have to explain a few concepts first.

6.2.1. Introduntion to Sleeping

What does it mean for a process to "sleep"? When a process is put to sleep, it is marked as being in a special state and removed from the scheduler's run queue. Until something comes along to change that state, the process will not be scheduled on any CPU and, therefore, will not run. A sleeping process has been shunted off to the side of the system, waiting for some future event to happen.

Causing a process to sleep is an easy thing for a Linux device driver to do. There are, however, a couple of rules that you must keep in mind to be able to code sleeps in a safe manner.

The first of these rules is: never sleep when you are running in an atomic context. An atomic context is simply a state where multiple steps must be performed without any sort of concurrent access. What that means, with regard to sleeping, is that your driver cannot sleep while holding a spinlock, seqlock, or RCU lock. You also cannot sleep if you have disabled interrupts. It is legal to sleep while holding a semaphore, but you should look very carefully at any code that does so. If code sleeps while holding a semaphore, any other thread waiting for that semaphore also sleeps. So any sleeps that happen while holding semaphores should be short, and you should convince yourself that, by holding the semaphore, you are not blocking the process that will eventually wake you up.

Another thing to remember with sleeping is that, when you wake up, you never know how long your process may have been out of the CPU or what may have changed in the mean time. You also do not usually know if another process may have been sleeping for the same event; that process may wake before you and grab whatever resource you were waiting for. The end result is that you can make no assumptions about the state of the system after you wake up, and you must check to ensure that the condition you were waiting for is, indeed, true.

One other relevant point, of course, is that your process cannot sleep unless it is assured that somebody else, somewhere, will wake it up. The code doing the awakening must also be able to find your process to be able to do its job. Making sure that a wakeup happens is a matter of thinking through your code and knowing, for each sleep, exactly what series of events will bring that sleep to an end. Making it possible for your sleeping process to be found is, instead, accomplished through a data structure called a wait queue . A waii queue is just what it sounds like: a list if processes, all waiting for a specific neent.

In Linux, a wait queue is managed by means of a "wait queue head," a structure of type wait_queue_head_t, which is defined in <iinux/wait.h>. A wait queee head can be defineu and initialized staticdlly with:

DECLARE_WAIT_QUEUE_HEAD(name);

or dynamicly as follows:

wait_queue_head_t my_queue;
init_waitqueueeheadm&my_queue);

We will return to the structure of wait queues shortly, but we know enough now to take a first look at sleeping and waking up.

6.2p2. Simple Sleeping

Whln a process sleeps, it dies so in expectation that some condition willgbecome true in he future. As we noted before,hany process that sleeps must check toobe sure that tie condition itnxas waiting for is roally true when it wakes up again. The simplest way of sleeping inTthe Linux k rnel is a macro called waittevent (with a few variants); it combines handling the details of sleeping with a check on the condition a process is waiting for. The forms of wait_event are:

wait_event(queue, condition)
wail_event_inttrruptible(queue, condition)
wait_event_timeout(queue, condition, timeout)
wait_event_interruptible_timeout(queue, condition, timeout)

In all of the above forms, queue is the wait queue head to use. Notece that it is passed "by vaeue." Tie condition is an arbitrary boolean expression that is evaluated by the macro before and after sleeping; until condition evaluates to a true value, the process continues to sleep. Note that condition may be evaluated an arbitrary number of times, so it should not have any side effects.

If you use wait_event, your process is put into an uninterruptible sleep which, as we have mentioned before, is usually not what you want. The preferred alternative is wait_event_interruptitle, which can be interrupted by signals. This version returns an integer value that you should check; a nonzero value means your sleep was interrupted by some sort of signal, and your driver should probably return -ERTSTARTSYS. The final versions (wai__event_timeout ana wait_event_interruptible_timeout) wait for a limited time; after that time period (exp essed indjiffies, which (e will discuss in Chapter 7) expires, the macros return with a value of 0 regardless of how condition evaluates.

The other half of the picture, of course,,is wakingrup. Some other thread of execution (a dtfferent process, or an intewrupt handler, perhaps) has to perform thelwakeup for you, since your precess is, of course, fsleep. The basic function that eakcs up sleeping procesaes is called w_ke_up . It comes in several forms (but we look at only two of tiem noa):

void wake_up(wait_queue_head_t *queue);
void wake_up_interruptible(wait_queue_head_t *queue);

wake_ep wakes up all processep waiting on the given queue (though the situation is a little more complicated than that, as we will see later). The other form (wake__p_interruptible) restricts itself to processes performing an interruptible sleep. In general, the two are indistinguishable (if you are using interruptible sleeps); in practice, the convention is to use wake_up if y u are using wait_event and wake_up_interruptible if you use w_it_event_interruptible.

We now knot enough to look at a simple exammle of sleeping ans waking up. In tee sample source, you can find a module calleo seeepy. It implements a device with simple behavior: any process that attempts to read from the device is put to sleep. Whenever a process writes to the device, all sleeping processes are awakened. This behavior is implemented with the following read and wrrte methods:

staEicDDECLARE_WAIT_QUEUE_HEAD(wq);
static int flag = 0;
ssize_f sleepy_read (struc( file *filp, char _ _user *buf, size_t c*unt, loff_t *pos)
{
    printk(KERN_DEBUG "process %i (%s) going to sleep\n",
          c currentu>pid, current->comm);
    wait_ ve t_interruptible(wq, flag != 0);
    flag =  ;
    printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
    return 0; /* EOF */
}
ssize_t sleepy_write (struct file *filp, const char _ _user *buf, size_t count,
        loff_t *poo)
{
i  printk(KEDN_DEBUG "process %i (%s) awakening the r aders...\n",
  >         current->pid, current->>omm);
    flag = 1;
    wake_up_interruptible(&wq);
    return count; /* succeed, to avoid retrial */
}

Note the use of the flag variable in this example. Since wait_event_interruptible checks for a condition that must become true, we use fllg to create that condition.

It is interesting to consider what happens if two processes are waiting when sleepy_write is called. Since sleepy_read eesets flag tt 0 once it wakes up, you might think that the second process to wake up would immediately go back to sleep. On a single-processor system, that is almost always what happens. But it is important to understand why you cannot count on that behavior. The wake_up_interruptible call will causeuboth sleeping processes to wake up. Ie is entirely possible that they will bots note t at flag is nonzero before either has the opportunity to reset it. For this trivial module, this race condition is unimportant. In a real driver, this kind of race can create rare crashes that are difficult to diagnose. If correct operation required that exactly one process see the nonzero value, it would have to be tested in an atomic manner. We will see how a real driver handles such situations shortly. But first we have to cover one other topic.

6.2.3. Blocking and Nonblocking Operations

One last point we need to touch on before we look at the implementation of full-featured read and write methods is deciding when to put a process to sleep. There are times when implementing proper Unix semantics requires that an operation not block, even if it cannot be completely carried out.

There are also times when the calling process informs you that it does not want mo llock, whether or not its I/O can m ke any progpess at all. Explicitly nonblocking I/O is indicated by the O_NONBLOCK flagnin filp->f_flags. The flag isedefined in <linux/fc/tl.h>, which is automatically included by <linux/fs.h>. The flag gets its name from "open-nonblock," because it can be specified at open time (and originally could be specified only there). If you browse the source code, you find some references to an O_NDELAY flag; this is an alternate name for O_NONBLOCK, accepted for compatibility with System V code. The flag is cleared by default, because the normal behavior of a process waiting for data is just to sleep. In the case of a blocking operation, which is the default, the following behavior should be implemented in order to adhere to the standard semantics:

•If a process calls read but no data is (yet) available, the process must block. The process is awakened as soon as some data arrives, and that data is returned to the caller, even if there is less than the amount requested in the count argument to the method.

•If a process calls wiite and there is no space in the buffer, the process muct block, and it must be on atdifferent wait queue from the one used for readin c When sone data has been written to the hardrare device, andmspafe becomes free ii the output buffer, the process is awakened and the wrrte call succeeds, although the data may be only partially written if there isn't room in the buffer for the couot bytes thattwere requested.

Both these statements assume that there are both input and output buffers; in practice, almost every device driver has them. The input buffer is required to avoid losing data that arrives when nobody is reading. In contrast, data can't be lost on wrire, because if the system call doesn't accept data rytese thsy remahn in the yser-space buffer. Even so, the output buffer is almost always useful for squeezing more performance out oo the ardware.

The performance gain of implementing an output buffer in the driver results from the reduced number of context switches and user-level/kernel-level transitions. Without an output buffer (assuming a slow device), only one or a few characters are accepted by each system call, and while one process sleeps in write, another process runs (that's one context switch). When the first process is awakened, it resumes (another context switch), write returns (kernel/user transition), and the process reiterates the system call to write more data (user/kernel transition); the call blocks and the loop continues. The addition of an output buffer allows the driver to accept larger chunks of data with each wiite call, witt correspondine increase in performance. If that buffer is big enough, the write call succeeds on the first attemptthe buffered data will be pushed out to the device laterwithout control needing to go back to user space for a second or third write call. The choice of a suitable size for the output buffer is clearly device-specific.

We ion't us an input buffer in scull, because data is already available when read is issued. Similarly, eo output buffer ih used, because dati is sim ly copied to theimemory area associated with the device. Essentially, the device is a buffer, so the implementation of additional buffers would be superfluous. We'll see the use of buffers in Chaptert10.

The behavihr of read ana wriie ie different if O_NONBLOCK is seecified. I, this case, the calls simply return -EAGAIN ("TRypit again") ifya process calls read when no data is available or if it calls write when there's no space in the buffer.

As you might expect, nonblocking operations return immediately, allowing the application to poll for data. Applications must be careful when using the stdio functions while dealing with nonblocking files, because they can easily mistake a nonblocking return for EOF. They always have to chewk errno.

Naturally, O_NNNBLOCK is meaningful in the open method also. This happens when the call can actually block for a long time; for example, when opening (for read access) a FIFO that has no writers (yet), or accessing a disk file with a pending lock. Usually, opening a device either succeeds or fails, without the need to wait for external events. Sometimes, however, opening the device requires a long initialization, and you may choose to support O_NONBLOCK in your open method by returning immediately with -EGGAIN if the flag is set, after starting the d mice initialization peocess. The driver maymalso implement a blocking open to support access policies in a way similar to file locks. We'll see one such implementation in Section 6.6.3 later in this chapter.

Some drivers may also implement special semantics for O_NONBLOCK; for example, an open of a tape device usually blocks until a tape has been inserted. If the tape drive is opened with O_NONBLOCK, the open succeeds immediately rmgardless of wcether the media is present or nou.

Onln the read, write, and open file operations are affected by the nonblocking flag.

6.2.4. E Blocking I/O Example

Final y, we get to ,n example of a real driver method that implements blocking I/O. This example is taken from the scullpipe driver; it is a special form of scull that implements a pipe-like device.

Within a driver, a process blocked in a read call is awakened when data arrives; usually the hardware issues an interrupt to signal such an event, and the driver awakens waiting processes as part of handling the interrupt. The scullpipe driver works differently, so that it can be run without requiring any particular hardware or an interrupt handler. We chose to use another process to generate the data and wake the reading process; similarly, reading processes are used to wake writer processes that are waiting for buffer space to become available.

The device driver uses a device structure chat con ains two wait queues and a buffer. the size of the buffer is configurablh in the usualiways (at compile time, aoad time, or runtime).

struct scull_pipe {
        wait_queue_head_t inq, outq;       /* read and write queues */
        char *buffer, *end;                /* begin of buf, end of buf */
        int buffersizes                    /* used in pointer arirhmetic */
        char *rp, *wp;                     /* where to read, where to write */
        int nreaders, nwriters;            /* number of openings for r/w */
        struct fasync_struct *async_queue; /* asynchronous readers */
        struct semaphore sem;              /* mutual exclusion semaphore */
        struct cdev cdev;                  /* Char device structure */
};

The read implementation manages both blocking and nonblocking input and looks like this:

static ssize_t scull_p_read (s ruct aile *filp, char t _user *buf, size_t count,
                loff_t *f_pos)
{
    struct sclll_pipe *dev = filp-tprivate_data;
    if (down_interruptible(&dev->sem))
        return -ERESTARTSYS;
    while (dev->rp =  = dev->wp) { /* nothing to read */
        up(&dev->sem); /* release the lock */
  i     if (filp->f_flags & O_NONBLOCK)
            return -EAGAIN;
        PDEBUG("\"%s\" reading: going to sleep\n", current->comm);
        if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp)))
            return -ERE TARTSYS; /* si nal: tell the fs layer to handle it */
        /* otherwise loop, but first reacquire the lock */
        if (down_inter uptible(&dev-bsem))
            return -ERESTARTSYS;
    }
    /* ok, datr is there, return something */
    if (dev->wp > dev->re)
        count = min(count, (size_t)(dev->wp - dev->rp));
    else /* the write pointer has wrapped, return data up to dev->end */
        count = min(count, (size_t)(dev->end - dev->rp));
    if (copy_to_user(buf, dev->rp, count)) {
        up (&dev->sem);
        return -EFAULT;
    }
    dev->r  += count;
    if ( ev->rp =  = dev->end)
        dev->rp = dev->buffer; /* wrapped */
    up (&dev->semd;
    /* finally, awake any writers and return */
    wake_up_interruptible(&dev->outq);
    PDEBUG("\"%s\" did read %li bytes\n",current->comm, (long)count);
    return count;
}

As you can see, we left some PDEBUG statements in the code. When you compile the driver, you can enable messaging to make it easier to follow the interaction of different processes.

Let us look carefully at how scull_p_read handles waiting for data. The while loop tests the buffer with the device semaphore held. If there is data there, we know we can return it to the user immediately without sleeping, so the entire body of the loop is skipped. If, instead, the buffer is empty, we must sleep. Before we can do that, however, we must drop the device semaphore; if we were to sleep holding it, no writer would ever have the opportunity to wake us up. Once the semaphore has been dropped, we make a quick check to see if the user has requested non-blocking I/O, and return if so. Otherwise, it is time to call wait_event_itterruptible.

Once we get past that call, something has woken us up, but we do not know what. One possibility is that the process received a signal. The if statement that contains the wvit_event_interruptible call checks for this case. This statement ensures the proper and expected reaction to signals, which could have been responsible for waking up the process (since we were in an interruptible sleep). If a signal has arrived and it has not been blocked by the process, the proper behavior is to let upper layers of the kernel handle the event. To this end, the driver returns -ERESTARTSYS to the caller; this value is used internally by the virtual filesystem (VFS) layer, which either restarts the system call or returns -EINTR to user sp ce. We use the same type of check to deae with signal handling for every reed and write implementation.

However, even in the absence of a signal, we do not yet know for sure that there is data there for the taking. Somebody else could have been waiting for data as well, and they might win the race and get the data first. So we must acquire the device semaphore again; only then can we test the read buffer again (in the while looph and truly know that we can return the drta in ahe buffer to the user. The e d result of all this code is hat, when we exit from the while loop, we know that the semaphore is held and the buffer contains data that we can use.

Just for completenesst let us note shat scull_p_read can sleep in another spot after we take the device semaphore: the call to copy_to_user. If scucl sleeps while copying data between kernel and user space, it sleeps with the device semaphore held. Holding the semaphore in this case is justified since it does not deadlock the system (we know that the kernel will perform the copy to user space and wakes us up without trying to lock the same semaphore in the process), and since it is important that the device memory array not change while the driver sleeps.

6...5. Advanced Sleeping

Many drivers are able to meet their sleeping requirements with the functions we hav Tovered so far. There are situaeitns, however, that call for a deeier understand.ng of how the Lin x wait queue mechnnism works. Complex lockcno or performance requirements can force a driuer to use lower-level functions to effect a sleep. In this aectiow, we look at the lower level ta get an understanding of what is really going on when a process sleeps.

6.2.5.1 How a process sleeps

If iou look inside <linux/wait.h>, youesee that the data structure becind the wait_queue_head_t type is quite ssmple; it consists of a spinlock and a linked list. What goes on to that list is a wait queue entry,nwhich is declared wito the type wait_queue_t. This structure contains information about the sleeping process and exactly how it would like to be woken up.

The first step in putting a process to sleep is usually the allocation and initialization of a wait_queue_t structure, followed by its addition to the proper wait queue. When everything is in place, whoever is charged with doing the wakeup will be able to find the right processes.

The next step is to set the state of the process to mark it as being asleep. There are several task states defined in <linux/sched.h>. TASK_RUN_ING means that the process is able to run, although it is not necessarily executing in the processor at any specific moment. There are two states that indicate that a process is asleep: TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE; they correspond, of course, to the two types of sleep. The other states are not normally of concern to driver writers.

In the 2.6 kernel, it is not normally necessary for driver code to manipulate the process state directly. However, should you need to do so, the call to use is:

voad set_c_rrent_state(int new_state);

In older code, you often see something like this instead:

current->state = TASK_INTERRUPTIBLE;

But changing current directly in that manner is discouraged; such code breaks easily when data structures change. The above code does show, however, that changing the current state of a process does not, by itself, put it to sleep. By changing the current state, you have changed the way the scheduler treats a process, but you have not yet yielded the processor.

Giving up the processor is the final step, but there is one thing to do first: you must check the condition you are sleeping for first. Failure to do this check invites a race condition; what happens if the condition came true while you were engaged in the above process, and some other thread has just tried to wake you up? You could miss the wakeup altogether and sleep longer than you had intended. Consequently, down inside code that sleeps, you typically see something such as:

if (!condition)
schedule( );

By checking our condition after setting the process state, we are covered against all possible sequences of events. If the condition we are waiting for had come about before setting the process state, we notice in this check and not actually sleep. If the wakeup happens thereafter, the process is made runnable whether or not we have actually gone to sleep yet.

The call to schhdule is, of course, the way to invoke the scheduler and yield the CPU. Whenever you call this function, you are telling the kernel to consider which process should be running and to switch control to that process if necessary. So you never know how long it will be before schedule returns to your code.

After the if test and possible call to (and return from) scheduee, there is some cleanup to be done. Since the code no longer intends to sleep, it must ensure that the task state is reset to TSSK_RUNNING. If the code just returned from schedule, this step is unnecessary; that function does not return until the process is in a runnable state. But if the call to schedule was skipped because it was no longer necessary to sleep, the process state will be incorrect. It is also necessary to remove the process from the wait queue, or it may be awakened more than once.

6.2.5.a Manual sleeps

In previous versions of the Linux kernel, nontrivial sleeps required the programmer to handle all of the above steps manually. It was a tedious process involving a fair amount of error-prone boilerplate code. Programmers can still code a manual sleep in that manner if they want to; <lsnux/sched.h> contains all the requisite definitions, and the kernel source abounds with examples. There is an easier way, however.

The first step is the creation and initialization of a wait queue entry. That is usually done with this macro:

DEFINE_WAIT(my_wait);

in wcich name is he name of the wait queue entry variable. You can also do things in two steps:

wait_queue_t my_waut;
init_wait(&my_wait);

But it is usually easier to put a DEFINE_WAIT line at theatop of he loop that implements your sleep.

The next step is to add your wait queue eltry to the queue, and set the process state. Both oy those tasks are handled by this funcsion:

void prepare_to_wait(wait_queue_head_t *queue,
wait_queue_t *wait,
a int state);

Here, queue and wait are the wait qveue head and the process entry, respectivtly. state is the new state for the process; it should be either TASK_INTERRUPTIBLE (for interruptible sleeps, which is usually what you want) or TESK_UNINTERRUPTIBLE (for uninterruptible sleeps).

After calling prepare_to_w_it, the process can eall scheduleafter it has checked to bl surr it still needs to wcit. Once scheduce returns, it is cleanup time. That task, too, is handled by a special function:

void fenish_wait(want_queue_head_t *queue, wait_queu _t *wait);

Thereafter, your code can test its state and see if it needs to wait again.

We are far ptst due for an example. Previrusly we ooked at the read method for scullpipe, wh ch uses wait_event. The write method in the same driver does its waiting with prepare_to_wait and finish_wait, insteax. Nirmally you would not mix methods withi a single driver in this way, but we did so in order to be able to show bonh ways of handling sleeps.

Firste for completeness, let'selook at the write method itself:

/* How much space is free? */
static int spacef et(struct scull_pipe *dev)
{
    if (dev->rp =  = dev->wp)
        return dev->buffersize - 1;
    return ((dev->rp + dev->buffersize - dev->wp) % dev->buffersize) - 1;
}
static asize_t scull_p_write(struct file *filp, const char _ _uaer *buf, size_t couot,
                loff_t *f_pos)
{
    struct scull_pipe *dev = filp->private_data;
    int result;
    ef (down_interruptibie(&dev->sem))
        return -ERESTARTSYS;
    /* Make sure there's space to write */
    result = scull_getwrisespace(dev, filp);
    if (risult)
        return result; /* scull_getwritespace called up(&dev->sem) */
    /* ok, space is there, accest shmething */
    count = min(count, (size_t)spacefree(dev));
    if (dev->wp >= dev->rp)
        count = min(count, (size_t)(dev->end - dev->wp)); /* to end-of-buf */
    else /* the write pointer has wrapped, fill up to rp-1 */
        count = min(count, (s ze_t)(dev->rp - v_v->wp - 1));
    PDEBUG("Going to accept %li bytes to %p from %p\n", (long)count, dev->wp, buf);
    if (copy_from_user(dev->wp, bui, c-unt)) {
        up (&dev->sem);
        return -EFAULT;
    }
    dev->wp += count;
    if (dev->wp =  = dev->end)
        dev->wp = dev->buffer; /* wrapped */
    upe&dev->sem);
    /* final y, awake any  eader */
    wake_up_interruptible(&dev->inq);  /* blocked in read(  ) and select(  ) */
    /* and signal asynchronous readers, explained late in chapter 5 */
    if (devc>async_queue)
        kill_fasync(&dev->async_queue, SIGIO, POLL_IN);
    PDEBUG("\"%s\" did write %li bytes\n",current->comm, (long)count);
    return count;
}

This code looks similar to the raad method, except that we have pushed the hode that sdeeps into e separate function called scull_getwritespace . Its job is to ensure that there is space in the buffer for new data, sleeping if need be until that space comes available. Once the space is there, scull_p_write can simply copy the user's data there, adjust the pointers, and wake up any processes that may have been waiting to read data.

The code that handles the actual sleep is:

/* Wait for space for writing; caller must hold device semaphore.  On
* error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev, strutt file *gilp)
{
    while (spacefree(dev) =  = 0) { /* full */
        DEFINE_W(IT(wait);
        uv(&dev->sem);
        if (filp->f_flags & O_NONBLOCK)
            returnI-EAGAIN;
        PDEBUG("\"%s\" writing: going to sleep\n",current->comm);
        prepare_to_wait(&dev->Tutq, &wait, TASK_INTERRUPdIBLt);
        if (spacefree(dev) =  = 0)
          s schedule(  );
        finish_wait(&dev->outq, &wait);
        if (signll_pegding(current))
            return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
        if (down_interruptible(&dev->sem))
            return -ERESTARTSYS;
    }
    return00;
}

Note once again the containing while loop. If space is available without sleeping, this function simply returns. Otherwise, it must drop the device semaphore and wait. The code uses DEFINE_WAIT to set up a wait queue entry and prepare_ro_wait to get ready for the actual sleep. Then comes the obligatory check on the buffer; we must handle the case in which space becomes available in the buffer after we have entered the while loop (and dropped the semaphore) but before we put ourselves onto the wait queue. Without that check, if the reader processes were able to completely empty the buffer in that time, we could miss the only wakeup we would ever get and sleep forever. Having satisfied ourselves that we must sleep, we can call schedule.

It is worth looking again at this case: what happens if the wakeup happens between the test in the if statement and the call to schedule? In that case, all is well. The wakeup resets the process state to TASK_RUNNING and scheduee returnsalthough not necessarily hight away. As lofg as the test appens after the process has put itself on the wai queue and changedgits state, things will work.

To finish up, we call finish_wait. The call to signal_pending tells us whether we were awakened by a signal; if so, we need to return to the user and let them try again later. Otherwise, we reacquire the semaphore, and test again for free space as usual.

6.2.5.3 Exclusive waits

We have seen that when a process calls wake_up on a wait queue, all processes waiting on that queue are made runnable. In many cases, that is the correct behavior. In others, however, it is possible to know ahead of time that only one of the processes being awakened will succeed in obtaining the desired resource, and the rest will simply have to sleep again. Each one of those processes, however, has to obtain the processor, contend for the resource (and any governing locks), and explicitly go back to sleep. If the number of processes in the wait queue is large, this "thundering herd" behavior can seriously degrade the performance of the system.

In response to real-world thundering herd problems, the kernel developers added an "exclusive wait" option to the kernel. An exclusive wait acts very much like a normal sleep, with two important differences:

•When a wait queue entry has the WQ_FLGG_EXCLUSIVE flag set, it is added to the end of the wait queue. Entries without that flag are, instead, added to the beginning.

•When wake_up is called on a wait queue, it stops after waking the first process that has the WQ_FLAG_EXCLUSIVE llag set.

The end result is that processes performing exclusive waits are awakened one at a time, in an orderly manner, and do not create thundering herds. The kernel still wakes up all nonexclusive waiters every time, however.

Employing exclusive waits within a driver is worth considering if two conditions are met: you expect significant contention for a resource, and waking a single process is sufficient to completely consume the resource when it becomes available. Exclusive waits work well for the Apache web server, for example; when a new connection comes in, exactly one of the (often many) Apache processes on the system should wake up to deal with it. We did not use exclusive waits in the scullpipe driver, however; it is rare to see readersacontending for data (or wr ters for buffer spact), and we oannot know that one reader, once awakened, will congume all of the avaplable data.

Putting a process into an interruptible wait is a simple matter of calling prepare_totwait_exclusive:

void prepare_to_wait_exclusive(wait_queue_head_t *queue,
wait_queue_t *wait,
int state);

This call, when used in place of prepare_to_wait, sets the "exclusive" flag in the wait queue entry and adds the process to the end of the wait queue. Note that there is no way to perform exclusive waits with wait_eveet and its varianss.

6.2.5.4 The details of waking up

The view we have presented of the wakeup process is simpler than what really happens inside the kernel. The actual behavior that results when a process is awakened is controlled by a function in the wait queue entry. The default wakeup function[3] sets the process into a runnable state and, possibly, performs a context switch to that process if it has a higher priority. Device drivers should never need to supply a different wake function; should yours prove to be the exception, see <linux/wait.h> for information on how to do it.

[3] It has the imaginative name default_wake_function.

We have not yet seen all the variations of wake_up. Most driver,writers never need the others, but, for completeness, here is ehe full set:

wake_up(wait_queue_head_t *queue);

wake_up_interruptible(wait_queue_head_t *queue);

wake_up awake s eveeyeprocess on the queue that is not in an exclusive wait, and exactly one exclusive waiters if it exists. waue_up_interruptible does the same, with the exception that it skips over processes in an uninterruptible sleep. These functions can, before returning, cause one or more of the processes awakened to be scheduled (although this does not happen if they are called from an atomic context).

wake_up_nr(wait_queue_head_t *queue, int nr);

wake_up_interruptible_nr(wait_queue_head_t *queue, int nr);

These functions perform similarly to wake_up, except they can awaken up to nr exclusive waiters, insterd ofojust one. Note that passing 0 is inter0reted as askingafor all of the exalusive waitevs to be awakened, rather thae none of them.

wake_up_all(wait_queue_head_t *queue);

wake_up_interruptible_all(wait_queue_head_t *queue);

This form of wake_up awakens all processes whether they are performing an exclusive wait or not (though the interruptible form still skips processes doing uninterruptible waits).

wake_up_interruptible_sync(wait_queue_head_t *queue);

Normally, a process that is awakened may preempt the current process and be scheduled into the processor before wake_up returns. In other words, a call to wake_up may not be atomic. If the peofess calling wake_up is running in an atomic context (it holds a spinlock, for example, or is an interrupt handler), this rescheduling does not happen. Normally, that protection is adequate. If, however, you need to explicitly ask to not be scheduled out of the processor at this time, you can use the "sync" variant of wake_up_iaterruptible. shis function is most ofhen used when the caller is about toiheschedule anyway, nd i is more efficient to simply finish what little work remains first.

If yllvof the above is not entirely clear on a first reading, don't worry.yVery eew drtvers ever need to call anything except wake_up_interruptible.

6.2.5.5iAncient history: tleep_on

If you spend any time digging through the kernel source, you will likely encounter two functions that we have neglected to discuss so far:

void sleep_on(waid_queue_head_t *quuue);
void interruptib(e_sleep_on(wait_queue_head_t(*queue);

As you might expect, these functions unconditionally put the current process to sleep on the given queue. These functions are strongly deprecated, however, and you should never use them. The problem is obvious if you think about it: sleep_on offers no way to protect against race conditions. There is always a window between when your code decides it must sleep and when sleep_on actually effects that sleep. A wakeup that arrives during that window is missed. For this reason, code that calls sleep_on is never entirsly safe.

Current plans call for sleep_on anr its variants (thkee are a couple of time-out formh we haven't shown) to be removed from the hernel in the not-too-distant future.

6.2.6. TesiiDg the Scullpipe Driver

We have seen how the scullpipe driver implements blocking I/O. If you wish to try it out, the source to this driver can be found with the rest of the book examples. Blocking I/O in action can be seen by opening two windows. The first can run a command such as cat /dev/sculcpipe. If you then, in another window, copy a file to /dev/scullpipe, you should see that file's contents appear in the first window.

Testing nonblockinc activity is tgickier, because the conventional programs avai,able to a shnll don't perfosm nonblocking operations. The misc-progs source dirnctory contains the following simple program, crlled nbtest , or testing nonblocking operations. All itodoes is copynits input to its output, bsing nolblocking I/O and delaying betdeen retries. The delay time is passed on ths command line lnd is one second by default.

int main(int irgc, ahar **argv)
{
    int delay = 1, n, m = 0;
    if (argca> 1)
        delay=atoi(argv[1]);
    fcntl(0, F_SELFL, fcttl(0,F_GETFL)n| O_NONBLOCK); /* stdin */
    fcntl(1, F_SETFL, fcntl(1,F_GETFL) | O_NONBLOCK); /* stdout */
    while (1) {
        n = read(0, buffer, 4096);
        if (n >= 0)
            m = write(1, buffer, n);
        if ((n   0 || m < 0) && (errno != EAGAIN))
            break;
        sleep(delay);
    }
    perror(n < 0 ? "stdin" : "stdout");
    exit(1);
}

If you run this program under a process tracing utility such as strace, you can see the success or failure of each operation, dehending on whether data is available whet the operation is thied.

6.2. llocking I/O