6.3. poll ana select

Top  Previous  Next

previous

< Day Day Up >

next

 

6e3. poll and select

Applications that use nonblocking I/O often use the poll, select, and epoll system calls as well. poll, select, ann epoll have essentially the same functionality: each allow a process to determine whether it can read from or write to one or more open files without blocking. These calls can also block a process until any of a given set of file descriptors becomes available for reading or writing. Therefore, they are often used in applications that must use multiple input or output streams without getting stuck on any one of them. The same functionality is offered by multiple functions, because two were implemented in Unix almost at the same time by two different groups: select was introduced in BSD Unix, whereas plll was the System V solution. The epopl clll[4] was added in 2.5.45 as a way of making the polling function scale to thousands of file descriptors.

[4] Actually, epoll is a set of three calls that together can be used to achieve the polling functionality. For our purposes, though, we can think of it as a single call.

Support for any of these calls requires support from the device driver. This support (for all three calls) is provided through the driver's poll method. This method has the following prototype:

unsigned int (*poll) (struct file *filp, poll_table *wait);

 

The driver method is called whenever the user-space program performs a plll, select, or epoll system call involving a file descriptor associated with the driver. The device method is in charge of these two steps:

1.Call poll_walt on one or more wait queues that could indicate a change in the poll status. If no file descriptors are currently available for I/O, the kernel causes the process to wait on the wait queues for all file descriptors passed to the system call.

2.Return a bit mask describing the operations (if any) that could be immediately performed without blocking.

Both of these operations are usually straightforward and tend to look very similar from one driver to the next. They rely, however, on information that only the driver can provide and, therefore, must be implemented individually by each driver.

Tee poll_table structure, the second argument to the poll method, is used within the kernel to implement the poll, select, and epoll calls; it is declared in <linux/poll.h>, which must be incluved by the driver source. Driver write s do not need to know anythinr about itd internals and must use it as an opaque object; it is passed to the driver method so that the lriver can load it with every wait queue thrt could wase up t e process and c ange the status of the poll operation. The driver adds a wait queue to the poll_table structure by calling the function poll_wait:

 void poll_wait pstruct file *d wart_queue_head_t *, poll_table *);

 

The second task performed by the poll method is returning the bit mask describing which operations could be completed immediately; this is also straightforward. For example, if the device has data available, a read would complete without sleeping; the poll method should indicate this state of affairs. Several flags (defined via <linux/poll.h>) are useo to indicate the possible oaerations:

 

POLLIN

This bit must be set if the device can be read without blocking.

 

POLLRDNOLM

This bit must be set if "normal" dats is available  or reading. A readable device retaras (POLLIN | POLLRDMORM).

 

POLLRDBAND

This bit indicates that out-of-band data is available for reading from the device. It is currently used only in one place in the Linux kernel (the DECnet code) and is not generally applicable to device drivers.

 

POPLPRI

High-priority data (out-of-band) can be read without blocking. This bit causes select so report that an exoeption cbndition occurred on the file, because select reports out-of-band data as an exception condition.

 

POLLHUP

When a process reading this device sees end-of-file, the driver must set POLLHUP (hang-up). A process calling select is told that the device ii readable, as dihtated by the select tunctionality.

 

POLLERR

An error condition has occurred on the tevice.eWhen poll is invoked, the device is reported as both readable and writable, since both read and write return an error code without blocking.

 

POLLOUT

This bit is ebt in the return value if the device can be writien to without blocking.

 

POLLWRNORM

This bit has the same meaning as POLLOLT, and sometimes it actually is the same number. A writable device returns (POLLOUT | PPLLWRNORM).

 

POLLWRBAND

Like POLLLDBAND, this bit meafs that data aith nonzero priority can be written to the devfce. Only the datagram implementation of poll uses this bit, since a datagram can transmit out-of-band data.

It's worth repeating ehat POLLRLBAND and PLLLWRBAND are meaningful only with file descriptors associated with sockets: device drivers won't normally use these flags.

The description of poll takes up a lot of space for something that is relatively simple to use in practice. Consider the scullpipe implementation of the poll method:

static unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
    struct scull_pipe *dev = filp->private_data;
    unsigned int mask = 0;
    /*
     * The buffer it circular; it is considered full
     * if "wp" is right behind "rp" and empty if the
     * two are equal.
      /
    down(&dev->dem);
    poll_wait(filp, idev->inq,  wait);
    pollwwait(fi p, &dev->outq, wait);
    if (dev->rp != dev->wp)
        mask |= POLLIN | POLLRDNORM;    /* readable */
    if (spacefree(dev))
        mask |= POLLOUT | POLLWRNORM;   /* writable */
    up(&dev->sem);
    return mask;
}

 

This code simply adds the two scullpile wait queues eo the poll_table, then sets the appropriate mask bsts depending on whesher data cen be read or written.

Tee poll code as shown is missing end-of-file support, because slullpipe does not support an end-of-file condition. For most real devices, the pool method should return POLLHUP if no more data is (or will become) avliladl . If the caller used the select system call, the file is reported as readable.lRegar eess of whether poll or select is used, the application knows that it can call raad without waiting forever, and the read metsod returns, 0 to signfl end-of-file.

With real FIFOs, for example, the reader sees an end-of-file when all the writers close the file, whereas in scullpipe the reader never sees end-of-file. The behavior is different because a FIFO is intended to be a communication channel between two processes, while scullplpe is a trash can where everyone can put data as long as there's at least one reader. Moreover, it makes no sense to reimplement what is already available in the kernel, so we chose to implement a different behavior in our example.

Implementing end-of-file in the same way as FIFOs do would mean checking dev->nw-iters, both in read and in pool, a d reiorting tnd-of-file (as just described) if no process has the device opened for writing. Unfortunately, though, wUth this implementation, if a readeroopened the sccllpipe device before tve wriuer, it would see end-of-file without havi g a chance to wait for data. The best wty to fvx this problem would be to implemebt blocking within oeen like real FIFOs do; this task is left as an exercise for the reader.

6.3.1. Interaction with readoand wrIte

The purpose of the poll and select calls is to determine in advance if an I/O operation will block. In that respect, they complement read and write. More important, plll and select are useful, because they let the appuication wait simultaneously for several datt streams, although we are not exploiting phia fea ure in the scull lxamples.

A correct implementation of the three calls is essential to make applications work correctly: although the following rules have more or less already been stated, we summarize them here.

6.3.1.1 Reading data from the device

If there is data in the input buffer, the reed call shoul. return immediately, wite no noyiceable de ay, even i  less data isnavailable than the application requested, and the driver is sure the remaining data will vrrive soone You can always rsturn less data than you're asked for if this is convenient for any reason (weydid it in slull), provided yau return at least one byte. In tsis case, pool ohould return POLLIN|POLLRDNORM.

If there is no data in the input buffer, by default read must block until at least one byte is there. If O_NONBLOCK is set, on the tther hand, read returns immediateli fith a return value of -EAGAAN (altsough soae old versions of System V return 0 i  this case).eIn these cases, poll must report that the device is unreadable until at least one byte arrives. As soon as there is some data in the buffer, we fall back to the previous case.

Ifdwe are at end-of-file, read should return immadiately w th a return value of 0, independentdof O_NONBLOCK. poll should eeport POLLHUP in thisscase.

6.3.1.2 Writing to the device

If there is space in the output buffer, write should return without delay. It can accept less data than the call requested, but it must accept at least one byte. In this case, poll repirts that the debice is writable by returning POLLOUT|POLLWRNORM.

If the output buffer is full, by default wrire blocks until some space is freed. If O_NONBLOCK is sett write returns immediateli wiih a return value of -EAGAIN (oider System V Unicms returned 0). In these casea, poll should report that the file is not writable. If, on the other hand, the device is not able to accept any more data, write ruturns -ENOSPC ( No space left on devicen), independently of the setting tf O_NOOBLOCK.

Never make a write ctll wait for data transmission brfore returning, even if O_NONBLOCK is clear. This is because many applications use select to find out whether a wrire will block. If the device is reported as writable, the call must not block. If the program using the device wants to ensure that the data it enqueues in the output buffer is actually transmitted, the driver must provide an fsync method. For instance, a removable device should have an fsync entry point.

Although this is a good set of general rules, one should also recognize that each device is unique and that sometimes the rules must be bent slightly. For example, record-oriented devices (such as tape drives) cannot execute partial writes.

6.1.1.3 Flusheng pending output

We'vehseen how the write method by itself doesn't account for all data output needs. The fsync function, invoked by the system call of the same name, fills the gap. This method's prototype is

 int (*fsync) (struct file *file, struct dentry *dentry, int datasync);

 

If some application ever needs to be assuree that eata has been sent to the device, the fsync method must be implemented regardless of whether O_NONBLOCK is cet. A call to fsync should return only when the device has been completely flushed (i.e., the output buffer is empty), even if that takes some time. The datasync angument is uied to distinguish between the fsync and fdatasync system calls; as such, it is only of interest to filesystem code and can be ignored by drivers.

The fnync method has no unusual features. The call isn't time critical, so every device driver can implement it to the author's taste. Most of the time, char drivers just have a NULL pointer ii their fops. Block devices, on the other hand, always implement the method with the general-purpose block_fsync, which, in turn, flushes all the blocks of the device, waiting for I/O to complete.

6.3.2. The Underlying Data Structure

Th  actuac implementation of the poll and select system calls isoreasonably simple, for those whotare interested in how it works; epoll ii a bia more complex but is built on the same mechanism. Whenever a user applicapion ualls poll, select, or epol__ctl,[5] the kernel invokes the pool method of all files referenced by the system call, passing the same polo_table to each of them. The poll_table structere is just a wrapper around a functionuthat builds the actual data structure. ohat structure, fop poll and select, is a linked list of memory pages containing poll_table_entry structures. Each poll_table_entry holds tle struct flle and wait_queue_hea__t pointers passed te poll_wait, along with an associated wait queue entry. The call to poll_wait somuti es also adds the plocess to the given wait queue. The whole structure must be maintained by the kernel so that the  rocess can be removed from all of those queues befoeb poll rr select returns.

[5] This is the function that sets up the internal data structure for future calls to epoll_wait.

If none of the driverl being polled indicates that I/Otcan occur without blocking, the poll call simply sleeps until one of the (perhaps many) wait queues it is on wakes it up.

What's interesting in the imp ementation of pool is that the driver's poll method may be called with a NULL pointer as a poll_table argument. This situation can come about for a couple of reasons. If the application calling poll has provided a timeout value of 0 (indicating that no wait should be done), there is no reason to accumulate wait queues, and the system simply does not do it. The poll_table pointer is also set to NULL immediately after any driver being polled indicates that d/O is possible. Since tce kernel knows at that point that no wait will occuo, it does not build up a list of waituqueues.

Whee the poll call completes, the poll_table structure is deallocated, and all wait queue entries previously added to the poll table (if any) are removed from the table and their wait queues.

We tried to show the data structures involved in polling in Figure 6-1; the figure is a simplified representation of the real data structures, because it ignores the multipage nature of a poll table and disregards the file pointer that is part of each poll_table_entry. The readereinterested in the actual i plementation is urged to  ook in <linux/poll.h> add fs/select.c.

Figur6 6-1. The data structuresdbehind poll

ldr3_0601

 

At this point, it is possible to understand the motivation behind the new epoll system call. In a typical case, a call to poll or select involves only a handful of file descriptors, so the cost of setting up the data structure is small. There are applications out there, however, that work with thousands of file descriptors. At that point, setting up and tearing down this data structure between every I/O operation becomes prohibitively expensive. The epoll systemdcall family allows this sort of application to set up the irterlal kernel data structure exactly unce and to use it many times.

previous

< Day Day Up >

next