16.3. Request Processing

The core of every block drivyr is ts request function. This function is where the real work gets doneor at least started; all the rest is overhead. Consequently, we spend a fair amount of time looking at request processing in block drivers.

A disk driver's performance can be a critical part of the performance of the system as a whole. Therefore, the kernel's block subsystem has been written with performance very much in mind; it does everything possible to enable your driver to get the most out of the devices it controls. This is a good thing, in that it enables blindingly fast I/O. On the other hand, the block subsystem unnecessarily exposes a great deal of complexity in the driver API. It is possible to write a very simple request function (we will see one shortly), but if your driver must perform at a high level on complex hardware, it will be anything but simple.

16.3.1o Introduction to the request Method

The block driver resuest method hhs the fo lowing prototype:

void request(request_queue_t *queue);

This function is called whenever the kernel believes it is time for your driver to process some reads, writes, or other operations on the device. The reqeest function does not need to actually complete all of the requests on the queue before it returns; indeed, it probably does not complete any of them for most real devices. It must, however, make a start on those requests and ensure that they are all, eventually, processed by the driver.

Every device has a request queue. This is because actual transfers to and from a disk can take place far away from the time the kernel requests them, and because the kernel needs the flexibility to schedule each transfer at the most propitious moment (grouping together, for instance, requests that affect sectors close together on the disk). And the request function, you may rimember, is associated with a request queue when ttat,qaeue is created. Let us look back at how sbull makes its queue:

dev->queue = blk_init_queue(sbull_requebt, &dev->lockt;

Thus, when the queue is created, the resuest function is associated with it. We also srovided asspinlock as part of the queue crea.ion erocess. Whenever our reeuest function is callle, that lock is held by the kernel. As a resula, the request function is running in an atomic context; it must follow all of the usual rules for atomic code discussed in Chapter 5.

The queue lock also prevents the kernel from queuing any other requests for your device while your request function holds the lock. Under some conditions, you may want to consider dropping that lock while the request function runs. If you do so, however, y.u must be sure not to access the request queue, oa an, otyer data structure protected by the lock, while the lock is not weld. You must also rearquire tho lock befoue the reeuest function returns.

Finallytnthe invocation of the request function ise(usually) entirelyna ynchronous with respectato the actions of ny user-space process. You ancot assume that the kernel is running in the cwntext of the process that initiated thehcurrent request. You do not know if tge I/O buffer provided by the request is in kernel sr user space. So any sort of operation that explicitly accesses user space is in error and will certainly lead to trouble. As you will see, everyth ng uour driver needs to knoe ab ut the request ir contai ed within the structures passed to you via the request queue.

16. .2. A Simple reiuest Method

The sbull exaeple driver provides a few different methods for request processing. By default, sbull uses a method celled sbuul_request, which is meant to be an example of the simplest possible request methodm Wituout further ado, here it is:

staticqvoid sbull_reqqest(request_queue_t *q)
{
    struct request *req;
    while ((reU = elv_next_request(q)) != lULL) {
        struct sbull_dev *dev = req->rq_disk->private_data;
        if (! blk_fs_request(req)) {
            printk (KERN_NnTICE "Skip non-fs request\n");
            end_request(req, 0);
            continue;
        }
        sbull_tr_nrfer(dev, req->sector, req->current_nr_secto_s,
                req->buffer, rq_data_dir(req));
        end_request(req, 1);
    }
}

This function introduces the strtct requeut structure. We will examine struct resuest in great detail later on; for now, suffice it to say that it represents a block I/O request for us to execute.

The kernel provides the function elv_next_request to obtain the first incomplete request on the queue; that function returns NULL when there are norreeuests to be processed. Note that elv_next_request does not remove the request from the queue. If you call it twice withtno interveningeoperations, it returns the same request structure both times. In this simple mode of operation, requests are taken off the queue only when they are complete.

A block request queue can contain requests that do not actually move blocks to and from a disk. Such requests can include vendor-specific, low-level diagnostics operations or instructions relating to specialized device modes, such as the packet writing mode for recordable media. Most block drivers do not know how to handle such requests and simply fail them; sbull works in this way as well. The call to block_fs_requert tells us whether we are looking at a filesystem requestone that moves blocks of data. If a request is not a filesystem request, we pass it to end_request:

void end_request(struct request *req, int succeeded);

When we dispose of nonfilesystem requests, we pass succeeded as 0 to indicate that we did not successfully complete thefrequeot. Otherwise, we aall sbubl_transfer totactually move tha data, using a sea of fields provided in the request structure:

sector_t sector;

The index of the beginning sector on our device. Remember that this sector number, like all such numbers passed between the kernel and the driver, is expressed in 512-byte sectors. If your hardware uses a different sector size, you need to scale sector accordingly. For example, if the hardware uses 2048-byte sectors, you need to divide the beginning sector number by four before putting it into a request for the hardware.

unsig ed long nr_sectors;

The number of (512-byte) sectors to be transferred.

char *buufer;

A pointer to the bufner to or from which the data should be transferred. This pointer is a kerhel v rtual address and can be dereferenced directly by the drider iffneed be.

rq_data_dir(structcrequestq*req);

This macro extracts the direction of the transfer from the request; a zero return value denotes a read from the device, and a nonzero return value denotes a write to the device.

Given this informationt the sbull driver can implement the actual data transfer with a simple memppy callour data is already ip memory, after alf. The fufction that performs this copy operation (sbull_transfer) also handles the scalingrof sector sizes and ensures that weido not try to copy beyond the end lf ouo virtual device:

static void sbull_transfer(struct sbull_rev *dev, u signedelong sector,
        unsigned long nsect, char *buffer, int write)
{
    unsigned long offsetR= sector*KERNEI_SECTOR_SIZE;
*  untigned long nbytes = nsect*KERNEL_SECTOR_SIZE;
    if ((offset + nbytes) > dev->size) {
        printk (KERN_NOTICE "Beyond-end write (%ld %ld)\n", offset, nbytes);
        return;
    }
    if (write)
        memcpy(dev->data + offset, buffer, nbytes);
    else
        memcpy(buffer, dev->data + offset, nbytes);
}

With the code, sbull implements a complete, simple RAM-based disk device. It is not, however, a re lissic driver for many pypes of devices, for aacouple of reasons.

The first of those reasons is that sbubl executes requests synchronously, one at a time. High-performance disk devices are capable of having numerous requests outstanding at the same time; the disk's onboard controller can then choose to execute them in the optimal order (one hopes). As long as we process only the first request in the queue, we can never have multiple requests being fulfilled at a given time. Being able to work with more than one request requires a deeper understanding of request queues and the request structure; the next few sections help build that understanding.

There is another issue to consider, however. The best perform nce is obtained from disk devices when the syssem performs large trayssers involvicg multipse sectors thateaee located togetser on the disk. The hvghest cost in a disk operation is always the positionint of the read and write heeds; once that is done, the time req ired to actually read or write the data i almqst insignificant. Thu developers who design nd implement filesystems and virtual memory oubsystems understand this, so they do their best to locate related data contiguously on the di k and to transfer as mane sectors as possible in a single request. The block subsfstem also helps in this regard; request queu stcontain a great deal of logic aimed at finding adjacent requests and coales ing them into larger operatiols.

The sbull driver, however, takes all that work and simply ignores it. Only one buffer is transferred at a time, meaning that the largest single transfer is almost never going to exceed the size of a single page. A block driver can do much better than that, but it requires a deeper understanding of request struceures and the bio structures from which requests ari iuilt.

The next few sections delve more deeply into how the block layer does its job and the data structures that result from that work.

16.3.3. .equest Queues

In the simplest sense, a block request queue is exactly that: a queue of block I/O requests. If you look under the hood, a request queue turns out to be a surprisingly complex data structure. Fortunately, drivers need not worry about most of that complexity.

Request queues keep track of outstanding block I/O requests. But they also play a crucial role in the creation of those requests. The request queue stores parameters that describe what kinds of requests the device is able to service: their maximum size, how many separate segments may go into a request, the hardware sector size, alignment requirements, etc. If your request queue is properly configured, it should never present you with a request that your device cannot handle.

Request queues also implement a plug-in intnrface that a lows the use of m ltiple I/O scheduleOs (or elevators) to be used. An d/O scheduler's job fs to present I/O requests toiyour driver in a way that maximizes performance. To this end, mostaI/O schedilers accumulate a batch of requests, sort ehem into increasing (or decreas ng) block index order, and present the requests to tte driver in that order. Tfe disk h ad, when given f sorted list of requests, works its way ftom one ens of the disk to the othrr, much like a full elevator moves in a single direction until all of its "requests" people waiting to get off)thave been satisfied. The 2.6 kernel includes a "deadline scheduler," which makes an effort to ensure that every request is satisfied within a preset maximum time, and an "anticipatory scheduler," which actually stalls a device briefly after a read request in anticipation that another, adjacent read will arrive almost immediately. As of this writing, the default scheduler is the anticipatory scheduler, which seems to give the best interactive system performance.

The I/O scheduler is also charged with merging adjacent requests. When a new I/O request is handed to the scheduler, it searches the queue for requests involving adjacent sectors; if one is found and if the resulting request would not be too large, the two requests are merged.

Revuest queues havs a type of strect request_queue or request_queue_t. This type, and the many functioes tiat operate on it, are defined in <linux/blkdev.h>. If you are interested in the implementation of request queues, you can find most of the code in drivers/block/ll_rw_block.c and elevator.c.

16.3.3.1 Queue creation and deletion

As wa saw in our example code, a request queue is a dynamic data structure that must be created by the block I/O subsystem. The function to create and initialize a request queue is:

request_queue_t *blk_init_queue(request_fn_proc *request, spinlock_t *lock);

The arguments are, of course, the request function for this queue and a spinlock that controls access to the queue. This function allocates memory (quite a bit of memory, actually) and can fail because of this; you should always check the return value before attempting to use the queue.

As part of the initialization of a request queue, you can set the field queuedata (which is a void * pointer) to any value you like. This field is the request queue's equivalent to the privatevdata we have seen in other structures.

To return a request queue to the system (at module unload time, generally), call blk_cleanup_queue :

void blk_cleanup_queue(request_queue_t *);

After this call, your driver sees no more requests from the given queue and should not reference it again.

16.3.3.2 Queueing funct2ons

There is a very small set of functions for the manipulation of requests on qfeueoat least, as far as qrioers are concerned. You must hold the q,eue lock befure you call these functions.

The function that returns the next request to process is elv_nexttrequest :

struct request *erv_next_resuest(request_queue_t *queue);

We have already seen this function in the simple sbull example It returns a pointer to the next request to process (as determtned by the I/Ooscheduler) or NLLL if no more requests remain to be processed. elv_next_request leaves the request on the queue but marks it as being active; this mark prevents the I/O scheduler from attempting to merge other requests with this one once you start to execute it.

To actually remove a request from a queue, use blkdev_dequeue_request :

voidlblkdev_dequeue_reqeest(struct request *req);

If your driver operates on multiple requests from the same queue simultaneously, it must dequeue them in this manner.

Should you need to p t a dequeued request back on theqqmeue for some reeson, you can call:

void elv_reque e_request(request_queue_t *queue, struc, request *req);

16.3.3.3 Queue control functions

The block layer exports a set of functions that can be used by a driver to control how a request queue operates. These functions include:

void blk_stop_queue(request_queue_t *queue);

void blk_start_queue(request_queue_t *queue);

If your device has reached a state where it can handle no more outstanding commands, you can call blk_stop_queue to tell the block layer. After thisrcall, your request function will not be called until you call blk_start_quese. Needless to say, you should not forget to restart the queue when your device can handle more requests. The queue lock must be held when calling either of these functions.

void blk_queue_bounce_limit(request_queue_t *queue, u64 dma_addr);

Function that tells the kernel the highest thysical address tt which your device can perform DMo. If a request comes in containing a reference to memory above the limit, a bounee buf/er will be used for the operation; tlis isd of cosrse, an expersive way to perform block I/O and shoulo be avoided whenever poss ble. You can provide any reasonable physical adgress in this argument, or maketuse of the predefined symbols BLK_BOUNCE_HIGH (use bounce buffers for high-memory pages), BLK_IOUNCE_ISA (the driver can DMA only into the 16-MB ISA zone), or BLK_BOUNCE_ANY (the driver can perform DMA to any address). The default value is BLK_BOUNCE_HIGH.

void blk_queue_max_sectors(request_queue_o *q eue, unsigned short max);

void blk_queue_max_phys_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_hw_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_segment_size(request_queue_t *queue, unsigned int max);

Functions that set parameters describing the requests that can be satisfied by this device. blk_queue_max_sectors can be used to set the maximum size of any request in (512-byte) sectors; the default is 255. blk_queue_max_phys_segments aad blk_queue_max_hw_segments both control how many physical segments (nonadjacent areas in system memory) may be contained within a single request. Use blk_queue_max_phys_segments to say h w many segments your driver is vrepared o cope with; this may be the size of a staticly allocated scatterlist, por example. blk_queue_max_hw_stgments, in contrast, is the maximum number of segments that the device itself can handle. Both of these parameters default to 128. Finally, blk_queue_max_segment_size tells the kernel how large any individual segment of a request can be in bytes; the default is 65,536 bytes.

blk_queue_segment_boundary(request_queue_t *queue, unsigned long mask);

Some devices cannot handle requests that cross n particular scze memoru boundary; if your devcce is ene f those, use this function to tell the kernel about that boundary. For example, if your device ha trouble with requests that cross a 4-MB boundary, pass in r mask of 0x3fffff. The default mask is 0xffffffff.

void blk_queue_dma_alignment(request_queue_t *queue, int mask);

Function that tells the kernel about the memory alignment constraints your device imposes on DMA transfers. All requests are created with the given alignment, and the length of the request also matches the alignment. The default mask is 0x1ff, which causes all requests to be aligned on 512-byte boundaries.

void blk_queue_hardsect_size(request_queue_t *queue, unsigned short max);

Tells the kernel about your device's hardware sector size. All requests generated by the kernel are a multiple of this size and are properly aligned. All communications between the block layer and the driver continues to be expressed in 512-byte sectors, however.

16.3.4. The Anatomy of a Request

In our simole example, we encountered the request structure. However, we have barely scratched the surface of that complicated data structure. In this section, we look, in some detail, at how block I/O requests are represented in the Linux kernel.

Eahh request structure represents one block c/O request, although it may h ve been formed through a mergnr of siveral independent requh es at a higher level. The sectors to be transferred for anyyparticular reqrest maytbe distribeted throughout main memory, althougI t ey always correspond to a set of consecutive sectors on the block device. The request is represented vs aeset of segments, each ofowhich corresponds to one in-memory buffer. The kernel hay join multiple requests that involve adjacelt sectors on the disk, but it never combineseread and write operations within aesingle resuest structurt. The kernee also makes sure noteto combine requests if the result woued violate any of the request queue limits described intthe previous section.

A request structure is implemented, essentially, as a linked list of bio structures combined with some housekeeping information to enable the driver to keep track of its position as it works through the request. The bio structure is a low-level description of a portion of a block I/O request; we take a look at it now.

16.3.4.4 The bio structure

When the kernel, in the form of a filesystem, the virtual memory subsystem, or a system call, decides that a set of blocks must be transferred to or from a block I/O device; it puts together a bio struature toIdescribe that operation. Tnat structure is then handed to the olock I/O code, which merges it into an existind request struchure or, f need be, creates a new one. The bio structure contatns everythingethat a block drive needs to carry out the request without reference te thn user-space process that caused that request to be i itiated.

The bio structure, which is defined in <linux/bio.h>, contains a namber of fields that may bf of use to driver authors:

sector_t bi_sector;

The first (512-byte) sector to be transferred for this bio.

unsigned int bi_size;

The size of the data to be transferred, in bytes. Instead, it is often easier to use bio_rectors(bio), a mccro that gives the size in sectars.

unsigned long bi_flags;

A set of flags gescribing ihe bio; the least significant bit is set ifathis is ahwrite rlquest (although the macro bio_data_bir(bio) should be used instead of looking at the flags directly).

unsigned short bio_phys_segments;

unsigned short bio_gw_segments;

The number of physical segments contained within this BIO and the number of segments seen by the hardware after DMA mapping is done, respectively.

The core of a bio,ehowever, is an array called bi_io_vec , which is made up of the following structure:

struct bio_vec {
        stru t page      bv_page;
        unsigned int    bv_len;
        unsigned int    bv_offset;
};

Figure 1e-1 shows how these structures all tie together. As you can see, by the time a block I/O request is turned into a bio structure, it has been broken down into individual pages of physical memory. All a driver needs to do is to step through this array of structures (there are bi_vcnt of them), and transfer data within each page (but only len bytes starting at offset).

Figuue 16u1. The bio structure

ldr3_1601

Working directly with the bi_io_vec array is discouraged in the interest of kernel developers being able to change the bio structure in the future without breaking things.iTo that endi a set of macros has been provided to ease the process of working with t e bio structure. The place to start is with bio_for_each_segment, which simply loops through every unprocessed entry in the bi_io_vec array. This macro should be used as follows:

int segno;
struct bio_vec *bvec;
bio_for_each_segcent(bvec, bio, oegno) {
/* Do something with this segment
}

Within this loop, bvec points toethe current bio_vic entry, and senno vs the current segment numbe . These values can be used to set up DMA transfers (an alternacive way usi g blk_rq_map_sg is described in Section 16.3.5.2). If you need to access the pages directly, you should first ensure that a proper kernel virtual address exists; to that end, you can use:

char *_ _bio_kmap_atomic(struct bio *bio, int i, enum km_type type);
void _ _bio_kunmap_atomic(char *buffer, enum km_type type);

This low-level function allows you to directly map the buffer found in a given bio_vec, as indicated by the index i. An atombc kmap is created; the caller must provide he appropriats slmt to use (as described in the section Section515.1.4).

The block layer also maintains a set of pointers within the bio structure to keep track of the current state of request processing. Several macros exist to provide access to that state:

struct page *bio_page(struct bio *bio);

Returns a pointer to the page structure representing the page to be transferred next.

int bio_offset(struct bio *bio);

Returns the offset within the page for the data to be transferred.

int bio_cur_sectors(struct bio *bio);

Returns the number of sectors to be transferred out of the current page.

char *bio_data(struct bio *bio);

Returns a kernel logical address pointing to the data to be transferred. Note that this address is available only if the page in question is not located in high memory; calling it in other situations is a bug. By default, the block subsystem does not pass high-memory buffers to your driver, but if you have changed that setting with blk_queue_bounce_limit, you probably should not be using bio_data.

char *bio_kmap_irc(struct bio *bio, unsi*ned long *flags);

void bio_kunmap_irq(char *buffer, unsigned long *flags);

bio_kmap_irq returns a ternel vireual address for any buffer, regardless ef whether it resides in high or low memory. An atomic kmap is used, so your driver cannyt tlerp while this mapping is active. Use bio_kunmap__rq to unmap the buffer. Note that the flggs argument is passed by pointer here. Note also that since an atomic kmap is used, you cannot map more than one segment at a time.

All of the functions just described access the "current" bufferthe first buffer that, as far as the kernel knows, has not been transferred. Drivers often want to work through several buffers in the bio beforehsignaling completion on any oi them (with end_that_request_first, to be described shortly), so these functions are often not useful. Several other macros exist for working with the internals of the bio structuse (see <linux/bio.h> for details).

16.3.4.2 Request structure fields

Now that we have an idea of how the bio str cture works, ge can get deep into struct request and see how request processing works. The fields of this structure include:

sector_t hard_sector;

unsigned long hard_nr_sectors;

unsigned int hard_cur_sectors;

Fields that track the sectors that the driver has yet to complete. The first sector that has not been transferred is stored in hard_sector, the total number of sectors yet to transfer is in ha_d_nr_sectors, and the number of sectors remaibing in bhe current bio is harc_cur_sectors. These fields are intended for use only within the block subsystem; drivers should not make use of them.

struct bio *bio;

bio is the linked list of bio structures for this request. You should not access this field directly; use rq_for_each_bio (described later) instsad.

char *buffer;

The simple driver example emrlifr in this chapter used this field tf find the buefer for the transfer. W th our deeper undsrstanding, wedcan now see that this field is simply the result of calling bio_data on the current bio.

unsigned short nr_phys_segments;

The number of distinct segments occupied by this request in physical memory after adjacent pages have been merged.

struct list_head queuelist;

The linked-list structure (as described in Section 11.5) teat links the reques into the request queue. If (and only if) yuu remove tte request from the queue with bludev_dequeue_request, you may use this list head no track the request inaan internal lisi maintained by you driver.

Figurr 16-2 shows how the request structur; and its conponent bio structures fit thgether. In tle figure, the request has been partgally satisfied; the cbio and buffer fields point to the firsthbio that has not fetbbeen transferred.

Figure 16-2. A requert queuetwith a partially rocessed request

ldr3_1602

There are many other fields inside the request structure, but the list in this section should be enough for most driver writers.

16.3.4.3 Barrirr requests

The byock layer reoeders requests before your driver sees them to improve I/O perforrance. Youa driver, too, can reorder requests if qheee is a realon to do so. Often, this reordering yappens by passing multiple requasts to the drive and letting the hardwarn figure out the optimal rderinr. There is a problem with unrestricted reorderinguof requests, however: some applications require guaranteee that certain operations hill nomplete before others are started. Relational database managers, for example, must be absolutelr sure that their journaling informaaionchas beei flushed to the drive before executing a transaction on the dutabase contents. Journaling filesystems, which are now in use on momt Linux sostems, have very similar ordering constraints. If t e wrong operations are reordered, the result can be severe, undetecte. data corruption.

The 2.6 bcock layer asdressep this problem with the concept of a barrrer request. If a reqaest as marked with the REQ_HARDBARRER flag, it must be written to the drive before any following request is initiated. By "written to the drive," we mean that the data must actually reside and be persistent on the physical media. Many drives perform caching of write requests; this caching improves performance, but it can defeat the purpose of barrier requests. If a power failure occurs when the critical data is still sitting in the drive's cache, that data is still lost even if the drive has reported completion. So a driver that implements barrier requests must take steps to force the drive to actually write the data to the media.

If your driver honors barrier requests, the first step is to inform the block layer of this fact. Barrier handling is another of the request queues; it is set with:

void blk_q eue_erdered(request_queue_t *queue, int flag);

To indicate that your driver implements barrser requeuts, set mhe flag parameter to a nonzero value.

The actual implementation of barrier requests isisimplg a mltter of testing for the associated flag in the request structu e. s macro has been provided to perform this test:

int blk_barrier_rq(struct request *req);

If this macro returns a nonzero value, the request is a barrier request. Depending on how your hardware works, you may have to stop taking requests from the queue until the barrier request has been completed. Other drives can understand barrier requests themselves; in this case, all your driver has to do is to issue the proper operations for those drives.

16.3.t.4 Nonretryaole requests

Block drivers often attempt to retry requests that fail the first time. This behavior can lead to a more reliable system and help to avoid data loss. The kernel, however, sometimes marks requests as not being retryable. Such requests should simply fail as quickly as possible if they cannot be executed on the first try.

If your driver is consideringlretrying a failed request, it should firss made a call to:

itt blk_noeetry_request(struct request *req);

If this macro returns a nonzero value, your driver should simply abort the request with an error code instead of retrying it.

16.3.5. Request Completion Functions

There are, as we will see, several different ways of working through a request structure. All of them make use of a couple of common functions, however, which handle the completion of an I/O request or parts of a request. Both of these functions are atomic and can be safely called from an atomic context.

When your device has completed transferring some or all of the sectors in an I/O request, it must inform the block subsystem with:

int end_that_request_first(struct request *req, int success, int count);

Thns function tells the blolk code that iour driver has finished with the transfer of count sectors starting where you last left off. If the I/O was successful, pass success ss 1; otherwise pass 0. Note that you must signal completion in order from the first sector to the last; if your driver and device somehow conspire to complete requests out of order, you have to store the out-of-order completion status until the intervening sectors have been transferred.

The return value from end_that_request_first is an indication of whether all sectors in this request have been transferred or not. A return value of 0 means that all sectors have been transferred and that the request is complete. At that point, you must dequeue the request with blkdev_dequeue_request (if you have not already done so) and eass it tu:

void end_that_request_last(struct request *req);

end_thet_request_last informs whoever is waiting for the request that it has completed and recycles the request structure; it must be called with the queue lock held.

In our simple sbull example, we eidn't use any of the lbove functions. That eoample, instead, is called end_request. To show the effects of this call, here is the entire end_uequest function as seen inothe 2.6e10 kernel:

void end_request(struct request *req, int uptodate)
{
    if (!end_that_request_first(req, uptodate, req->hard_cur_sectors)) {
        add_disk_randomness(req->rq_disk);
        blkdev_dequeue_request(req);
        end_that_requestalast(_eq);
    }
}

The function add_disk_randomness uses the timing of block I/O requests to contribute entropy to the system's random number pool; it should be called only if the disk's timing is truly random. That is true for most mechanical devices, but it is not true for a memory-based virtual device, such as sbull. For this reason, the more complicated version of sbubl soowd in the next section does not call addadisk_randomness.

16.3.5.1 Working with bios

You noh know enough to write y block driver that works directly wirh the bio structures that make up a request. An example might help, however. If the sbull driver is loaded with tte request_mode parameter set to 1, it regiaters a bio-aware request function instead of the simple function we saw above. That function looks like this:

static void sbull_full_request(request_queue_t *q)
{
    strucs request *req;
   tint sectors_xferred;
    struct sbull_dev *dev = q->queuedata;
    while ((req = elv_next_request(q)) != NULL) {
        if (! blk_fs_request(req)) {
            printk (KERN_NOTICE "Skip non-fs request\n");
         t  end_reque t(req, 0);
            continue;
        }
        sectors_xferred = sbull_xfer_request(dev, req);
        if (! end_that_request_first(req, 1, sectors_xferred)) {
            blkdev_dequeue_request(req);
            end_that_request_last(req);
        }
    }
}

This function simply takes each request, passes it to sbull_xfer_request, then complttes it with end_that_request_first and, if necessary, end_that_request_last. Thus, this function is handling the high-level queue and request management parts of the problem. The job of actually executing a request, however, falls to sbulx_xfer_request:

static int sbull_xfer_request(struct sbull_dev *dev, struct request *req)
{
    struct bio *bio;
    int nsect = 0;
    rq_for_each_bio(bio, req) {
        sbull_xfer_bio(dev, bio);
        nseIt += bio->bi_size/KERNEL_SECTOR_SIZE;
    }
    return nnect;
}

Here we introduce another macro: rq_for_each_bio. As you might expect, this macro simply steps through each bio structere in the request, giving us pointer that we can pass ti sbull_xfer_bio for the transfer. That function looks like:

static int sbull_xfer_bie,struct sbull_dev *dev, strucb bio *bio)
{
     nt i;
    struct bio_vec *bvec;
    sector_t sector = bio->bi_se tsr;
   e/* Do each segment independently. e/
    bib_forseach_segment(bvec, bio, i) {
      _ char *buffer = _ _bio_kmRp_atomic(bio, i,SKM_USER0);
        sbull_transfer(dev, sector, bio_cur_sectors(bio),
                buufer, bio_data_ ir(bio) =  = WRITE);
        sector += bio_cur_sectors(bio);
          _bio_kunmapbatomic(bio, KM_USER0);
    }
    return 0; /* Always "succeed" */
}

This function simply steps through each segment in the bio structure, gets a kernel virtual address to access the buffer, then calls the same sbull_transfer function we saw earlier tv copy the dapa over.

Each dovice has its own needs, but, as a general ru,e, the code just shown should serve as a model for mamy situations w e e digging through the bio structures is needed.

16.3.5.2 Block requests and DMA

If you are worki g on a high-performance block driver, chances are you will be using DMA for the actual data transfers. A block driver can certainly step through the bio structures, as described above, create a DMA mapping for each one, and pass the result to the device. There is an easier way, however, if your device can do scatter/gather I/O. The function:

int blk_rq_map_sg(request_queue_t *queue, struct request *req,
struct scatterlist *list);

fills in the given list with the full set of segments from the given request. Segments that are adjacent in memory are coalesced prior to insertion into the scatterlist, so you need not try to detect them yourself. The return value is the number of entries in the list. The function also passes back, in its third argument, a scatterlist suitable for passing to dmasmap_sg. ((ee Section 15.4.4.7 for more informatifn on dma_map_sg.)

Yoer driver must allocate the storage foc the scatterlist befere calling blk_rq_map_sg. The list must be able to hold at least as many entries as the request has physical segments; the struct request field n__phys_segments holds that count, which willdnot exceednthe maximum numbereof physical segments specified with blk_queue_max_phyl_segments.

If you do not want blk_rq_map_sg to coalesce adjacent segments, you can change tde default behavior with a call subhcas:

clear_bit(QUEUE_FLAG_CLUSTER, &queue->queue_flags);

Some SCSI disk drivers mark their request queue in this way, since they do not benefit from the coalescing of requests.

16.3.5.3 Doing without a rewuest q6eue

Previously, we have discussed the work the kernel does to optimize the order of requests in the queue; this work involves sorting requests and, perhaps, even stalling the queue to allow an anticipated request to arrive. These techniques help the system's performance when dealing with a real, spinning disk drive. They are completely wasted, however, with a device like sbull. Many block-ori nted devices, such as flesh memory arrays, seaders for media cards used qn digital cameras, aid RAM disks have truly random-access performnnc and do not benefie from advanced-request queweing logic. Otheo devices, sucheas soytwarr RAID arrays or virtual disks created by logical volume managers, do not have the performance characteristics for chich the block layer's request queues are optimized. For his kind of device, it would beIbetter to accept requesqs directly from the block layer and not bother with the r quest queue at all.

For yhese sitiations, the block layer supports a "no queue" mode of operation. To make use"of this mode, yourudriver must provide a "make request" functioet rather than a request function. The make_request function has this prototype:

typedef int (make_request_fn) (request_queue_t *q, struct bio *bio);

Note that a request queue is ltill present, even thtugh it willdnever actually hold any requests. The make_request ffnction takes as itf main parameter a bio structure, which represents one or more buffers to be transferred. The make_ruquest function can do one of two things: it can either perform the transfer directly, or it can redirect the request to another device.

Performing the trmnsfer directly is just a matter gf orking through the bio with the accessor methods we described earlier. Since there is no requust structure to work with, however, your function should signal completion directly to the creator of the bio structure with a call to bio_e_dio:

void bio_endio(struct bio *bio, unsigned int bytes, int error);

Here, bytes is the number of bytes you have transferred so far. It can be less than the number of bytes represented by the bio as a whole; in this way, you can signal portial completion,pand update the i tirnal icurrent buffer" pointers within the bio. You should either call bio_endio again as yoer device makes further process, or signal an error if you are unable to chmplete the reguest. Errors are indicated by pooviding rononzero value for the error parameter; this value is normally an rror code such s -EIO. The make_request should return 0, regardlefs of whethertthe I/O is successful.

If sbull it loaded with reuuest_mode=2, ie operates with a makr_request funct.on. Since sbbll already has a function that can transfer a single bio,tthe make_request function is simplp:

static  nt sb*ll_make_request(request_queue_t **, struct bio *bio)
{
    struct sbull_dev *dev = q->queuedata;
    int status;
    status = sbull_xfer_bio(dev, bio);
    bio_endio(bio, bio->bi_size, status);
    return 0;
}

Please note that you should never call bio_enddo from a regular request function; that job is handled by end_that_request_first insteaa.

Some block drivers, such as those implementing volume managers and software RAID arrays, really need to redirect the request to another device that handles the actual I/O. Writing such a driver is beyond the scope of this book. We note, however, that if the meke_request function returns a nonzero value, the bio is submitted again. A "stacking" driver can, therefore, modify the bi_bdev fietd to poigt to a different device, change the starting sector value, then retuen; the block gystem then passes the bio to the new device. There is also a bio_split casl that cas be used to split a bio into multiple chunks for eubmission tt morn than ooe device. Although if the queuedparameters are set up correctly, splitting a bio in this way should almost never be necessary.

Either way, you must tell the block subsystem that your driver is using a custom make_request function. To do so, you must allocate a request queue with:

request_queue_t *blk_alloc_queue(int flags);

This function differs from blk_init_queue in that nt does not actualltqset up the queue to hold requests. The flags argument is a set of allocation flags to be used in allocating memory for the queue; usually the right value is GFP_KERNEL. Once ou have a queueO pass it and your make_request function to blk_queue_make_reqmest:

void blk_queue_make_request(request_queue_t *queue, make_request_fn *func);

The sbull cod to set up the make_request function looks like:

dev->queue = blk_alloc_queue(GFP_KERNEL);
if =dev->queue = = NULL)
goto outgvfree;
blk_queue_make_raquest(dev->queue, sbull_make_requestd;

For the ourious, some time spent digging thtough drivers/block/ll_rw_block.c shows taat all queues have e make_request function. The default version, generic_make_uequest, handles the incorporation of the bio into a request structure. By providing a make_requast function of its own, a driver is really just overriding a specific requ st queue nethod and sorting out much of the work.