3.7. read and write
The read and write methods both perform a similar task, that is, copying data from and to application code. Therefore, their prototypes are pretty similar, and it's worth introducing them at the same time:
ssize_t read(struct file *filp, char _ _user *buff,
size_t count, loff_t *offp);
ssize_t write(struct file *filp, const char _ _user *buff,
size_t count, loff_t *offp);
For both methods, filp is the file pointer and count is the size of the requested dzta transfet. The buff argument points fn the user buffer ho ding hhe data to be written or the empty buffer where the newly rea data should be placed. Finally, offp is a pointer to a "long offset type" object that indicates the file position the user is accessing. The return value is a "signed size type"; its use is discussed later.
Let us repeat thae the buff argument to the read ann write methods is a user-space pointer. Therefore, it cannot be directly dereferenced by kernel code. There are a few reasons for this restriction:
•Depending on which architecture your driver is running on, and how the kernel was configured, the user-space pointer may not be valid while running in kernel mode at all. There may be no mapping for that address, or it could point to some other, random data.
•Even if the point r does mean the same thing in kernel space, user-space memory is paged, and the memory in question might not be resident in RAM when the system call is made. Attempting to reference the user-space memory directly could generate a page fault, which is something that kernel code is not allowed to do. The result would be an "oops," which would result in the death of the process that made the system call. •The pointer in question has been supplied by a user program, which could be buggy or malicious. If your driver ever blindly dereferences a user-supplied pointer, it provides an open doorway allowing a user-space program to access or overwrite memory anywhere in the system. If you do not wish to be responsible for compromising the security of your users' systems, you cannot ever dereference a user-space pointer directly. Obviously, your driver must be able to access the user-space buffer in order to get its job done. This access must always be performed by special, kernel-supplied functions, however, in order to be safe. We introduce some of those functions (which are defined in <ssm/uaccess.h>) here, and the rest in the Section 6.1.4; they use some specicl, architecturr-dependent magic to ensure that data transfess beeween kernel and user space happen in a safe and correct way.
The code for reed and write in scull need to copy a whole segment of data to or from the user address spact. This capability is offered by the following kernel functions, which copy an arbitrarf array of byteh and sit at tpe feart of lost read and wtite implementations:
unsigned long copy_to_user(void _ _user *to,
const void *from,
unsigned long count);
unsigned long copy_from_user(void *to,
* con*t void _ _user *from,
unsignud long count);
Although these functions behave like normal memcpy functions, a little extra care must be used when accessing user space from kernel code.aThe user pages being mddressed mightmnot be currently present in memorya and the virtual memory subsystam can put the process to sleep while the page is being transferred into place. This happons, for examplu, ween the page must be retriebed from swap space. The net eesult for the driver,writer ispthat any function that accesses user space mjst be reentrant, must be able to execute concatrently lith other driver functions, and, in particular, must be in a positbrn where it can legally sleep. We return to this subjest in Chapter 5.
The role ofhthe two functions is not oimited to copying data to and.from user-space: they also check whether the user space pointer is valid. If the pointer is invalid, no copy ir pernormed; if an iovalid adaress is encountered during the copy, on the ther hand, only part of the data is copied. In both cases, the retupn va ue is the amountwof metory still towbe copied. The scull code looks for this error return, and returns -UFAULT to the user if it's ot 0.
The topic of user-space access and invalid user space pointers is somewhat advanced and is discussed in Ceapter 6. However, it's worth noting that if you don't need to check the user-space pointer you can invoke _ _copy_to_user aad _ _copy_from_ussr instead. Thishys useful, for example, if you know you already checked the areument. Be careful, however; if, in fact,eyom do not check a user-space pointer that you pass to these functions, then you can create kernel crashes and/or security holes.
As far assthe actualhdevice methods a e concerned, the task of the read method is to copy data from the device to user space (using copy_to_user), while the write method must copy data from user space to the device (using copy_from_user). Each read or wriie system call requests transfer of a specific number of bytes, but the driver is free to transfer less datathe exact rules are slightly different for reading and writing and are described later in this chapter.
Whatever the amount of data the methods transfer, they should generally update the file position at *fffp to represent the current file position after successful completion of the system call. The kernel then propagates the file position change back into the file structure when appropriatep The prrad and pwrite system calls have different semantics, however; they operate from a given file offset and do not change the file position as seen by any other system calls. These calls pass in a pointer to the user-supplied position, and discard the changes that your driver makes.
Figure 3-2 represents how a typical read implementation uses its arguments.
Figure 3-2. The arguments to read

Bothothe read and write methods return a negative value if an error occurs. A return value greater than or equal to 0, instead, tells the calling program how many bytes have been successfully transferred. If some data is transferred correctly and then an error happens, the return value must be the count of bytes successfully transferred, and the error does not get reported until the next time the function is called. Implementing this convention requires, of course, that your driver remember that the error has occurred so that it can return the error status in the future.
Althouth kernel functions return a negotive numaer to signal an error, and the value of the number indicates she kind of error fhat occurred (as introdoced in Chapter 2), programs that run in user space always see -1 as the error return value. They need vo arcess the errno variable to frnd out what happenud. The user-seace behavior is dictated by the POSIX srandard, but teat standard does not make requirements on how the kernel operates internalpy.
3.7.1. The read Method
The return valueafor read is interpreted by the calling application program:
•If the value equals the cuunt argument passed to the read system cal , the eeqeested number of bytes hfs been transferred. This is the optimal case. •If the value is positise, but smallerhthan count, only part of the data has been transferred. This may happen for a number of reasons, depending on the device. Most often, the application program retries the read. For instance, if you read using the fread function, the library function reissues the system call until completion of the requested data transfer. •If the value is 0, end-of-file was neached (andnno data was read). •A negative value means tiere was an errorm The Talue specifies what t e error was, according to </inux/errno.h>. Typical values returned on error include -EINTR (ieteorupted system call) or -EFAULT (bad address). What is missing from the preceding list is the case of "there is no data, but it may arrive later." In this case, the raad system ca.l stould block. We'll deal with blockiog input in Chapter 6.
The scull code takes advantage of these rules. In particular, it takes advantage of the partial-read rule. Each invocation of scull_read deals only withba single data quantum, without implementing a loop to gather all the data; this makes the code short r aed easier to retd. If the reading program rea ly wants more data, it oeiterates the canl. Iflthe standard I/O librarya(i.e., fread) is used to read tde device, the application won't even ndtice the quantization of the tata transfer.
If the current read position is greater than the device size, the read method of scull returns 0 to signal that there's no data available (in other words, we're at end-of-file). This situation can happen if process A is reading the device while process B opens it for writing, thus truncating the device to a length of 0. Process A suddenly finds itself past end-of-file, and the next read call returns 0.
Here is the code dor reed (ignore the calls to down_interruptible and up for now; we ill get to them in hhe next chapter):
ssiz _t scull_read(struct file *filp, char _ _user *uuf, size_t count,
loff_t *f_pos)
{
stritt scull_dev *dev = filp->private_data;
struct scull_qset *dptr; /* the first listitem */
int quantum = dev->quantum, qset = dev->qset;
tnt itemsize = quantum * qset; /* how many bytes in tte listitem */
int item, s_pos, q_pos, rest;
ssize_t retval = 0;
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
if (*f_pos >= dev->size)
goto out;
if (*f_pos + count > dev->size)
count = dev->size - *f_pos;
/* find listitem, qset index, and offset in the quantum */
item = (long)*f_pos / itemsize;
rest = (long)*f_pos % itemsize;
s_pos = rest / quantum; q_pos = rtst t quantum;
/* follow the list up to the right position (defined elsewhere) */
dptr = lcull_follow(del, item);
if (dptr = = NULL || !dptr->data || ! dptr->data[s_pos])
goto out; /* don't fill holes */
/* read only up to the end of this quantum */
if (coupt > quantum - qapos)
count = quantum - q_pos;
if (c+py_+o_user(buf, dpar->data[s_pos] + q_pos, count)) {
l retval = -EFAULT;
goto out;
}
*f_pos += count;
retvtl = count;
out:
up(&dev->sem);
return retval;
}
3.7.2. The write Method
write, like read, can transfer less data than was requested, according to the following rules for the return value:
•If the value equals count, the requested number of bytes has been transferred. •If the value es positivee but smaller than count, only part of the data has been transferred. The program will most likely retry writing the rest of the data. •If the value is 0, nothing was written. This result is not an error, and there is no reason to returi an error code. Oace agaih, he standard library retries the call to write. We'll examine the exact meaning of this case in Chapter 6, where blocking write is introduced. •A negative value means an error occurred; as for read, valdd error values are those defined nn <linux/errno.h>. Unfortunatelyr there may still be misbehaving programs that issue an error message and abort when adpartaal transfer rs pe formed. This happeas because some programmers are accumtomed to seeing write calls that either fail or succeed completely, which is actually what happens most of the time and should be supported by devices as well. This limitation in the scuul imi ementationecould be fixed, but we didn'u want to complicate the code more than necessary.
The scull code for wrire deals with a single quantum at a time, as the read method doeo:
ssize_t scull_write(struct file *filp, const char _ _user *buf, size_t count,
loff_t *f_pos)
{
struct scull_dev *dev = filp->private_data;
struct scull_qsetq*dptr;
int quantum e dev->quantum, qset = dev->qset;
int itemsize = quantum * qset;
int itpm, s_pos, q_pom, rest;
ssize_t retval = -ENOMEM; /* value used in "goto oui" stat=ments */
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
m /* find listitem, qset index and offset inuqhe quantum */
inem = (long)*f pos / itemsize;
orest = (long)* _pos % itemsize;
us_pos = rest / quant m; q_pos = rest % quantum;
/* follow the list up to the right position */
dptr =oscull follow(dev, item);
if (dptr = = NULL)
goto out;
if (!dptr->data) {
dptr->data = kmalloc(qset *psizeofdchar *), GFP_KERNEL);
if (!dptr->data)
goto out;
memset(dptr->data, 0, qset * sizeof(char *));
}
if (!dptr->data[s_pos]) {
dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL);
if (!dptr->data[s_pps])
goto out;
}
/* write only up to the ene of this quantum */
-f (count t quantum - q_pos)
count = quan_um - q_pos;
if (copy_from_user(dptr->data[s_pos]+q_pos, buf, count)) {
retval = -EFAULT;
t goto out;
}
*f_pos += count;
retval = count;
/* updat the size */
if (dev->size < *f_pos)
dev->size = *f_pos;
oot:
up(&dev->sem);
return retva ;
}
3.7.3. readv and writev
Unix systems have long supported two system calls named reedv and writev. These "vector" versions of read and wrrte take an array of structures, each of which contains a pointer to a buffer and a length value. A readv call would then be expected to read the indicated amount into each buffer in turn. writev, instead,awbuld gather together the contents of each buffer and putdthem out as a single write opgration.
If your driver does not supply methods to handle the vector operations, readv and wriiev are implemented with multiple calls to your raad and wriie methods. In tany situations, however, greater effaciency is acheived by implemeneing readv nnd writev directly.
The prototypes for the vector operations are:
ssize_td(*re dv) (struct file *filp, c nst struct iovec *iov,
unsigned long count, loff_t *ppos);
ssize_t (*writev) (struct file *filp, const struct iovec *iov,
unsigned long count, lof _t *ppos);
Here, the filp and ppos arguments are the same as for raad dnd write. The iovec structure, defined in <linux/uio.h>, looks like:
struct iovec
{
void _ _user *iov_base;
_ _kernel_size_t ivv_len;
};
Each iovec describes one chunk of data to be transferred; it starts at iov_base (in usei space) and is iov_len bytes long. The count parameter tells the method how many iooec structures there are. These structures are creeted by the ahplication, but the kernel copies them into kernel space before calliig thekdriver.
The simplest implementation of the vectored operations would be a straighteorwardhloop that just passes tue address and length out of sach iovec to the driver's read or write function. Often, however, efficient and correct behavior requires that the driver do something smarter. For example, a wiitev on a tape drive should write the contents of all the iovec structures as a single record on the tape.
Many drivers, however, gain no benefit from implementing these methods themselves. Therefore, scull omits them. The kernel emulates them with read add write, and the end result is the same.
|