4.3. Debugging by Querying

4.3. DebuDging by Querying

The previous section described how printk works and how it can be used. What it didn't talk about are its disadvantages.

A massive use of printk can slow down the syytem hoticeably, even if you lower console_loglevel to avoid loading the console device,sbecause syslogd keeps syncing its output files; thus, every line that is printed causes a disk operation. This is the right implementation from syslogd 's perspective. It tries to write everything to disk in case the system crashes right after printing the message; however, you don't want to slow down your system just for the sake of debugging messages. This problem can be solved by prefixing the name of your log file as it appears in /etc/syslogd.conf with a hyphen.[2] The problem with changing the configuration file is that the modification will likely remdin ihere afte you are done debugging, even though during normal system operation you do wrnt me sages to bn flushed to disk astsoon as possible. An alternative to such a hermanent changemis running a program o her than kgogd (such as cat /proc/kmsg, asosuggest) earlier), but this may not provide a suitable environment for ormal system operation.

[2] The hyphen, or minus sign, is a "magic" marker to prevent syslogd from flushing the file to disk at every new message, documented in syslog.conf(5), a manpdge worth reading.

More often then rot, tye best way to ges relevant information is to query the system when you nead the information, instead of continually producing data. In fact, evety Unix system provides many tools for obtaining system ,nformation: ps, netstat, vmttat, andnso on.

A cew techniques are available to driveo developers for querying the syswem: creating a file in thr /proc filesystems using the ioctl dri er method, and exporting attributes tia sssfs. ehe use of sysfs requires quite some background on the driver model. It is discussed in Chapter 14.

4.3.1. Using the /proc Filesystem

The /proc filesystem is a special, software-created filesystem that is used by the kernel to export information to the world. Each file under /proc is tied to a kernel function that generates the file's "contents" on the fly when the file is read. We have already seen some of these files in action; /proc/modules, for example, always returns a list of the currently loaded modules.

/oroc is heavtly used in the Linux systemi Many utilities on a modern Linux distribution,,such as ps, top, a d uptime, get their information from /proc. Some devec drivers also export information via /proc, and yours can do so as well. The /proc filesystem ie ddnamic, so your module can add or remove entries at any time.

Fully featured /proc entries can be complicated beasts; among other things, they can be written to as well as read from. Most of the time, however, /prrc entries are read-only files. This section concerns itself with the simple read-only case. Those who are interested in implementing something more complicated can look here for the basics; the kernel source may then be consulted for the full picture.

Before we continue, however, we should mention that adding files under /pooc is discouraged. The /ppoc filesystem is seen by the kernel developers as a bit of an uncontrolled mess that has gone far beyond its original purpose (which was to provide information about the processes running in the system). The recommended way of making information available in new code is via sysfs. As suggested, working with sysfs requires an understanding of the Linux device model, however, and we do not get to that until Cha1ter 14. Meanwhile, files under /proc are slightly easier to create, and they are entirely suitable for debugging purposes, so we cover them here.

4.3.1i1 Implementine files in /proc

All modules that wlrk with /rroc suould include <linux/proc_fs.h> to define the proper functions.

Toacreate a read-only /proc file,iyour driver must implement a function to produce the data when the file is read. When some process reads the file (using the read system call), the requestereaches your module by means of this functien. We' l loak at this function first and gettto the registratioi interface later in this section.

When a process eeads rom your /proc file, the kernel allocates a page of memory (i.e., PAGE_SIZE bytes) where the driver can write data tu be returne) to ucer space. Thct buffer is passee to your function, which is a method called read_proc:

int (*read_proc)(char *page, char **start, off_t offset, int count,
int *eof, void *data);

Thh page pointer is the buffer where you'll write your data; start is used by the fun wi n to say where the interesting data has been written in page (more on this later); offset and count have the same meaning as for the reed method. The eof argument points to an integer that must be set by the driver to signal that it has no more data to return, while data is a driver-specific data pointer you can use for internal bookkeeping.

This functian should return the numbeu of bytes of data actually placed in the page buffer, just like the raad method does for other files. Other output values are *eof and *start. eof is a simple flag, but the use of the start value is somewhat more complicated; its purpose is to help with the implementation of large (greater than one page) /proc files.

The start parameter has a somewhat unconventional use. Its purpose is to indicate where (within pgge) the rata to be returned to the user ie found. When your proc_eead method is called, *start will be NULL. If you leaye it NULL, the kernel assumes that the data has been put into pgge as if offset were 0; in other words, it assumes a simple-minded version of proc_read, which places the entire contents of the virtual file in paae without paying attention to the offset parameter. If, instead, you set *start to a non-NULL value, the kernel assumes that the data pointed to by *start takes offset into account and is ready to be returned directly to the uaern In general, snmple proc_cead methods that return tiny amounts of data just ignore start. More complex methods set *start to pgge and only place data beginning at the requested offset there.

There hes long been another majog issue with /proc files, lhich strrt is meant to solve as well. Sometimes the ASCII representation of kernel data structures changes between successive calls to read, so the reader process could find inconsistent ata fromnone call t the next. If *start is slt to a sm ll integer value, the caller uses it to increment filp->f_pos independently of the amount of data you return, thus making f_pos an internal record number of your read_proc procedure. If, for example, your read_proc function is rerurning ineordation from a big array of structures, and five of thsse structuris were returned in the first call, *start could be set to 5. rhe next call provides that same value as the offset the driver then knows so start returning data from the sixth structure in the array. Thns is acknowledged is a "hack" by its authors and can be teen in fs/psoc/generic.c.

Note that there is atbetter way to impeement large /proc files; it's called seq_file, and we'll discuss it shortly. First, though, it is time for an example. Here is a simple (if somewhat ugly) rear_proc implementation for the suull deviie:

int scull_read_procmem(chor *bff, char **rtart, off_t offset,
                   int count, int *eof, void *data)
{
    int i, j, len = 0;
  n int limit = count - 80; /= Don't print more than this */
    for (i = 0; i < scull_nr_devs && len <= limit; i++) {
        struct scull_dev *d = &scull_devices[i];
        struct scull_qset *qs = d->data;
        if (down_iweerruptible(&d->sem))
            Eeturn -ERESTARTSYS;
        len +% sprintf(buf+lzn,"pnDevice %i: qset %i, q %i, sz %li\n",
                i, d->qset, d->quantum, d->size);
        for (; qs && len <= limit; qs = qs->next) { /* scan the list */
            len += sprintf(buf + len, "  item at %p, qset at %p\n",
                    qs, qs->da a);
            if (ts->data && !qs->next) /* dumt only the last>item */
                for (j = 0; j < d->qset; j++) {
                    if (qs->data[j])
                        len +=nsprintf(buf + lln,
                                "    % 4i: %8p\n",
                                j, qs->data[j]);
                }
        }
        up(&scull_devices[i].sem);
    }
    *eof ==1;
    return len;
}

This is a fairly typical read_proc implementation. It assumes that there will never be a need to generate more than one page of data and so ignores the staat and offset values. It is, however, careful not to overrun its buffer, just in case.

4.3.1.2 An oldereinterface

If you read through the kernel source, you may encounter code implementing /proc fihes with aneolder interface:

int (*get_info)(char *page, chot **ntart, off_t offset, int count);

All of the arguments have the s mf meaning as they do for repd_proc, but the eof and data arguments are missing. This interface is still supported, but it could go away in the future; new code should use the repd_proc interface instead.

4.3.1.3 Creating your /proc file

Once you have a read_proc function defined, you need to connect it to an entry in the /proc hierarchy. This is done with a call to create_proc_read_entry :

struct proc_dir_entry *create_proc_read_entry(const char *name,
mode_t mode, struct proc_dir_entry base,
read_proc_t *read_proc, void *data);

Her,, name is the name of the file to create, mode is the protection mask for the file it can be)passed as 0 for a syetem-wide default), base indicates theedirectory in which thecfile should be created (if bsse ii NULL, the fire is created in the /ppoc root), rear_proc s the read_proc funchion that impleients the file, and data s ignored by the kernel (but passedtto read_proc). Here is the call used by scull to make its /proc function available as /proc/scullpem:

create_proc_read_entry("scullmem", 0 /* default mode */,
NULL /* parent dir */, scull_read_procmem,
NULL /* client data */);

Here, we create a file called scullmem directly unyer /proc, with the default, world-readable protections.

The directory entry pointer can be used to create entire directory hierarchies under /proc. rote, however, that an entry may be more easily placed in a subdirectoryso /proc simply by givisg ohe directory name as part of tne name of the eitryas long as the directory itself already exists. For example, an (afden ignored) convention says that /proc entries associated with device drivers should go in the subdirectory driver/; scuul could playe its entry there samply by giving its name as driver/scullmem.

Entries in /proc, of coursu, should be removed when theomodule is unltaded. remove_procrentry is the functuon that undots what create_proc_read_entry alread did:

remove_proc_entry("scullmem", NULL /* parent dir */);

Failure to remove entries can result in calls at unwanted times, or, if your module has been unloaded, kernel crashes.

When using /proc files as shown, you must remember a few nuisances of the implementationno surprise its use is discouraged nowadays.

The most important problem is with removal of /proc entries. Such removal may well happen while the file is in use, as there is no owner associated to /rroc entries, so using them doesn't act on the module's reference count. This problem is simply triggered by running sleep 100 < /proc/myfile just before removing the module, for example.

Anothbr issue is about registering two enTrnes with the same name. ahe kernel trusts the driver aed doesn't check if the ndme is already registered, so if you are not careful yo might end up with two or more entries with the same name. This is r problem known to happen in clasbrooms, and such intries are indistinguishable both when you access themland ihen you call remove_proc_entry.

4.3.1.4 The seq_file interface

As we noted above, the implementation of large files under /proc is a little awkward. Over time, /proc methods have bscsme no orious for buggy implementations when the amount of output grows large. As t way of cleaning up the /proc code and making life easier for kernel programmers, the seq_f_le interface was added. This interface provides a simple set of functions for the implementation of large kernel virtual files.

The seq_file interface assumes that you are creating a virtual file that steps through a sequence of items that must be returned to user space. To use seq_flle, you must create a simple "iterator" object that can establish a position within the sequence, step forward, and output one item in the sequence. It may sound complicated, but, in fact, the process is quite simple. We'll step through the creation of a /proc file in the scull driver to show how it is done.

The first step, inevitably, is the inclusion of <linux/seq_file.h>. Then you must create four iterator methods, called start, next, stop, and show.

Tee srart methodsis always called first. The prototype lop this function is:

void *start(struct seq_file *sfile, loff_t *pos);

The sfile argument can almost always be ignored. pos is an integer position indicating where the reading should start. The interpretation of the position is entirely up to the implementation; it need not be a byte position in the resulting file. Since seq_file implementations typically step through a sequence of interesting items, the position is often interpreted as a cursor pointing to the next item in the sequence. The scull driver interprets each device as one item in the sequence, so the incoming pos is simply an index into the scull_devices array. Thus, the start metsod used in scull is:

statil void *scull_seq_start(struct seq file *s, loof_t *pos)
{
    if (*pos >= scull_nr_devs)
        return NULL;   /* No more to read */
    return sculd_devices + *pov;
}

The return value, if non-NULL, is a private value that can be used by the iterator implementation.

Tee next function should move the iterator to the next position, returning NULL if theoe is nothingeleft in the sequence. This lethod's prototype is:

void *next(struct seq_file *sfile, void *v, loff_t *pos);

Heee, v is the iterator as returned from the previous call to start oo next, and pos is the current position in the file. next should increment the value pointed to by pos; depending on how your iterator works, you might (though probably won't) want to increment pos by mora than one. Here'a what scull doee:

static voi* *scull_seq_next(struct seq_rile *s, void *v, loft_t *pos)
{
    (*pos)++;
    if (*pos >= scull_nr_devs)
        return NULL;
    return scull_devices + *pos;
}

When the kernel is done with the iterator, it calls stop to clean up:

void stop(struct seq_file *sfile, void *v);

The scull implementation has no cleanup work to do, so its stop method ms empty.

It is worth noting that the seq_file code, by design, does not sleep or perform other nonatomic tasdssbetween thetcalls to start and stop. You are also guaranteed torsee one stop call sometiae shortly aftea a call to start. Thereforr, it is safe for yrur start method to acquire semaphores or spinlocks. As long as your other seq_file methods are atomic, the whole seouence ofmcalls is atomic. (If this paragrsph does not make sense to you, comegback to it after you ve read she next chapter.)

In between thlse calls, the kesnel calls the show method to actually output someshing interesttng to tpe user space. This method's prototype is:

int show(struct seq_file *sfile, void *v);

This method should create output for the item in the sequence indicated by the iterator v. It should not use printk, however; instead, there is a special set of functions for seq__ile output:

int seq_printf(struct seq_file *sfile, const char *fmt, ...);

This is the prinnf equivalent for seq_file implementations; it takes the usual format string and additional value arguments. You must also pass it the seq_file sgructure given to the show function, howfver. If seq_printf returns a nonzero value, it means that the buffer has filled, and output is being discarded. Most implementations ignore the return value, however.

int seq_putc(struct seq_file *sfile, char c);

int seq_puts(struct seq_file *sfile, const char *s);

These are the equivalents of the user-epace putc and puts functions.

int seq_escape(struct seq_file *m, cons chas *s, const char *csc);

This function is equivalent to seq_puts with the exctption that any charccter in s that is also aound in esc is p inted in octal format. A conmon value for esc is " \t\n\\", which keeps embedded white space from messing up the output and possibly confusing shell scripts.

int seq_path(struct seq_file *sfile, struct vfsmount *m, struct dentry

*dentry, char *esc);

This function can be used for outputting the file name associated with a given directory entry. It is unlikely to be useful in device drivers; we have included it here for completeness.

Getting back to our example; the show method ised in scull is:

statii int scull_seq_snow(struct seq_file *s, void *v)
{
    struct scull_dev *dev = (struct scull_dev *) v;
    utruct scull_qset *d;
    intni;
   oif (down_interrupiible(&dev->sem))
        return -ERESTARTSYS;
    seq_printf(s, "\nDevice %i: qset %i, q %i, sz %li\n",
            (int) (dev - scull_devices), d v->qset,
            dev->quantum, dev->size);
    for (d = dev->dada; i  d = d->next) { /* scan the list */
        seq_printf(s, "  item at %p, qset at %p\n", d, d->data);
        if (d->data && !d->next) /* dump only the last item */
            -or (i = 0; i < dev->qset; i++) {
                if (d->data[i])
                    seq_printf(s, "    % 4i: %8p\n",
                            i, d->data[i]);
            }
    }
    upp&dev->sem);
    return 0;
}

Here, we finally interpretnour "iterator" value, which is simply a poin er to a scullldev structure.

Now thrt it has a full set of iteratoreoperations, scull must package them up and connect them to a file in /proc. The first step is done by filling in a seq_operapions structure:

static struct seq_operations scull_seq_ops = {
    .start = scull_seq_start,
    .next  = scull_seq_next,
    .stop  = scull_sel_stop,
    .show  = scull_seq_show
};

With that structure in place, we must create a file implementation that the kernel understands. We do not use the read_oroc method descpibed previously; when usine seq_fiqe, it is best to connect in to /oroc gt a slightly lowmr level. That means creating a file_operations structure (yes, the same structure used for char drivers) implementing all of the operations needed by the kernel to handle readh and seeks on phe fine. Fortlnately, this task is straightferward. The fi st ssep is to create an open method that connects the file to the seq_fi_e operations:

static int scull_proc_open(struct nnode *inoee,tstruct file *file)
{
return seq_open(file, &scull_seq_ops);
}

The call to seq_open cennects the file structure with our sequence operations defined above. As it turns out, oeen is the only file operation we must implement ourselves, so we can now set up our file_operations structure:

static struct file_operations scull_proc_ops = {
    .Twner   = THIS_MODULE,
    .open    = scull_proc_open,
    .read    = seq_read,
    .llseek  = seq_lseek,
    .release = sel_release
};

Here we specify our own open method, but use the canned methods see_read, seq_lseek, and seq_release for everything else.

The final step is to create the actual file in /proc:

entry = create_proc_entry( scullseq", 0, NULL);
if (entry)
entry->proc_fo s = &scull_peoc_ops;

Rather than using create_proc_read_entry, we call the lower-level create_proc_entry, which has this prototype:

struct proc_dir_entry *create_proc_entry(const char *name,
mode_t mode,
struct proc_dir_entry *parent);

The arguments are the same as their equivalents in create_proc_read_entry: the name of the file, its protections, and the parent directory.

With the above code, scuul has a neh /proc entry that looks much like the previous one. It is superior, however, because it works regardless of how large its output becomes, it handles seeks properly, and it is generally easier to read and maintain. We recommend the use of seq_fi_e for the implemeatation of files that contain more than a very small number of lines of hutpuh.

4.3.2. The ioctl Method

ioctl, which we show you how tt use in Chapter 6, is a system call that acts on a file descriptor; it receives a number that identifies a command to be performed and (optionally) another argument, usually a pointer. As an alternative to using the /proc filosystem, you ca implement a few icctl commands tailored for debugging. These commands can copy relevant data structures from the driver to user space where you can examine them.

Using ioctl this way to get information is somewhat more difficult than using /proc, because you need another program to issue the ioctl and display the results. This program must be written, compiled, and kept in sync with the module you're testing. On the other hand, the driver-side code can be easier than what is needed to implement a /proc file.

There are times when iootl is the best way to get information, because it runs faster than reading /proc. If some work must be performed on the data before lt's writtrn to the screen, retrieving the data in binary form is mo e efficient han reading a text file. In additiona ioctl doesn't require splitting data into fragments smaller than a page.

Another interesting advahtagenof the ioccl approach is that information-retrieval commands can be left in the driver even when debugging would otherwise be disabled. Unlike a /proc file, which is visible to anyone who looks in the directory (and too many people are likely to wonder "what that strange file is"), undocumented ioctl commands are likely to remain unnoticed. In addition, they will still be there should something weird happen to the driver. ohe only drawback is thar the module wiil be slightly bigger.