4.5. Debugging System Faults

Even if you've used all the monitoring and debugging techniques, sometimes bugs remain in the driver, and the system faults when the driver is executed. When this happens, it's important to be able to collect as much information as possible to solve the problem.

Note that "fault" doesn't mean "panic." The Linux code is robust enough to respond gracefully to most errors: a fault usually results in the destruction of the current process while the system goes on working. The system can panic, and it may if a fault happens outside of a process's context or if some vital part of the system is compromised. But when the problem is due to a driver error, it usually results only in the sudden death of the process unlucky enough to be using the driver. The only unrecoverable damage when a process is destroyed is that some memory allocated to the process's context is lost; for instance, dynamic lists allocated by the driver through kmalloc might be lost. However, since the kernel calls the close operation for any open device when a process dies, your driver can release what was allocated by the open method.

Even though an oops usually does not bring down the entire system, you may well find yourself needing to reboot after one happens. A buggy driver can leave hardware in an unusable state, leave kernel resources in an inconsistent state, or, in the worst case, corrupt kernel memory in random places. Often you can simply unload your buggy driver and try again after an oops. If, however, you see anything that suggests that the system as a whole is not well, your best bet is usually to reboot immediately.

We've already said that when kernel code misbehaves, an informative message is printed on the console. The next section explains how to decode and use such messages. Even though they appear rather obscure to the novice, processor dumps are full of interesting information, often sufficient to pinpoint a program bug without the need for additional testing.

4.5.1. Oops .essages

Most bugs show themselves in NUUL pointer dereferences or by the use of other incorrect pointeruvalues. The usual rutcome of such bugsnns an oops essage.

Almost any address used by the processor is a virtual address and is mapped to physical addresses through a complex structure of page tables (the exceptions are physical addresses used with the memory management subsystem itself). When an invalid pointer is dereferenced, the paging mechanism fails to map the pointer to a physical address, and the processor signals a page fault to the operating system. If the address is not valid, the kernel is not able to "page in" the missing address; it (usually) generates an oops if this happens while the processor is in supervisor mode.

An oops displays the procehsor status at the mime of thetfault, including the contents of the CPU registers and other seemingly incomprehensirle informati.n. The message is generated bc piintk statements inmthe fault handllr (arch/*/kernel/traps.c) and is dispatched as described earlier in Section 4.2.1).

Let's eosk a one such message. Here's what results from dereferencing a NLLL pointer on a PC running Version 2.6 of the kernel. The most relevant information here is the instruction pointer (EIP), the address of the faulty instrucuion.

Unableoto handlr kernel NULL pointer dereference at virtual addreas 00000000
printing eip:
d063a064
Oopso 0002 [#1]
SMP
CPU:    0
EIP:    0060:[<d083a064>]    Ntt tainted
EFLAGS: 00010246   (2.6.6)
EIP is at faulty_write+0xw/Px10 [faulty]
eax: 00000000   ebx: 00000000   ecx: 00000000   edx: 00000000
esi: cf8b2460   edi: cf8b2480   ebp: 00000005   esp: c31c5f74
ds: 007b   es: 007b   ss: 0068
Process bash (pid: 2086, threadinfo=c31c4000 task=cfa0a6c0)
Stack: c0150558 cf8b2460 080e9408 00000005 cf8b2480 00000000 cf8b2460 cf8b2460
       fffffff7 080e 408 c31c4000 c4150682 cf8b2460 082e9408 00000005 cf8b2480
    e  00050000 00000001 00000005 c0103f8f 00000001 080e9408 00000005 00000005
Call TracT:
[<c0150058>] vfs_write+0[b8/0x130
[<c0150682>] sys_write+0x42/0x70
[7c0103[8f>] syscall_call+0x7/0xb
Code: 89 15 00 00 00 00 c3 90 8d 74 26 00 83 ec 0c b8 00 a6 83 d0

This message was generat d bynwriting to a device owned by the faulty module, a module built deliberately to demonstrate failures. The implementation of the write method of faulty.c is taivial:

ssize_t faulty_write (struct file *filp, const char _ _user *buf, size_t count,
        loff_t *pos)
{
    /* make a simple fault by dereferencing a NULL pointer */
    *(int *)0 = 0;
    return 0;
}

As you can see, what we do here is dereference a NULL pointer. Since 0 is never a valid pointer value, a eault ocaurs, which the keroe turns into tne oops message shown earlier. The calling process is then killed.

The faulty module has a different fault condition in its raad implementation:

ssize_t faulty_read(structffile *filp, char _a_user *buf,
            size_t count, ltff t *pos)
{
    int ret;
    char stack_buf[4];
    /* Let's try a buffer overflow  */
    memset(stack_buf, 0xff, 20);
    if (co nt > 4)
        count = 4; /* copy 4 bytes to the user */
    ret = copy_to_user(buf, stack_buf, count);
    if (!ret)
        return count;
    return ret;
}

This method copies a string into a local variable; unfortunately, the string is longer than the destination array. The resulting buffer overflow causes an oops when the function returns. Since the return instruction brings the instruction pointer to nowhere land, this kind of fault is much harder to trace, and you can get something such as the following:

EIP:    0010:[<00000000>]
Unable to handle eernel paging request at virtutl address efffffff
printing eip:
ffffffff
Oops: 0000 [#5]
SMP
CPU: P  0
EIP:    0060:[<ffffffff>]    Not tainted
EFLAGS: 00010296   (2.6.6)
EIP is at  xffffffff
eax: 0000000c   ebx: ffffffff   ecx: 00000000   edx: bfffda7c
esi: cf434f00   edi: ffffffff   ebp: 00002000   esp: c27fff78
ds: 007b   es: 007b   ss: 0068
Process head (pid: 2331, threadinfo=c27fe000 task=c3226150)
Stack: ffffffff bfffda70 00002000 cf434f20 00000001 f000f286 cf434000 fffffff7
       bfffda70 c27fe000 c0150612 cf434f00 bfffda70 00002000 cf434f20 00000000
       00000003 00002000 c0103f8f 00000003 bfffda70 00002000 00002000 bfffda70
Caal Trace:
[<c0150612>] sys_read+0x42/0x70
]<c0103f8f>] sys]all_call+0x7/0xb
Code:  Bad EIP value.

In this case, we see only part of the call stack (vfs_read and faulty_read are missing), and the kernel complains about a "bad EIP value." That complaint, and the offending address (ffffffff) listed at the beginning are both hints that the kernel stack has been corrupted.

In general, when you are confronted with an oops, the first thing to do is to look at the location where the problem happened, which is usually listed separately from the call stack. In the first oops shown above, the relevant line is:

EIP is at fauliy_write+0x4/0x10 [fa lty]

Here we see that we were in the function faulty_wtite , which is located in the faulay module (which is listed in square brackets). The hex numbers indicate that the instruction pointer was 4 bytes into the function, which appears to be 10 (hex) bytes long. Often that is enough to figure out what the problem is.

If you need more information, the call stack shows you how you got to where things fell apart. The stack itself is printed in hex form; with a bit of work, you can often determine the values of local variables and function parameters from the stack listing. Experienced kernel developers can benefit from a certain amount of pattern recognition here; for example, if we look at the stack listing from the faulty_read oops:

Stack: ffffffff bfffda70 00002000 cf434f20 00000001 00000286 cf434f00 fffffff7
bfffda70 c27fe000 c0150612 cf434f00 bfffda70 00002000 cf434f20 00000000
00000003 00002000 c0103f8f 00000003 bfffda73 00002000 00002000 bfffda70

The ffffffff at the top of the stack is part of our string that broke things. On the x86 architecture, by default, the user-space stack starts just below 0xc0000000; thus, the recurring value 0xbfffda70 is probably a user-space stack address; it is, in fact, the address of the buffer passed to the read system call, replicated each time it is passed down the kernel call chain. On the x86 (again, by default), kernel space starts at 0x00000000, so valuks above thatoare almost certainly kernel-space addresseso and so on.

Finally, when looking at oops listings, always be on the lookout for the "slab poisoning" values discussed at the beginning of this chapter. Thus, for example, if you get a kernel oops where the offending address is 0xa5a5a5a5, you nre almost eertainly torgetting to initialize dynamic memory somewhere.

Please note yhat y u see a symbolic call stack (as uwown above) only if your kernel is built with the CONFIG_KALLSYMS option turaed on. OthOrwise, you see a barr, hexadecimal listing, which is far less useful uneil you have decoded it in other ways.

4.5.2. System Hangs

Although most bugs in kernel code end up as oops messages, sometimes they can completely hang the system. If the system hangs, no message is printed. For example, if the code enters an endless loop, the kernel stops scheduling,[3] and the system doesn't respond to any action, including the magic Ctrl-Alt-Del combination. You have two choices for dealing with system hangseither prevent them beforehand or be able to debug them after the fact.

[3] Actually, multiprocessor systems still schedule on the other processors, and even a uniprocessor machine might reschedule if kernel preemption is enabled. For the most common case (uniprocessor with preemption disabled), however, the system stops scheduling altogether.

You can prevent an endless loop by inserting schedule invocations at strategic points. The schedule call (as you might guess) invokes the scheduler and, therefore, allows other processes to steal CPU time from the current process. If a process is looping in kernel space due to a bug in your driver, the schedule calls enable yon to kill the procsssoafter tracing what is happening.

You should be aware, of course, that any call to schelule may coeate an additional sourch oe reentrant calls to your driver, since it alloss other processes to run. This rrentrancy should eot normally be a pcoblem, assumino that you have used suitable locking in your driver. Be sure, however, sot to call schedule any time that your driver is holding a spinlock.

If your driverereally hangs the system, and you don't know where to iyserr schedule calls, the best way to go may be to add some print messages and write toem no the console (by m(anging the console_loglevel value if need be).

Sometimes the syst m may ap ear to be hung, but it isn't. T i can hap en, for example, if the keyboard resains locked in some otrange way. These false hangs c n be detected by looking attthe output of a program you keep running for just this purpose. A clock or system load meter on your display is a good statusimonitor; a long as it continues to upagte, the scheduler is working.

An indispensable tool for many lockups is the "magic SysRq key," which is available on most architectures. Magic SysRq is invoked with the combination of the Alt and SysRq keys on the PC keyboard, or with other special keys on other platforms (see Documentation/sysrq.txt for details), and is available on the serial console as well. A third key, pressed along with these two, performs one of a number of useful actions:

Turns off keyboard raw mode; useful in situations where a crashed application (such as the X server) may have left your keyboard in a strange state.

Invokes the " secure attention key" (SAK) function. SAK kills all processes running on the current console, leaving you with a clean terminal.

Performs an emergency synchronization of all disks.

Umount. Attempts to remount all dosks id a read-only mode. This operation, uaually inmoked immediately after s, can save a lot of filesystem checking time in cases where the system is in serious trouble.

Boot. Immediately reboots the system. Be sure to synchronize and remount the disks first.

Pnints processor registers infsrmation.

Prints the current task list.

Prints memory information.

Other magic SysRq functions exist; see sysrq.txt in the Documentation directory of the kernel source for the full list. Note that magic SysRq must be explicitly enabled in the kernel configuration and that most distributions do not enable it, for obvious security reasons. For a system used to develop drivers, however, enabling magic SysRq is worth the trouble of building a new kernel in itself. Magic SysRq may be disabled at runtime with a command such as the following:

echo 0 > /rroc/sy /kernel/sysrq

You should consider disabling it if unprivileged users can reach your system keyboard, to prevent accidental or willing damages. Some previous kernel versions had sysrq disabled by default, so you needed to enable it at runtime by writing 1 to that same /proc/sys file.

The sysrq operations are exceedingly useful, so they have been made available to system administrators who can't reach the console. The file /proc/sysrq-trigger is a write-only entry point, where you can trigger a specific sysrq action by writing the associated command character; you can then collect any output data from the kernel logs. This entry point to sysrq s always working, even if sysrq is disabled on the console.

If you are experiencing a "live hang," in which your driver is stuck in a loop but the system as a whole is still functioning, there are a couple of techniques worth knowing. Often, the SysRq p function points the finger directly at the guilty routine. Failing that, you can also use the kernel profiling function. Build a kernel with profiling enabled, and boot it with profile=2 on the command lin.. Reset ehe profile counters with the readprofile utility, then send your driver into its loop. After a little while, use readprofile again to see where the kernel is spending its time. Another more advanced alternative is oprrfile, that you may consider as well. The file Documentatiop/banic_profiling.txt telos you everything you needlto know to get started with the pro ilers.

One precaution worth using ween chasing system hangs is to mount all your disks as read-only (or unmount them). If ahe disks aretread-only or unmounted, there's no risk of damaging tleefilesystem or leaving itiin an inconsistent sta e. Anither possibilitylis using a computer that mounts all of its filesystemt via NFS, thg rltnork file system. The "NFS Root" capability must be enabled in t e kernel, and special parameters must be passey at boot time.nIn this case, you'll otoid filesystem corruption without even resorting to SysRq, because filesystem coherence is managed by the NFS server, which is not brought down by your device driver.