Formatted HTML doc to follow...
NMI Watchdogs and NMI Panics nmi_watchdog The NMI watchdog monitors system interrupts and initiates a reboot if the system appears to have hung. On a normal system hundreds of device and timer interrupts are received per second. If there are no interrupts in a 5 second interval, the NMI watchdog assumes that the system has hung and initiates a system reboot. In order to understand how the NMI Watchdog works, it is first necessary to understand the APIC. The APIC, or Advanced Programmable Interrupt Controller, has been built into all x86 CPUs since the Pentium Pro. This built-in APIC is known as the Local APIC. Primarily, the APIC is used to issue interrupts to other CPUs in a multi-processor system, but still has its uses in single processor systems--for example with the NMI watchdog function. IO-APIC is another APIC included on certain motherboards. The IO-APIC collects interrupts from various I/O devices and sends them to the Local APIC built into the processor. The IO-APIC is a replacement for the legacy 8259 Programmable Interrupt Controllers (PIC) which have been in use since the original PC-AT architecture. Obviously, the IO-APIC is a major improvement in PC Architecture, but it is usually only included on higher-end motherboards. In order to use the NMI Watchdog, APIC support must be enabled in the kernel. For SMP kernels, this is automatically enabled. For Uniprocessor kernels, CONFIG_X86_UP_APIC or CONFIG_X86_UP_IOAPIC must be enabled. (The IO-APIC is more desirable than the local APIC). [Note: certain kernel debugging options, such as Kernel Stack Meter or Kernel Tracer, may implicitly disable the NMI watchdog.] The NMI watchdog is enabled by adding nmi_watchdog=n to the command line used to boot the kernel. The "n" will either be 1 or 2. For all SMP systems and UP systems with an IO-APIC, nmi_watchdog will be "1". For UP systems without an IO-APIC, nmi_watchdog will be "2". This is not guaranteed to work, however. If there is doubt, test each setting as shown below. Here is an example from /etc/grub.conf for systems which utilize the GRUB boot loader: title Test Kernel (2.4.9-10smp) root (hd0,0) # This is the kernel's command line. kernel /vmlinuz-2.4.9-10smp ro root=/dev/hda2 nmi_watchdog=1 Here is an example from /etc/lilo.conf for systems which utilize the LILO boot loader: image=/boot/vmlinuz-2.4.9-10smp label=linux read-only root=/dev/hda2 append="nmi_watchdog=1" To determine if the NMI watchdog was activated, check /proc/interrupts. The NMI interrupt should display a non-zero value. If NMI displays a zero, try nmi_watchdog=2. If that still displays zero then the processor is not supported by the NMI watchdog code. The output, when functioning correctly, should look similar to the following: CPU0 0: 5623100 XT-PIC timer 1: 13 XT-PIC keyboard 2: 0 XT-PIC cascade 7: 0 XT-PIC usb-ohci 8: 1 XT-PIC rtc 9: 794332 XT-PIC aic7xxx, aic7xxx 10: 569498 XT-PIC eth0 12: 24 XT-PIC PS/2 Mouse 14: 0 XT-PIC ide0 NMI: 5620998 LOC: 5623358 ERR: 0 MIS: 0 unknown_nmi_panic A new feature was introduced in kernel 2.6.9 which helps to make easier the process of diagnosing system hangs on certain hardware. This feature, called Unknown_nmi_panic utilizes NMI (Non-Maskable Interrupt) switch capability to force a kernel panic on a hung system. Unknown_nmi_panic was also backported to Red Hat Enterprise Linux 3 Update 3. This feature makes use of the computer's NMI switch (if it is equipped with one). Because the NMI Switch generates an undefined NMI interrupt, this feature cannot be utilized on systems that also use the NMI Watchdog or oprofile features as both of these make use of the undefined NMI interrupt. If unknown_nmi_panic is activated with one of these features present, it will not work. Note that this is a user-initiated interrupt which is really most useful for helping to diagnose a system that is experiencing system hangs for unknown reasons. To enable this feature, set the following system control parameter as follows: kernel.unknown_nmi_panic = 1 This can either be done via the command line using the "sysctl -w" command, or by adding the above line to the /etc/sysctl.conf file. Once this is done (and the system rebooted if not using the command line), a panic can be forced by pushing the system's NMI switch. Systems that do not have a NMI switch should still use the NMI Watchdog feature which will automatically generate an NMI if the system hangs.