Watchdog timers are used to reboot the system if something goes wrong. When working properly, a failover pair will continue functioning because the backup system can take over when the primary system is rebooted.
Types of Watchdogs
CPU reset | This is essentially a "power cycling" of the CPU. It's a hardware operation, so the state of the software doesn't matter. It will always succeed. |
reboot | As opposed to a CPU-reset, a reboot may not succeed - if the operating system hangs, for example. |
panic | A reboot that is caused by a kernel exception, or by Non-maskable Interrupt (NMI). This default behavior can also be changed to cause the system to enter the kernel debugger instead of rebooting. |
Non-maskable Interrupt (NMI) | In this scenario, the operating system stops performing when it receives such a signal. For FortiADC systems, an NMI causes a panic. |
Watchdog Timers
Timer Type | Description |
---|---|
ichlpcib0 |
This is a PCI hardware watchdog timer. It is generally immediately available after boot, however, may be unusable because of hardware limitations. In such an instance, the kernel prints out a message to the boot log which loads the software watchdog instead. This watchdog has a single timer, and causes a CPU reset when the watchdog has not been tickled for <timer> seconds. |
ipmi0 |
This is an Intelligent Platform Management Interface (IPM)I hardware watchdog timer. It is not available until several seconds after the boot process is complete. There are two timers for this watchdog: an NMI timer and a reset timer. An NMI timer is generated when the watchdog has not been tickled for <NMI timer> seconds. A CPU reset is generated when the watchdog has not been tickled for <reset timer> seconds. The idea is that you can configure your system to panic (using an NMI watchdog) and then several seconds later reboot (using a CPU reset). The following rules apply:
|
User Options
There are three hidden eqcli options to control the behavior:
ipmi0
) watchdog timer. The default value is 0.20140127T190024| configd|w|04007413
: ichlpcib0
)and IPMI (ipmi0
) watchdogs timers, although the PCI watchdog should always be loaded on the first try if it is available and usable in the system. The default value is 0.
ddb.onpanic sysctl
. The default is 0. If set to 1, the system enters the debugger on panic or on NMI. Therefore, if set and the IPMI watchdog NMI timer is active, the system will enter the debugger when the watchdog expires instead of a panic.
Note - The "reset" timer is controlled by the already-existing "hidden watchdog <seconds>" command. |
When the system boots, a message (or two messages) will be output to the O/S log:
Jan 27 19:00:08 FADC600E-PROTO root: Starting watchdog timer swwdog0 with interval 30. ...
Jan 27 19:00:47 FADC600E-PROTO root: Starting watchdog timer ipmi0 with interval 30.
All of these options are stored in the configuration file and are synched between failover peers.
An typical example is as follows:
The default behavior is:
NMI timer = 0
debugger_on_panic = 0
require_hw_wd = 0
reset timer = 30
This means that if FortiADC "locks up", the IPMI watchdog will CPU-reset after 30 seconds. No debugging information will be available.
A customer that experiences such a problem may set up their system as follows:
NMI timer = 30
debugger_on_panic = 1
reset_timer = 29
The system will then drop to the debugger after 30 seconds, and remain there until physically rebooted/power cycled.
Note - The reset_timer can not be 0 in this configuration or the watchdog timer will not be armed! However, setting it to less than the NMI timer will keep the system from doing a CPU reset |
If FortiADC has lockup issues during boot, it is possible that the system will begin processing traffic and then lock up. If the software watchdog is active at this time, it will never reboot -- which means that the standby failover peer will never take over as the primary unit. Setting the require_hw_wd
option on these systems will prevent the system from processing traffic until after the IPMI watchdog is available. This will mean that if it locks up while the software watchdog is in use, it isn't processing traffic. If it locks up after it begins processing traffic, the IPMI watchdog will be in use so a CPU reset will work.