You are here: Using Certificates in HTTPS Clusters > Troubleshooting > Using Watchdog Timers

Using Watchdog Timers

Watchdog timers are used to reboot the system if something goes wrong. When working properly, a failover pair will continue functioning because the backup system can take over when the primary system is rebooted.

Types of Watchdogs

   
CPU reset This is essentially a "power cycling" of the CPU. It's a hardware operation, so the state of the software doesn't matter. It will always succeed.
reboot As opposed to a CPU-reset, a reboot may not succeed - if the operating system hangs, for example.
panic A reboot that is caused by a kernel exception, or by Non-maskable Interrupt (NMI). This default behavior can also be changed to cause the system to enter the kernel debugger instead of rebooting.
Non-maskable Interrupt (NMI) In this scenario, the operating system stops performing when it receives such a signal. For FortiADC systems, an NMI causes a panic.

Watchdog Timers

Timer Type Description
ichlpcib0

This is a PCI hardware watchdog timer. It is generally immediately available after boot, however, may be unusable because of hardware limitations. In such an instance, the kernel prints out a message to the boot log which loads the software watchdog instead. This watchdog has a single timer, and causes a CPU reset when the watchdog has not been tickled for <timer> seconds.

ipmi0

This is an Intelligent Platform Management Interface (IPM)I hardware watchdog timer. It is not available until several seconds after the boot process is complete. There are two timers for this watchdog: an NMI timer and a reset timer.

An NMI timer is generated when the watchdog has not been tickled for <NMI timer> seconds.

A CPU reset is generated when the watchdog has not been tickled for <reset timer> seconds. The idea is that you can configure your system to panic (using an NMI watchdog) and then several seconds later reboot (using a CPU reset). The following rules apply:

reset = 0, nmi => 0: the watchdog is not armed (i.e. if the reset timer is 0, the nmi value does not matter)


reset > 0, nmi = 0: The system will reset the CPU <reset> seconds after the timer stops being tickled. (This is the default behavior, reset = 30).


reset > 0, nmi > 0, reset < nmi: The system will generate an NMI <nmi> seconds after the timer stops being tickled. No CPU reset is ever asserted.


reset > 0, nmi > 0, reset > nmi: The system will generate an NMI <nmi> seconds after the timer stops being tickled. It will reset the CPU <reset> seconds after the timer stops being tickled. (Note: Not <reset> seconds after <nmi>). The NMI timer is controlled via the sysctl: hw.ipmi0.pretimeout.

swwdog0: This is a NetBSD software watchdog. It is always available, however, requires the kernel to be operational. If the kernel is "locked up", the watchdog may not fire. It has a single timer, and causes a panic when it has not been tickled for <timer> seconds.

User Options

There are three hidden eqcli options to control the behavior:

  1. debug >configd ipmi_nmi <seconds>

    This sets the IPMI NMI timer. It has no effect on systems that do not have an IPMI (ipmi0) watchdog timer. The default value is 0.
  2. debug > configd require_hw_wd <0 or 1>

    Defers loading the network subsystem until after a hardware watchdog has been armed. Has no effect if the watchdog has been completely disabled (reset timer is 0). Does not do anything until after the system is rebooted. The user will see "Waiting for system processes..." along with this message in the eq log: 20140127T190024| configd|w|04007413:

    Deferring the system startup until the hardware watchdog timer becomes available applies to both the PCI (ichlpcib0)and IPMI (ipmi0) watchdogs timers, although the PCI watchdog should always be loaded on the first try if it is available and usable in the system. The default value is 0.
  3. debug > configd debugger_on_panic <0 or 1>

    Sets the ddb.onpanic sysctl. The default is 0. If set to 1, the system enters the debugger on panic or on NMI. Therefore, if set and the IPMI watchdog NMI timer is active, the system will enter the debugger when the watchdog expires instead of a panic.

Note - The "reset" timer is controlled by the already-existing "hidden watchdog <seconds>" command.

When the system boots, a message (or two messages) will be output to the O/S log:

Jan 27 19:00:08 FADC600E-PROTO root: Starting watchdog timer swwdog0 with interval 30. ...

Jan 27 19:00:47 FADC600E-PROTO root: Starting watchdog timer ipmi0 with interval 30.

All of these options are stored in the configuration file and are synched between failover peers.

An typical example is as follows:

The default behavior is:

NMI timer = 0
debugger_on_panic = 0
require_hw_wd = 0
reset timer = 30 This means that if FortiADC "locks up", the IPMI watchdog will CPU-reset after 30 seconds. No debugging information will be available.

A customer that experiences such a problem may set up their system as follows:

NMI timer = 30
debugger_on_panic = 1
reset_timer = 29

The system will then drop to the debugger after 30 seconds, and remain there until physically rebooted/power cycled.

Note - The reset_timer can not be 0 in this configuration or the watchdog timer will not be armed! However, setting it to less than the NMI timer will keep the system from doing a CPU reset

If FortiADC has lockup issues during boot, it is possible that the system will begin processing traffic and then lock up. If the software watchdog is active at this time, it will never reboot -- which means that the standby failover peer will never take over as the primary unit. Setting the require_hw_wd option on these systems will prevent the system from processing traffic until after the IPMI watchdog is available. This will mean that if it locks up while the software watchdog is in use, it isn't processing traffic. If it locks up after it begins processing traffic, the IPMI watchdog will be in use so a CPU reset will work.