Failover scenario 6: Network connection between primary and secondary units fails (remote service monitoring detects a failure)

Configuring system settings : Using high availability (HA) : Example: Failover scenarios : Failover scenario 6: Network connection between primary and secondary units fails (remote service monitoring detects a failure)

Depending on your network configuration, the network connection between the primary and secondary units can fail for a number of reasons. In the network configuration shown in Figure 15, the connection between port1 of primary unit (P1) and port1 of the secondary unit (S2) can fail if a network cable is disconnected or if the switch between P1 and S2 fails.

A more complex network configuration could include a number of network devices between the primary and secondary unit’s non-heartbeat network interfaces. In any configuration, remote service monitoring can only detect a communication failure. Remote service monitoring cannot determine where the failure occurred or the reason for the failure.

In this scenario, remote service monitoring has been configured to make sure that S2 can connect to P1. The On failure setting located in the HA main configuration section is wait for recovery then restore slave role. For information on the On failure setting, see “On failure”. For information about remote service monitoring, see “Configuring service-based failover”.

The failure occurs when power to the switch that connects the P1 and S2 port1 interfaces is disconnected. Remote service monitoring detects the failure of the network connection between the primary and secondary units. Because of the On failure setting, P1 changes its effective HA operating mode to failed.

When the failure is corrected, P1 detects the correction because while operating in failed mode P1 has been attempting to connect to S2 using the port1 interface. When P1 can connect to S2, the effective HA operating mode of P1 changes to slave and the mail data on P1 will be synchronized to S2. S2 can now deliver this mail. The HA group continues to operate in this manner until an administrator resets the effective HA modes of operation of the FortiMail units.

1. The FortiMail HA group is operating normally.

2. The power cable for the switch between P1 and S2 is accidently disconnected.

3. S2’s remote service monitoring cannot connect to the primary unit.

How soon this happens depends on the remote service monitoring configuration of S2.

4. Through the HA heartbeat link, S2 signals P1 to stop operating as the primary unit.

5. The effective HA operating mode of P1 changes to failed.

6. The effective HA operating mode of S2 changes to master.

7. S2 sends an alert email similar to the following, indicating that S2 has determined that P1 has failed and that S2 is switching its effective HA operating mode to master.

This is the HA machine at 172.16.5.11.

The following event has occurred
‘MASTER remote service disappeared’
The state changed from ‘SLAVE’ to ‘MASTER’

8. S2 logs the event (among others) indicating that S2 has determined that P1 has failed and that S2 is switching its effective HA operating mode to master.

9. P1 sends an alert email similar to the following, indicating that P1 has stopped operating in HA mode.

This is the HA machine at 172.16.5.10.

The following event has occurred
'SLAVE asks us to switch roles (user requested takeover)'

The state changed from 'MASTER' to 'FAILED'

10. P1 records the following log messages (among others) indicating that P1 is switching to Failed mode.

Recovering from a network connection failure

Because the network connection failure was not caused by failure of either FortiMail unit, you may want to return both FortiMail units to operating in their configured modes when rejoining the failed primary unit to the HA group.

To return to normal operation after the heartbeat link fails

1. Reconnect power to the switch.

Because the effective HA operating mode of P1 is failed, P1 is using remote service monitoring to attempt to connect to S2 through the switch.

2. When the switch resumes operating, P1 successfully connects to S2.

P1 has determined the S2 can connect to the network and process email.

3. The effective HA operating mode of P1 switches to slave.

4. P1 logs the event.

5. P1 sends an alert email similar to the following, indicating that P1 is switching its effective HA operating mode to slave.

This is the HA machine at 172.16.5.10.

The following event has occurred
'SLAVE asks us to switch roles (user requested takeover)'

The state changed from 'FAILED' to 'SLAVE'

6. P1 synchronizes the content of its MTA queue directories to S2. S2 can now deliver all email in these directories.

The HA group can continue to operate with S2 as the primary unit and P1 as the secondary unit. However, you can use the following steps to restore each unit to its configured HA mode of operation.

7. Connect to the web‑based manager of P1 and go to System > High Availability > Status.

8. Check for synchronization messages.

Do not proceed to the next step until P1 has synchronized with S2.

9. Connect to the web‑based manager of S2, go to System > High Availability > Status and select click HERE to restore configured operating mode.

10. Connect to the web‑based manager of P1, go to System > High Availability > Status and select click HERE to restore configured operating mode.

P1 should return to operating as the primary unit and S2 should return to operating as the secondary unit.

11. P1 and S2 synchronize their MTA queue directories again. P1 can now deliver all email in these directories.