Resiliency - Troubleshooting

When the resiliency services (MOVEit DMZ Database Resiliency and MOVEit DMZ Web Resiliency) notice a problem, these services will take corrective action, usually after first retrying the failed operation. If the problem is not resolved right away, the services will send email to the "Send Errors To" email address specified in the MOVEit DMZ Configuration Utility. If the problem is subsequently resolved, the services will send a follow-up email.

In some cases, the problems cannot be resolved automatically by the Resiliency system. In these cases, an operator will need to take corrective action, usually via the Resiliency / Advanced window of the MOVEit DMZ Configuration Utility.

General Troubleshooting Procedure

Most MOVEit DMZ Resiliency problems are often the result of one or more of the conditions listed below.

Different Time on Different Nodes - You can manually check the clocks on each node or look for error messages with "time" or "old" phrases in the MOVEit DMZ Config's Resiliency status pane. You can also use the methods described in "Advanced Topics - Time Synchronization" to test your time server connections.
Lost Connection to NAS - You can try uploading or downloading a file through each node or look for error messages with "available" phrases in the MOVEit DMZ Config's Resiliency status pane. Also check network connectivity (ping, etc.) over the cluster network and signs of folder permission or user account changes on the NAS. NAS IP address changes or NetBIOS/DNS name resolution problems may also be the cause of NAS service interruptions.
Lost Connection to Database - This will be evident in the MOVEit DMZ Config's Resiliency status pane; look for error messages regarding database connectivity. Check network connectivity (ping, etc.) over the cluster network and that MySQL is really running on the two MOVEit DMZ Database nodes. Node IP address changes or MySQL permission changes may also be the cause of database service interruptions.
Database Synchronization Problems - This will be evident in the MOVEit DMZ Config's Resiliency status pane; look for error messages regarding "duplicate IDs" or other replication problems. In most cases you will need to perform "Common Procedures - Database Resynchronization" to correct the situation.
Deadlocked Who-Is-Primary Decision - This will be evident in the MOVEit DMZ Config's Resiliency status pane; look for conflicting messages about who should be "Master".

Specific Error Messages

The database slave node complains that replication is not working, with a message like "Duplicate entry '1046' for key".

The database on the slave database node is not in sync with the master database. This likely means that updates have been made directly to the slave database, rather than via replication. This creates inconsistent databases, with some data in the master and other data in the slave--a dangerous situation. If you feel that the master database is correct and is not missing data, see "Common Procedures - Database Resynchronization" to correct the situation.

Although all nodes are running, the status display indicates old timestamps for one of the nodes, and/or Database Resiliency waits several minutes at startup, complaining that the other node is not running.

The clocks on the nodes are not in sync. MOVEit DMZ reacts poorly to drifting clocks, especially if failover time has been set to a low value like 30 seconds. If you know that network issues are not a concern, see "Common Procedures - Change Time Server" to correct the situation by switching to a functional time server.

Any node complains that "The specified network name is no longer available"

This message indicates that a Windows network connection to the NAS has been broken. The network connection between that node and the NAS may not be reliable or the NAS may have shut down unexpectedly.

You get a "Can't get log file position" message after clicking "Record replication" on the Master node.

In this case, the "Master" node could be node 1 or node 2 depending on the current failover state. This message indicates a serious problem communicating with the local MySQL server, however, the root cause could be a number of things. If the MySQL service is started and receiving connections from the console (try just typing "c:\mysql\bin\mysql.exe" and look for a healthy "Access denied for user" message), then there is likely a problem with this machine's "MY.INI" configuration file. In one case, a Resiliency-capable "MY.INI" was mysteriously copied to the wrong folder ("C:\Documents and Settings\Administrator\Windows" instead of "C:\Windows") during the MOVEit DMZ Resiliency installation. However, most cases involve garbled or old settings that were put into or on top of a good "MY.INI" file by an uninstall, a partial restore or other means.

The "DBResil" and "SQLResil" services cannot start and a "could not start due to a logon failure" error is listed in the corresponding Windows Security Event Log.

This error happens when the Local Security Policy does not list a domain user used to access a NAS in the "Act as OS" or "login as service" categories. This configuration is normally handled by the Resiliency installation package, but can also easily be reset by hand using normal access to the Local Security Policy. To retest, simply try to restart the services through the MOVEit DMZ Config utility.

Additional Help

For additional help, you may want to consult the Knowledge Base on our support site at https://moveitsupport.ipswitch.com.