Node Discovery using Shared Storage for HA Clusters

The EFT Enterprise HA Resiliency project replaces MSMQ UDP Multicast with iterative MSMQ TCP Unicast (that is, message is sent to each the nodes individually over MSMQ TCP). In addition to the improved reliability of TCP, removing the dependence of UDP Multicast simplifies deployment on cloud infrastructures. However, a new method of discovering of nodes is required, and can be accomplished using the shared storage (with file locking semantics) that EFT already requires.

ActiveNodes.json File

For EFT HA clusters, node discovery is accomplished through an ActiveNodes.json file on the existing shared storage. All access to ActiveNodes.json is performed in exclusive access mode (so no one else changes or reads the file while accessing it). The ActiveNodes.json file is the “one source of truth” for which nodes are connected when operating in any HA mode. To determine where to send the nodes, the ActiveNodes.json must be consulted. Nodes will be responsible for adding and removing themselves from the list. The "Event Rule Master" (that is, the node that is holding the lock on Master.lck) is responsible for periodically removing nodes that failed without removing themselves.

ActiveNodes.json file example:

{
 "Nodes": [{
 "Name": "EFT_NODE_1",
 "IP": ["192.168.1.101"],
 "ActivationTime": "2017-03-30T15:20:30-06:00"
},{
"Name": "EFT_NODE_2",
"IP": ["192.168.1.102"],"ActivationTime": "2017-03-30T15:20:35-06:00"
}]
}

Sending “Multicast” Messages

To determine which nodes to send the “Multicast” messages to, ActiveNodes.js is consulted.

  • If ActiveNodes.json has been modified (Check both last modified Date/Time and file size)

    • Open ActiveNodes.json in “OPEN_ALWAYS” and “Exclusive Access” mode (locks file from access by others)

      • Retry up to 10 times, waiting 100ms between each, if file is locked by another process

        • If cannot open the file after all retries, EFT shall write an error to EFT.log and shall indicate that cached values will be used

    • Read file contents

    • Close ActiveNodes.json

    • Parse JSON file contents (optimization, if this is done after the file is closed)

    • Store the cached values for future use

  • Send the message to each node listed in the JSON (or cached value if could not open file)

Node Adding Self

When a node starts up it updates the ActiveNodes.json file, adding itself.

  • Open ActiveNodes.json in “OPEN_ALWAYS” and “Exclusive Access” mode (locks file from access by others)

    • Retry up to 300 times, waiting 100ms between each, if file is locked by another process

      • If cannot open the file after all retries, EFT shall write an error to EFT.log and Windows Event Log, and for abject failure leading to restart

      • Service should exit in FAILURE mode

  • Read file contents

  • Parse JSON file contents

  • If this node is not already in list (normally it shouldn’t be, but if current node had failed and restarted it may)

    • Add entry for this node to the JSON

    • Write Updated JSON File Contents (truncating file as needed)

  • Close ActiveNodes.json

  • Store the cached values for future use

Node Removing Self

When a node shuts down it updates the ActiveNodes.json file, removing itself.

  • Open ActiveNodes.json in “OPEN_ALWAYS” and “Exclusive Access” mode (locks file from access by others)

    • Read file contents

      • Retry up to 3 times, waiting 100ms between each, if file is locked by another process

        • If cannot open the file after all retries, EFT shall write an error to EFT.log and Windows Event Log

  • Parse JSON file contents

  • If this node is in list (it should be)

    • Remove entry for this node

    • Write Updated JSON File Contents (truncating file as needed)

  • Close ActiveNodes.json

Event Rule Master Removing “Expired” Nodes

Periodically the Event Rule Master updates the ActiveNodes.js file, removing any expired nodes.

  • Open ActiveNodes.json in “OPEN_ALWAYS” and “Exclusive Access” mode (locks file from access by others)

    • Retry up to 4 times, waiting 100ms between each, if file is locked by another process

      • If cannot open the file after all retries, EFT shall write a warning to EFT.log

  • Read file contents

  • Parse JSON file contents

  • Remove expired nodes from list, (i.e. all of the following conditions are met)

    • Use the “decay” mechanism that exists today in EFT  (2x missed heartbeat from a node means the node is down)

  • If any nodes removed, write Updated JSON File Contents (truncating file as needed)

  • Close ActiveNodes.json