No announcement yet.

DAG Failure

  • Filter
  • Time
  • Show
Clear All
new posts

  • DAG Failure

    This morning a client had a DAG failure which caused mailflow to cease.

    The primary server DAG Network was showing as being unavailable/offline. There was network communication with both nodes in the cluster, and the FSW.

    The fix was to use Failover Cluster Manager to highlight the Unavailable cluster network, and untick 'Allow Clients to Connect Through This Network'. Although this tick box would automatically re-enable immediately, the DAG instantly came online and mailflow returned.

    I am very concerned that this is a single point of failure, in a setup as below -

    Site A - 1 DAG Member, 1 FSW
    Site B - 1 DAG Member

    How can I ensure that failure of the cluster network on the server in Site A does not bring the whole DAG down - What step(s) should be taken to increase resiliency?

    Were I to of introduced a second DAG member in site 1, could this outage caused by the cluster network failing on the active server have been avoided?

  • #2
    Re: DAG Failure

    On each server, each day around 10PM I am seeing the following event -

    Cluster node 'MYSERVER' was removed from the active failover cluster membership. The Cluster service on this node may have stopped.

    I have collated the following changes which I will make to the environment. If anyone has any further comments/suggestions to make this process more resilient I would appreciate it.

    Run performance monitor to see if VMware is dropping network packets as per article - I have set this performance counter up on both servers and will check it over the next couple of days to see if packets are being dropped through the VMXNET interface.

    Temporarily disabled Sophos AV on both servers to rule out Antivirus as being the issue

    Move the File Share Witness which is on an old DC to the new 2012 DC:
    Set-DatabaseAvailabilityGroup -identity "MYDAG1" -witnessserver "MYSERVER" -witnessdirectory "C:\File Share Witness\"

    Update Both Exchange Servers - currently SP3 but in need of Update Rollup 6

    Increase the failure/thresholds for cluster tolerance to help mask latency or short loss of connection of heartbeats between the nodes.

    cluster /prop SameSubnetDelay=2000
    cluster /prop SameSubnetThreshold=10
    cluster /prop CrossSubnetDelay=4000
    cluster /prop CrossSubnetThreshold=10

    Disable TCP Chimney Offload and Receive Side Scaling on the network adapters:
    netsh int tcp set global chimney=disabled
    netsh int tcp set global rss=disabled