No announcement yet.

Delay 10 seconds

  • Filter
  • Time
  • Show
Clear All
new posts

  • Delay 10 seconds

    I have 2 servers with windows 2003 ENT on each one, configured in claster. Servers working in mode astive-pasive.
    I have some service running and managed by cluster.
    If I simulate error on SRV1, it takes about 10 seconds for this service to back to life on the SRV2.
    Is it posible to reduce this time up to 0 seconds? This mean that clients that connected to the cluster will not "feel" that one of servers is down.

    May be the solution of this is to configure my servers work in mode Active-Active and not Active-Pasive?
    Is this configuration can be done only with OS reinstallation, or it can be configured in existent OS?
    Or there are other solutions for this problem?

  • #2
    Re: Delay 10 seconds

    Realize there are several time components here. A polling interval where the health of each resource is tested; a polling (heartbeat) interval where the health of the active node is tested by the passive members; some amount of time to demote / promote a new active node; and the inherent startup time of all the resources in the group.

    So 10 seconds doesn't seem all that bad.

    If the group is off line, how long does it take to bring it on line? That's about the best you could hope for w/ infinite polling and fault detection speed.

    Maybe you can lighten what's in the group so it starts up faster and increase the heartbeat frequency but your never going to get this down to sub-second response. Even if you do, a client connected to the active node will be disconnected at failover. In web applications where you are detached mostly this may not be much of an issue vbut I don't know what your service is about.

    Active/active may help if the service supports it. Even then, if the nodes don't share session state data with each other, you'll still sense a disconnect on a failure.

    Remember, this is a "high availability" cluster, not a "fault tolerant" one.

    True fault tolerance is often difficult and expensive to achieve. Do weigh the cost / benefit before chasing after this "technical rabbit". Say you experience 1 failover per week (a very high rate indeed). At 10 seconds, that's still 99.998% up time. And at 1 failover per 4 weeks, you're now better than 99.999%. Both values are unheard of on M$ boxes (regardless of claims) because of critical updates alone.
    Last edited by rvalstar; 3rd July 2007, 15:37.


    ** Remember to give credit where credit is due and leave reputation points sigpic where appropriate **

    2006-2099 R Valstar. This post is offered "as is" for discussion purposes only with no express or implied warranty of any kind including, but not limited to, correctness or fitness for use. Nothing herein shall be construed as advice. Attempting any activity based on information in this post is done at your own risk.