|Read the Digest in
You need the free Adobe
The digest of current topics on Continuous Processing Architectures. More than Business Continuity Planning.
BCP tells you how to recover from the effects of downtime.
CPA tells you how to avoid the effects of downtime.
Thanks to This Month's Availability Digest Sponsor
In this issue:
Browse through our Useful Links.
Check our article archive for complete articles.
Sign up for your free subscription.
Join us on our Continuous Availability Forum.
Become a Published Author
We welcome our second guest author in this issue of the Availability Digest. Phil Kloot, with help from his original development team, relates how Wells Fargo built one of the earliest active/active systems in the late 1980s to control its ATM network. In our June issue, Damian Ward described the new VocaLink system for providing fast payment services to the U.K. financial community.
Others have contributed heavily to Digest articles, which would not have appeared without their input – Wolfgang Breidbach on Bank-Verlag’s pioneering active/active system, Colin Butcher on the new data centers for the U.K. National Health Service, Keith Parris on best practices for improving availability of nonredundant systems, Ron LaPedis on business continuity, and Jim Johnson on the Megaplex. We make sure that we acknowledge all of our contributors.
Now it’s your turn. Beef up your resume with a published article on any topic related to availability – systems you worked on, never-again experiences that rattled you, best practices, new analytic insights for The Geek Corner, reviews of books you find useful – even reviews of availability-related products (just no marketing hype, please).
We are always looking for fresh and interesting material for the Digest, and we will even help you write your article if you wish. Just let us know at firstname.lastname@example.org.
Dr. Bill Highleyman, Managing Editor
Wells Fargo has always been a pioneer. Remember the Pony Express that delivered mail by horseback to the American Wild West in the mid- to late 1800s? That service was run by Wells Fargo, a company formed in 1852 to provide banking and express services to California.
As in the stagecoach days, Wells Fargo has been a pioneer in bringing banking convenience to its customers. It was an early adopter of ATM technology and worked tirelessly to make ATM services highly available. To ensure availability, the bank implemented its ATM network with a fault-tolerant Tandem system located in its San Francisco data center.
But what if the network failed or the Tandem system went down? The bank reasoned that if another operating ATM was right next to the failed ATM, the customer could easily use it instead and not be inconvenienced. To achieve this, Wells Fargo built a second Tandem data center hundreds of miles from its original data center and distributed the ATMs at each location between the two data centers. The data centers each had a copy of the application database. The databases were synchronized using asynchronous replication with lock coordination to eliminate data collisions.
It’s one thing to have a major system go down for four hours while it is being recovered. It is another thing for a three-billion dollar retailer to lose its entire web site for eight days. That is what happened to American Eagle Outfitters in late July.
American Eagle had done everything right. It had backups of backups. It had a disaster-recovery site. It had detailed business-continuity and disaster-recovery plans. So what went wrong? Testing and verification.
American Eagle had outsourced its website management to IBM. At the time of the crash, both the primary and standby servers crashed, losing all data – a highly unlikely event. Attempts to restore the database by magnetic tape failed – a restoration rate of only one gigabyte per second could be achieved. At this rate, it would have taken over two weeks to restore the 400-gigabyte database. Subsequent failover to the disaster recovery site also failed when it was unexpectedly found that the site was not yet operational.
Clearly, neither the tape-restoration procedure nor the disaster-recovery site was ever tested by IBM or by American Eagle. The result was eight days of lost revenue and an untold loss of customer loyalty.
A major impediment to moving to an active/active architecture for some applications is the problem of data collisions when using asynchronous replication. A data collision occurs when two application instances in different nodes update the same data item at the same time. Each change is replicated to the other system and overwrites the original change made at that system. Now both databases are different, and both are wrong.
There are standard techniques for configuring systems to avoid data collisions or to reliably resolve data collisions uniformly across the application network if they cannot be avoided.
In this article, we review ways to avoid data collisions or to detect and resolve them if they cannot be avoided. The key is to try to structure an application to minimize the number of data collisions that must be resolved manually.
In our May, 2010, Never Again article entitled Fire Suppression Suppresses WestHost for Days, we related how WestHost, a major web hosting and dedicated server provider, lost its data center for six days when a fire-suppression system test went terribly wrong. At that time, it was not determined why the accidental release of suppressant gas caused multiple hard disks to fail. The best guess was the sudden increase in pressure caused by the gas discharge.
Over the last two years, other reports of server damage following inadvertent fire-suppression system activations have surfaced. New tests by Siemens, one of the leading providers of fire-suppressant systems, have now shown that a sudden increase in gas pressure was probably not to be blamed. Interestingly, it is more likely that the noise caused by the activation of the fire suppression system caused the problem.
We first review the WestHost story and then look at the testing procedures that point to this surprising conclusion. We next review the recommendations made by Siemens for avoiding equipment damage due to inadvertent activations of fire-suppression systems.
Sign up for your free subscription at http://www.availabilitydigest.com/signups.htm
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.
Managing Editor - Dr. Bill Highleyman email@example.com.
© 2010 Sombers Associates, Inc., and W. H. Highleyman