|Read the Digest in
You need the free Adobe
BCP tells you how to recover from the effects of downtime.
CPA tells you how to avoid the effects of downtime.
Thanks to This Month's Availability Digest Sponsor
In this issue:
Browse through our Useful Links.
Check our article archive for complete articles.
Sign up for your free subscription.
Join us on our Continuous Availability Forum.
Google Finds That Business Continuity Takes More Than Planning
A good business-continuity plan is meaningless if it has not been tested and if staff has not been trained in its execution. Google recently found this out the hard way, as told in this issue’s Never Again story.
Google’s infrastructure ensures that its services will always be available. Should a server crash, failover to another server in the same data center is automatic and transparent. Google even protects against the failure of an entire data center by failing over applications to servers in a backup data center. Or so it thought.
In February, Google suffered the unthinkable – an entire data center went down. However, the failover to the backup data center failed, converting what should have been a routine ten-minute outage to a failover fault lasting almost three hours. What happened? It turns out that the documentation for the failover procedure was faulty. Clearly, the procedure had never been tested; nor had staff been trained.
To Google’s credit, it posted a detailed post-mortem explaining exactly what happened and what it plans to do to avoid such a failure in the future. Describing experiences like this helps us all to plan better. If you have a Never Again story, let the Digest share it with our readers so that we can all learn to improve our own business-continuity procedures. We’ll keep it anonymous if you would like.
Dr. Bill Highleyman, Managing Editor
What should have been a ten-minute outage at a major Google data center hosting the Google App Engine turned into a two-and-a-half hour ordeal simply because of faulty failover documentation. Of course, the fact that the failover procedures were incorrectly documented also implies that they were never tested and that the staff was never trained.
Kudos to Google for its transparency during and after the outage. It published a detailed post-mortem study explaining minute-by-minute exactly what happened, what the underlying causes of the outage were, and what it plans to do to avoid this situation in the future.
The incident is an excellent example of a failure chain. Many failures occur because of a sequence of events. If any one event does not happen, the failure chain is broken; and the failure does not occur. In this case, the failure chain included a power failure, a backup power fault, recent failover enhancements, faulty failover documentation of the new procedures, and the unavailability of the knowledgeable technical people who could have untangled the documentation. If any one of these events had not happened, this major outage would have been only a minor inconvenience.
It can take many minutes for an earthquake to be scientifically reported, but social networks can reduce this time to seconds. The U.S. Geological Survey (USGS), which has responsibility in the U.S. for earthquake detection and reporting, is building a prototype to take advantage of social Internet technology to speed up early reporting of earthquake activity. It is parsing tweets sent by Twitter users to find out about earthquakes in the seconds after the tremors begin. It calls this new system the Twitter Earthquake Detector (TED).
The USGS has realized that social networking can produce more than simply “short bursts of inconsequential information.” Immediate responses by observers over a social network can be mined to provide early reports of many disasters, from fires and accidents to terrorist attacks. All it takes is a little software (it’s probably more accurate to say “a lot of software”).
Does the absence of a detectable fault prove the absence of a design defect?
Is electromagnetic interference (EMI) with automobile engine computers the cause of so many sudden, unintended acceleration (SUA) incidents? That is the controversy now raging in the public domain as auto manufacturers scramble to reassure nervous customers about the safety of their vehicles. The engine computers (electronic control units, or ECUs) control throttle settings, fuel/air mixture ratios, and transmissions in order to satisfy increasingly stringent fuel economy and emission standards. Sometimes, unfortunately, these computers misbehave. When that happens, the computers often leave no evidence trail. How can an ECU design defect be corrected if you can’t detect the fault?
SUA incidents have surged throughout the last decade since the widespread introduction of engine computers in automobiles. We in the IT community are painfully aware that gremlins hide in computer systems. Whether the problem is EMI, software bugs, operator error, or HAL trying to take over Discovery One (remember the movie “2001: A Space Odyssey”?), we must find failsafe methods to identify and correct safety-critical computer faults. Of course, the ultimate failsafe method will depend upon no electronics or mechanical linkages that themselves can fail.
If you think sudden, unintended acceleration is a serious problem, just wait until the introduction of steer-by-wire technology in automobiles.
Windows Server Failover Clustering (WSFC) is the successor to Microsoft Cluster Server (MSCS). WSFC and its predecessor, MSCS, offer high availability for critical applications such as email, databases, and line-of-business applications by implementing a redundant cluster of Windows servers that provide a single-system image to the users.
MSCS has been Microsoft’s solution to building high-availability clusters of Windows servers since it was first introduced with Windows NT Server 4.0. MSCS has been significantly enhanced and simplified and renamed WSFC with the release of Windows Server 2008. WSFC for Windows Server 2008 R2 has seen even further enhancements to Windows clustering.
The WSFC cluster service monitors cluster health and automatically moves applications from a failed node to surviving nodes, bringing high availability to critical applications. WSFC also provides high availability for Microsoft’s Hyper-V virtualization services.
WSFC brings many enhancements to stalwart MSCS clustering services. With WSFC, up to sixteen Windows servers can be organized into a multisite, geographically-dispersed cluster with cluster sites separated by hundreds of miles. A convenient GUI administrator tool supported by several wizards removes the need for a cluster specialist to configure and manage the cluster.
WSFC makes cluster technology even more attractive to small businesses and large enterprises alike.
Sign up for your free subscription at https://availabilitydigest.com/signups.htm
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest may be distributed freely. Please pass it on to an associate.
Managing Editor - Dr. Bill Highleyman email@example.com.
© 2010 Sombers Associates, Inc., and W. H. Highleyman