|Read the Digest in
You need the free
The digest of current topics on Continuous Availability. More than Business Continuity Planning.
BCP tells you how to recover from the effects of downtime.
CA tells you how to avoid the effects of downtime.
In this issue:
Browse through our Useful Links.
Check our article archive for complete articles.
Sign up for your free subscription.
Join us on our Continuous Availability Forum.
Are Seven 9s Achievable?
A recent post entitled Uptime/Downtime to the Yahoo! Tandem Computers Group raised the question of achieving seven 9s. This question intrigued me, so I posted a similar thread entitled Can Seven 9s Be Achieved in Practice? on our LinkedIn Continuous Availability Forum The response has been overwhelming.
Can a seven-9s system that drops connections when it occasionally fails be called continuously available? How do we include the availability of external systems critical to the application? What is the impact of the dynamic nature of software? Do development tools have an impact on the level of availability that can be achieved? Can migration from fault-tolerant systems to commodity systems preserve availability? Is RTO (recovery time objective) a better measure for highly available systems than 9s? These and many other discussions wend their way through this popular thread.
As one who provides consulting and seminars for high availability, this kind of feedback is extremely important to me. It’s how I learn. Check out the thread, and add your insight. Thanks.
Dr. Bill Highleyman, Managing Editor
RIM’s BlackBerry Messenger service went dark across most of the world for several days in October, 2011. A major switch in its U.K. Network Operating Center failed, and the backup switch did not come up.
This outage paralleled similar BlackBerry outages. In April, 2007, a deficient software upgrade took down the Canadian NOC; and the backup system could not be brought into service. Less than a year later, the BlackBerry service again went down for several hours because of an upgrade gone wrong. In December, 2009, BlackBerry service was up and down for a week as RIM tried to correct another bad upgrade.
It seems that these failures have a common thread – testing. Improper upgrades that take down a system can only be attributed to inadequate testing of the upgrades. Failover faults, as in this most recent outage, are also often a function of inadequate testing. If upgrade testing and periodic failover testing are not thorough, the organization is depending upon faith and hope that the system will come into service when needed. These are not the attributes on which one should bet the enterprise.
The technology exists today to achieve arbitrarily fast recovery times following a system failure with little if any loss of data. The key to this technology is data replication.
Data replication comes in several forms – asynchronous or synchronous, unidirectional or bidirectional. Each combination supports different ranges of recovery times (RTOs) and data loss (RPOs). By understanding the costs of downtime and data loss for each application and the costs of achieving various levels of high availability and continuous availability, IT management can make informed decisions concerning the availability approach that is right for each application.
In the first three parts of this series, we explored various data-replication techniques and the highly available architectures that can be implemented with them. In this final part, we look at the considerations that will lead you to the choice of the proper architecture to meet your business needs.
Nothing strikes fear in the hearts of management so much as losing the entire corporate IT infrastructure. To make sure this never happens, companies invest heavily in their data centers with technologies ranging from fault-tolerant systems to redundant data centers.
Nevertheless, should such a failure occur, it might be days before IT services can resume. The company might well be out of business by then. Therefore, it is common to provide one or more backup data centers so that operations can continue within a few hours of a data-center failure.
Despite all the precautions, there exists a disturbing incidence of significant data-center outages. These incidents show that it is not enough to try to protect against any event – fire, flood, power outage, network failure, and so on – that might take down a data center. An event that was not even envisioned is going to happen sometime, and it will take down a data center somewhere.
In this seven-part series, we review from the archives of the Availability Digest some Never Again horror stories that highlight unlikely events that have taken down entire data centers. In this first part, we look at unusual power outage stories.
In many active/backup architectures, preventing failover faults requires the backup system’s software versions to correspond to those on the production system. Equally important, testing failover is made much more complex if version errors must be tracked down and corrected in order to successfully pass a test.
Tools are available to compare the production system’s software modules to those on the backup system in order to detect version errors. If such errors are found, operations staff must take steps to correct them. A more advanced solution is to have a facility that not only will detect version errors on the backup system but that automatically will correct such errors. Such a facility is FileSync from TANDsoft, Inc. FileSync synchronizes Enscribe files. Coupled with TANDsoft’s Command Stream Replicator, which synchronizes the effects of operator-entered commands, FileSync relieves the operations staff from having to continually monitor and correct backup software versions.
In Part 1 of this series, we describe TANDsoft’s FileSync utility. In Part 2, we review Command Stream Replicator.
Sign up for your free subscription at http://www.availabilitydigest.com/signups.htm
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.
Managing Editor - Dr. Bill Highleyman email@example.com.
© 2011 Sombers Associates, Inc., and W. H. Highleyman