|Read the Digest in
You need the free
The digest of current topics on Continuous Availability. More than Business Continuity Planning.
BCP tells you how to recover from the effects of downtime.
CA tells you how to avoid the effects of downtime.
In this issue:
Browse through our Useful Links.
Check our article archive for complete articles.
Sign up for your free subscription.
Join us on our Continuous Availability Forum.
Check out our seminars.
Check out our writing services.
A Presentation Marred By A Failover Fault
I recently presented a talk on data-center failures and the lessons we can learn from the experiences of others. Unexpectedly, I became my own best example.
As I was setting up for the talk, my PC failed. With a room full of people, what was I to do? Actually, this was equivalent to the production-site failures about which I was about to speak. Those who recovered from system failures followed a set of best practices. How did I do? Here is the best practices list:
All in all, I’d give myself a B- for availability. My failover failed, but I was able to continue providing services.
Check out the details in this article’s issue entitled “A Personal Failover Fault.”
Dr. Bill Highleyman, Managing Editor
Availability is all about providing a service, no matter what. The “no matter what” struck me during my presentation at the Connect OpenVMS Boot Camp, held recently in Bedford, Massachusetts. As editor of the Availability Digest, I was to give a talk entitled “Help! My Data Center Is Down!” It describes incidents taken from the Digest’s Never Again series of horror stories, incidents that have incapacitated entire data centers for hours and even days.
As I was booting up the PC on which my slides were stored, I experienced my own horror story. My PC was taken over by a malicious virus (or so it seemed to me) and became unusable. I was about to become one of the incidents of which I was to speak.
My talk ends with lessons learned to keep the business going in the face of such incidents: Did I follow my own advice? Almost. It is refreshing to see that the principles of achieving high availability apply even to simple systems and that I got at least a passing grade in applying them. I wasn’t perfect, but the service survived at an acceptable level.
Appearing before the Senate Select Committee on Intelligence in early March, 2013, James R. Clapper, Director of National Intelligence, testified that cyber threats have now surpassed terrorism as the top security threat facing the United States. This conclusion is documented in the United States Intelligence Community’s assessment of threats to U.S. national security.
The assessment noted that the world’s threat environment is changing rapidly and radically. Attacks involving cyber weapons can be deniable by the perpetrators and unattributable to any source.
We can see from our Never Again stories the growing predominance of cyber threats. If we are to provide continuous availability of our IT services, we must begin to extend our focus from hardware, software, human, and environmental faults to external attacks from malicious players.
Some of these attacks are intrusions intended to spy on us or steal our data. Others are intended to take down our systems. Today, these latter attacks are generally meant to exact retribution for some imagined or real grievance. Tomorrow, they may be intended to do significant harm to us for competitive or national security reasons.
There are many published statistics that characterize the causes of downtime. A troublesome aspect of these studies is that they vary all over the place. Though they generally focus on the same vulnerabilities – hardware, software, network, human, and environment, the contributions of each of these faults does not seem to converge to any meaningful numbers.
To add our own input, we analyzed over 250 outages reported in our Never Again series. The outage reports were all drawn randomly from the press over the last seven years and so should represent a reasonably accurate cross section of downtime triggers.
Based on our results, it seems that a reasonable rule of thumb for the causes of outages is that software, networks, and environmental factors each account for about 20% of all outages. Hardware faults are responsible for about 15% of outages, and people and miscellaneous factors account for 10% to 15% each. Miscellaneous factors are about evenly split between capacity overloads and cyber crime.
However, even though the human factor directly caused only 10% to 15% of outages, people were a contributing factor to 60% of all outages. Clearly, the weak link in data-center availability is people.
Bank-Verlag’s two data centers in Germany utilize a broad range of systems from multiple vendors. Included in these systems are several HP NonStop servers. Bank-Verlag wanted to have a common monitoring facility to manage all of its systems and settled on the open-source Nagios monitoring application. Though Nagios supports the company’s wide range of Windows, Unix, and Linux systems, it is not supported on NonStop.
In order to integrate its NonStop servers into the Nagios monitoring facility, Bank-Verlag created its own monitoring subsystem for NonStop and calls it “BVmonitoring.” BVmonitoring has the functionality of a Nagios agent along with significant infrastructure to gather the events and statistics that Nagios needs in order to perform its monitoring functions. As a consequence, Bank-Verlag is able to monitor and manage its NonStop systems with the same open-source management facility that it uses to manage the other systems in its data centers. NonStop servers are now fully integrated into Bank-Verlag’s IT infrastructure.
Bank-Verlag is not interested in selling BVmonitoring as a product. However, it is willing to license the software to other users with maintenance and support for a monthly or yearly fee.
Sign up for your free subscription at http://www.availabilitydigest.com/signups.htm
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.
Managing Editor - Dr. Bill Highleyman firstname.lastname@example.org.
© 2013 Sombers Associates, Inc., and W. H. Highleyman