Read the Digest in PDF. You need the free Adobe Reader.

The digest of current topics on Continuous Availability. More than Business Continuity Planning.

BCP tells you how to recover from the effects of downtime.

CA tells you how to avoid the effects of downtime.

www.availabilitydigest.com

Thanks to This Month's Availability Digest Sponsor

Join Opsol at HP Discover June 4^th - 7^th as Wells Fargo describes replacing Base24 with OmniPayments.

Opsol’s OmniPayments offers a full complement of credit/debit card authorization and support services.

50% cost savings in your payments processing. 450 million transactions per month.

VISA, MASTERCARD, PROSA certified. Proven Base24 Shutdown reference customer.

In this issue:

Never Again

More Never Agains VI

Commonwealth Bank - a Correction

Best Practices

OpenStack - The Open Cloud

Availability Topics

Help! My Data Center is Down! - Lessons

Browse through our Useful Links.

Check our article archive for complete articles.

Join us on our Continuous Availability Forum.

Check out our seminars.

Check out our technical writing services.

The Continuous Availability Forum has Reached 400 Members

Our LinkedIn Continuous Availability Forum has been continuously available for two years now. In this short time, we have attracted over 400 members. Congratulations to our 400^th member, Mike Legato. Mike is Manager of Applications at Stratus Technologies (www.stratus.com) and joins several of his fellow compatriots from Stratus.

The Forum has been quite active both in terms of the number of discussions that have been started and the participation of the members in those conversations. Our debates cover a wide range of topics concerning system availability. Recent questions have included, "Why are Fallback Procedures So Often Overlooked,” “What Does HP's Project Odyssey Portend for HP-UX, NonStop, and OpenVMS,” “What are the best Guidelines to Decide on How Much Uptime can be Committed to the Customer,” “Is Amazon a Victim of its Own Success,” and “Can Seven 9s Be Achieved in Practice.”

These dialogues can be every bit as educational as our articles in the Availability Digest. Check out the Continuous Availability Forum, and post your own questions for comment.

Dr. Bill Highleyman, Managing Editor

Never Again

More Never Agains VI

The first quarter of 2012 has had its share of catastrophic outages. We have already reported on the discovery of Oracle’s ticking time bomb, on the continuing string of outages suffered by Australia’s "big four" banks, and on the multiday failure of Microsoft’s Azure public cloud. We summarize in this article some of the others that have made headlines in the past few months.

Software bugs seem to be the big problem in this series of outages. Oracle had its ticking-time-bomb bug. Azure had a leap-year bug. BATS went down with an infinite-loop bug.

Second in frequency were recovery faults. At the Tokyo Stock Exchange, a backup data-distribution server failed to take over. At Ninefold, host servers failed following the recovery of an NFS server.

These outages have a common characteristic – testing. It seems that no matter how much system testing we do, there are always remaining problems. How much testing is justified? That, of course, depends upon the cost associated with an outage.

--more--

Commonwealth Bank of Australia – A Correction

In a recent issue, we reviewed a series of outages being experienced by Australia’s largest banks as they engage in multi-year replacements of their aging legacy systems (Australia’s Painful Banking Outages, March 2012). The “big four” Australian banks - National Australia Bank (NAB), Commonwealth Bank of Australia (CBA), the Australia and New Zealand Bank (ANZ), and Westpac - have all had their shares of outages affecting ATMs, retailers’ POS devices, and online banking.

In response to the article, one of our subscribers informed us that we had made an error in our reporting of one of the Commonwealth Bank’s outages, an error that we correct in this article. Our error reflected a statement, published in the press, that oversimplified what actually happened. We apologize to Commonwealth Bank for this error and are happy to relate more accurately what, in fact, occurred.

--more--

Best Practices

OpenStack – The Open Cloud

As expressed by noted IT publisher Tom O’Reilly, “If cloud computing is the future, then understanding how to make that future open is one of the great technology challenges of our day.” Moving an application to a cloud today is certain lock-in. Clouds are simply incompatible with each other.

There is clearly an advantage to having common cloud standards that allow portability between clouds. OpenStack is a major initiative to achieve this goal. OpenStack allows service providers, enterprises, and government agencies to build massively scalable public and private clouds using freely available Apache-licensed software.

In the early days of Edison, companies depended upon their own electrical generators to power their factories. These generators were the equivalent of today’s data centers. Then local standardization allowed communities to share a common power system for lighting, manufacturing, and other uses. This is where we are presently in our cloud technology. It wasn’t until national and international standards were developed that nationwide power grids could serve all communities. Thus was created the true electric utility.

Similarly, a true compute utility is awaiting the development of accepted standards. Only then can a company plug into the public cloud of its choice for its computing needs. This is the goal of OpenStack.

--more--

Availability Topics

Help! My Data Center is Down! – Part 7: Lessons Learned

In the first six parts of this series, we described several spectacular data-center failures that were caused by a variety of factors – power outages, storage crashes, Internet and intranet failures, upgrades gone wrong, and the actions of IT staff.

Interestingly, most of these failures could not have been prevented by a better hardware/software infrastructure. In none of the outages was a server failure the root cause. A few outages were caused by dual storage-system crashes. The recovery from intranet failures could have been mitigated with better internal network monitoring. The predominant cause of the failures was the direct action – or lack of action – by IT staff. Studies have shown that about 70% of all data-center outages involved at least to some extent the actions of human beings.

When a data center fails, two questions must be answered – how long will it take to recover IT services, and what can be done in the future to prevent a repeat of the failure? In the final part of this series, we review some of the lessons that address these two questions based on what we have learned from the data-center failures that we have discussed.

--more--

Sign up for your free subscription at https://availabilitydigest.com/signups.htm

Would You Like to Sign Up for the Free Digest by Fax?

Simply print out the following form, fill it in, and fax it to:

Availability Digest

+1 908 459 5543

Name:

Email Address:

Company:

Title:

Telephone No.:

Address:

____________________________________

The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.

Managing Editor - Dr. Bill Highleyman editor@availabilitydigest.com.