Read the Digest in PDF. You need the free Adobe Reader.

The digest of current topics on Continuous Availability. More than Business Continuity Planning.

BCP tells you how to recover from the effects of downtime.

CA tells you how to avoid the effects of downtime.

www.availabilitydigest.com

Thanks to This Month's Availability Digest Sponsor

ETI-NET develops software and hardware solutions that allow HP NonStop servers to access modern, multi-vendor storage technologies. Its products offer cost-effective management of backup and archiving operations. It supports major data centers globally and consolidates their NonStop systems into shared storage infrastructures that provide common services regardless of platform type.

In this issue:

Never Again

Azure Cloud Succumbs to Leap Year

Australia's Painful Banking Outages

Best Practices

HP's Project Odyssey - Mission-Critical x86

Availability Topics

Help! My Data Center is Down! - People

Browse through our Useful Links.

Check our article archive for complete articles.

Join us on our Continuous Availability Forum.

Check out our seminars.

Check out our technical writing services.

Where are the Fallback Plans?

Just after completing this issue’s article on the continuing failures impacting Australia’s four major banks, I found myself the victim of a U.S. banking outage. I had to complete a critical transaction prior to the weekend, so I went on Friday directly to my branch of PNC Bank to solicit assistance. “No problem,” the bank officer said as he turned to his terminal and entered the pertinent information

Then, silence. Hitting the Enter key did nothing. After two or three more tries, he threw up his hands in disgust. He explained that PNC had just acquired the U.S. branches of Royal Bank of Canada and were in the process of consolidating the IT services of the two banks. Something must have gone wrong. Come back later, and try again.

The Australian banking problems are being caused by the modernization of their aging infrastructures. The PNC problem was (presumably) caused by a major migration of RBC applications to the PNC environment.

In my seminars on availability, I talk extensively about the risks involved with major upgrades. The ultimate protection is a solid fallback plan so that services can be restored if the upgrade goes bad. It seems that many enterprises do not want to attend to this important defense. Please don’t fall into the same trap.

Postscript: My transaction was late - by a week! The system was down that long.

Dr. Bill Highleyman, Managing Editor

Never Again

Windows Azure Cloud Succumbs to Leap Year

Shades of Y2K! Microsoft’s Windows Azure Cloud went down for over a day on Wednesday, February 29, 2012. Starting around midnight as the clock ticked to Leap Day, various subsystems of the Azure Cloud started to fail one-by-one. Soon, applications for many customers became unresponsive. By 8 AM Thursday morning, thirty-two hours later, Microsoft reported that recovery efforts were complete but that "a small number of customers may face long delays during service management operations."

Smells like a Leap-Year bug.

It is troubling that after the Y2K hysteria, we should be experiencing once again a calendar-related failure. A raft of date-simulation products were developed back then to allow systems to simulate dates without changing the system clock, thereby permitting the Y2K transition to be tested while the system remained in production. Many of these products are still around today. If the Azure cloud had been tested for the Leap-Year problem to the extent that most systems were checked for the Y2K problem, Microsoft may have avoided this disaster.

--more--

Australia’s Painful Banking Outages

A recent online banking outage suffered by the National Australia Bank continued a series of such outages at Australia’s four largest banks over the last two years. The National Australia Bank (NAB), Commonwealth Bank, the Australia and New Zealand Bank (ANZ), and Westpac all have had their shares of outages affecting ATMs, retailers’ POS devices, and online banking. The outages have occurred as these historic banks engage in multi-year replacements of their aging core legacy systems, some dating back to the 1980s. Apparently, these systems have become quite fragile in their old age.

It has been suggested by some that Australians can expect for the next decade regular outages of key banking services as progress is made in replacing the banks’ legacy systems. However, in today’s high-technology world, there is an expectation of high availability and high resilience for critical services such as banking. Institutions now cannot cover up IT failures. There is no place to hide from Twitter and Facebook.

In this article, we look at the string of online banking failures and the response of Australia’s financial regulatory authorities to the consequent loss of confidence by Australians in their rickety banking system.

--more--

Best Practices

HP’s Project Odyssey – Migrating Mission Critical to x86

HP’s Enterprise Servers, Storage and Networking (ESSN) Business Unit markets two lines of servers – Proliant servers (acquired from Compaq) and Integrity servers. Proliant servers are based on the Intel x86 Xeon processor and support Windows and Linux operating systems. Integrity servers are Itanium-based and support HP mission-critical operating systems – HP-UX, NonStop, and OpenVMS.

On November 22, 2011, HP announced a major new initiative dubbed “Project Odyssey.” It is intended to extend the mission-critical features of HP-UX from Itanium blades to Windows and Linux x86 blades over the next two years. Project Odyssey raises many questions for those involved with HP’s current, highly available operating systems – HP-UX, NonStop, and OpenVMS. In this article, these concerns are explored.

If HP customers embrace the move to highly reliable standard operating systems, HP-UX may be the first to go since migrating Unix applications to Linux is a reasonable task. But achieving the fault tolerance provided by NonStop systems and OpenVMS Split-Site Clusters is probably not in the cards. Sadly, if the reliability provided by hardened Linux and Windows systems is good enough, the market may see a declining need for great, continuously available systems. Let’s hope that great triumphs over good enough!

--more--

Availability Topics

Help! My Data Center is Down! – Part 6: The Human Factor

In many respects, a company’s data center is part of its lifeblood. Significant investments are made to ensure that corporate data centers never fail. Unfortunately, they do.

Industry studies have shown that the human factor plays a role in about 70% of data-center failures. In some cases, it is a careless error on the part of an operator. In others, it is out-and-out malfeasance. Not only can staff errors directly cause outages, but even worse, they also can escalate a controllable problem into a major crisis. One would think that staff problems are the one area that we can effectively control. Evidently, this is not the case.

In our previous articles on data-center failures, we focused on failures due to power, storage subsystems, network faults, and upgrades gone wrong. In this article, we look at some human contributions to data-center outages.

--more--

Sign up for your free subscription at https://availabilitydigest.com/signups.htm

Would You Like to Sign Up for the Free Digest by Fax?

Simply print out the following form, fill it in, and fax it to:

Availability Digest

+1 908 459 5543

Name:

Email Address:

Company:

Title:

Telephone No.:

Address:

____________________________________

The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.

Managing Editor - Dr. Bill Highleyman editor@availabilitydigest.com.