|Read the Digest in
You need the free
The digest of current topics on Continuous Availability. More than Business Continuity Planning.
BCP tells you how to recover from the effects of downtime.
CA tells you how to avoid the effects of downtime.
Thanks to This Month's Availability Digest Sponsor
In this issue:
Browse through our Useful Links.
Check our article archive for complete articles.
Sign up for your free subscription.
Join us on our Continuous Availability Forum.
Check out our seminars.
Check out our technical writing services.
Could A Bug Take Down Your Entire Data Center?
Is it possible that a lowly software bug could take down an entire Tier 4 data center? After all, these data centers use redundant everything to guarantee in excess of four 9s availability.
The answer, unfortunately, is “Yes!” An extremely calamitous software flaw has been discovered in the Oracle database by, of all people, the journalists at InfoWorld. Described in this issue’s Never Again article, Oracle’s Ticking Time Bomb, the problem has to do with an internal clock designed to last centuries. However, the clock can run out in just a few years due to a bug in an unrelated Oracle utility. When it does, databases are considered unusable, applications crash, and the data center is brought to its knees. There is no easy recovery. This defect has existed for years despite, I’m sure, extensive Oracle product testing and millions of hours of field experience.
Fortunately, no data center has yet to suffer the consequences of this fault; and Oracle has issued a corrective patch. However, this nasty surprise will not be the last. Make sure that you have a well-tested Business Continuity Plan to carry you through disasters that you cannot even imagine today.
Dr. Bill Highleyman, Managing Editor
A potentially catastrophic bug that has been around for years has been discovered in the Oracle database. Left unfixed, the bug could crash all of the interconnected databases in a large enterprise. Recovery would take days or even weeks. Such a disaster could occur due to normal operation, or it could be exploited by a malicious attacker.
The problem stems from a mechanism deep within the Oracle database – one with which Oracle DBAs seldom deal. It is the System Change Number (SCN).
Oracle’s SCN flaw has a low likelihood of impacting most companies except those that are running hundreds of Oracle instances. However, if it does happen, the results are catastrophic. A company’s data centers might be down for weeks.
Oracle has now released a patch to correct the flaw. It is therefore imperative that companies immediately install the patch that Oracle has provided. Unfortunately, older versions of Oracle cannot be patched and will continue to exhibit the SCN flaw. DBAs must make sure that patched databases do not link to unpatched databases. If older unpatched database instances are to be included in linked configurations, they should be upgraded to a patchable version.
The HP CloudSystem allows companies to convert their current IT assets into a private cloud. It is not a prepackaged system. Rather, HP CloudSystem focuses on hardware, software, and consulting services to provide an efficient path to cloud computing. It combines servers, storage, networking, and security to automate the lifecycle of applications and infrastructure from provisioning through management to termination.
Based on HP BladeSystem technology, an HP CloudSystem can support a wide range of heterogeneous server, storage, networking, operating system, and hypervisor resources that can be managed as a unified environment. Once a degree of comfort has been achieved with its private cloud, a company can extend it into a hybrid cloud to take advantage of additional capacity and services in one or more public clouds. An HP CloudSystem also can be configured as a public cloud to support service providers who wish to move to a cloud offering.
HP provides a wide range of software facilities and services to ease a company’s entry into the world of clouds. This includes HP CloudStart, in which HP will design and deliver an initial cloud system ready for deployment by a company in thirty days.
In our December, 2011, issue, we described Stratus’ $50,000 wager that its servers will not fail. This is the second time in as many years that Stratus has stuck its neck out with such an offer. Unfortunately, at the time of our article’s publication, Stratus’ latest offer was soon about to expire as of the end of 2011.
Good news for those considering fault-tolerant industry-standard servers. Stratus has extended its latest offer for a year. Virtualized Stratus 4500 or 6310 ftServers running VMware’s vSphere that are ordered anytime in 2012 will be warranted to be failure-free for the first six months of production, or Stratus will pay you $50,000. So far, Status has not been obligated to make any payments under either of its wagers.
Continuous availability is no longer a technological problem. It is an exercise in balancing system cost with downtime cost. Stratus’ ftServer is an affordable starting point to achieve extreme availabilities. Stratus says so – with its wallet.
Data centers are extraordinarily complex. They include hundreds or thousands of servers and storage subsystems with their applications and operating systems, all interconnected by vast internal networks. A failure in any one of these components can bring some if not all data-center functions to their knees.
However, major failures are not always caused by hardware or software. A disturbing number are caused by upgrades that go wrong. An upgrade to any data-center component is a complex operation. It should be properly planned, and all cognizant personnel should be available during the upgrade. There is too much of a chance that regardless of the effort put into the upgrade plan, something will go wrong.
If an upgrade should go awry, such a failure is typically not a problem if a fallback plan has been put into place. However, in too many cases, data centers have undertaken an upgrade with no planned backout procedure. If the upgrade fails, major applications will be down – for hours and sometimes for days.
In our previous articles on data-center failures, we focused on failures due to power, storage subsystems, and network faults. In this article, we look at some major data-center outages due to faulty upgrades.
Sign up for your free subscription at http://www.availabilitydigest.com/signups.htm
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.
Managing Editor - Dr. Bill Highleyman firstname.lastname@example.org.
© 2011 Sombers Associates, Inc., and W. H. Highleyman