Read the Digest in PDF. You need the free Adobe Reader.

The digest of current topics on Continuous Availability. More than Business Continuity Planning.

BCP tells you how to recover from the effects of downtime.

CA tells you how to avoid the effects of downtime.

www.availabilitydigest.com

Do you want to improve the availability of your critical applications?

Do you want your staff to be well-versed in availability fundamentals?

Make use of our one day or multiday seminars on High Availability: Concepts and Practices.

Given at your site or online and tailored to your needs. www.availabilitydigest.com/seminars.htm

In this issue:

Never Again

Hurricane Sandy

Northern Virginia 911 Service Down for Days

Best Practices

Load Shedding

The Geek Corner

SAP on VMware High Availability Analysis

Browse through our Useful Links.

Check our article archive for complete articles.

Join us on our Continuous Availability Forum.

Check out our seminars.

Check out our technical writing services.

It’s Not The Cause of the Outage – It’s the Outcome That Counts!

Whoever thought that lower Manhattan would be flooded with several feet of sea water? Probably no one, based on the catastrophic effects of Hurricane Sandy as related in one of our Never Again articles in this Availability Digest. When Con Edison cut power to lower Manhattan because of power substation flooding, many data centers found that they could not continue operations with their diesel generators, which were now under several feet of water in the basements in which they were housed. Tens of thousands of web sites worldwide went dark for up to two weeks when their New York City-based hosting providers were taken out of service. How did these companies continue in business?

As related in another Never Again article in this issue, a monster storm took out a major Verizon communication hub in Northern Virginia. It cut off 911 service to millions of residents for up to four days. How do emergency services continue when emergency operators can’t receive calls?

These stories emphasize that disaster recovery is not so much about restoring IT systems. It is about how to solve the problems caused by the failure of IT systems. This is a topic that we cover in some detail in our seminars on continuous availability.

Dr. Bill Highleyman, Managing Editor

Never Again

Hurricane Sandy

Hurricane Sandy was the largest Atlantic storm in recorded history. It spanned an area broader than Texas. The hurricane-force winds extended 1,100 miles from its center and affected 24 states in the United States. Sandy’s storm surge moved homes off of their foundations on the New Jersey shore and filled New York City tunnels, subways, power substations, and basements with salt water. 8.5 million people in dozens of states lost power.

Pumping out tunnels, subways, and basements in New York City took days. Even more extensive was the effort to restore power. Many customers, including large areas of New York City, were without power for two weeks or more.

About 150 data centers were in Sandy’s path as it moved through New Jersey and New York. These data centers faced devastating consequences from power outages and flooding. As it turned out, power outages caused only minor inconveniences. Flooding caused catastrophic damage. The primary flooding outages were in data centers located in lower Manhattan. Many had backup generators or fuel tanks in basements that became flooded. When power was cut off to lower Manhattan, these data centers had no way to continue operations.

--more--

Northern Virginia’s 911 Service Down for Four Days

Late in the evening of Friday, June 29, 2012, a quick and violent storm with 90 mph winds swept through Northern Virginia in the U.S., leaving millions of customers without power. Landline and wireless phone services were cut. Disastrously, the storm left 2.3 million residents in Northern Virginia without access to 911 emergency services. 911 services were not fully restored until four days later on July 3^rd.

The problem turned out to be a generator failure at Verizon’s Arlington communications hub. Air got into its fuel lines, and the facility went down early the morning of June 30^th when the batteries ran out. Arlington serves as the 911 hub for routing emergency calls to the proper jurisdictions. Without this capability, no 911 calls could be completed.

The most likely time for a 911 system failure is during a disaster. The most likely time that 911 services will be needed is during a disaster. This puts a specially important onus on the reliability of 911 systems and on their abilities to survive natural and manmade disasters. Equally important is the disaster response plan to provide emergency services should a 911 system fail. The public has to feel comfortable that it can get emergency aid, no matter what.

--more--

Best Practices

Load Shedding

A recent thread in our LinkedIn Continuous Availability Forum covered a very important topic – load shedding. What do you do if your system approaches full capacity? What do you do in an active/active system if you lose one node and the surviving nodes must carry the full load? What do you do following a failover if your backup system is smaller than your production system?

If you want to maintain a reasonable level of service, you may have to shed some of the load that is being carried by the system. But which load?

Paul Green of Stratus Technologies posed the following question in the thread entitled “What is the appropriate load-shedding policy when a continuously-available system becomes overloaded?” In part he asked, “What policy should the system follow when it is overloaded? Should it simply let the requests queue up externally? Should it deny some requests and accept others?”

We had dozens of comments on the subject, so we organized these comments into a meaningful discussion for this article.

--more--

The Geek Corner

SAP on VMware High Availability Analysis

Vas Mitra is a SAP Virtualization Architect for VMware. Having analyzed the availability of SAP applications running on a VMware ESXi cluster using the concepts that we have published in the Availability Digest’s Geek Corner, Vas has given us permission to publish his analysis.

The paper describes how to calculate the theoretical availability of SAP deployed in virtual machines on a cluster of x86 servers running VMware hypervisor (referred to as ESXi hosts). The content directly leverages probability and mathematical/algebraic analyses from the white papers of the Availability Digest.

The mathematical model can help determine the availability of a virtual SAP solution expressed as a fraction/percentage. The calculations are first presented for a 5-node cluster of ESXi hosts with one spare, and it is also shown how the equations can be generalized for n nodes with s spares.

--more--

Sign up for your free subscription at https://availabilitydigest.com/signups.htm

Would You Like to Sign Up for the Free Digest by Fax?

Simply print out the following form, fill it in, and fax it to:

Availability Digest

+1 908 459 5543

Name:

Email Address:

Company:

Title:

Telephone No.:

Address:

____________________________________

The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.

Managing Editor - Dr. Bill Highleyman editor@availabilitydigest.com.