Read the Digest in PDF. You need the free Adobe Reader.

The digest of current topics on Continuous Processing Architectures. More than Business Continuity Planning.

BCP tells you how to recover from the effects of downtime.

CPA tells you how to avoid the effects of downtime.

www.availabilitydigest.com

In this issue:

Never Again

Haiti's Cellular Network Failure Cost Lives

Best Practices

Synchronous Replication Recovery Strategies

Availability Topics

What Is the Availability Barrier?

The Geek Corner

Calculating RPO

Check our article archive for complete articles.

Join us on our Continuous Availability Forum

Let Us Post Your Links

In the next month or so, the Availability Digest is planning to open a new Links page pointing to resources of all types for your critical application needs. If you have one or more products or services for which you would like links posted, let us know; and we will be happy to include them in our list.

The Links page will be organized by category. Current categories include Replication, Middleware, Active/Active Systems, Fault-Tolerant Systems, Clusters, Business Continuity, Security, Application Modernization, Time Synchronization, Virtual Tape, Copy/Load Utilities, Groups, Publications, Education, Conferences, Consultants, and Blogs. If you want to add a link in a new category, let us know; and we’ll add the category.

Each link includes a URL to the page of your choice and a description of up to 100 characters, including spaces. Just send your information to us at admin@availabilitydigest.com, and we’ll see that it gets posted.

We hope that our Links page will become an important reference site for all of you who are looking to improve the availability and quality of your critical applications.

Dr. Bill Highleyman, Managing Editor

Never Again

Haiti’s Cellular Network Failure Cost Lives

The cost of system downtime can be measured in dollars, customer dissatisfaction, customer loss, regulatory actions, or bad press. At the extreme, the failure of a safety-critical system can result in loss of life. Sad to say, the Haitian cell-phone system proved to be a safety-critical system; and it failed.

Most Haitians have cell phones. They are inexpensive, and there is not much in the way of land lines in Haiti. Following the devastating 7.0 earthquake that struck the afternoon of Tuesday, January 12th, many Haitians found themselves trapped in the rubble of fallen buildings. Some were able to call on their cell phones, report their position, and were saved. How many tried but could not get cell-phone service and perished, we will never know.

This is the story of one of the worst technological disasters in modern times.

--more --

Best Practices

Synchronous Replication Recovery Strategies

Recovering the database of a failed node in an active/active system using synchronous replication is somewhat more involved than in a similar system using asynchronous replication.

Synchronous-replication recovery first requires that the node to be recovered have a copy of the application database that is current as of the outage or at some time later. Changes that have accumulated during the outage must then be applied while at the same time keeping the recovering database up-to-date with new changes coming in. Only when the recovering database is fully synchronized with the operational database can it be returned to service and begin processing transactions.

Recovery methods include synchronous online copying that imposes a load on a surviving system during recovery, asynchronous online copying that requires the application to be briefly paused before returning the recovered node to service, and a mixed online copy that avoids these problems but that depends upon a specific synchronous-replication architecture.

In any event, the impact of a node failure on a synchronously replicated active/active system is no different than a node failure on an asynchronously replicated active/active system. The system reverts to asynchronous replication to the failed node until the node is returned to service.

--more --

Availability Topics

What is the Availability Barrier?

Since the dawn of commercial computing in the 1950s, the technological quest for computer engineers has been the drive to reduce the price/performance ratio of computing systems. We have surpassed early expectations manyfold. Today’s $400 desktop has thousands of times the processing power of the multimillion dollar mainframes of 1960.

With today’s demands for 24x7 global services, we must now shift our attention to another goal – availability. No longer is it acceptable to be down for hours due to a processor failure or for a necessary upgrade.

For decades, we sought to improve availability by making components more reliable. However, there will always be component failures. To achieve continuous availability, we must look to recovering rapidly from the failures that we know are bound to occur. The technology is now here to provide failover times measured in seconds or even subseconds. We break the availability barrier by focusing on the system recovery time rather than on the failure interval.

Let it fail, but fix it fast. If we can recover so fast that no one notices that there has been an outage, we in effect have achieved continuous availability.

--more --

The Geek Corner

Calculating RPO

In a redundant data-processing system, RPO, the Recovery Point Objective, is the amount of data loss that is acceptable following a node failure. Of course, it is unrealistic to think that there is an absolute limit to data loss. Some obscure event may result in a significant loss of data even though the expected loss might be quite small. Therefore, the RPO should be expressed as a probability that no more than a certain amount of data will be lost. For instance, the RPO may state that:

“99% of node failures will result in no more than 300 milliseconds of lost data at 100 transactions per second.”

By using convenient curves provided in this article, you can determine whether you can meet an RPO specification based on straightforward measurements of replication latency. If your current system does not meet the specification, you can determine the specification that it can meet.

Alternatively, you can determine how much you have to speed up the replication channel in order to meet the specification. A cost/benefit decision can then be made as to whether to spend the money to speed up replication or to accept a reduced RPO specification.

--more --

Sign up for your free subscription at https://availabilitydigest.com/signups.htm

Would You Like to Sign Up for the Free Digest by Fax?

Simply print out the following form, fill it in, and fax it to:

Availability Digest

+1 908 459 5543

Name:

Email Address:

Company:

Title:

Telephone No.:

Address:

____________________________________

The Availability Digest may be distributed freely. Please pass it on to an associate.

To be a reporter, visit https://availabilitydigest.com/reporter.htm.

Managing Editor - Dr. Bill Highleyman editor@availabilitydigest.com.