Welcome to our second issue of the Availability Digest and its focus on the quest for 100% uptime. In this issue, we feature a description of an active/active product from HP - its Home Location Register. The HLR is the brains of a cellular network.

We also tell the tale of how a simple coffee pot took down an active/active node. We discuss the importance of documentation and the help that the Universal Modeling Language, UML, can provide. There is lots more, including a trip to the past to look at the state of fault tolerance twelve years ago.

The monthly Availability Digest is distributed freely, though access to the detailed article content is available by subscription only. You can tell which articles require a subscription - their links are --more for subscribers--. Links to freely accessible articles are --more for free--. Visit www.availabilitydigest.com to sign up for your free Digest or to subscribe to its detailed content.

We are always looking for good content. Be an AD reporter, and earn free subscription months by contributing a story - especially a Case Study or a Never Again horror story. See our web site for details.

Many of you are getting the Digest from our unsolicited mailings. However, to avoid spamming your, we will be ending this practice in the near future. To guarantee your continued receipt of the free Digest, be sure to sign up for it on our web site.

Dr. Bill Highleyman, Managing Editor

Case Studies

HP’s Active/Active Home Location Register

The Home Location Register, or HLR, is the brains of a cellular network. A memory-resident, database-driven system, it keeps track of the location of all of its mobile subscribers and provides the list of services authorized for each subscriber to the cellular switching network.

The availability of HLR services is of paramount importance to a cellular network. If the HLR is down, the network is down (except for 911 emergency services in the U.S., which are federally mandated). The system must have zero downtime. Even planned outages for upgrades and repair must be avoided.

HP’s HLR system provides just such availability. It is configured as an active/active NonStop mated pair. Its memory-resident subscriber database is asynchronously replicated to a companion node so that service can be continued by a surviving node in the event of a node failure.

HP HLR systems are now deployed by 35 service providers on five continents and serve 200 million subscribers.

-- more for subscribers--

Never Again

Active/Active Save #1 – Coffee Pot Takes Down Node

It often takes a chain of events to cause a system failure. One such chain started with purposefully miswiring power to a newly installed node to avoid an upgrade delay due to a wrong power connector. The UPS system was temporarily bypassed with house power until a proper power connector could be installed.

Unfortunately, the installation crew forgot to correct the situation; and the node continued to run on house power. Months later, someone plugged in a coffee pot and blew the circuit breaker. It was only because the company was running an active/active configuration that the users did not lose service.

--more for subscribers--

Best Practices

Document Your System

Proper documentation of software systems and procedures is often viewed as a necessary evil. Too often, the “evil” wins out over the “necessary;” and documentation is skimped or ignored.

The lack of documentation can create significant problems. Not only can software become unmaintainable over time, but serious operator errors due to a lack of documentation can bring down a system.

This problem is now greatly alleviated by the Unified Modeling Language, an international standard established a decade ago. Powerful tools exist to make the generation of UML diagrams relatively painless.

With the self-documenting capabilities of today’s modern languages such as Java, it is no longer necessary to face the daunting task of documenting code. However, UML can be used to easily document everything from software structures to operational procedures; and this is a Best Practice that data centers need.

A major company has decided to use UML to document its systems. Its efforts demonstrate that the UML approach is now a valid and valuable approach.

--more for subscriber--

Active/Active Topics

Asynchronous Replication Engines

Active/active systems require that geographically-dispersed database copies be kept in exact synchronization. There are several techniques for doing this, but the most common is the use of asynchronous data replication engines.

An asynchronous data replication engine has several advantages. It is fast and generally requires no application changes. It runs “under the covers” so that there is no performance impact on the application.

However, there are issues which must be considered. The primary issues are that data could be lost following a node failure and that data collisions can occur. Both of these are minimized by choosing a replication engine which has a very short replication latency – the time from when a change is made to the source database and the time that change is applied to the target database.

There are many off-the-shelf asynchronous replication engines available today, and they cover virtually every platform. Many of these engines are heterogeneous in that they can replicate between disparate platforms.

Asynchronous replication engines are described in detail in the book entitled Breaking the Availability Barrier: Survivable Systems for Enterprise Computing.

--more for subscribers--

The History of Fault Tolerance

Come with us on a fact-filled and fun-filled trip back to 1984 for a look at the emerging field of commercially-available fault-tolerant systems. This is a reprint of an article published by our Managing Editor, Dr. Bill Highleyman, in the September, 1984, issue of Computerworld. Led by granddaddy Tandem and close #2 Stratus, systems that had been announced included those from Synapse, Auragen, Tolerant, Sequoia, NoHalt, Parallel, August, and Syntrex. It appeared that DEC was ready to pounce, but Big Blue showed no interest.

The fault-tolerant marketplace was heating up so fast that Dr. Bill predicted “As with memory and languages and operating systems, fault-tolerance will become a subconscious requirement. We just wouldn’t think of building a system without it. How long? Anybody’s quess. But based on past experience, it’s got to be less than 10 years.” Oops!

--more for free-- (Caution: This is a scanned document and is about 2 MB long.)

Product Reviews

Virtual Tape – Getting Rid of a Troublesome Medium

Magnetic tape has been the backup medium of choice for decades. Accompanying it, however, is a lot of baggage. In large shops, there can be a floor-full of magnetic tape units and myriad operators. Tapes must be moved to off-site storage and then retrieved in the event of a system failure or data loss. Perhaps magnetic tape’s most serious limitation is the hours, days, or weeks that it may take to restore a failed system or to bring up a cold backup.

A recent solution to these problems is virtual tape. A virtual tape system virtualizes magnetic tape cartridges as disk files. Virtual tape cartridges are fast and space-efficient. They may be electronically replicated to off-site storage and to backup sites. Their contents can be written to physical cartridges at the remote storage site if required.

HP offers a virtual-tape solution via its HP Virtual TapeServer. The provision by HP of over 800 virtual tape drives worldwide demonstrates the acceptance of this important new technology.

--more for free--

The Geek Corner

Calculating Availability – Repair Strategies

In our previous article, Calculating Availability - Redundant Systems, we developed the basic availability relationship for an active/active system. However, that relationship did not concern itself with repair strategies.

There are two repair strategies of interest when a multiple node failure takes down an active/active system. One is parallel repair, in which multiple service technicians are dispatched to each of the failed nodes to repair them in parallel. The other is sequential repair, in which only one service technician (or one team) is sent to repair a failed node. When that node is repaired, repair activity begins on the next node.

When compared to sequential repair, parallel repair reduces system failure probability by a factor of (s+1)!, reduces the downtime of the system by a factor of (s+1), and increases its uptime by a factor of s!, where s is the number of spare nodes in the system.

In addition, reducing the time that it takes to repair a node has an amplified affect on system availability. If the nodal repair time is reduced by a factor of k, the probability of system failure is reduced by k^s⁺¹.

This subject is treated in much more detail in the books Breaking the Availability Barrier: Survivable Systems for Enterprise Computing and Breaking the Availability Barrier: Achieving Century Uptimes with Active/Active Systems.

--more for subscribers--

Would you like to Sign Up for the free Digest by Fax?

Simply print out the following form, fill it in, and fax it to:

Availability Digest

+1 908 459 5543

The free Digest, published monthly, provides abbreviated articles for your review.

Access to full article content is by subscription only at

https://availabilitydigest.com/subscribe.htm.

Name:

Email Address:

Company:

Title:

Telephone No.

Address:

____________________________________

The Availability Digest may be distributed freely. Please pass it on to an associate.

Access to most detailed article content requires a subscription.

To sign up for the free Availability Digest or to subscribe, visit https://availabilitydigest.com/subscribe.htm.

To be a reporter (free subscription), visit https://availabilitydigest.com/reporter.htm.

Managing Editor - Dr. Bill Highleyman editor@availabilitydigest.com.

The UML User Guide

Case Studies

HP’s Active/Active Home Location Register

Never Again

Active/Active Save #1 – Coffee Pot Takes Down Node

Best Practices

Document Your System

Active/Active Topics

Asynchronous Replication Engines

The History of Fault Tolerance

Recommended Reading

The Unified Modeling Language User Guide

Product Reviews

Virtual Tape – Getting Rid of a Troublesome Medium

The Geek Corner

Calculating Availability – Repair Strategies