Read the Digest in PDF. You need the free Adobe Reader.

The digest of current topics on Continuous Processing Architectures. More than Business Continuity Planning.

BCP tells you how to recover from the effects of downtime.

CPA tells you how to avoid the effects of downtime.

 

In this issue:

 

   Case Studies

      HP’s Home Location Register

   Never Again

      Coffee Pot Takes Down Node

   Best Practices

      Document Your System

   Active/Active Topics

      Asynchronous Replication Engines

      The History of Fault Tolerance

   Recommended Reading

      The UML User Guide

   Product Reviews

      Virtual Tape

  The Geek Corner

      Repair Strategies

 

Complete articles may be found at http://www.availabilitydigest.com/.

Dear Reader:

 

Welcome to our second issue of the Availability Digest and its focus on the quest for 100% uptime. In this issue, we feature a description of an active/active product from HP - its Home Location Register. The HLR is the brains of a cellular network.

 

We also tell the tale of how a simple coffee pot took down an active/active node. We discuss the importance of documentation and the help that the Universal Modeling Language, UML, can provide. There is lots more, including a trip to the past to look at the state of fault tolerance twelve years ago.

 

The monthly Availability Digest is distributed freely, though access to the detailed article content is available by subscription only. You can tell which articles require a subscription - their links are --more for subscribers--. Links to freely accessible articles are --more for free--. Visit www.availabilitydigest.com to sign up for your free Digest or to subscribe to its detailed content.

 

We are always looking for good content. Be an AD reporter, and earn free subscription months by contributing a story - especially a Case Study or a Never Again horror story. See our web site for details.

 

Many of you are getting the Digest from our unsolicited mailings. However, to avoid spamming your, we will be ending this practice in the near future. To guarantee your continued receipt of the free Digest, be sure to sign up for it on our web site.

 

Dr. Bill Highleyman, Managing Editor


 

Case Studies 

 

HP’s Active/Active Home Location Register

 

The Home Location Register, or HLR, is the brains of a cellular network. A memory-resident, database-driven system, it keeps track of the location of all of its mobile subscribers and provides the list of services authorized for each subscriber to the cellular switching network.

 

The availability of HLR services is of paramount importance to a cellular network. If the HLR is down, the network is down (except for 911 emergency services in the U.S., which are federally mandated). The system must have zero downtime. Even planned outages for upgrades and repair must be avoided.

 

HP’s HLR system provides just such availability. It is configured as an active/active NonStop mated pair. Its memory-resident subscriber database is asynchronously replicated to a companion node so that service can be continued by a surviving node in the event of a node failure.

 

HP HLR systems are now deployed by 35 service providers on five continents and serve 200 million subscribers.

 

-- more for subscribers--

 


 

  Never Again 

 

Active/Active Save #1 – Coffee Pot Takes Down Node

 

It often takes a chain of events to cause a system failure. One such chain started with purposefully miswiring power to a newly installed node to avoid an upgrade delay due to a wrong power connector. The UPS system was temporarily bypassed with house power until a proper power connector could be installed.

 

Unfortunately, the installation crew forgot to correct the situation; and the node continued to run on house power. Months later, someone plugged in a coffee pot and blew the circuit breaker. It was only because the company was running an active/active configuration that the users did not lose service.

 

--more for subscribers--

 


 

Best Practices

 

Document Your System

 

 

Proper documentation of software systems and procedures is often viewed as a necessary evil. Too often, the “evil” wins out over the “necessary;” and documentation is skimped or ignored.

 

The lack of documentation can create significant problems. Not only can software become unmaintainable over time, but serious operator errors due to a lack of documentation can bring down a system.

 

This problem is now greatly alleviated by the Unified Modeling Language, an international standard established a decade ago. Powerful tools exist to make the generation of UML diagrams relatively painless.

 

With the self-documenting capabilities of today’s modern languages such as Java, it is no longer necessary to face the daunting task of documenting code. However, UML can be used to easily document everything from software structures to operational procedures; and this is a Best Practice that data centers need.

 

A major company has decided to use UML to document its systems. Its efforts demonstrate that the UML approach is now a valid and valuable approach.

 

--more for subscriber--

 


 

Active/Active Topics

 

Asynchronous Replication Engines

 

Active/active systems require that geographically-dispersed database copies be kept in exact synchronization. There are several techniques for doing this, but the most common is the use of asynchronous data replication engines.

 

An asynchronous data replication engine has several advantages. It is fast and generally requires no application changes. It runs “under the covers” so that there is no performance impact on the application.

 

However, there are issues which must be considered. The primary issues are that data could be lost following a node failure and that data collisions can occur. Both of these are minimized by choosing a replication engine which has a very short replication latency – the time from when a change is made to the source database and the time that change is applied to the target database.

 

There are many off-the-shelf asynchronous replication engines available today, and they cover virtually every platform. Many of these engines are heterogeneous in that they can replicate between disparate platforms.

 

Asynchronous replication engines are described in detail in the book entitled Breaking the Availability Barrier: Survivable Systems for Enterprise Computing.

 

--more for subscribers--

 

The History of Fault Tolerance

 

Come with us on a fact-filled and fun-filled trip back to 1984 for a look at the emerging field of commercially-available fault-tolerant systems. This is a reprint of an article published by our Managing Editor, Dr. Bill Highleyman, in the September, 1984, issue of Computerworld. Led by granddaddy Tandem and close #2 Stratus, systems that had been announced included those from Synapse, Auragen, Tolerant, Sequoia, NoHalt, Parallel, August, and Syntrex. It appeared that DEC was ready to pounce, but Big Blue showed no interest.

 

The fault-tolerant marketplace was heating up so fast that Dr. Bill predicted “As with memory and languages and operating systems, fault-tolerance will become a subconscious requirement. We just wouldn’t think of building a system without it. How long? Anybody’s quess. But based on past experience, it’s got to be less than 10 years.” Oops!

 

--more for free-- (Caution: This is a scanned document and is about 2 MB long.)

 


 

Recommended Reading

 

The Unified Modeling Language User Guide

 

 

System documentation has long been a thorn in the side of data centers. We can’t live without it, but we never have the time or the money to document properly.

 

The proper documentation of system structure and operational procedures is a much less daunting task today now that the Unified Modeling Language has been accepted as the de facto standard for documentation. The UML allows one to define easily the system structures, their behavior, and their relationships and to express various useful views of the system via standard diagrams. The adoption of a standard for documentation carries with it several advantages, including the availability of off-the-shelf tools and a ready understanding of the results by all concerned.

 

UML was adopted by the Object Management Group (OMG) as a documentation standard a decade ago, and OMG now says that this is its most-used standard. The UML User Guide is an easy-to-read and reasonably complete UML tutorial written by the original developers of the language.

 

--more for subscribers--

 


 

Product Reviews

 

Virtual Tape – Getting Rid of a Troublesome Medium

 

Magnetic tape has been the backup medium of choice for decades. Accompanying it, however, is a lot of baggage. In large shops, there can be a floor-full of magnetic tape units and myriad operators. Tapes must be moved to off-site storage and then retrieved in the event of a system failure or data loss. Perhaps magnetic tape’s most serious limitation is the hours, days, or weeks that it may take to restore a failed system or to bring up a cold backup.

 

A recent solution to these problems is virtual tape. A virtual tape system virtualizes magnetic tape cartridges as disk files. Virtual tape cartridges are fast and space-efficient. They may be electronically replicated to off-site storage and to backup sites. Their contents can be written to physical cartridges at the remote storage site if required.

 

HP offers a virtual-tape solution via its HP Virtual TapeServer. The provision by HP of over 800 virtual tape drives worldwide demonstrates the acceptance of this important new technology.

 

--more for free--

 


 

The Geek Corner

 

Calculating Availability – Repair Strategies

 

In our previous article, Calculating Availability - Redundant Systems, we developed the basic availability relationship for an active/active system. However, that relationship did not concern itself with repair strategies.

 

There are two repair strategies of interest when a multiple node failure takes down an active/active system. One is parallel repair, in which multiple service technicians are dispatched to each of the failed nodes to repair them in parallel. The other is sequential repair, in which only one service technician (or one team) is sent to repair a failed node. When that node is repaired, repair activity begins on the next node.

 

When compared to sequential repair, parallel repair reduces system failure probability by a factor of (s+1)!, reduces the downtime of the system by a factor of (s+1), and increases its uptime by a factor of s!, where s is the number of spare nodes in the system.

 

In addition, reducing the time that it takes to repair a node has an amplified affect on system availability. If the nodal repair time is reduced by a factor of k, the probability of system failure is reduced by ks+1.

 

This subject is treated in much more detail in the books Breaking the Availability Barrier: Survivable Systems for Enterprise Computing and Breaking the Availability Barrier: Achieving Century Uptimes with Active/Active Systems.

 

--more for subscribers--

 


 

Would you like to Sign Up for the free Digest by Fax?

 

Simply print out the following form, fill it in, and fax it to:

Availability Digest

+1 908 459 5543

 

The free Digest, published  monthly, provides abbreviated articles for your review.

Access to full article content is by subscription only at

http://www.availabilitydigest.com/subscribe.htm.

 

 

Name:

Email Address:

Company:

Title:

Telephone No.

Address:

____________________________________

____________________________________

____________________________________

____________________________________

____________________________________

____________________________________

____________________________________

____________________________________

The Availability Digest may be distributed freely. Please pass it on to an associate.

Access to most detailed article content requires a subscription.

To sign up for the free Availability Digest or to subscribe, visit http://www.availabilitydigest.com/subscribe.htm.

To be a reporter (free subscription), visit http://www.availabilitydigest.com/reporter.htm.

Managing Editor - Dr. Bill Highleyman editor@availabilitydigest.com.

© 2006 Sombers Associates, Inc., and W. H. Highleyman