Read the Digest in PDF. You need the free Adobe Reader.

The digest of current topics on Continuous Availability. More than Business Continuity Planning.

BCP tells you how to recover from the effects of downtime.

CA tells you how to avoid the effects of downtime.


Follow us


Thanks to This Month's Availability Digest Sponsor

Connect is sponsoring the "Security on the High Seas" conference, a cruise to Cozumel, Mexico.

The conference will be held on a Royal Caribbean cruise ship from February 5 to 9, 2015.

The conference includes informative sessions from security experts and CCSK certification.

Connect has negotiated a very attractive pricing structure for the conference and certification.


In this issue:


   Never Again

      911 Service Down Six Hours -Software Bug

   Best Practices

     The Smarts Behind Smart Cards: Part 1

     Build to Fail

   The Geek Corner

     Random Events Have No Memory


      The Twitter Feed of Outages



Browse through our useful links.

See our article archive for complete articles.

Sign up for your free subscription.

Visit our Continuous Availability Forum.

Check out our seminars.

Check out our writing services.

Security on the High Seas Conference


Connect, the Independent HP Business Technology Community, is trying something new. It is sponsoring a three-day Security on the High Seas conference aboard a beautiful Royal Caribbean cruise ship. The ship will leave Fort Lauderdale, Florida, on February 5, 2015, and will return to port on February 9th.


The conference will include one full day of informative sessions from security experts and one full day of training for the Certificate of Cloud Security Knowledge (CCSK) from the Cloud Security Alliance. The conference price includes an exam token from the Cloud Security Alliance to take the CCSK exam and to receive CCSK certification. I will be speaking on the impact that DDoS attacks can have on your online services. February 7th will be spent touring Cozumel, Mexico.


Connect has arranged for very attractive pricing for the conference. Prices start at U.S. $1,132 for one attendee and U.S. $1,604 for two attendees (less if the second attendee is not going to participate in the conference). The fee includes room and all meals and the price of the CCSK certification. More information on the Security on the High Seas conference can be found on the Connect web site at


- Bill Highleyman, Managing Editor



  Never Again 


911 Service Down for Six Hours Due to a Software Bug


For six hours on April 10, 2014, the entire state of Washington lost 911 service for all of its residents.


At about 1 AM PDT on April 10th, call dispatchers in 911 call centers around the state of Washington began to notice that no 911 calls were coming through. Call dispatchers attempted to reroute incoming 911 calls to neighboring call centers but found that these jurisdictions had no service either.


Frantic calls to CenturyLink, the 911 service provider for the state of Washington, provided no relief. It took six hours to restore service. During that time, 911 calls totaling 4,500 in Washington state went unanswered. Only 770 were completed. Even worse, a partial outage extended to six other states. All in all, 911 calls totaling 6,600 failed. Fortunately, no one died as a result of the outage.


An FCC (the U.S. Federal Communications Commission) report analyzing the outage concluded that the 911 outage was the result of a preventable software error in a routing center. The outage could have been immediately corrected by rerouting 911 calls that could not be completed to another routing center.




Best Practices


The Smarts Behind EMV Smart Cards: Part 1 - Online Processing


During the last months of 2013, Target, the third largest retailer in the U.S., suffered a card-skimming attack in which hackers were able to obtain the magnetic-stripe data off of cards used in Target stores. Stolen was the personal data from 110 million payment cards. Thousands of fraudulent transactions followed. Is there a defense against these data breaches?


The answer is smart cards. A smart card, also called a chip card or an integrated-circuit card (ICC), includes an embedded computer chip that employs cryptographic and risk-management features. In conjunction with a smart-card POS or ATM terminal, these features are designed to thwart skimming, card-cloning, card-counterfeiting, and other fraudulent attacks.


Smart cards have been in use all around the world except in the U.S. They are now coming to the U.S. In this article, split over two issues of the Availability Digest, we describe how smart cards add significant security to payment-card transactions. Part 1 covers the methods for authorizing smart-card transactions online with the issuer. In Part 2, we will discuss the procedures for securely authorizing smart-card transactions offline without direct issuer involvement.





Build to Fail


What do build to fail and chaos monkey have to do with continuous availability? Plenty, as Netflix has demonstrated. Netflix survived a massive Amazon Web Services reboot that affected several Netflix virtual machines with hardly a hiccup. Netflix attributed this success to its policy of building applications to run continuously on systems that can fail (build to fail) and to test these applications periodically with random system failures (chaos monkey).


The first step to achieving continuous availability is to design for availability. To do this, Netflix utilizes consistent reliability design patterns to tie micro-services into applications that are distributed across many nodes.


The second step is to verify that the application designs are reliable and that they will recover from unexpected failures. Netflix periodically injects random failures into its systems to ensure that they can tolerate failures via its Simian Army. A member of its Simian Army is Chaos Monkey, whose task it is to randomly disable production Amazon virtual machines.


The result of this effort was a near-perfect survival of the Amazon reboot of ten percent of the Netflix database nodes.




The Geek Corner


Random Events Have No Memory


In availability, we talk about mean time between failures (MTBF) and mean time to repair (MTR), where mean means average. MTBF is the average time between system failures. MTR is the average time it takes to return a failed system to service.


MTBF and MTR are random variables. That is, they can take on any number of values. The system may fail in three months, and then it may not fail again for two years. It may take one hour to repair the system the first time and 30 minutes to repair the next time.


A key concept in availability theory is that MTBF and MTR are memoryless variables. That is, whether the system will go down or will be repaired in the next minute is independent of the past history of events; and the timing of these events has no impact on the timing of future events.


In this article, we show that for random variables such as MTBF and MTR, the amount of time to the next event is given by the exponential distribution; and the probability of a specified number of events happening in a given time interval is given by the Poisson distribution.





@availabilitydig - The Twitter Feed of Outages


A challenge every issue for the Availability Digest is to determine which of the many availability topics out there win coveted status as Digest articles. We always regret not focusing our attention on the topics we bypass.


Now with our Twitter presence, we do not have to feel guilty. This article highlights some of the @availabilitydig tweets that made headlines in recent days.







Sign up for your free subscription at


Would You Like to Sign Up for the Free Digest by Fax?


Simply print out the following form, fill it in, and fax it to:

Availability Digest

+1 908 459 5543




Email Address:



Telephone No.:










The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.

Managing Editor - Dr. Bill Highleyman

2014 Sombers Associates, Inc., and W. H. Highleyman