Read the Digest in PDF. You need the free Adobe Reader.

The digest of current topics on Continuous Availability. More than Business Continuity Planning.

BCP tells you how to recover from the effects of downtime.

CA tells you how to avoid the effects of downtime.

www.availabilitydigest.com

Follow us

@availabilitydig

Let’s be realistic. Availability costs money. So does system downtime. For every application and every system, there is an affordable balance. Determining that balance is the role of risk assessment and system architecture, and it’s what we at the Availability Digest do best. Visit our Consulting page for more information. We’ll be happy to provide you with a quotation.

In this issue:

Case Studies

Bank Chooses Shadowbase for BASE24

Never Again

123-Reg Deletes Hundreds of Websites

Software Bugs Take Down Google GCE

Availability Topics

The Dawn of Fault-Tolerant Computing

Tweets

The Twitter Feed of Outages

Browse through our useful links.

See our article archive for complete articles.

Visit our Continuous Availability Forum.

Check out our seminars.

Check out our writing services.

Check out our consulting services.

We Write for Others as Well as for Ourselves

The articles you read in the Availability Digest result from years of experience in researching and writing a variety of technical documents and marketing content. It’s what we do best, and we provide our services to others. Not a month goes by without the publishing of one or more articles that were ghostwritten by us under a customer’s byline.

Our involvement in a writing assignment follows a structured format: ● You send us whatever information is available so that we can begin the research process. ● You identify the people we need to interview, and we will conduct those conversations either onsite at your location or via another communications method preferable to you. ● We complete the writing assignment within the stipulated timeline and will send it to you for review, revisions, and approval. ● You or your company owns the resultant work. We hold no copyright on it.

Check out the Writing page on our website, then let’s get started. Contact us at editor@availabilitydigest.com.

Dr. Bill Highleyman, Managing Editor

Case Studies

Bank Chooses “Sizzling-Hot-Takeover” Data Replication for its BASE24™ Business Continuity Solution

For the past eight years, a Tier 1 regional bank serving a major resort island was using an ACI BASE24 Classic financial transaction switch to manage its network ATMs and POS terminals. For business continuity, its BASE24 system ran in an active/passive mode on a pair of HPE NonStop S-Series servers. Early in 2015, the bank determined its need to upgrade these servers, which along with the operating system and application software, were nearing their end-of-support lives. The bank made the decision to migrate its BASE24 system to a pair of NonStop NS-Series servers, again running as an active/passive pair.

With an eye to cost issues and in order to optimize its business continuity failover time for system outages, both scheduled and unscheduled, the bank replaced its existing data-replication product with HPE Shadowbase. The replacement decision allowed the bank to take advantage of Shadowbase’s sizzling-hot-takeover (SZT) facility, which can reduce failover time to a few seconds.

--more--

Never Again

123-Reg Deletes Hundreds of its Hosted Websites

123-Reg is the U.K.’s largest domain registrar and one of its biggest website hosting providers. It has issued 3.5 million domain names and hosts 1.7 million websites.

Within its global data centers, 123-Reg operates 115,000 servers. Running on every physical server are several virtual private servers (VPSs), each dedicated to one customer. Every VPS hosts several virtual machines (VMs), each of which hosts a website for the customer owning the VPS. In effect, multiple customers share a single server that appears to be a system dedicated to just one customer.

Most companies hosting websites on 123-Reg are e-commerce businesses relying on their websites for sales.

On April 16, 2016, 123-Reg accidently wiped out hundreds of its customers’ websites when it ran a maintenance script with a software bug. The hosting service provider operates an “unmanaged” hosting service and does not provide backups for its customers. Though it encourages customers to maintain backup copies of their websites, most do not. For those customers, their websites may be irretrievably lost.

--more--

Cascading Software Bugs Take Down Google Compute Engine

On April 11, 2016, Google’s Compute Engine (GCE) experienced a massive outage that affected all Google’s regions worldwide. The outage was caused by a series of software bugs that fed on each other while Google engineers were busy upgrading the Google network. The first software bug corrupted the network upgrade. The second software bug sent the corrupted upgrade to the network rather than cancelling it. The third software bug failed to inform the network management software that a corrupted network upgrade was being propagated.

The result was that inbound Internet traffic to Google was not routed correctly. Connections were dropped, and users could not reconnect. Services dependent upon the network such as VPNs and Level 3 load balancers began to fail. Google users worldwide were unable to connect to the Google Compute Engine. Outbound Internet traffic was not affected.

The asia-east1 region was unreachable for over an hour. The entire Google worldwide GCE network was down for eighteen minutes.

Google announced service refunds to its clients, and the refunds exceeded the requirements of Google’s SLAs.

--more--

Availability Topics

The Dawn of Fault-Tolerant Computing

In 1980, I published a four-part Computerworld series entitled “Survivable Systems.” The articles described the state-of-the-art fault-tolerant systems at the time. The need for systems that never (at least, hardly ever) failed was just being recognized. Several companies jumped in with their own versions of fault-tolerant systems, including Tandem, Stratus, Synapse, Auragen, August, NoHalt, Parallel Computers, and Tolerant Systems.

A lot has changed over the last 36 years. Systems have become more “open,” with Linux-like operating systems and x86-based hardware architectures. However, what hasn’t changed is the need for systems that never fail. Applications that were hardly in use in the 1980s now are becoming mission-critical. The use of email that came into being in 1993 is a perfect example. With the advent of social media, systems promoted as 24x7 can’t risk a failure. As soon as a system is under distress, the Twitter universe explodes with complaints and comments, often causing irreparable harm to a company’s reputation for reliability.

Some early products are still in use, for instance, Tandem and Stratus systems. Others have been incorporated into newer products. Still others simply have disappeared. In this article, we visit the dawn of fault-tolerant computers and the various architectures that were being promoted as such at the time.

--more--

Tweets

@availabilitydig - The Twitter Feed of Outages

A challenge every issue for the Availability Digest is to determine which of the many availability topics out there win coveted status as Digest articles. We always regret not focusing our attention on the topics we bypass.

Now with our Twitter presence, we don’t have to feel guilty. This article highlights some of the @availabilitydig tweets that made headlines in recent days.

--more--

Sign up for your free subscription at https://availabilitydigest.com/signups.htm

Would You Like to Sign Up for the Free Digest by Fax?

Simply print out the following form, fill it in, and fax it to:

Availability Digest

+1 908 459 5543

Name:

Email Address:

Company:

Title:

Telephone No.:

Address:

____________________________________

The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.

Managing Editor - Dr. Bill Highleyman editor@availabilitydigest.com.