Read the Digest in PDF. You need the free Adobe Reader.

The digest of current topics on Continuous Processing Architectures. More than Business Continuity Planning.

BCP tells you how to recover from the effects of downtime.

CPA tells you how to avoid the effects of downtime.

In this issue:

Never Again

PayPal Services Downgrade with Upgrade

Availability Topics

Fault Tolerance for Virtual Environments 3

Product Reviews

OpenVMS Active/Active Split-Site Clusters

The Geek Corner

Heterogeneous Systems - Part 3

Complete articles may be found at https://availabilitydigest.com/articles.

NEVER update without a fallback plan

This seems like such an obvious statement. But why then is it so often ignored? Our Never Again story tells about one recent experience with disastrous results. PayPal attempted an upgrade without a fallback plan and compromised millions of online stores for weeks before they finally resolved the problem. There has been no estimate of the scores of small merchants that might have been put out of business.

PayPal is not unique. See our Never Again articles on BlackBerry for other examples of not following this best practice.

And a reminder. Come join me for a full-day seminar on active/active systems at the 2008 HP Technology Forum, to be held in Las Vegas, Nevada, USA, this month. The seminar will be held on Monday, June 16, starting at 8:30 AM and will cover the theory and implementation of active/active systems, with several case studies of successful implementations now in production.

Dr. Bill Highleyman, Managing Editor

Never Again

PayPal Services Downgrade with Upgrade

PayPal provides payment processing services for online merchants, auction sites, and others. Now owned by auction giant eBay, PayPal processes over $50 billion USD per year and services 200 million accounts. It is used in 190 countries and supports 19 currencies.

Clearly, a great deal of today’s ecommerce flows through PayPal. Its services must be extremely reliable as billions of dollars of revenue for millions of small online merchants depend upon it. An extended outage could put many small merchants out of business.

And that is what happened last month. Critical PayPal services went down – not for hours but for weeks. The problem occurred when PayPal attempted to upgrade its Instant Payment Notification system and did so with no rollback plans in the event of an upgrade problem. During the many days that it took to correct the problems, many online merchants were unable to fill customer orders or did so only at a financial loss.

--more--

Availability Topics

Fault Tolerance for Virtualized Environments – Part 3

Virtualization significantly increases the utilization of a server by creating several independent virtual machines (VMs) on a single physical server. To its copy of the operating system (a guest operating system), each virtual machine appears as if it were a dedicated physical server.

In Parts 1 and 2 of this series, we described the reasons for the burgeoning interest in virtualization and how virtualization is implemented. Though virtualization can significantly reduce data center costs, the loss of a virtualized server can mean the loss of many applications. In Part 3, we address the very important problem of achieving continuous availability in virtualized environments.

Virtualization products provide a broad range of failover capabilities, but all can result in long failover times as applications are migrated and as corrupt databases are repaired. This problem can be alleviated by using fault-tolerant servers to host virtualized environments. Fault-tolerant servers can withstand any single fault and many multiple faults with no impact on the user, and they can reduce the incidence of costly failovers by one or two orders of magnitude.

We conclude with some brief reviews of virtualization products and fault-tolerant servers that provide the features needed to achieve the high availability required in virtual environments.

--more--

Product Reviews

OpenVMS Active/Active Split-Site Clusters

HP OpenVMS clusters offer a stark contrast to contemporary cluster technology. Nodes in an OpenVMS cluster run in an active/active mode in which multiple nodes across multiple sites can cooperate in a common application against a common database.

Contemporary clusters do not run in a true active/active mode because a disk volume can be mounted only on one node at a time (unless Oracle RAC is used), and only that node can participate in the application. Consequently, when a node fails, the application has to be started on another node, the volume remounted and repaired, and the users switched. This leads to failover times measured in minutes or more.

OpenVMS clusters are an interesting mix of cluster and active/active technologies. Although they are structured as clusters with multiple nodes accessing a logically local file system, the nodes can be geographically distributed and can all be executing common applications as an active/active network.

OpenVMS clusters recover in seconds because once a failure is detected, all that must be done to continue operation is to switch users to a surviving node in any site. Furthermore, no data is lost following a failure because the application database copies are updated synchronously.

--more--

The Geek Corner

Calculating Availability – Heterogeneous Systems Part 3

In Parts 1 and 2 of this series, we reviewed some elementary concepts of probability theory and applied them to the analysis of the availability of active/active systems and active/standby systems. We considered not only user downtime due to dual-system failures but also user downtime due to failover times and failover faults. We extended these results to include the case in which the nodes in the system are heterogeneous and have different availabilities.

In Part 3, we show how to calculate the availability of a complex system comprising parallel and serial-node configurations. This is a step-wise analysis in which parallel subsystems and serial subsystems are iteratively reduced to single nodes until only one – the system node – remains.

In our final Part 4, we will use these techniques to analyze a complex system comprising an active/active system backed up by a single high-availability standby system.

--more--

Would You Like to Sign Up for the Free Digest by Fax?

Simply print out the following form, fill it in, and fax it to:

Availability Digest

+1 908 459 5543

Name:

Email Address:

Company:

Title:

Telephone No.

Address:

____________________________________

The Availability Digest may be distributed freely. Please pass it on to an associate.

To be a reporter, visit https://availabilitydigest.com/reporter.htm.

Managing Editor - Dr. Bill Highleyman editor@availabilitydigest.com.