Read the Digest in PDF You need the free Adobe Reader.

The digest of current topics on Continuous Processing Architectures. More than Business Continuity Planning.

BCP tells you how to recover from the effects of downtime.

CPA tells you how to avoid the effects of downtime.

In this issue:

Case Studies

Active/Active Payment Processing at Swedbank

Never Again

So You Think Your System is Reliable?

Best Practices

Reliable Multicasting

Recommended Reading

Distributed Systems: Principles and Paradigms

Complete articles may be found at https://availabilitydigest.com/articles.

The raft of crashes of systems that were thought to be reliable overwhelms our Never Again column. In this issue we summarize almost two dozen major crashes that happened just in the last six months and that inconvenienced millions of users, cost businesses big money, and even jeopardized life and property.

Interestingly, fully a third of these system failures were caused by power problems of one sort or another. Power glitches continue to be a major cause of system outages.

Another pattern seen in these examples is that many of the outages affected airline terminals, train stations, and motor vehicle bureaus. Evidently, in many cases our disaster recovery planning does not extend beyond the data center.

We have a long way to go to approach in practice the availabilities that are achievable with today’s technology. We hope that our message in the Availability Digest will help lead us eventually to the Nirvana of continuous availability.

If you have similar stories to share or would like to talk about your own systems, drop us a note at editor@availabilitydigest.com. We’d love to hear from you.

Dr. Bill Highleyman, Managing Editor

Case Studies

Active/Active Payment Processing at Swedbank

Based in Stockholm, Sweden, Swedbank (www.swedbank.com) is one of the largest retail banks in the Nordic region. It is the leading bank in Sweden, Estonia, Latvia, and Lithuania.

Among its other services, Swedbank processes electronic payment requests for a number of Swedish and overseas banks as well as ATM payments and payment requests from Swedbank’s own customers. It uses ACI’s Base24 application to provide these services. Processing over one billion transactions per year, it authorizes and authenticates large volumes of banking transactions.

Because of the growing importance of the payment authorization and authentication function for ATM and POS transactions, Swedbank decided that it had to offer true 24x7 service for these critical functions to its customers. Therefore, Swedbank recently moved these functions to a NonStop active/active configuration.

The new configuration has proven that it can fulfill the 24x7 need. Swedbank has been operating its dual-node active/active system since 2006 with no outages or service issues.

--more--

Never Again

So You Think Your System is Reliable?

The abundance of system failures causing major impacts overwhelms our Never Again column. Earlier this year, we published a summary of several major outages to try to catch up with the news. In this article, we continue that tradition and give brief summaries of many significant system outages that occurred during the second half of 2007.

There are some interesting lessons to learn from these experiences. Of the twenty-one instances reported, seven were due to power problems. We see this as a recurring theme. Also, another seven of the instances had to do with satellite branches – train stations, airports, and motor vehicle bureaus. It seems that we often don’t extend our notions of uptime beyond the central facility.

There was one case in which a failure cost billions of dollars when Samsung’s chip line went down. Then there was the VoIP provider Skype that went down for days when its network overloaded trying to handle millions of simultaneous end-user software upgrades. What about Cisco? Even an acknowledged leader in network redundancy can go down. And there was the lowly PC that kept thousands of arriving international passengers on planes for hours at the Los Angeles airport.

--more--

Best Practices

Reliable Multicasting

Many applications exist in which messages must be sent to a group of readers. Replicating database changes to the database copies in an active/active network is one such example with which we deal frequently in the Availability Digest. However, there are many other applications that require this capability. The distribution of actionable events, of stock market activity, and of news streams are other examples. This is called multicasting.

Multicasting in its simplest form is unreliable. Messages are sent via a one-way best-efforts protocol such as UDP. Missing messages may not be detectable and are not recoverable.

In many applications, it must be guaranteed that every receiver in a multicast group receives every message. This is known as reliable multicasting. Building reliable multicast networks that can scale to a large number of receivers is a difficult problem. No single best solution exists, and each solution introduces new problems.

The problem of reliable multicasting has different solutions for local area networks and wide area networks. The problem is further complicated if there are multiple senders and if proper message ordering is required.

--more--

Distributed Systems: Principles and Paradigms

Distributed Systems: Principles and Paradigms is a thorough description of the theory and practice behind the technology that goes into building effective distributed systems. Authored by Andrew Tanenbaum and Maarten Van Steen, Professors of Computer Sciences at Vrije University in Amsterdam, The Netherlands, this book deals with the myriad issues that must be faced when implementing distributed systems.

The authors define a distributed system as a collection of independent computers that appears to its users as a single coherent system. As a consequence, the book focuses on the transparency of data and services in distributed systems. Topics include distributed architectures, processes, communication, naming, system synchronization, data replication, fault tolerance, and security.

Though the book borders on the erudite rather than the practical (the authors often lapse into notational descriptions that fortunately can be ignored by the theoretically challenged), it makes extensive use of well-known systems as examples to demonstrate the principles described. It is a must reference for any serious practitioner of distributed systems.

--more--

Would You Like to Sign Up for the Free Digest by Fax?

Simply print out the following form, fill it in, and fax it to:

Availability Digest

+1 908 459 5543

Name:

Email Address:

Company:

Title:

Telephone No.

Address:

____________________________________

The Availability Digest may be distributed freely. Please pass it on to an associate.

To be a reporter, visit https://availabilitydigest.com/reporter.htm.

Managing Editor - Dr. Bill Highleyman editor@availabilitydigest.com.