Read the Digest in PDF. You need the free Adobe Reader.

The digest of current topics on Continuous Availability. More than Business Continuity Planning.

BCP tells you how to recover from the effects of downtime.

CA tells you how to avoid the effects of downtime.

www.availabilitydigest.com

Thanks to This Month's Availability Digest Sponsor

Join Opsol at HP Discover June 4^th - 7^th as Wells Fargo describes replacing Base24 with OmniPayments.

Opsol’s OmniPayments offers a full complement of credit/debit card authorization and support services.

50% cost savings in your payments processing. 450 million transactions per month.

VISA, MASTERCARD, PROSA certified. Proven Base24 Shutdown reference customer.

In this issue:

Never Again

Will You Have Internet Access After July 9th?

Availability Topics

It's Official! Leap Day Caused Azure Outage

Recommended Reading

Beyond Redundancy - Georedundancy

Product Reviews

Critical Date Testing - Leap Day and More

Browse through our Useful Links.

Check our article archive for complete articles.

Join us on our Continuous Availability Forum.

Check out our seminars.

Check out our technical writing services.

Availability Digest to Present at HP Discover 2012

Once again, the Availability Digest continues its tradition of presenting topics of interest to the high availability IT community. This time, it’s at HP Discover 2012, to be held in Las Vegas from June 5^th to June 7^th. Dr. Bill Highleyman, Managing Editor of the Availability Digest, will present a session entitled “Help! My Data Center is Down!” He will discuss several spectacular data-center failures experienced by major companies due to a variety of factors, including human errors, failed upgrades, failover faults, network outages, and even some that were unthinkable.

Many important lessons can be learned from these failures and are applicable to data centers everywhere. If it can happen to Google, to Microsoft, or to the U.S. Internal Revenue Service, it can happen to you.

This topic is a favorite in our continuous availability and high availability seminars that we provide to our clients. Join us on Tuesday, June 5^th, at 11:45 AM to gain important insights into preventing your major data-center failure.

Dr. Bill Highleyman, Managing Editor

Never Again

Will You Have Internet Access After July 9, 2012?

The U.S. Federal Bureau of Investigation (FBI) predicts that up to 300,000 people around the world – many in the U.S. – will lose Internet service on July 9, 2012. If one of them is you, there may not be much that you can do about it except to rebuild your operating system.

The story begins with a well-known class of malware called “DNSChanger.” Simply put, DNSChanger infects a PC and changes the IP address of the PC’s DNS (Domain Name System) server to a rogue DNS server. During a two-year investigation, the FBI, in concert with other international law enforcement agencies, uncovered a network of rogue DNS servers that were being used in an advertising scam. The hackers infected millions of PCs with DNSChanger malware and directed them to their own fraudulent web sites.

After seizing the rogue servers, the FBI faced a dilemma. If it simply disabled the rogue DNS servers, millions of PCs would suddenly be left without Internet access. Instead, it set up temporary legitimate DNS servers to replace the rogue servers for infected users.

This move was court-approved but only until July 9, 2012. Users who have not taken corrective action by then will lose Internet access.

--more--

Availability Topics

It’s Official! Leap Day Caused the Windows Azure Outage

In our March, 2012, Never Again article entitled “Windows Azure Cloud Succumbs to Leap Year,” we related how Microsoft’s Windows Azure Platform as a Service (PaaS) cloud went down for a day and a half as the result of what appeared to be a Leap Day software bug. At the time, the conjecture was that validity dates for SSL (Secure Sockets Layer) certificates were calculated erroneously. As it turns out, the conjecture was pretty close.

Following in the path of Google and Amazon, which have been very transparent in describing publicly what happened during major outages, Microsoft has released a detailed timeline of exactly what went wrong in this major outage and the sometimes frantic efforts to restore service to its customers.

In this article, we summarize the events related by Microsoft. The bottom line is that a calculation intended to set a security certificate for new virtual machines to expire in one year yielded the result “February 29, 2013,” an invalid date. This error led to the erroneous conclusion that physical servers were failing, and the fault cascaded rapidly throughout the entire Azure cloud.

--more--

Beyond Redundancy: How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems

The book Beyond Redundancy provides an in-depth analysis of various approaches to geographical redundancy of IT systems to improve service availability. Among several recommendations, it concludes that the superior approach is the use of active/active systems with client-initiated failure detection.

Enterprises commonly make significant investments duplicating critical systems in geographically dispersed sites to improve service availability. Beyond Redundancy focuses on the theoretical and practical aspects of the benefits of georedundancy on service availability and reliability.

The book describes a variety of georedundant architectures. It uses Markov modeling to calculate the availability of various approaches. The book is not a casual read. It is intended for the serious student who needs to understand the various aspects of georedundant systems from an analytical as well as a practical viewpoint. As such, it serves two purposes:

It is an excellent treatise on the concepts behind georedundancy and the issues that must be considered in designing such systems. The complexity added by the Markov models can be ignored for this purpose.

For those who want to be able to specifically analyze various configurations, it provides the analytical tools to do so via its Markov models.

--more--

Product Reviews

Critical Date Testing – Leap Day and More

As we illustrated in It’s Official! Leap Day Caused the Windows Azure Outage, our companion article in this issue, Microsoft described in a very transparent blog why the Windows Azure Cloud went down for a day and a half at the stroke of midnight GMT, as the clock ticked to February 29, 2012. It was, in fact, a Leap Day bug.

After the immense and largely successful effort expended to protect against the Y2K problem, one would think that date/time bugs would be a thing of the past. Evidently, not so. Azure was not the only fatality of Leap Day. Systems around the world felt its impact.

Even with all our Y2K experience, there are still many failures due to date and time bugs. There is no reason for this. Products are available to thoroughly test applications to ensure that they handle critical dates and times successfully. These products are by and large noninvasive and require no program modifications. They can be used to offset times for testing or to simulate multiple time zones for system consolidation.

Date/time simulation products exist for most operating system environments, including those for IBM mainframes, HP NonStop servers, Linux, Windows, and most UNIX platforms.

--more--

Sign up for your free subscription at https://availabilitydigest.com/signups.htm

Would You Like to Sign Up for the Free Digest by Fax?

Simply print out the following form, fill it in, and fax it to:

Availability Digest

+1 908 459 5543

Name:

Email Address:

Company:

Title:

Telephone No.:

Address:

____________________________________

The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.

Managing Editor - Dr. Bill Highleyman editor@availabilitydigest.com.