|Read the Digest in
You need the free
The digest of current topics on Continuous Availability. More than Business Continuity Planning.
BCP tells you how to recover from the effects of downtime.
CA tells you how to avoid the effects of downtime.
In this issue:
Browse through our Useful Links.
Check our article archive for complete articles.
Sign up for your free subscription.
Join us on our Continuous Availability Forum.
Help! My Data Center is Down!
I am currently publishing a series of articles in The Connection magazine entitled Help! My Data Center is Down. The series deals with spectacular data-center failures that have occurred recently. The articles focus on different failure causes, including power outages, storage subsystem failures, Internet and Intranet problems, upgrades gone wrong, and human errors. They are taken from our Never Again articles and provide many lessons from which we can all gain.
One of the Digest subscribers who also reads The Connection recently told me that the articles have had an impact on his data center. As each article is published, this data-center manager gathers his IT staff to discuss the lessons to be learned from the failures described and to determine operational improvements to avoid becoming a Never Again story.
This led us to conclude that these articles should get a wider circulation. Starting with the next issue of the Digest, we will publish these articles with kind permission from The Connection. Look for them, and use them to strengthen your data-center defenses.
Also, consider our availability seminars, which delve deeply into these and other issues to improve the availability of your IT resources.
Dr. Bill Highleyman, Managing Editor
A power failure in the evening of Sunday, August 7, 2011, took down an Availability Zone in Amazon’s Dublin, Ireland, data center, which houses Amazon’s European region for its Elastic Compute Cloud (EC2). Thousands of users in dozens of European countries found that they had no access to their applications nor to their data. It was days before service was restored. The power utility initially reported that the power loss was caused by a lightning strike that caused a massive transformer in an electrical substation outside of Dublin to explode.
Why should a power failure cause days of havoc? Where were the backup generators? As it turns out, several factors led to a failure chain totally unanticipated by Amazon. Hardware faults, software bugs, and operator errors were all involved. Lightning was not.
As is Amazon’s practice, it was very forthcoming with updates on the status of the outage via its Service Health Dashboard. However, the complexity of the failure chain was evident in some of the confusion exhibited by Amazon as it tried to provide a running commentary on the situation.
Part 3 – High Availability Architectures
In the previous two parts of this series, we defined a variety of availability metrics and pointed out that data replication was fundamental to providing the data redundancy needed to achieve high- and continuous availability. We explored various data-replication technologies and described their strengths and weaknesses.
Of particular interest are unidirectional- and bidirectional asynchronous and synchronous replication engines. In Part 3, we look at a variety of highly available system architectures that use these data-replication technologies to achieve a wide range of availability characteristics.
After a look at magnetic-tape and virtual-tape backup systems that provide disaster recovery but not high availability, we explain the high- and continuous availability that can be achieved with unidirectional replication and with active/active systems. These architectures can also achieve zero data loss (RPO = 0) if synchronous replication is used.
The historic evolution of the electric grid has striking similarities to the current evolution of the compute grid. Nicholas Carr, in his book Big Switch: Rewiring the World, from Edison to Google, traces the parallel evolution of these two great technologies that have transformed and are transforming the world in which we live.
Carr points out the fascinating parallelism between the electric grid and what may become the compute grid. Each started with locally supported technologies – the private power generator and the mainframe data center – and evolved into distributed environments – the electric grid and the compute cloud, which may itself evolve into the compute grid, the World Wide Computer.
The introduction of widespread availability of electricity caused many changes in our society. Likewise, the widespread introduction of computing power is having significant social and economic impacts, both good and bad. As the author concludes:
“It’s clear that two of the hopes most dear to the Internet optimists – that the Web will create a more bountiful culture and that it will promote greater harmony and understanding – should be treated with skepticism. Cultural impoverishment and social fragmentation seem equally likely outcomes.”
The key component for achieving high availability and continuous availability is data replication. It is the replication engine that maintains a remote copy of the production database so that data is protected against any disaster that might befall the production database.
By many measures, the Oracle database is the most widely used database today for critical corporate applications. Oracle has a rich set of data-replication capabilities to support a range of availabilities and data protection. In this article, we look at these Oracle products. They include Oracle Data Guard, Oracle GoldenGate, and Oracle Streams.
Oracle is making a major shift in its data-replication strategy, moving from Oracle Streams to its newly acquired GoldenGate replication engine. Industry conjecture is that GoldenGate’s heterogeneous capabilities that are lacking in Streams give Oracle a powerful way to integrate its database into environments that currently do not use Oracle. GoldenGate will provide the mechanism for Oracle to become the database of record in these environments, accepting data from other databases and distributing Oracle’s centralized data to those databases.
Sign up for your free subscription at http://www.availabilitydigest.com/signups.htm
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.
Managing Editor - Dr. Bill Highleyman email@example.com.
© 2011 Sombers Associates, Inc., and W. H. Highleyman