|Read the Digest in
You need the free
The digest of current topics on Continuous Availability. More than Business Continuity Planning.
BCP tells you how to recover from the effects of downtime.
CA tells you how to avoid the effects of downtime.
In this issue:
Browse through our Useful Links.
Check our article archive for complete articles.
Sign up for your free subscription.
Join us on our Continuous Availability Forum.
Join Us At BITUG’s Big SIG
The British Isles HP NonStop User Group (BITUG) is hosting its Big SIG on December 7th and 8th (www.bitug.com). Promoted as the largest dedicated NonStop event in the world in 2011, the Big SIG will be held in London at the historic Trinity House next to the Tower of London. Wednesday, December 7th, is dedicated to an education day. Thursday, December 8th, is the meeting proper with two dozen technical presentations.
I am honored to be giving Thursday’s keynote address, entitled “Help! My Data Center is Down!” In this presentation, I describe several spectacular data-center failures that were caused by unimaginable events. These experiences show that no matter what steps you take to protect your data center, something out there is lurking to take you down. Even your critical applications running on NonStop servers are not immune.
I encourage you to attend this major NonStop event and see what lessons you can take back to your data center based on the disaster stories we will discuss.
Dr. Bill Highleyman, Managing Editor
Data Center Monitoring with Open-Source Nagios
A primary requirement for achieving high availability is to be able to act proactively, not reactively, to problems as they arise. Problems should be detected at the earliest possible moment so that automated or manual actions can be taken to correct the situation. In order to accomplish this, a monitoring system that integrates all systems into a single data center-wide view must be in place.
BV Zahlungssysteme, or BV Payment Systems in English, provides services for card-based payment transactions and electronic banking for German banks. Its credit-card, debit-card, and online banking services always must be available, as their failure can bring German retail commerce to a halt.
Four HP NonStop servers comprise the heart of the company’s financial-service processing architecture. Supporting the NonStop servers are many Unix, Linux, and Windows servers. To keep this complex operational, it is imperative to be able to monitor all of the servers and the other data-center components with a single system monitor.
Unfortunately, currently available monitors that keep tabs on commodity servers do not support NonStop servers. BV Payment Systems undertook a project to extend the open-source Nagios monitor to NonStop servers so that the company can monitor its two data centers via a “single pane of glass.”
Help! My Data Center is Down! Part 2: Storage Outages
Increasingly, the data center has become part of the lifeblood of a company. If the data center goes down, so do many of the services that a company provides to its customers, vendors, and employees.
In our previous article in this series, we discussed several unimaginable, power-related events that took out data centers, with outages lasting hours and even days. These ranged from a truck driver’s heart attack and a battery-room explosion to the simple act of plugging in a coffee pot. The failure to keep a tree trimmed triggered the great Northeast Blackout of 2003.
In this article, we look at some spectacular storage-system failures. Corporate data is one of the most prized assets of a company. Companies do everything they can to protect the integrity of their data, from maintaining real-time remote backups to long-term offsite storage. Unfortunately, as we shall see, the media is replete with horror stories of companies that have lost their data for long periods of time or forever.
A major step forward in achieving high availability in the cloud is Amazon’s Availability Zones. Availability Zones allow a company to run multiple instances of its critical applications in different data centers so that the applications can survive even a data-center failure.
There have been several spectacular cloud failures recently, ranging from hours to days, due to a wide variety of causes – power, storage, networks, and people. These outages cut across all cloud-service providers, large and small – Amazon and Google have both contributed their share. A lesson to be learned from such outages is that the root cause of the next cloud failure is probably unimaginable.
Amazon’s Availability Zones provide a powerful approach to guarantee survivability of critical applications even if an entire Availability Zone should fail. Each Availability Zone is an independent data center that is fault-isolated from other Availability Zones. Application instances can be run in two or more Availability Zones either as multiple operational instances or as active/backup pairs. Should an Availability Zone fail, an instance in another Availability Zone can take over the processing of the application instance in the failed Availability Zone.
FileSync and CSR Synchronize NonStop Systems: Part 2 – Command Stream Replicator
Failover to a backup system often fails because the backup system’s software configuration is different from that being run by the production system. We call this configuration drift.
For HP NonStop systems, NonStop RDF and third-party data replication engines synchronize database contents. FileSync from TANDsoft synchronizes files. However, what is left are configuration changes entered via a variety of utilities.
Command Stream Replicator (CSR) from TANDsoft fills in the last piece of the configuration-synchronization puzzle. CSR replicates specified operator commands entered on the production system to the backup system or to other target systems in order to keep the system configurations synchronized.
Command Stream Replicator replicates everything that other replicators don’t. It requires no application modifications, nor does it require access to the utility source code. It simply intercepts commands as they are entered and sends them to a target system for execution on that system.
CSR improves failover reliability to a backup system by ensuring that the production and backup systems are uniformly configured. It supports the replication of configuration changes to all systems in an active/active configuration. The results are reliable failovers and a significant simplification of NonStop system administration procedures in a multisystem environment.
Sign up for your free subscription at http://www.availabilitydigest.com/signups.htm
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.
Managing Editor - Dr. Bill Highleyman firstname.lastname@example.org.
© 2011 Sombers Associates, Inc., and W. H. Highleyman