|Read the Digest in
You need the free
In this issue:
Browse through our useful links.
See our article archive for complete articles.
Sign up for your free subscription.
Visit our Continuous Availability Forum.
Check out our seminars.
Check out our writing services.
Outage Information is Too Valuable to Hide
We all have data-center outages. However, though the media reports on them, we seldom hear what caused the outages and how they were resolved. Imagine how more reliable we could make our own data-center operations if we could leverage the experience of others?
In this issue’s article entitled Iowa’s Data Center Taken Down by Fire, we describe the steps that the U.S. State of Iowa took to recover from a fire. This is an example of an organization that freely disclosed its experience for the benefit of us all.
Also, in this issue’s article entitled Let’s Share Outage Information for the Benefit of All, our guest author, Andrew Gallo, discusses how outage information is shared freely in many other industries but not in the IT world. He lays out the information that should be freely disseminated about any outage.
True, security, competition, and legal liability may constrain what we can disclose. However, Amazon and Google are excellent examples of organizations that have found their way around these limitations, as is evidenced in many of our Never Again stories. If you would like to provide such written open disclosure to the industry but are constrained by staff time, look to our technical writing services to help you.
Dr. Bill Highleyman, Managing Editor
Heartbleed is a flaw in the OpenSSL cryptographic software library, which provides communication security over the Internet. Heartbleed allows attackers to read memory data from both client and server devices to obtain private keys, passwords, and user names. It can then use the information to decrypt communications to and from these devices, to attack user accounts on other web sites, and to impersonate the infiltrated website.
Heartbleed was introduced in a released version of OpenSSL in March, 2012, and was not discovered until April, 2014, two years later. Heartbleed is relatively easy to exploit and leaves no trace.
It is estimated that 17% of all secure web sites use the flawed version of OpenSSL. The rush is on to upgrade these systems with the corrected version of the software.
The good news is that it appears that hackers have not discovered the Heartbleed vulnerability in its two years of existence, just as the security specialists have not.
However, this may not be true of government agencies, which are more focused on finding security vulnerabilities for intelligence gathering. These agencies often employ large groups of security specialists for just such a purpose.
A fire took down Iowa’s primary data center for the entire U.S. state. An orderly progression to restore service is discussed in this article, and the article concludes with several lessons learned.
An important effort that is often overlooked is communication. A wide range of communication is required in an incident such as this – management, IT restoration teams, affected agencies and users, the press, and local government, fire, and police officials. In all of these cases, the Iowa data center excelled at keeping people informed.
A difficult decision that had to be made by Iowa’s data-center management team was whether to fail over to the backup data center or to try to restore the primary data center. In this case, the team decided to restore the primary data center. If the alternate decision had been made, would the failover have been successful? Only periodic testing of failover procedures can answer this question.
Publishing the procedures, successes, failures, and lessons learned in an incident such as this, as was done by the State of Iowa, is a benefit to us all. We wish more organizations would adopt this practice.
Outages happen. Much of what we do as engineers is to design systems to avoid or minimize the chance and impact of outages. It is important to share our lessons learned after an analysis of an outage. The tough lessons learned in these situations are too valuable to others to be kept secret.
A good report should contain a detailed description of the event, how it was recognized, the scope and severity of the outage, and what was done to restore service. The personnel involved and key decisions (along with their justifications and if they were the right decisions) should be included.
The summary should contain a direct cause analysis, contributing cause analysis, and a root cause analysis:
§ Direct cause – what led directly or immediately to the occurrence
§ Contributing cause – factors that by themselves would not have caused a problem but when present, worsened the problem
§ Root cause – the underlying conditions that led to the outage. If root causes weren’t present, the outage would not have happened.
Additionally, the summary should include remediation of these causes so that the outage will not be repeated in the future (at least by the same causes).
A bitcoin is a digital currency that made its debut in 2009. We described in some detail how bitcoins work in last month’s article entitled “Mt. Gox, Largest Bitcoin Exchange, Goes Belly Up.”
Bitcoin mining is the way in which new bitcoins are minted (digitally, that is). Mining involves packaging bitcoin transactions into blocks and appending them to the bitcoin block chain that records every bitcoin transaction. For each block that a miner adds to the block chain, he is rewarded with 25 bitcoins. At today’s Bitcoin value of about $600 USD, this amounts to $15,000. Sounds like a fast way to make a lot of money.
The backup service iDrive decided to try its hand at bitcoin mining. It put 600 of its servers to work mining bitcoins. After a bit of experience, it calculated that it would earn around 0.4 of a bitcoin per year – about $240!!
How can mining for bitcoins be so difficult? The answer is that the algorithm for creating a legitimate block of transactions is very difficult to calculate, and its difficulty increases as time goes on. In this article, we explore what is involved in bitcoin mining.
A challenge every issue for the Availability Digest is to determine which of the many availability topics out there win coveted status as Digest articles. We always regret not focusing our attention on the topics we bypass.
With our new Twitter presence, we don’t have to feel guilty. This article highlights some of the @availabilitydig tweets that made headlines in recent days.
Sign up for your free subscription at http://www.availabilitydigest.com/signups.htm
Would You Like to Sign Up for the Free Digest by Fax?
Simply print out the following form, fill it in, and fax it to:
+1 908 459 5543
The Availability Digest is published monthly. It may be distributed freely. Please pass it on to an associate.
Managing Editor - Dr. Bill Highleyman email@example.com.
© 2014 Sombers Associates, Inc., and W. H. Highleyman