Large-Scale Disk Drive Failure Rate Research
Despite the fact that a huge and increasing amount, up to 90%, of new information is stored primarily - or only - on hard disk drives, large-scale research into why, how and when disk drives fail has not been published until now. Google has just published a research paper, "Failure Trends in a large Disk Drive Population" written by Eduardo Pinheiro, Wolf-Dieter Weber and Luiz Andre Barroso, three scientists working for Google, Inc.
The paper reports research involving more than one hundred thousand disk drives, part of Google's own infrastructure. Whereas most disk drive failure research has been published by drive manufactures and based either on warranty failure reports (hence covering only a short period following deployment), or on accelerated and extrapolated tests (which can't be used to predict actual failure rates with any practical reliability), this research involved actually monitoring the drives in a real end-user scenario. Google monitor their drives every few minutes, collecting real-time information such as the output of parameters from the built-in SMART system, environmental factors such as operating temperature and activity level. The result of these measurements are stored in a huge database, and data mining techniques subsequently used to derive correlations, or prove the lack of correlations, between failure events and the measurements. The researchers defined a 'failure' as an event leading to the replacement of a drive.
The findings of this effort are very interesting. The researchers found no evidence of higher failure rates for drives running at a higher operating temperature or drives being used more, thus pulling down two widely held ideas. It was not possible to prove that no such correlation does exist but the lack of a strong correlation indicates that other factors are more important in influencing failure rates in a data centre environment.
The research also indicated that SMART data are not reliable indicators of future failure of individual drives. Many of the drives under examination failed without displaying any SMART warnings prior to failing. Some of the SMART system warnings, however, were found to correlate with failures, in particular, drives were 39 times more likely to fail within 60 days after their first scan error has been detected than drives where this was not the case, Likewise, drives having given reallocation errors, offline reallocation errors or probational counts, were more likely to fail than drives having shown none of these symptoms.
Thus, whereas SMART data can provide good aggregate indicators across a large drive population (e.g. for use in supply chain management), they are not on their own sufficient to predict possible failures of individual drives.
Google's research into this issue is on-going, and future papers will hopefully cast more light on these issues. Another important area of missing research is a large scale examination of how long data can reliably be stored on writable media like CDs and DVDs. IBM has carried out some research in this field but never published the results. (Niels Bjergstrom)