RAID-Z, scrub and probabilities of reconstruction failure

Moderators: jhartley, MSR734, nola

RAID-Z, scrub and probabilities of reconstruction failure

Post by grahamperrin » Thu Oct 18, 2012 9:34 am

At viewtopic.php?p=1009#p1009

si-ghan-bi wrote:You have a 10x2TB array in RAIDZ1?
You may read this:
http://www.zdnet.com/blog/storage/why-r ... n-2009/162


Also: RAID 5 and Uncorrectable Read Errors, and so on … but nothing relating to ZFS or scrub.

With a bunch of disks used for RAID-Z1 plus suitably frequent scrubbing, I imagine that:

  1. the probability of URE affecting a second disk during RAID-Z1 reconstruction is lower than
  2. the probability for a comparable bunch with RAID-5 without ZFS (with neither prior scrub nor prior read verification routines.

True? Or just my imagination?

Meantime I'm reading The RAID Reliability Anthology – Part 1 – The Primer (2010-08-30), which does include the magic word scrub …

Postscripts

How ZFS handles online replacement in a RAID-Z (theoretical) – interesting, but doesn't answer my probability question.

Calculators

Whilst bookmarking pages such as Free RAID Calculator - Caclulate RAID Array Capacity and Fault Tolerance., RAID Calculator - International Computer Concepts (ICC) and RAID 5 and Uncorrectable Read Errors I wished for a calculator to include ZFS-related options such as:

  • RAID-Z1
  • RAID-Z2
  • RAID-Z3
  • frequency of scrub

Such calculators are open to criticism but still, there's the wish.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: RAID-Z, scrub and probabilities of reconstruction failur

Post by si-ghan-bi » Thu Oct 18, 2012 11:17 am

I don't think that previous history (=scrubs) makes any difference when you calculate that statistics... every time you read, you get that probability.
With 1.4E14 bits read and a 50% probability of error for 1E14, you are definitely gambling.
si-ghan-bi Offline


 
Posts: 145
Joined: Sat Sep 15, 2012 5:55 am

Re: RAID-Z, scrub and probabilities of reconstruction failur

Post by grahamperrin » Thu Oct 18, 2012 11:53 am

Thanks. I'll need to read more (later) to get my head around all of this!
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Cross references

Post by grahamperrin » Sun Dec 09, 2012 7:20 pm

wished for a calculator to include ZFS-related options such as:

  • RAID-Z1
  • RAID-Z2
  • RAID-Z3
  • frequency of scrub

Such calculators are open to criticism but still, there's the wish.


Calculation of RAID reliability (2012-12-09)

RAID-Z Calculator (2013-01-13)
Last edited by grahamperrin on Mon Jan 14, 2013 1:06 am, edited 1 time in total.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: RAID-Z, scrub and probabilities of reconstruction failur

Post by si-ghan-bi » Mon Dec 10, 2012 11:42 am

The formula used by http://www.raidtips.com/raid5-ure.aspx is limited to URE issues (that means, it applies during rebuild) is correct and is explained well.
The calculation from http://www.servethehome.com/raid-calculator/raid-reliability-calculator-simple-mttdl-model/ gives too low probabilities, meaning that it takes into account mechanical failures (those are NOT the UREs...).
Neither offers a reasonable formula or explanation for RAID6/Z2/Z3, where multiple parity disks are read.

I try: the probability to have an URE in the first disk AND an URE in the second disk has to be the square of the single-parity disk probability. A 3 TB RAID1 with two disk redundancy gives 21% * 21% = 4.5%.
However, we are not reading the WHOLE second disk, only the sectors affected by URE. I think we have to multiply the probability of an URE by the probability to have one in the same 4 KB sector: 21% * [1 - (1-p)^(4096*8)] = about zero.
In other words, two disks parity already protect beyond need for URE errors.
The real problem are mechanical failures. Those take out the whole disk! after a failure, you have no more multiple redundancy. These failures are however dependent on MTBF. Using 1E6 hours as MTBF I think the probability per hour is 1e-6. If the drive (3 TB) is being rebuild at a speed of 10 MB/s, independently from the probability of getting alive at that point in time, the additional risk during the rebuild is 9E-5= 0.009%. However, I think the part about MTBF is more complicated than this, because I expect a young disk to withstand better the rebuild compared to an older disk. I think it depends on the width of the Gaussian distribution whose center is 1E6 hours! But we don't have that value and I don't remember too well statistics (except the basics, see URE probability).

Maybe a question in the math section of StackExchange would be the best solution.
si-ghan-bi Offline


 
Posts: 145
Joined: Sat Sep 15, 2012 5:55 am


Return to General Discussion

Who is online

Users browsing this forum: No registered users and 0 guests

cron