Scrub: interpreting the counts of errors and data errors

Moderators: jhartley, MSR734, nola

During and after a scrub

Post by grahamperrin » Thu Mar 14, 2013 2:05 am

For reference only, this morning …

During a scrub

Code: Select all
macbookpro08-centrim:~ gjp22$ sudo zpool status -v twoz
Password:
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub in progress since Wed Mar 13 23:08:56 2013
    331Gi scanned out of 347Gi at 15.4Mi/s, 0h18m to go
    2.76Mi repaired, 95.25% done
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     3
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0    76  at disk3s4  (repairing)

errors: Permanent errors have been detected in the following files:

        twoz:/macbookpro08-centrim.sparsebundle/bands/606e
        twoz:/macbookpro08-centrim.sparsebundle/bands/5f6e
        twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/3252
        twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/4aca
        twoz@2013-03-12-224158:/macbookpro08-centrim.sparsebundle/bands/606e
macbookpro08-centrim:~ gjp22$ clear


Code: Select all
macbookpro08-centrim:~ gjp22$ zpool status twoz
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub in progress since Wed Mar 13 23:08:56 2013
    331Gi scanned out of 347Gi at 15.4Mi/s, 0h18m to go
    2.76Mi repaired, 95.30% done
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     3
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0    76  at disk3s4  (repairing)

errors: 7 data errors, use '-v' for a list
macbookpro08-centrim:~ gjp22$ clear


Code: Select all
macbookpro08-centrim:~ gjp22$ zpool get capacity twoz
NAME  PROPERTY  VALUE   SOURCE
twoz  capacity  44%     -
macbookpro08-centrim:~ gjp22$ zfs get copies twoz
NAME  PROPERTY  VALUE   SOURCE
twoz  copies    3       local
macbookpro08-centrim:~ gjp22$ clear


Worth noting: during the scrub, the Time Machine .sparsebundle within this pool received writes from at least one backup.

After the scrub

Code: Select all
macbookpro08-centrim:~ gjp22$ date
Thu 14 Mar 2013 06:35:47 GMT
macbookpro08-centrim:~ gjp22$ zpool status twoz
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub repaired 3.49Mi in 6h33m with 3 errors on Thu Mar 14 05:42:44 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     3
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0    87  at disk3s4

errors: 3 data errors, use '-v' for a list
macbookpro08-centrim:~ gjp22$ clear


Code: Select all
macbookpro08-centrim:~ gjp22$ sudo zpool status -v twoz
Password:
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub repaired 3.49Mi in 6h33m with 3 errors on Thu Mar 14 05:42:44 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     3
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0    87  at disk3s4

errors: Permanent errors have been detected in the following files:

        twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/3252
        twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/4aca
        twoz@2013-03-12-224158:/macbookpro08-centrim.sparsebundle/bands/606e
macbookpro08-centrim:~ gjp22$
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Scrubbing after destroying any snapshot with an error

Post by grahamperrin » Thu Mar 14, 2013 2:15 am

ilovezfs wrote:… Once you have deleted all of the affected snapshots, and run a fresh scrub with all corrupt snapshots deleted, no more errors should be reported either as filenames or object numbers …


From what I recall, a few weeks or months ago, scrubs did continue to detail errors as objects.

Let's test. Now with automatic backups off in Time Machine, and with the pool used for no other purpose …

Code: Select all
macbookpro08-centrim:~ gjp22$ date
Thu 14 Mar 2013 07:10:16 GMT
macbookpro08-centrim:~ gjp22$ uptime
 7:10  up 10:19, 6 users, load averages: 1.11 1.76 2.72
macbookpro08-centrim:~ gjp22$ zpool clear twoz
macbookpro08-centrim:~ gjp22$ sudo zpool status -v twoz
  pool: twoz
 state: ONLINE
 scan: scrub canceled on Thu Mar 14 07:09:55 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     0
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0     0  at disk3s4

errors: No known data errors
macbookpro08-centrim:~ gjp22$ zfs destroy twoz@2013-01-02-201022
macbookpro08-centrim:~ gjp22$ zfs destroy twoz@2013-03-12-224158
macbookpro08-centrim:~ gjp22$ zpool scrub twoz
macbookpro08-centrim:~ gjp22$


… I'll probably have the end result of this scrub tomorrow morning.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Scrub: interpreting the counts of errors and data errors

Post by ilovezfs » Thu Mar 14, 2013 9:28 am

Looking forward to the results.
ilovezfs Online


 
Posts: 249
Joined: Sun Feb 10, 2013 9:02 am

Interim results

Post by grahamperrin » Fri Mar 15, 2013 3:11 am

Nothing conclusive at this point … 

Code: Select all
sh-3.2$ zpool status twoz
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub repaired 520Ki in 21h23m with 3 errors on Fri Mar 15 04:35:24 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     3
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0    20  at disk12s4

errors: 3 data errors, use '-v' for a list
sh-3.2$ date
Fri 15 Mar 2013 05:06:09 GMT
sh-3.2$ clear


Code: Select all
sh-3.2$ uptime
 5:06  up 10:17, 5 users, load averages: 0.67 1.05 1.38
sh-3.2$ sw_vers
ProductName:   Mac OS X
ProductVersion:   10.8.3
BuildVersion:   12D78
sh-3.2$ uname -a
Darwin macbookpro08-centrim.home 12.3.0 Darwin Kernel Version 12.3.0: Sun Jan  6 22:37:10 PST 2013; root:xnu-2050.22.13~1/RELEASE_X86_64 x86_64
sh-3.2$ sudo zpool status -v twoz
Password:
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub repaired 520Ki in 21h23m with 3 errors on Fri Mar 15 04:35:24 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     3
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0    20  at disk12s4

errors: Permanent errors have been detected in the following files:

        twoz@2013-03-13-003952:/macbookpro08-centrim.sparsebundle/bands/606e
        twoz@2013-01-02-211021:/macbookpro08-centrim.sparsebundle/bands/3252
        twoz@2013-01-02-211021:/macbookpro08-centrim.sparsebundle/bands/4aca
sh-3.2$ clear


… for test purposes I opted to destroy more snapshots …

Code: Select all
sh-3.2$ date
Fri 15 Mar 2013 08:05:47 GMT
sh-3.2$ zfs destroy twoz@2013-01-02-211021
sh-3.2$ zfs destroy twoz@2013-03-13-003952
sh-3.2$


… but I'm not rushing to another scrub.

Instead for a while I'll switch on Time Machine and use this error-prone hard disk drive, without scrubbing, to help shape my thoughts around another topic.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Scrub: interpreting the counts of errors and data errors

Post by ilovezfs » Fri Mar 15, 2013 5:38 am

It is nice to see that it didn't leave you hanging with object numbers and that it's not reporting any new errors in the current, non-snapshot data.
ilovezfs Online


 
Posts: 249
Joined: Sun Feb 10, 2013 9:02 am

Interim results

Post by grahamperrin » Sun Mar 17, 2013 10:21 am

Again, nothing conclusive at this point …

Code: Select all
sh-3.2$ uptime
14:44  up  3:33, 4 users, load averages: 2.66 4.04 3.48
sh-3.2$ sudo zpool status -v twoz
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub repaired 520Ki in 21h23m with 3 errors on Fri Mar 15 04:35:24 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0    15
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0   114  at disk4s4

errors: Permanent errors have been detected in the following files:

        twoz:/macbookpro08-centrim.sparsebundle/bands/7244
        <0x136d>:<0x924e>
        <0xda8>:<0x4400>
        <0xda8>:<0x5d2d>
sh-3.2$ date
Sun 17 Mar 2013 14:44:25 GMT
sh-3.2$ sw_vers
ProductName:   Mac OS X
ProductVersion:   10.8.3
BuildVersion:   12D78
sh-3.2$ uname -a
Darwin macbookpro08-centrim.home 12.3.0 Darwin Kernel Version 12.3.0: Sun Jan  6 22:37:10 PST 2013; root:xnu-2050.22.13~1/RELEASE_X86_64 x86_64
sh-3.2$


Then after two relatively minor Time machine backups (around 4 MB copied) to the disk image, five more checksum errors:

Code: Select all
sh-3.2$ date
Sun 17 Mar 2013 15:16:01 GMT
sh-3.2$ sudo zpool status -v twoz
Password:
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub repaired 520Ki in 21h23m with 3 errors on Fri Mar 15 04:35:24 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0    20
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0   144  at disk4s4

errors: Permanent errors have been detected in the following files:

        twoz:/macbookpro08-centrim.sparsebundle/bands/7244
        <0x136d>:<0x924e>
        <0xda8>:<0x4400>
        <0xda8>:<0x5d2d>
sh-3.2$


Refraining from a scrub, over now to viewtopic.php?p=4287#p4287 to see whether fsck_hfs finds any issue with the disk image …
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Scrub: interpreting the counts of errors and data errors

Post by ilovezfs » Sun Mar 17, 2013 1:28 pm

So I believe
<0x136d>:<0x924e> = twoz@2013-03-13-003952:/macbookpro08-centrim.sparsebundle/bands/606e
<0xda8>:<0x4400> = twoz@2013-01-02-211021:/macbookpro08-centrim.sparsebundle/bands/3252
<0xda8>:<0x5d2d> = twoz@2013-01-02-211021:/macbookpro08-centrim.sparsebundle/bands/4aca

<0x136d> = twoz@2013-03-13-003952
<0xda8> = twoz@2013-01-02-211021

And now there seems to be the uninvited guest, twoz:/macbookpro08-centrim.sparsebundle/bands/7244. I'm wondering if you can get the create date timestamp on that sucker to see if it's a brand new file. GetFileInfo -d macbookpro08-centrim.sparsebundle/bands/7244
ilovezfs Online


 
Posts: 249
Joined: Sun Feb 10, 2013 9:02 am

a new band in the .sparsebundle

Post by grahamperrin » Sun Mar 17, 2013 11:21 pm

Code: Select all
macbookpro08-centrim:~ gjp22$ sudo GetFileInfo -d /Volumes/twoz/macbookpro08-centrim.sparsebundle/bands/7244
03/17/2013 07:14:24


Considering the amount of data written (see the linked topic) I expect that there are many more new files.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Scrub: interpreting the counts of errors and data errors

Post by ilovezfs » Mon Mar 18, 2013 4:42 am

I think the fact that you have copies=3 is the main reason that you're escaping more file corruption. Since the drive is not reporting I/O errors, you might also want to test your computer's RAM, unless you're not having checksum errors with any other pools on other drives. However, I think you have a sick drive that has no idea it is sick, so it is silently returning garbage data. I think ZFS is intervening and using good copies of the data, elsewhere on the drive, that do match the checksums, and then writing new copies to replace the garbage copies along the way. It makes sense that fsck_hfs has no idea that there are problems because I believe it is only looking at its own b-trees, as opposed to all of the data.

If you want to continue to hunt for corruption at the hfs+ level, I would play around with regenerating the b-trees, perhaps one at a time. fsck_hfs -dfRace does all three at once, but I might do fsck_hfs -dfRa, then -dfRc, and then -dfRe, and see if anything unusual comes up in the debugging output. Since ZFS seems to be masking the problems using the good copies I doubt there will be anything significant turned up, except for the usual HFS+ oddities that come up as a matter of course in the debugging output of fsck_hfs -d on healthy drives and healthy filesystems.

Also, It would be interesting to know what Disk Warrior's opinions are, because I believe its method is to reverse engineer the b-trees by looking at the data itself. Again, I doubt it's going to complain much, except for noting the typical irregularities it always barks about that don't usually matter.

The other thing I was thinking about was whether your scrubs would have any different findings if you got a second opinion and used Open Indiana's zpool scrub and perhaps a third opinion from ZFS on Linux's zpool scrub. I doubt it, but who knows?

You could also try backing up to the sparsebundle over afp/netatalk, hosting the zpool on an Open Indiana virtual machine or an Ubuntu virtual machine with ZFS on Linux to see whether you get the same numerous checksum errors and random data errors. That might imply some sort of compatibility issues between your drive and ZEVO, or ZEVO bugs, which would be interesting.
ilovezfs Online


 
Posts: 249
Joined: Sun Feb 10, 2013 9:02 am

Re: Scrub: interpreting the counts of errors and data errors

Post by raattgift » Mon Mar 18, 2013 12:14 pm

I think playing around with a known flaky drive in a vdev without replication is pretty wasteful, but everyone needs hobbies.

Without thinking too hard about it, it looks like your drive is silently failing writes, perhaps due to physical damage to the mox layer itself, or the debris that would introduce internal to the drive.

You can probably save time by doing a zfs send of the affected snapshot (or clone the snapshot and cat the reported-errored file) to /dev/null, rather than scrubbing.

You can get a much better feel for what's going on with the disk in question by attaching a known working drive as a mirror of the known flaky one.
zpool attach [-f] pool bad_device good_device
and then wait for the resilver to finish.

You can then torture the now-mirrored vdev and watch the bad_device being repaired from the good_device on-the-fly. It'll be faster and more reliable than relying on the automatic duplication of the metadata and the count > 1 duplicate blocks.

Hopefully you have already made a backup of any of the data you want to keep.

You should really toss away the drive, its power supply, and its usb cable so that you don't accidentally introduce important data to them. Bad drives are pretty common. The recovery code in ZEVO is descended from Sun vintage 2008ish during which time the ZFS team did literally thousands of regression tests against a wide variety of known-flaky hardware. Likely the only thing special or unique about your flaky disk is that you have physical control over it. :-)

There is likely to be some meat about this here :

https://www.google.co.uk/search?client= ... 8&oe=UTF-8
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

PreviousNext

Return to General Discussion

Who is online

Users browsing this forum: ilovezfs and 0 guests

cron