scrub not reliably finding data corruption (?)

All your general support questions for OpenZFS on OS X.

scrub not reliably finding data corruption (?)

Postby zfsosxuser » Wed Apr 03, 2019 9:25 am

I have been using ZFS for many years, though only the last year or two on OS X. Recently, I have run into a lot of data error issues. I will try to be brief in this post and skip a lot of the details.

First, here is an overview of my setup:
* Three machines with separate pools, one is source, two are backups
* Source machine: MacBook Pro, 16GB non-ECC RAM, internal SSD, Thunderbolt-2 attached external SSD, no redundancy, replicated to backups daily
* Backup 1: 2008 Mac Pro with internal disks, raidz2, 16GB ECC RAM
* Backup 2: newish iMac with Thunderbolt-2 attached raidz2
* Scrub source about once every 2 months, scrub backups about once a month
* Occasional checksum errors are automatically repaired

Brief background summary: recently, data errors showed up on the external SSD on source. I wiped the SSD and started a file-level restore from Backup 1 to an APFS volume, but during the restore, zpool status started reporting data corruption, and the restore tool (Carbon Copy Cloner) reported several files with IO errors. I confirmed this by running "cat FILE > /dev/null" on the backup and getting an IO error. I switched to a second backup pool on Backup 1 and the same thing happened. I switched to Backup 2, and again, previously unseen data errors came up. All of the data errors were in the same set of files in each backup dataset. I cloned and mounted previous snapshots and found the same files had IO errors going back several months; the oldest snapshot on Backup 2 did not have IO errors, but every other snapshot I checked did. I checked zpool history and saw I had run several scrubs after the dates of the snapshots and before the corruption showed up. I routinely monitor zpool status in a dashboard and do not recall seeing anything related to data errors during that time.

Today, I found what I think is the clearest indication that there may be a defect in the scrub logic, rather than just something I forgot or did wrong. That, or I very much am misunderstanding something.

For the last week or so, after the above happened, I have been seeing Backup 2's "zpool status" report 10 data errors, though only 3 were listed with "zpool status -v". I ran a zpool scrub on the pool after the data errors were found, and now, after the scrub is done, the data errors are gone, though the status message about the data errors remains. I did not do anything to correct the data errors.

According to http://zfsonlinux.org/msg/ZFS-8000-8A, the data errors are not automatically repairable and must be repaired manually.

I think this is a clear indication that scrub is NOT finding all instances of data errors, otherwise these errors should not have disappeared.

Should I file a bug report, and if so, where do those go? Or, is there something I am missing about this?

Code: Select all
$ zpool status -v
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub in progress since Mon Apr 1 21:31:58 2019
8.38T scanned out of 11.0T at 170M/s, 4h27m to go
64K repaired, 76.32% done
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
media-740DBE4A-F1E4-405D-A33D-FA4C8FCFF998 ONLINE 0 0 0
media-B520CC13-2D30-458D-AE60-A53B9D27770B ONLINE 0 0 0
media-95B75780-A764-4A0B-8660-5F1130DC5AFD ONLINE 0 0 2
media-DA078B2D-D0C8-48F3-9019-691ED2C690A3 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:
<0x230e>:<0x0>
<0x2318>:<0x0>
tank/FS.broken:<0x0>

$ zpool scrub tank

...

$ zpool status -v
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub repaired 64K in 22h43m with 0 errors on Tue Apr 2 20:15:34 2019
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
media-740DBE4A-F1E4-405D-A33D-FA4C8FCFF998 ONLINE 0 0 0
media-B520CC13-2D30-458D-AE60-A53B9D27770B ONLINE 0 0 0
media-95B75780-A764-4A0B-8660-5F1130DC5AFD ONLINE 0 0 2
media-DA078B2D-D0C8-48F3-9019-691ED2C690A3 ONLINE 0 0 0

errors: No known data errors


Thanks
zfsosxuser
 
Posts: 2
Joined: Wed Apr 03, 2019 8:45 am

Re: scrub not reliably finding data corruption (?)

Postby kingneutron » Wed Apr 03, 2019 5:11 pm

Wouldn't hurt to check your RAM, I had to rebuild a couple of pools on a 2011 iMac after upgrading it with 2x8GB sticks that turned out bad...

Reboot and hold down D for diagnostics

REF: https://support.apple.com/en-us/HT202731

Holding down Option+D at boot should/will download a RAM tester over the 'Net
kingneutron
 
Posts: 13
Joined: Sat Mar 16, 2019 4:37 pm


Return to General Help

Who is online

Users browsing this forum: No registered users and 21 guests