Scrub: interpreting the counts of errors and data errors

Moderators: jhartley, MSR734, nola

Scrub: interpreting the counts of errors and data errors

Post by grahamperrin » Mon Mar 04, 2013 1:43 pm

The original subject line for this topic was:

  • A scrub with three data errors counted as two errors

Without verbosity:

Code: Select all
sh-3.2$ zpool status twoz
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub repaired 1.02Mi in 10h14m with 2 errors on Sun Mar  3 21:28:19 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     0
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0     0  at disk5s4

errors: 3 data errors, use '-v' for a list


With verbosity:

Code: Select all
sh-3.2$ sudo zpool status -v twoz
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub repaired 1.02Mi in 10h14m with 2 errors on Sun Mar  3 21:28:19 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     0
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0     0  at disk5s4

errors: Permanent errors have been detected in the following files:

        twoz:/macbookpro08-centrim.sparsebundle/bands/3252
        twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/3252
        twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/4aca
Last edited by grahamperrin on Thu Mar 07, 2013 2:27 pm, edited 1 time in total.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: A scrub with three data errors counted as two errors

Post by scasady » Mon Mar 04, 2013 1:47 pm

three errors at file level but only two at block level ??
scasady Offline


 
Posts: 45
Joined: Sat Sep 15, 2012 8:00 am

Re: A scrub with three data errors counted as two errors

Post by grahamperrin » Mon Mar 04, 2013 2:19 pm

I wonder. I find it difficult to visualise.

It's remarkable that two of the three files are bands (of a sparse bundle disk image) that share the same name.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: A scrub with three data errors counted as two errors

Post by ilovezfs » Wed Mar 06, 2013 2:13 am

I would assume that the file /macbookpro08-centrim.sparsebundle/bands/4aca has changed since 2013-01-02-201022, whereas the file 3252 has not. Hence, twoz:/macbookpro08-centrim.sparsebundle/bands/3252 and twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/3252 would in fact be the same file and the same error. Use md5 sums and date modified to confirm, perhaps. You can check twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/3252 by creating a clone.
ilovezfs Online


 
Posts: 249
Joined: Sun Feb 10, 2013 9:02 am

During a scrub: five data errors, three files listed

Post by grahamperrin » Thu Mar 07, 2013 3:15 pm

Uptime: six minutes. Scrub began some time earlier. Five data errors:

Code: Select all
sh-3.2$ date
Tue  5 Mar 2013 07:28:44 GMT
sh-3.2$ uptime
 7:28  up 6 mins, 2 users, load averages: 2.40 2.05 1.11
sh-3.2$ zpool status twoz
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub in progress since Tue Mar  5 01:44:33 2013
    229Gi scanned out of 321Gi at 12.8Mi/s, 2h3m to go
    0 repaired, 71.25% done
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     0
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0     0  at disk3s4

errors: 5 data errors, use '-v' for a list


Only three files, so I assume that the five errors are spread across the three files:

Code: Select all
sh-3.2$ sudo zpool status -v twoz
Password:
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub in progress since Tue Mar  5 01:44:33 2013
    229Gi scanned out of 321Gi at 12.8Mi/s, 2h2m to go
    0 repaired, 71.31% done
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     0
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0     0  at disk3s4

errors: Permanent errors have been detected in the following files:

        twoz:/macbookpro08-centrim.sparsebundle/bands/3252
        twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/3252
        twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/4aca


A few hours later, after the scrub:

Code: Select all
macbookpro08-centrim:~ gjp22$ date
Wed  6 Mar 2013 03:13:32 GMT
macbookpro08-centrim:~ gjp22$ uptime
 3:13  up 19:51, 5 users, load averages: 24.33 19.41 13.83
macbookpro08-centrim:~ gjp22$ sudo zpool status -v twoz
Password:
  pool: twoz
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
 scan: scrub repaired 2.49Mi in 17h28m with 2 errors on Tue Mar  5 19:13:06 2013
config:

   NAME                                         STATE     READ WRITE CKSUM
   twoz                                         ONLINE       0     0     0
     GPTE_34E7E852-7E88-4FAD-B162-2AEF6D300D42  ONLINE       0     0     8  at disk3s4

errors: Permanent errors have been detected in the following files:

        twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/3252
        twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/4aca


Side note: for this pool a few days ago I set copies=3.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Scrub: interpreting the counts of errors and data errors

Post by ilovezfs » Thu Mar 07, 2013 3:59 pm

If you deleted the snapshot, I wonder if it would think twoz:/macbookpro08-centrim.sparsebundle/bands/3252 had a permanent error after a scrub completes.
ilovezfs Online


 
Posts: 249
Joined: Sun Feb 10, 2013 9:02 am

Re: Scrub: interpreting the counts of errors and data errors

Post by grahamperrin » Fri Mar 08, 2013 12:06 am

grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Scrub: interpreting the counts of errors and data errors

Post by ilovezfs » Fri Mar 08, 2013 2:16 am

So I took a look at Oracle's documentation. http://docs.oracle.com/cd/E19082-01/817 ... index.html "Each error would indicate only that an error occurred at a given point in time. Each error is not necessarily still present on the system." Note that "A complete scrub of the pool is guaranteed to examine every active block in the pool, so the error log is reset whenever a scrub finishes."

http://docs.oracle.com/cd/E19082-01/817 ... index.html
"The zpool status command also shows whether any known errors are associated with the pool. These errors might have been found during disk scrubbing or during normal operation."

Regarding scrubbing status errors: "The third section of the zpool status output describes the current status of any explicit scrubs. This information is distinct from whether any errors are detected on the system, though this information can be used to determine the accuracy of the data corruption error reporting. If the last scrub ended recently, most likely, any known data corruption has been discovered."

Also Oracle discusses the case where the files are not identifiable: http://docs.oracle.com/cd/E19082-01/817 ... index.html "If the object number to a file path cannot be successfully translated, either due to an error or because the object doesn't have a real file path associated with it, as is the case for a dnode_t, then the dataset name followed by the object's number is displayed."

In your first post, you noted 3 data errors, with only 2 scrub errors. I'm wondering if that third error occurred after the completion of the scrub. How long had it been since the scrub's completion? Did you run any commands in the interim?

Your conjecture, "Only three files, so I assume that the five errors are spread across the three files," sounds possible, but not necessarily true, given Oracle's remark "Each error would indicate only that an error occurred at a given point in time. Each error is not necessarily still present on the system." It could be the case that other, perhaps unrelated, errors occurred which are no longer present.

At the end of your scrub that completed Tue Mar 5 19:13:06 2013 the number of scrub errors (2) matched the number of files with permanent errors (2). I'm wondering if sudo zpool status twoz, without verbosity, would have reported exactly 2 data errors right at the end of the scrub, given that the log is supposed to reset at the end of a scrub. That would confirm scrub errors = non-verbose data errors at end of scrub = number of files with permanent errors at end of scrub, and confirm the idea that some additional third error occurred after your original scrub on Sun Mar 3 21:28:19 2013.

And is it in fact that the case that scrub actually fixed twoz:/macbookpro08-centrim.sparsebundle/bands/3252? That would be good!
ilovezfs Online


 
Posts: 249
Joined: Sun Feb 10, 2013 9:02 am

Re: Scrub: interpreting the counts of errors and data errors

Post by grahamperrin » Fri Mar 08, 2013 1:04 pm

ilovezfs wrote:… is it in fact that the case that scrub actually fixed twoz:/macbookpro08-centrim.sparsebundle/bands/3252? …


I'm not sure.

I imagine the file 3252 being overwritten with good data (during a Time Machine backup to the disk image) but that thought is probably inconsistent with the copy-on-write nature of ZFS.

If it helps:

Code: Select all
sh-3.2$ zfs list -t snapshot | grep twoz@2013-03-
twoz@2013-03-03-203000                 3.73Gi       -   255Gi  -
twoz@2013-03-05-082248                 4.29Gi       -   285Gi  -


----

Thanks for the other information. I'll probably try to digest these things in very small chunks, I'm easily confused in this area!
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Scrub: interpreting the counts of errors and data errors

Post by ilovezfs » Fri Mar 08, 2013 4:58 pm

You're welcome. Regarding the idea that the "thought is probably inconsistent with the copy-on-write nature of ZFS," to the contrary. Once ZFS is done writing the new data during a Time Machine backup, assuming the data in that band has been updated due to some change in the file it backs, the file twoz:/macbookpro08-centrim.sparsebundle/bands/3252 will now point to the new data, not the old, corrupt data. Because of copy-on-write ZFS will not update the old data in place. What happens is that if there are snapshots that point to the old data, ZFS will leave the old data alone, so that the snapshots can continue to point to the old data, which the main, non-snapshot version of the data set no longer contains. If there are no snapshots that point to the old data, then ZFS can reuse the space that the old data occupied whenever ZFS desires to do so. But whether there are snapshots pointing to the old data or no snapshots pointing to the old data, ZFS will not use the space occupied by the old data while it is writing the new data. That's what copy-on-write means. In your example in the other thread where you deleted zfs destroy twoz@2013-01-02-191022, I think you must have had other snapshots of that filesystem (for example, the snapshot twoz@2013-01-02-201022 which you referenced in this thread). My guess is that the reason that ZFS was not immediately able to identify twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/3252 and twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/4aca as the names of the affected files is that whatever the mechanism is that associates data used by two snapshots with the filenames of each of them does not immediately happen. In other words, both twoz@2013-01-02-191022 and twoz@2013-01-02-201022 pointed to the same data, but immediately upon deleting twoz@2013-01-02-191022, ZFS does not immediately know that the old logged error in twoz@2013-01-02-191022:/macbookpro08-centrim.sparsebundle/bands/3252 is now associated with the filename twoz@2013-01-02-201022:/macbookpro08-centrim.sparsebundle/bands/3252. I think if you had done a scrub right then, it would have figured that out since it would be rediscovering the error afresh, as opposed to reading from a log. The log probably records both the object number and the filename associated with the error when it occurs, so if the filename in the log no longer exists, then it will only have the object number to report, until it encounters that error again either during a scrub or in the course of normal operation.

So at this point it seems that only your snapshots are broken. Every time you delete one of the corrupt snapshots, ZFS will at first just report an object number, and then if you scrub, it will report the name of another snapshots that still points to the old corrupt data. Once you have deleted all of the affected snapshots, and run a fresh scrub with all corrupt snapshots deleted, no more errors should be reported either as filenames or object numbers, given that ZFS no longer sees an error in the current data set. At that point, it should be happy to overwrite that data and report no errors.
ilovezfs Online


 
Posts: 249
Joined: Sun Feb 10, 2013 9:02 am

Next

Return to General Discussion

Who is online

Users browsing this forum: ilovezfs and 0 guests

cron