Unexpected resilver with neither redundancy nor replacement

This forum is to find answers to problems you may be having with ZEVO Community Edition.

Moderators: jhartley, MSR734, nola

Unexpected resilver with neither redundancy nor replacement

Post by grahamperrin » Sun Mar 31, 2013 5:01 am

Code: Select all
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
   continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Sun Mar 31 00:42:07 2013
    3.09Gi scanned out of 1.98Ti at 95.8Mi/s, 6h0m to go
    688Ki resilvered, 0.15% done
config:

   NAME                                         STATE     READ WRITE CKSUM
   tall                                         ONLINE       0     0     0
     GPTE_78301A52-4AFF-4D96-8DE9-E76ABC14909C  ONLINE       0     0     0  at disk4s2
     GPTE_99056308-F5E2-4314-852C-4DA04732A2D0  ONLINE       0     0     0  at disk5s2  (resilvering)


That's abbreviated. I can add details of the errors and much more but first, there's a basic question:

  • is it sane for a resilver to occur in a two-disk pool where there's no redundancy?

Prior to what's above, the two disks – 2 GB and 3 GB – formed a simple 5 GB pool.

I see articles such as Aaron Toponce : ZFS Administration, Part VI- Scrub and Resilver (2012-12-11) but I haven't read the resilver content in detail, because I didn't anticipate it happening with my current configuration …
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Unexpected resilver with neither redundancy nor replacem

Post by raattgift » Sun Mar 31, 2013 8:26 am

Resilvering happens whenever the transaction group IDs vary among members of a vdev or pool, so of course it is sane. ZFS does what it can to try to take advantage of what redundancy there is. Even on a pool badly configured as two single-device vdevs, the pool's metadata *typically* has at least one copy on each device. It can thus walk the metadata upwards, as each node contains checksum data for its children.

(Think of the situation where the devices in question are in themselves redundant -- like a hardware RAID array. You would still want to check and repair the metadata, and might still find errors because of things like the RAID5 hole.)
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: Unexpected resilver with neither redundancy nor replacem

Post by grahamperrin » Mon Apr 01, 2013 4:50 am

Thanks.

I can visualise copies of metadata across the two component devices in this single-vdev pool.

Still I'm surprised that without replacement, and without me running a command, a resilver began automatically.

For this (poorly configured) single-vdev pool, might that automation (without replacement) be normal after a state such as DEGRADED or FAULTED is encountered?

I'll prepare to reconnect the disks. Would it be prudent to then unmount the file systems (to not write anything new to the pool) until after the resilver completes?
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Unexpected resilver with neither redundancy nor replacem

Post by raattgift » Mon Apr 01, 2013 1:56 pm

Resilvering happens whenever there is an unexpected mismatch in the txg value in the device labels.

At the minimum, it makes the txg values consistent across all the labels in the pool and in each vdev.

It may also do other repairs.

It is prudent to wait until a resilver happens before exporting a pool, but not necessary; resilvering will resume from where it left off when the pool is imported. (You can predict some of this from examination of the txg and resilvering fields in the zfs labels).

You do not have to wait for a resilver to finish before beginning aggressive use of the pool, although resilvering necessarily requires fairly random IOPS from the devices, which in turn will affect the pool's performance.
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: Unexpected resilver with neither redundancy nor replacem

Post by grahamperrin » Tue Apr 02, 2013 12:43 am

Thanks again. Now I'm reassured.

I had exported the pool almost immediately after the resilver was realised, around the 0.15% done mark.

Now I'll be away from the disks for a while, I'll reconnect this evening – and not allow sleep of the Mac or hard disks (see Sleep).

Over the long weekend I followed some other users' chats in IRC, I noted this –

Code: Select all
sudo zdb -ul device


– and will probably run the command on slice 2 of both devices soon after resilver resumes.

In retrospect, the main reasons for me worrying unnecessarily were:

  1. over the past year or so, it seemed to me that most (possibly all) of the discussions that I read about resilvering involved planned replacement of a disk where redundancy was for data (not metadata alone) with copies on two or more component devices; nothing like my situation
  2. unless I missed something, no notification through Growl or HardwareGrowler.

Considering point (b) I'll leave this topic under troubleshooting.

(Long ago with different pools, probably with Silver Edition, at least once I did see a notification about pool health (probably degraded) … and the quick start guide for Community Edition 1.1 does include resilvering amongst the four categories of notification … so maybe when this most recent incident occurred I was over-focused on the Terminal windows. Or maybe I simply had notifications suppressed.)

A plain english paragraph or two about resilvering would be good to have in a future guide …
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: Unexpected resilver with neither redundancy nor replacem

Post by grahamperrin » Sat Apr 06, 2013 8:07 am

There's now a separate overview topic for the pool where I first encountered resilvering:


Additional notes for this topic

As mentioned above, the first resilver was interrupted gracefully by me (zpool export) very soon after I realised that resilver had begun.

A later resilver, possibly not the first, was probably interrupted ungracefully by a kernel panic.

A later scrub of the pool concluded with no errors found, as shown in the overview topic.

I do have more details, but I'll refrain from putting things on a timeline until after things have settled.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom


Return to Troubleshooting

Who is online

Users browsing this forum: hlxpgxmum and 0 guests

cron