What to do after ZIL ssd failure?

All your general support questions for OpenZFS on OS X.

What to do after ZIL ssd failure?

Postby galego » Tue Nov 15, 2016 8:00 pm

I get the follow error on import:

Code: Select all
sudo zpool import Media
The devices below are missing, use '-m' to import the pool anyway:
       18351213345093273837 [log]

cannot import 'Media': one or more devices is currently unavailable


When I import with -m option I get a kernel panic.

OS X 10.11.6
spl.kext_version: 1.5.2-1
zfs.kext_version: 1.5.2-1

Any ideas?? A little panicked here.
Last edited by galego on Sun Nov 20, 2016 6:22 pm, edited 1 time in total.
galego
 
Posts: 7
Joined: Tue May 13, 2014 8:17 am

Re: Can't import raidz2 pool after ZIL ssd failure

Postby galego » Tue Nov 15, 2016 8:06 pm

Finally got it to import using -m option after cold boot. How do I remove the ZIL drive so I don't keep getting the error?

Neither of the following work. The commands are accepted without error but the log drive remains and the pool status is unchanged:
sudo zpool remove 18351213345093273837
sudo zpool remove Media media-5683211F-E975-4ED9-986F-679D4CF70475

Code: Select all
pool: Media
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
   the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://zfsonlinux.org/msg/ZFS-8000-2Q
  scan: scrub repaired 116K in 12h49m with 0 errors on Tue Sep 13 01:24:17 2016
config:

   NAME                                            STATE     READ WRITE CKSUM
   Media                                           DEGRADED     0     0     0
     raidz2-0                                      ONLINE       0     0     0
       media-88504425-F423-3C48-81FB-18E2E32D888C  ONLINE       0     0     0
       media-9EFC3899-745E-2941-AF1F-3FFF5E6A6827  ONLINE       0     0     0
       media-2130F752-DC18-B348-9480-870CFCBF9D54  ONLINE       0     0     0
       media-1A778E13-FCFE-6749-B5CA-34AF59B32131  ONLINE       0     0     0
       media-969D69BE-7158-0F40-A048-D2A4DA7967B3  ONLINE       0     0     0
       media-0B08C1BC-5278-664D-A780-B8D54A4C8D30  ONLINE       0     0     0
       media-1E502FEC-53DA-C64B-A345-95FAF3DDE599  ONLINE       0     0     0
       media-89F5FB77-7442-CC44-A101-0A53AEBC3340  ONLINE       0     0     0
   logs
     18351213345093273837                          UNAVAIL      0     0     0  was /private/var/run/disk/by-id/media-5683211F-E975-4ED9-986F-679D4CF70475

errors: No known data errors
galego
 
Posts: 7
Joined: Tue May 13, 2014 8:17 am

Re: Can't import raidz2 pool after ZIL ssd failure

Postby leeb » Wed Nov 16, 2016 1:38 pm

galego wrote:Neither of the following work. The commands are accepted without error but the log drive remains and the pool status is unchanged:
sudo zpool remove 18351213345093273837
sudo zpool remove Media media-5683211F-E975-4ED9-986F-679D4CF70475

Assuming you wrote those exactly as entered, your first one is incorrect, you forgot to include the pool name. It should be
Code: Select all
zpool remove Media 18351213345093273837

If that's all you need then great. If you made a typo posting this instead and entered that correctly, then you could try clearing the errors first. So
Code: Select all
zpool import -m Media
zpool clear Media
zpool remove Media 18351213345093273837

Don't forget that having an unmirrored SLOG device does open up a small additional window for potential data loss, should it fail along with the system experiencing a power loss before RAM state can be written to the main pool. It's not like the bad old days at all though, and if you've got a UPS probably ignorable entirely.
leeb
 
Posts: 43
Joined: Thu May 15, 2014 12:10 pm

Re: Can't import raidz2 pool after ZIL ssd failure

Postby galego » Wed Nov 16, 2016 4:30 pm

lee,
Thanks so much for your help. I'm pretty sure I typed it in wrong only in the message.

Before your response, I tried to fix it and possibly made it worse. I replaced the SSD:

Code: Select all
 sudo zpool replace -f Media 8154475211869065760 /dev/disk1


This resulted in a Kernel panic/boot loop until I physically disconnected the SSD again.

I now get this:

Code: Select all
Media-Server:~ ogomez$ zpool status
  pool: Media
 state: DEGRADED
status: Some supported features are not enabled on the pool. The pool can
   still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
   the pool may no longer be accessible by software that does not support
   the features. See zpool-features(5) for details.
  scan: resilvered 0 in 0h35m with 0 errors on Wed Nov 16 14:53:36 2016
config:

   NAME                                            STATE     READ WRITE CKSUM
   Media                                           DEGRADED     0     0     0
     raidz2-0                                      ONLINE       0     0     0
       media-88504425-F423-3C48-81FB-18E2E32D888C  ONLINE       0     0     0
       media-9EFC3899-745E-2941-AF1F-3FFF5E6A6827  ONLINE       0     0     0
       media-2130F752-DC18-B348-9480-870CFCBF9D54  ONLINE       0     0     0
       media-1A778E13-FCFE-6749-B5CA-34AF59B32131  ONLINE       0     0     0
       media-969D69BE-7158-0F40-A048-D2A4DA7967B3  ONLINE       0     0     0
       media-0B08C1BC-5278-664D-A780-B8D54A4C8D30  ONLINE       0     0     0
       media-1E502FEC-53DA-C64B-A345-95FAF3DDE599  ONLINE       0     0     0
       media-89F5FB77-7442-CC44-A101-0A53AEBC3340  ONLINE       0     0     0
   logs
     replacing-1                                   UNAVAIL      0     0     0  insufficient replicas
       18351213345093273837                        UNAVAIL      0     0     0  was /private/var/run/disk/by-id/media-5683211F-E975-4ED9-986F-679D4CF70475
       8154475211869065760                         UNAVAIL      0     0     0  was /private/var/run/disk/by-id/media-F1C7AA67-14C3-1544-B174-F3E55E50FED3

errors: No known data errors


Attempting to remove the drive after clearing the errors as you suggested results in:
Code: Select all
Media-Server:~ ogomez$ sudo zpool remove Media 8154475211869065760
cannot remove 8154475211869065760: operation not supported on this type of pool
Media-Server:~ ogomez$ sudo zpool remove Media 18351213345093273837
cannot remove 18351213345093273837: operation not supported on this type of pool


Puzzling (to me at least)
galego
 
Posts: 7
Joined: Tue May 13, 2014 8:17 am

Re: Can't import raidz2 pool after ZIL ssd failure

Postby galego » Fri Nov 18, 2016 8:39 pm

Would love love love to figure this out.

It's tough to expect much when I only interact in this forum when something is amiss. Although I suppose that paucity is a testament to the stability of OpenZFS on OS X.

This makes unattended reboots a pain, but can't really complain if no one has any other ideas.
galego
 
Posts: 7
Joined: Tue May 13, 2014 8:17 am

Re: Can't import raidz2 pool after ZIL ssd failure

Postby galego » Sun Nov 20, 2016 1:15 pm

Scoured the forums for a solution and no luck. Unless one of the OpenZFS gurus out there can chime in with a solution for me to try, I'll take this to be a bug.

Cheers,
Oliver
galego
 
Posts: 7
Joined: Tue May 13, 2014 8:17 am

Re: Can't import raidz2 pool after ZIL ssd failure

Postby leeb » Mon Nov 21, 2016 2:23 pm

galego wrote:Would love love love to figure this out.

I'm sorry it wasn't straight forward to clear the errors here. You don't say what system you're running this on, but if it's a tower where it's trivial to swap drives (or if you want to temporarily partition your boot), or if you have another system available (including FreeBSD/Linux) you could try importing the pool elsewhere and attempting to clear the errors there maybe.

If it's truly annoying though and it's data that's important (not sure if "media server" means it's just a convenience storage for network movie access and the like or if it's used for important project media data) you could also treat this as an excuse to test you near-line replica with a nuke&pave onto a fresh pool. I recognize that could be a significant time sink (I've only got around 10TB but that's still the better part of a day to replicate) and it's galling to do use an inelegant brute force approach to what should be a simple quickie, I've just gone through a similar set of feelings with an obscure network hardware bug that I'd love to track down rather then just RMA the equipment. Still, it's not like it's a useless exercise on its own anyway. The old saw of "if you've never tested your backup you don't really have one" isn't without merit.
It's tough to expect much when I only interact in this forum when something is amiss. Although I suppose that paucity is a testament to the stability of OpenZFS on OS X.

That's likely a lot of it, but it probably also has to do with there being a great deal of time between stable releases, or even test releases for that matter (beyond "compile your own"), which slows down discussion a bit. At some point enough time goes by and you see enough fixes for various issues appearing in commits and developer discussion that it becomes unclear whether an issue is worth sinking a lot of time into diagnosis/testing because it may already be dealt with, and it's tempting to just wait for the next major and then see if the issue still exists.

O3X may also just lack critical mass right now, and Apple's regrettable decisions on the desktop computer front may have further reduced the most natural potential user base. I'd love to throw together a test pool to try to reproduce your problem for example but just am not really in a position to do it right now in terms of available systems, and I'm waiting on the the next release (maybe by the end of the year?) for two newer MPs so that I can try it with 10.12 as well.

I do also think that O3X is has made massive strides this year alone in terms of general production readiness. Features like Spotlight support are significant core issues for use under OS X (well, macOS now I guess, shucks I hadn't even considered how their rebrand messes with the lovely O3X acronym), so getting that sorted out was key. It sounds like this upcoming one will have native crypto (or at least make adding crypto easier) and significant numbers of QOL improvements, at which point I think it may well have 100% of the foundational stuff. From that other Sierra Disk Utility thread it sounds like there will even be an opportunity to start to have some simple GUI integration which would be pretty neat. Hopefully the future is reasonably bright.
leeb
 
Posts: 43
Joined: Thu May 15, 2014 12:10 pm

Re: What to do after ZIL ssd failure?

Postby JasonBelec » Mon Feb 06, 2017 7:52 am

Well, SSDs fail, spectacularly just like HDDs, so backups are a good thing, snapshots are a good thing.

That said, have you tried fixing the SSD with the manufacturer tools, since you don't mention the SSD I can't be more specific. I have a couple dead ones here from a particular manufacturer OCZ. I also have some that have been running since SSDs came out. Relying on others sucks.

Have you tried cloning the SSD (forensic level) to another and swapping that new one into the pool? Sometimes you have to get rough.
JasonBelec
 
Posts: 32
Joined: Mon Oct 26, 2015 1:07 pm


Return to General Help

Who is online

Users browsing this forum: Haravikk and 25 guests