Crashing problems with 2.1.6 on Monterey

All your general support questions for OpenZFS on OS X.

Crashing problems with 2.1.6 on Monterey

Postby jwiegley » Mon Jan 09, 2023 9:36 am

Hello,

This weekend I upgraded my Monterey machine from 2.1.0 to 2.1.6. I also upgraded my pool, and changed the checksum from `sha512` to `blake3`. In hindsight this was too many changes at once, but I've had such good luck with OpenZFS on OS X over the past decade that I proceeded without fear.

However, since this upgrade several things happened within 48 hours:

1. ZFS now thinks one of my drives is busted and marks it as `REMOVED` from its mirror.
2. When trying to copy files from a filesystem in the pool (even when imported with `readonly=true`), the kernel panics after just a few minutes of `rsync`.

I can't find any logs on my system relating to these crashes, but most were about "corrupted heap" and one was about "NMI timeout" or something along those lines.

So I have a few questions:

a. Can I somehow undo the upgrade of my pool and revert back to 2.1.0? It was rock stable.
b. Is there any debug version of 2.1.6 I can run to provide better feedback in order to help track this problem down?

Thanks,
John
jwiegley
 
Posts: 7
Joined: Fri May 14, 2021 9:06 am

Re: Crashing problems with 2.1.6 on Monterey

Postby jwiegley » Mon Jan 09, 2023 5:52 pm

I tried installing 2.1.6 on another Monterey machine to use `zfs send` for backing up some of the filesystems, but that crashed also after only about a minute of copying. The error was "kernel heap corruption detected".
jwiegley
 
Posts: 7
Joined: Fri May 14, 2021 9:06 am

Re: Crashing problems with 2.1.6 on Monterey

Postby Sharko » Mon Jan 09, 2023 9:32 pm

Hi jwiegley, you might find that the recover feature tunable might get your pool functional long enough to get the data off it. I've also had luck importing a pool as read-only when I've had problems in the past. Here is a posting that has details:

viewtopic.php?f=24&t=3728

Kurt
Sharko
 
Posts: 230
Joined: Thu May 12, 2016 12:19 pm

Re: Crashing problems with 2.1.6 on Monterey

Postby jwiegley » Wed Jan 11, 2023 11:22 am

Thank you, Kurt, I was able to find an arrangement that is allowing me to copy more data off:

1. Installed 2.1.6 on my Monterey laptop, imported the crashing pool there with `-N -o readonly=on`.
2. Set recover=1 as suggested.
3. Downgraded to 2.1.0 on my Monterey desktop, created a new pool there with new disks.
4. Using `sudo zfs send -cR tank@snapshot | ssh desktop sudo zfs recv -F tank`.

This has succeeded at copying 1.3 TB so far, which is far better than anything prior.

John
jwiegley
 
Posts: 7
Joined: Fri May 14, 2021 9:06 am

Re: Crashing problems with 2.1.6 on Monterey

Postby jwiegley » Wed Jan 11, 2023 1:49 pm

Just a bit more information: I'm not sure whether this is due to using `blake3` or not, but it seems that it's one of my filesystems on the pool in particular that's become "radioactive". Any attempt to copy files from it, or use `zfs send`, will crash the system: either with a checksum failure, or an input/output error. This is the only filesystem in the pool where I added new files after changing the checksum scheme to `blake3`.
jwiegley
 
Posts: 7
Joined: Fri May 14, 2021 9:06 am

Re: Crashing problems with 2.1.6 on Monterey

Postby jwiegley » Wed Jan 11, 2023 1:51 pm

For example, when the 2.1.6 pool started reporting errors, I asked to see them with `status -v` and was presented this list:

Code: Select all
errors: Permanent errors have been detected in the following files:

        tank/Media:/macOS/Software/Vivaldi.5.5.2805.48.universal.dmg
        tank/Media:/macOS/Software/TorBrowser-12.0.1-macos_ALL.dmg
        tank/Media:/macOS/Software/OpenZFS_on_OS_X_2.1.6.dmg
        tank/Media:/macOS/Software/Notion-2.1.4.dmg
        tank/Media:/macOS/Software/dbvis_macos-x64_14_0_2.dmg


This happens to be exactly the set of newly added files in the day since I enabled `blake3`.
jwiegley
 
Posts: 7
Joined: Fri May 14, 2021 9:06 am

Re: Crashing problems with 2.1.6 on Monterey

Postby cgiard » Wed Jan 11, 2023 6:20 pm

Sounds related to https://openzfsonosx.org/forum/viewtopic.php?f=26&t=3742. Makes me wonder if using blake3 somehow corrupts the zpool itself.
cgiard
 
Posts: 22
Joined: Sat Dec 20, 2014 8:10 am

Re: Crashing problems with 2.1.6 on Monterey

Postby jwiegley » Fri Jan 13, 2023 12:19 am

After moving the new pool's contents back to a 2.1.0-created pool using sha512 (I was able to recover every file but one), I then tried this:

1. Destroy the old pool that was causing kernel panics.
2. Create a brand new 2.1.6 pool, with the checksum set to blake3 from the start.
3. Use `zfs send | zfs recv` to copy the old pool (NOT upgraded this time) to the new pool, so that all the files would be rewritten with the new blake3 checksum.

It got maybe 2 minutes into this copy before panicking the kernel.

So... I'm happy to debug further if anyone has steps for me to follow, but it seems that blake3 turns my pools radioactive at this point.

John
jwiegley
 
Posts: 7
Joined: Fri May 14, 2021 9:06 am


Return to General Help

Who is online

Users browsing this forum: No registered users and 20 guests