Page 1 of 1

Crashing problems with 2.1.6 on Monterey

PostPosted: Mon Jan 09, 2023 9:36 am
by jwiegley
Hello,

This weekend I upgraded my Monterey machine from 2.1.0 to 2.1.6. I also upgraded my pool, and changed the checksum from `sha512` to `blake3`. In hindsight this was too many changes at once, but I've had such good luck with OpenZFS on OS X over the past decade that I proceeded without fear.

However, since this upgrade several things happened within 48 hours:

1. ZFS now thinks one of my drives is busted and marks it as `REMOVED` from its mirror.
2. When trying to copy files from a filesystem in the pool (even when imported with `readonly=true`), the kernel panics after just a few minutes of `rsync`.

I can't find any logs on my system relating to these crashes, but most were about "corrupted heap" and one was about "NMI timeout" or something along those lines.

So I have a few questions:

a. Can I somehow undo the upgrade of my pool and revert back to 2.1.0? It was rock stable.
b. Is there any debug version of 2.1.6 I can run to provide better feedback in order to help track this problem down?

Thanks,
John

Re: Crashing problems with 2.1.6 on Monterey

PostPosted: Mon Jan 09, 2023 5:52 pm
by jwiegley
I tried installing 2.1.6 on another Monterey machine to use `zfs send` for backing up some of the filesystems, but that crashed also after only about a minute of copying. The error was "kernel heap corruption detected".

Re: Crashing problems with 2.1.6 on Monterey

PostPosted: Mon Jan 09, 2023 9:32 pm
by Sharko
Hi jwiegley, you might find that the recover feature tunable might get your pool functional long enough to get the data off it. I've also had luck importing a pool as read-only when I've had problems in the past. Here is a posting that has details:

viewtopic.php?f=24&t=3728

Kurt

Re: Crashing problems with 2.1.6 on Monterey

PostPosted: Wed Jan 11, 2023 11:22 am
by jwiegley
Thank you, Kurt, I was able to find an arrangement that is allowing me to copy more data off:

1. Installed 2.1.6 on my Monterey laptop, imported the crashing pool there with `-N -o readonly=on`.
2. Set recover=1 as suggested.
3. Downgraded to 2.1.0 on my Monterey desktop, created a new pool there with new disks.
4. Using `sudo zfs send -cR tank@snapshot | ssh desktop sudo zfs recv -F tank`.

This has succeeded at copying 1.3 TB so far, which is far better than anything prior.

John

Re: Crashing problems with 2.1.6 on Monterey

PostPosted: Wed Jan 11, 2023 1:49 pm
by jwiegley
Just a bit more information: I'm not sure whether this is due to using `blake3` or not, but it seems that it's one of my filesystems on the pool in particular that's become "radioactive". Any attempt to copy files from it, or use `zfs send`, will crash the system: either with a checksum failure, or an input/output error. This is the only filesystem in the pool where I added new files after changing the checksum scheme to `blake3`.

Re: Crashing problems with 2.1.6 on Monterey

PostPosted: Wed Jan 11, 2023 1:51 pm
by jwiegley
For example, when the 2.1.6 pool started reporting errors, I asked to see them with `status -v` and was presented this list:

Code: Select all
errors: Permanent errors have been detected in the following files:

        tank/Media:/macOS/Software/Vivaldi.5.5.2805.48.universal.dmg
        tank/Media:/macOS/Software/TorBrowser-12.0.1-macos_ALL.dmg
        tank/Media:/macOS/Software/OpenZFS_on_OS_X_2.1.6.dmg
        tank/Media:/macOS/Software/Notion-2.1.4.dmg
        tank/Media:/macOS/Software/dbvis_macos-x64_14_0_2.dmg


This happens to be exactly the set of newly added files in the day since I enabled `blake3`.

Re: Crashing problems with 2.1.6 on Monterey

PostPosted: Wed Jan 11, 2023 6:20 pm
by cgiard
Sounds related to https://openzfsonosx.org/forum/viewtopic.php?f=26&t=3742. Makes me wonder if using blake3 somehow corrupts the zpool itself.

Re: Crashing problems with 2.1.6 on Monterey

PostPosted: Fri Jan 13, 2023 12:19 am
by jwiegley
After moving the new pool's contents back to a 2.1.0-created pool using sha512 (I was able to recover every file but one), I then tried this:

1. Destroy the old pool that was causing kernel panics.
2. Create a brand new 2.1.6 pool, with the checksum set to blake3 from the start.
3. Use `zfs send | zfs recv` to copy the old pool (NOT upgraded this time) to the new pool, so that all the files would be rewritten with the new blake3 checksum.

It got maybe 2 minutes into this copy before panicking the kernel.

So... I'm happy to debug further if anyone has steps for me to follow, but it seems that blake3 turns my pools radioactive at this point.

John