I've been using ZFS for years on FreeNAS. A recent home lab project opened my eyes to how performant ZFS replication is relative to filesystem backups, so I've decided to try it on my desktop.
I have a ThunderBay 4 (TB2) that's been a SoftRAID stripe (raid0) 'local cache' of my photo library, which normally lives on a FreeNAS raidz2. To try out O3X, I rebuilt the ThunderBay as an OpenZFS stripe. Before we go any further: Yes, I understand the risks of a stripe, no point shouting at me about that. That's why I describe it as a 'local cache' . It's there for performance, the source of truth is the NAS, and that's backed up to the *other* NAS, also a raidz2.
So, with that said, I've been getting weird crashes, though. When I say weird... the machine just locks up and eventually reboots. No kernel panic, no crashdump or "You restarted your computer..." on reboot. Just a long (10s?) freeze and a reboot.
I've done some basic troubleshooting, and it appears that this is related to errors on the stripe. I shouldn't be too surprised that ZFS reported errors that SoftRAID didn't, but it seems that after I hit a crop of read/checksum errors, IO will stop, then a bit later the entire machine will freeze, and then apparently watchdog. I replaced the drive that was throwing errors and cooked up a quick and dirty qualification test: Fill the array from /dev/urandom, then run a scrub.
Another disk started throwing errors when the scrub hit about 90% complete. I verified that it follows the drive if I move the disks around in the enclosure. I've ordered a replacement for that drive as well, but it'll be Monday before it gets here in the current environment. But that's an opportunity:
I now have a zpool that will reliably crash my machine within a few minutes of import, because it has an active scrub that will trigger all of this. How can I help debug this? What logs/debug info should I already have, and what additional debug can I turn on to help pin this down?
This would be a great solution for my use case, but I'd really like to avoid a read error bringing down the entire host.
System Details: iMac15,1 4 GHz Quad-core i7, 32GB, Catalina macOS 10.15.4 (19E264b) (Yes, that's beta, I know)
$ zfs version
zfs-1.9.3-0
zfs-kmod-1.9.3-0
ThunderBay 4:
Vendor Name: Other World Computing
Device Name: ThunderBay 4
Vendor ID: 0x5A
Device ID: 0xDE08
Device Revision: 0x1
UID: 0x005ADE0815308E80
Route String: 3
Firmware Version: 24.2
This is the drive that's currently throwing errors, the other one was the same revision:
WDC WD40EFRX-68WT0N0:
Capacity: 4 TB (4,000,787,030,016 bytes)
Model: WDC WD40EFRX-68WT0N0
Revision: 82.00A82
Serial Number: WD-WCC4E1LH10U9
Native Command Queuing: Yes
Queue Depth: 32
Removable Media: No
Detachable Drive: No
BSD Name: disk18
Rotational Rate: 5400
Medium Type: Rotational
Partition Map Type: GPT (GUID Partition Table)
S.M.A.R.T. status: Verified