Read errors on stripe cause hang/crash.

All your general support questions for OpenZFS on OS X.

Read errors on stripe cause hang/crash.

Postby zandr » Thu Mar 19, 2020 11:23 am

I've been using ZFS for years on FreeNAS. A recent home lab project opened my eyes to how performant ZFS replication is relative to filesystem backups, so I've decided to try it on my desktop.

I have a ThunderBay 4 (TB2) that's been a SoftRAID stripe (raid0) 'local cache' of my photo library, which normally lives on a FreeNAS raidz2. To try out O3X, I rebuilt the ThunderBay as an OpenZFS stripe. Before we go any further: Yes, I understand the risks of a stripe, no point shouting at me about that. That's why I describe it as a 'local cache' . It's there for performance, the source of truth is the NAS, and that's backed up to the *other* NAS, also a raidz2.

So, with that said, I've been getting weird crashes, though. When I say weird... the machine just locks up and eventually reboots. No kernel panic, no crashdump or "You restarted your computer..." on reboot. Just a long (10s?) freeze and a reboot.

I've done some basic troubleshooting, and it appears that this is related to errors on the stripe. I shouldn't be too surprised that ZFS reported errors that SoftRAID didn't, but it seems that after I hit a crop of read/checksum errors, IO will stop, then a bit later the entire machine will freeze, and then apparently watchdog. I replaced the drive that was throwing errors and cooked up a quick and dirty qualification test: Fill the array from /dev/urandom, then run a scrub.

Another disk started throwing errors when the scrub hit about 90% complete. I verified that it follows the drive if I move the disks around in the enclosure. I've ordered a replacement for that drive as well, but it'll be Monday before it gets here in the current environment. But that's an opportunity:

I now have a zpool that will reliably crash my machine within a few minutes of import, because it has an active scrub that will trigger all of this. How can I help debug this? What logs/debug info should I already have, and what additional debug can I turn on to help pin this down?

This would be a great solution for my use case, but I'd really like to avoid a read error bringing down the entire host. :)

System Details: iMac15,1 4 GHz Quad-core i7, 32GB, Catalina macOS 10.15.4 (19E264b) (Yes, that's beta, I know)

$ zfs version
zfs-1.9.3-0
zfs-kmod-1.9.3-0

ThunderBay 4:
Vendor Name: Other World Computing
Device Name: ThunderBay 4
Vendor ID: 0x5A
Device ID: 0xDE08
Device Revision: 0x1
UID: 0x005ADE0815308E80
Route String: 3
Firmware Version: 24.2

This is the drive that's currently throwing errors, the other one was the same revision:
WDC WD40EFRX-68WT0N0:

Capacity: 4 TB (4,000,787,030,016 bytes)
Model: WDC WD40EFRX-68WT0N0
Revision: 82.00A82
Serial Number: WD-WCC4E1LH10U9
Native Command Queuing: Yes
Queue Depth: 32
Removable Media: No
Detachable Drive: No
BSD Name: disk18
Rotational Rate: 5400
Medium Type: Rotational
Partition Map Type: GPT (GUID Partition Table)
S.M.A.R.T. status: Verified
zandr
 
Posts: 7
Joined: Thu Mar 19, 2020 6:19 am

Re: Read errors on stripe cause hang/crash.

Postby lundman » Fri Mar 20, 2020 1:54 am

If it is panicking it should write report files to /Library/Logs/DiagnosticReports
User avatar
lundman
 
Posts: 755
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: Read errors on stripe cause hang/crash.

Postby zandr » Fri Mar 20, 2020 1:40 pm

No useful logs there, I just ran two cycles and got two hangs. There's an lsd crash in that directory now, but nothing that points to ZFS.

It isn't a panic, this looks like a hard crash that results in a watchdog.

Can I reload the kext or run import with some extra debug flags that would help?
zandr
 
Posts: 7
Joined: Thu Mar 19, 2020 6:19 am

Re: Read errors on stripe cause hang/crash.

Postby 4ever6 » Sat Mar 21, 2020 8:42 pm

Hey, nice job. Now I know I probably have some silently failing HDDs in my enclosures...I miss my checksumming.

I started battling this problem some time ago, and it ultimately is what made me migrate off, and back onto SoftRAID, which is disappointing. I would love to get this fixed, and knowing you can reproduce it so easily is great.

I can't search for my own posts--I recall starting some conversations about this, but, life ultimately dictated my ability to move forward. I had a full debug OS on my Mac while I was hunting this one down, and was trying to use the NMI trigger/power button to get a remote debug during the hang. I was never successful. The NMI trigger worked great, though, in other situations. I was able to reproduce it pretty easily, interestingly also using OWC/TB2 enclosures (2x4 3.5 [HDD], 1x4 2.5 [SSD]).

Now you are going to make me interested in this, again, and I don't even know why I stopped in here randomly. Just wanted to see how things were going.

If it's narrowed to a read-failure code path that is a better place to start than I had at the time. I'll start poking around in code and also see if I can get it reproducible again. I need to setup all my debug environment again, too.
4ever6
 
Posts: 11
Joined: Fri Apr 07, 2017 7:24 pm

Re: Read errors on stripe cause hang/crash.

Postby 4ever6 » Sun Mar 22, 2020 12:24 am

Ok. I'm at a loss as to which versions of macOS I can actually use w/ a published KDK. 10.15.3 doesn't seem to have one published, and the latest beta 4 isn't up there either (can you get old betas?). Three years of rust. I'm typically debugging linux during the day. What is everyone using for dev?
4ever6
 
Posts: 11
Joined: Fri Apr 07, 2017 7:24 pm

Re: Read errors on stripe cause hang/crash.

Postby zandr » Sun Mar 22, 2020 7:33 am

Ah, glad to hear it isn't just me.

A few thoughts:
I'm going to put *both* bad drives in the enclosure and see if I can provoke this with other layouts. It seems like this might only happen on stripes, since that's a less common use case.
I'm testing now with compression off, though I don't think a scrub cares if the blocks are compressed.
I was seeing some panics running SoftRAID. I'll see if I can provoke one of those and post it here. It may be unrelated (IIRC, it was in some Apple AHCI component), but it's worth a look.
zandr
 
Posts: 7
Joined: Thu Mar 19, 2020 6:19 am

Re: Read errors on stripe cause hang/crash.

Postby zandr » Mon Mar 23, 2020 9:06 am

Same hardware, 4-wide mirror, I now get Machine Check Exceptions. This is repeatable. Two logs attached. I've also upgraded to 1.9.4 just to see if that fixed anything. It didn't.

Kernel hacker I am not... if there's things I need to do to make this more useful, let me know. My replacement drive arrived, but I can mess with this a bit more before I get on with that project.
Attachments
Kernel_panics.zip
(5.69 KiB) Downloaded 92 times
zandr
 
Posts: 7
Joined: Thu Mar 19, 2020 6:19 am

Re: Read errors on stripe cause hang/crash.

Postby 4ever6 » Mon Mar 23, 2020 8:46 pm

zandr wrote:Same hardware, 4-wide mirror, I now get Machine Check Exceptions. This is repeatable. Two logs attached. I've also upgraded to 1.9.4 just to see if that fixed anything. It didn't.

Kernel hacker I am not... if there's things I need to do to make this more useful, let me know. My replacement drive arrived, but I can mess with this a bit more before I get on with that project.


Hmmm. Have you done a thorough hardware verification? Memory test, etc.
4ever6
 
Posts: 11
Joined: Fri Apr 07, 2017 7:24 pm

Re: Read errors on stripe cause hang/crash.

Postby zandr » Mon Mar 23, 2020 8:58 pm

Nothing more thorough than the Apple Diagnostics. (which pass)

I guess Macs are ISA enough that memtest86 will work, I'll try running that tomorrow.
zandr
 
Posts: 7
Joined: Thu Mar 19, 2020 6:19 am

Re: Read errors on stripe cause hang/crash.

Postby 4ever6 » Tue Mar 24, 2020 5:53 am

zandr wrote:Nothing more thorough than the Apple Diagnostics. (which pass)

I guess Macs are ISA enough that memtest86 will work, I'll try running that tomorrow.


Thanks. I went to storage and grabbed my larger external JBOD. Need to populate it with disks and continue figuring out the best dev environment to get this going again. Appreciate the feedback.
4ever6
 
Posts: 11
Joined: Fri Apr 07, 2017 7:24 pm

Next

Return to General Help

Who is online

Users browsing this forum: Google [Bot] and 9 guests

cron