Frequent hangs since 1.7.2; how to narrow down?

All your general support questions for OpenZFS on OS X.

Frequent hangs since 1.7.2; how to narrow down?

Postby poolparty » Mon Apr 30, 2018 12:10 pm

Hi all,


I’ve been running OoO 1.6.1 on my MBP for half a year with great joy and success. I’ve since upgraded to Sierra, and later to a newer MBP, with everything working flawlessly.

A few weeks ago, I upgraded to OoO 1.7.2. Shortly after that, my system started hanging at seemingly random times, forcing me to power off. I’ve logged exactly ten such hangs in the course of two weeks.

To rule out pure coincidence, I reverted to 1.6.1 last week and found my system hasn’t crashed since. I’m perfectly fine sticking to 1.6.1 for now; still, I’d love to sort it out so I can upgrade again.

I’ve been unable to reproduce those hangs exactly, and couldn’t find any hints in my log files either. The only lead I have is that this only seems to happen in situations with some disk activity (e. g. in the middle of rsyncing an unrelated HFS+ volume to another.)

Whenever a hang occurs, it always seems to follow the same script: suddenly, every app beachballs, apparently as soon as it tries to access the disk. The only exception are open Terminal windows where a shell is idling; I can click into those and type commands until I press Enter, then the shell will hang, too. Eventually, there’s nothing left to do but to force power off the system.

My only ZFS dataset (mod parents) is my home mount; the rest of the root filesystem is formatted HFS+. Either partition is a separate encrypted CoreStorage stack; I don’t use ZFS’s encryption.

What are possible ways to narrow down the issue?


Thanks a lot in advance and kind regards!
Claudia
User avatar
poolparty
 
Posts: 2
Joined: Mon Apr 30, 2018 11:34 am

Re: Frequent hangs since 1.7.2; how to narrow down?

Postby lundman » Mon Apr 30, 2018 9:31 pm

Sounds like a deadlock, most likely ZFS deadlocking whilst holding some lock in VFS, which means all future IO will also grind to a halt. So the shell will respond, but if you enter a command (and it has the load the command in) it will stop.

As a developer, I would NMI the machine in this state, then connect with lldb from another machine to dump all the stacks and processes. This would confirm the deadlock, but it does require two macs, and a little developer knowledge.

Although, I did fix one deadlock in zil.c for 1.7.3, so if you are HS you could try that version. There is a beta release available
User avatar
lundman
 
Posts: 370
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: Frequent hangs since 1.7.2; how to narrow down?

Postby poolparty » Sat May 12, 2018 8:23 am

Thanks a lot for your pointers @lundman!

I’ve decided to stick with Sierra on all my Macs because I want to make sure OoO 1.7.2+ works fine before I upgrade to HS. Until then, I won’t be able to try out the 1.7.3 beta I guess.

Kernel debugging is a thing into which I’ve always kind of wanted to get so that’s a perfect excuse for starting! In preparation, I’ve read a few articles, and set up a second Mac for two-machine debugging, both on macOS 10.12.6. I’m planning to use the second Mac as deadlock bait, and to use my main Mac to run the debugger. It would feel a little out of place to debug over Wi-Fi so I’m using an Ethernet link with fixed IP addresses. I’ve also downloaded the Kernel Debug Kit for the main Mac.

On my bait Mac, I’ve set the DB_NMI (0x0004), DB_ARP (0x0040), DB_LOG_PI_SCRN (0x0100), DB_KERN_DUMP_ON_NMI (0x0800), and DB_DBG_POST_CORE (0x1000) bits in the debug boot parameter; I’ve also set _panicd_ip to the fixed IPv4 address of my main Mac, which runs the kernel debugging daemon.

To test the waters, I’ve also set DB_KERN_DUMP_ON_PANIC (0x0400) on the bait Mac. I then deployed a small Hello-World kext, which works flawlessly.

I’ve decided to take small steps in kernel country so my next task would be to test remote debugging first before I start fishing for the OoO deadlock. I’ve found a kext online called InstantPanic; however, macOS won’t let me load that kext because it’s 32-bit. It also seems to link against outdated kernel APIs, according to kextutil. Welp, I’ll have to write a small kext myself to induce a panic, I guess!

I’ll be without my bait Mac for the next few weeks but I’m already looking forward to pick up where I left off!
User avatar
poolparty
 
Posts: 2
Joined: Mon Apr 30, 2018 11:34 am

Re: Frequent hangs since 1.7.2; how to narrow down?

Postby lundman » Wed May 16, 2018 4:35 pm

I believe I settled on these boot-args:

Code: Select all
boot-args -v keepsyms=1 debug=0x144 kcsuffix=development


You don't have to change kernel to development, I just do it for printf can print pointers.

The ZFS kext has a easy panic too, if you are just wanting to try it:

Code: Select all
sysctl kstat.zfs.darwin.tunable.vnop_debug=9119  (warning, this will panic).


but you can just as easily use dtrace for it as well:

Code: Select all
sudo dtrace -w -n "BEGIN{ panic();}"   (warning, this will panic).


I used to have kernel remote dump set, so a remote machine could receive it, but it takes like 40mins on my small VM, so quickly stopped that.

Once I trigger a panic (or NMI the host):

Code: Select all
# lldb /Library/Developer/KDKs/KDK_10.13.3_17D47.kdk/System/Library/Kernels/kernel.development


(again, change path to version of your kernel, needs to be same on both machines, and change to normal kernel if you aren't running .development). Comes from the KernelDebugKit download.

Then in lldb:

Code: Select all
(lldb) target create --no-dependents --arch x86_64 ../spl/module/spl/spl.kext/Contents/MacOS/spl
(lldb) target create --no-dependents --arch x86_64 module/zfs/zfs
(lldb) kdp-remote 172.16.248.131


To load SPL, ZFS kexts and then connect to remote machine. Change the path to suit, and IP of remote machine.

Then in case of panic, "paniclog" "systemlog" and "bt". In case of NMi, "showallstacks" and find a thread you are interested in, "switchtoact $threadaddress".
User avatar
lundman
 
Posts: 370
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan


Return to General Help

Who is online

Users browsing this forum: No registered users and 1 guest