From OpenZFS on OS X
Revision as of 03:09, 17 November 2014 by Lundman (Talk | contribs)

Jump to: navigation, search

You should also familiarize yourself with the project roadmap so that you can put the technical details here in context.


Debugging with GDB

Dealing with panics.

Apple's documentation:

Boot target VM with

$ sudo nvram boot-args="-v keepsyms=y debug=0x144"

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple here.

$ gdb /Volumes/Kernelit/mach_kernel
(gdb) source /Volumes/KernelDebugKit/kgmacros
(gdb) target remote-kdp
(gdb) kdp-reattach   # obviously use the IP of your target / crashed VM
(gdb) showallkmods

Find the addresses for ZFS and SPL modules.

^Z to suspend gdb, or, use another terminal

$ sudo kextutil -s /tmp -n \
-k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ \

Then resume gdb, or go back to gdb terminal.

$ fg
(gdb) set kext-symbol-file-path /tmp
(gdb) add-kext /tmp/spl.kext 
(gdb) add-kext /tmp/zfs.kext
(gdb) bt

Debugging with LLDB

$ echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
$ lldb /Volumes/KernelDebugKit/mach_kernel
(lldb) kdp-remote
(lldb) showallkmods
(lldb) addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000   (Address from showallkmods)
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000

Then follow the guide for GDB above.


If you prefer to work in GDB, you can always panic a kernel with

$ sudo dtrace -w -n "BEGIN{ panic();}"

But this was revealing:

$ sudo /usr/libexec/stackshot -i -f /tmp/stackshot.log 
$ sudo symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
$ less /tmp/trace.txt

Note that my hang is here:

PID: 156
    Process: zpool
    Thread ID: 0x4e2
    Thread state: 0x9 == TH_WAIT |TH_UNINT 
    Thread wait_event: 0xffffff8006608a6c
    Kernel stack: 
    machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
      0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
        thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
          lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
            0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
              msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
                0xffffff7f80e52a76 (0xffffff7f80e52a76)
                  0xffffff7f80e53fae (0xffffff7f80e53fae)
                    0xffffff7f80e54173 (0xffffff7f80e54173)
                      0xffffff7f80f1a870 (0xffffff7f80f1a870)
                        0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
                          0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
                            0xffffff7f80f1b65f (0xffffff7f80f1b65f)
                              0xffffff7f80f042ee (0xffffff7f80f042ee)
                                0xffffff7f80f45c5b (0xffffff7f80f45c5b)
                                  0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
                                    spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
                                      VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

$ sudo kextstat #grab the addresses of SPL and ZFS again
$ sudo kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel \
-e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/ 
$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
              0xffffff800056a2e4 (0xffffff800056a2e4)
                spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
                  taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
                    taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
                      0xffffff7f80f1a870 (0xffffff7f80f1a870)
$ sudo symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
                    0xffffff7f80e54173 (0xffffff7f80e54173)
                      vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
                        vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
                          vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
                            vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
                              spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)


Memory leaks

In some cases, you may suspect memory issues, for instance if you saw the following panic:

panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826

To debug this, you can attach GDB and use the zprint command:

(gdb) zprint
ZONE                   COUNT   TOT_SZ   MAX_SZ   ELT_SZ ALLOC_SZ         TOT_ALLOC         TOT_FREE NAME
0xffffff8002a89250   1620133  18c1000  22a3599       16     1000         125203838        123583705 kalloc.16 CX
0xffffff8006306c50    110335   35f000   4ce300       32     1000          13634985         13524650 kalloc.32 CX
0xffffff8006306a00    133584   82a000   e6a900       64     1000          26510120         26376536 kalloc.64 CX
0xffffff80063067b0    610090  4a84000  614f4c0      128     1000          50524515         49914425 kalloc.128 CX
0xffffff8006306560   1070398 121a2000 1b5e4d60      256     1000          72534632         71464234 kalloc.256 CX
0xffffff8006306310    399302  d423000  daf26b0      512     1000          39231204         38831902 kalloc.512 CX
0xffffff80063060c0    100404  6231000  c29e980     1024     1000          22949693         22849289 kalloc.1024 CX
0xffffff8006305e70       292    9a000   200000     2048     1000          77633725         77633433 kalloc.2048 CX

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

(gdb) findoldest                                                                
oldest record is at log index 393:
--------------- ALLOC  0xffffff803276ec00 : index 393  :  ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>:        mov    %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>:        jmpq   0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>:       mov    %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov    %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>:  mov    %rax,-0x28(%rbp)
and indeed, list any index
(gdb) zstack 394
--------------- ALLOC  0xffffff8032d60700 : index 394  :  ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>:        mov    %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>:        jmpq   0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>:       mov    %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov    %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>:  mov    %rax,-0x28(%rbp)
How many times was zfs_kmem_alloc involved in the leaked allocs?
(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)

At least we know it is our fault.

How many times is it arc_buf_alloc?

(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)


Huge thanks to BrendanGregg for so much of the dtrace magic.

dtrace the kernel while running a command:

$ sudo dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
    @[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks

It will run for 60 seconds.

Convert it to a flamegraph:

$ ./ out.stacks > out.folded
$ ./ out.folded > out.svg

This is rsync -a /usr/ /BOOM/deletea/ running:

rsync flamegraph

Or running Bonnie++ in various stages:

Create files in sequential order

IOzone flamegraph
IOzone flamegraph (untrimmed)


Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. Of course, this is not ideal testing specs, but should serve as an indicator.

The pool was created with

$ sudo zpool create -f -o ashift=12 \
-O atime=off \
-O casesensitivity=insensitive \
-O normalization=formD \
BOOM /dev/disk1

and the HFS+ file system was created with the standard OS X Disk, with everything default (journaled, case-insensitive).

Iozone was run with standard automode:

sudo iozone -a -b outfile.xls
HFS+ read
HFS+ write
ZFS read
ZFS write

As a guess, writes need to double, and reads need to triple.



File-based zpools for testing

  • create 2 files (each 100 MB) to be used as block devices:
$ dd if=/dev/zero bs=1m count=100 of=vdisk1
$ dd if=/dev/zero bs=1m count=100 of=vdisk2
  • attach files as raw disk images:
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk1
$ hdiutil attach -imagekey diskimage-class=CRawDiskImage -nomount vdisk2
  • create mirrored zpool:
$ sudo zpool create -f -o ashift=12 -O casesensitivity=insensitive -O normalization=formD tank mirror disk2 disk3
  • show zpool:
$ sudo zpool status
  pool: tank
 state: ONLINE
  scan: none requested
	tank        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    disk2   ONLINE       0     0     0
	    disk3   ONLINE       0     0     0
errors: No known data errors
  • test ZFS features, find bugs, ...
  • export zpool:
$ sudo zpool export tank
  • detach raw images:
$ hdiutil detach disk2
"disk2" unmounted.
"disk2" ejected.
$ hdiutil detach disk3
"disk3" unmounted.
"disk3" ejected.

Platform differences

This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OSX. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.


One of the biggest hassles with OSX is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode. (Of course, we could craft a mirror struct of vnode and tailor it to each OSX version where vnode changes. But that is rather hacky.)

Following that, the only place where you can set the vtype (VREG, VDIR), vdata (user pointer to hold the ZFS znode), vfsops (list of filesystem calls "vnops") etc, is only in calling vnode_create(). So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done. ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.

The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and we can not use mutex locks (panic: locking against ourselves).

Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of

2205         if (VNOP_RECLAIM(vp, ctx))
2206                 panic("vclean: cannot reclaim");

in vfs_subr.c

As for vnop_reclaim, it requires a little more work. We need to detect if curthread is coming via the inside of vnode_create. Currently, the call to vnode_create looks like: = curthread;
list_insert_head(&zfsvfs->z_reentry_threads, &reentry);
vnode_create(VNCREATE_FLAVOR, VCREATESIZE, &vfsp, vpp);
list_remove(&zfsvfs->z_reentry_threads, &reentry);

That is to say, add a node (carrying only curthread id) to the list z_reentry_threads in zfsvfs. Then at the start of vnop_reclaim, we can test curthread against this list to see if the current thread is inside a call to vnode_create. If we are not inside vnode_create (zp->z_reclaim_reentry = FALSE) we can handle the reclaim as upstream, either calling zfs_znode_free or zfs_zinactive depending on the value of z_sa_hdl. However, if we are inside vnode_create (zp->z_reclaim_reentry = TRUE) we call zfs_zinactive, but IF zp->unlinked is set, and we end up in zfs_rmnode(), this function will return early, before it does the "final TX". At the end of vnop_reclaim, we then add the znode (zp) to a zfsvfs->z_reclaim_list. Note that the znode has not finished reclaim, but its z_vnode is now set to NULL (since vnop_reclaim was called, we have no choice but to release it) nor has it called zfs_znode_free().

The reclaim_thread (started from zfs_vfsops creation of zfsvfs) will wake up, see the znode on the reclaim_list, and do the final TX, lifted from zfs_rmnode. It then finally calls zfs_znode_free(). As it is run as a separate thread, dmu_tx and dmu_tx_commit will succeed.

Fastpath vs Recycle

Another interesting aspect is that IllumOS has a delete fastpath. In zfs_remove, if it is detected that the znode can be "deleted_now", it marks the vnode as free and directly calls zfs_znode_delete(), if it can not, adds it to zfs_unlinked_add().

In OSX, there is no way to directly release a vnode. Ie, XNU always has full control of the vnodes. Even if you call vnode_recycle(), the vnode is not released until vnop_reclaim is called. The vnode can just be marked for later reclaim, but remain active (especially if you are racing against other threads using the same vnode). So in zfs_remove, we attempt to call vnode_recycle(), and only if this returns "1" do we know that vnop_reclaim was called, and we can directly call zfs_znode_delete(). Note that the O3X vnop_reclaim handler then has special code to not do anything with the vnode (zp->z_fastpath) but to only clear out the z_vnode and return.

There is also a little special lock-handling in zfs_zinactive, since we can call it from inside of a vnode_create() which is called by ZFS with locks held. If this is the case, we do not attempt to acquire locks in zfs_zinactive.