Development

From OpenZFS on OS X
Revision as of 04:51, 11 April 2014 by Ilovezfs (Talk | contribs)

Jump to: navigation, search

Kernel

Debugging with GDB

Dealing with panics.

Apple's documentation: https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/KEXTConcept/KEXTConceptDebugger/debug_tutorial.html

Boot target VM with

sudo nvram boot-args="-v keepsyms=y debug=0x144"

Make it panic.

On your development machine, you will need the Kernel Debug Kit. Download it from Apple here.

gdb /Volumes/Kernelit/mach_kernel
source /Volumes/KernelDebugKit/kgmacros
target remote-kdp
 
kdp-reattach  192.168.30.133   # obviously use the IP of your target / crashed VM
 
showallkmods
(find "address" for zfs and spl modules)
^Z   # suspend gdb, or, use another terminal
 
kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel -e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/
 
fg # resume gdb, or go back to gdb terminal
set kext-symbol-file-path /tmp
 
add-kext /tmp/spl.kext 
add-kext /tmp/zfs.kext
 
bt

Debugging with LLDB

echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
lldb /Volumes/KernelDebugKit/mach_kernel
kdp-remote  192.168.30.146
showallkmods
addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000   (Address from showallkmods)
addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000

Then follow the guide for GDB above.

Non-panic

If you prefer to work in gdb, you can always panic a kernel with

sudo dtrace -w -n "BEGIN{ panic();}"

But this was revealing:

  1. /usr/libexec/stackshot -i -f /tmp/stackshot.log
  2. symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
  3. less /tmp/trace.txt

Note that my hang is here:

PID: 156

   Process: zpool
   Thread ID: 0x4e2
   Thread state: 0x9 == TH_WAIT |TH_UNINT 
   Thread wait_event: 0xffffff8006608a6c
   Kernel stack: 
   machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
     0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
       thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
         lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
           0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
             msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
               0xffffff7f80e52a76 (0xffffff7f80e52a76)
                 0xffffff7f80e53fae (0xffffff7f80e53fae)
                   0xffffff7f80e54173 (0xffffff7f80e54173)
                     0xffffff7f80f1a870 (0xffffff7f80f1a870)
                       0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
                         0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
                           0xffffff7f80f1b65f (0xffffff7f80f1b65f)
                             0xffffff7f80f042ee (0xffffff7f80f042ee)
                               0xffffff7f80f45c5b (0xffffff7f80f45c5b)
                                 0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
                                   spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
                                     VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)

It is a shame that it only shows the kernel symbols, and not inside SPL and ZFS, but we can ask it to load another sym file. (Alas, it cannot handle multiple symbols files. Fix this Apple.)

  1. kextstat
  2. Grab the addresses of spl and zfs again
  3. kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel -e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/
  1. symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
             0xffffff800056a2e4 (0xffffff800056a2e4)
               spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
                 taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
                   taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
                     0xffffff7f80f1a870 (0xffffff7f80f1a870)
  1. symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
                   0xffffff7f80e54173 (0xffffff7f80e54173)
                     vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
                       vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
                         vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
                           vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
                             spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)

Voila!

Memory leaks

If you suspect memory issues, for example from a panic like:

panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826 Attach gdb and use the zprint command:

(gdb) zprint ZONE COUNT TOT_SZ MAX_SZ ELT_SZ ALLOC_SZ TOT_ALLOC TOT_FREE NAME 0xffffff8002a89250 1620133 18c1000 22a3599 16 1000 125203838 123583705 kalloc.16 CX 0xffffff8006306c50 110335 35f000 4ce300 32 1000 13634985 13524650 kalloc.32 CX 0xffffff8006306a00 133584 82a000 e6a900 64 1000 26510120 26376536 kalloc.64 CX 0xffffff80063067b0 610090 4a84000 614f4c0 128 1000 50524515 49914425 kalloc.128 CX 0xffffff8006306560 1070398 121a2000 1b5e4d60 256 1000 72534632 71464234 kalloc.256 CX 0xffffff8006306310 399302 d423000 daf26b0 512 1000 39231204 38831902 kalloc.512 CX 0xffffff80063060c0 100404 6231000 c29e980 1024 1000 22949693 22849289 kalloc.1024 CX 0xffffff8006305e70 292 9a000 200000 2048 1000 77633725 77633433 kalloc.2048 CX In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

(gdb) findoldest oldest record is at log index 393:


ALLOC 0xffffff803276ec00 : index 393  : ztime 21643824 -------------

0xffffff800024352e <zalloc_canblock+78>: mov  %eax,-0xcc(%rbp) 0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35> 0xffffff8000224c39 <OSMalloc+89>: mov  %rax,-0x18(%rbp) 0xffffff7f80e847df <zfs_kmem_alloc+15>: mov  %rax,%r15 0xffffff7f80e90649 <arc_buf_alloc+41>: mov  %rax,-0x28(%rbp) and indeed, list any index

(gdb) zstack 394


ALLOC 0xffffff8032d60700 : index 394  : ztime 21648810 -------------

0xffffff800024352e <zalloc_canblock+78>: mov  %eax,-0xcc(%rbp) 0xffffff80002245bd <get_zone_search+23>: jmpq 0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35> 0xffffff8000224c39 <OSMalloc+89>: mov  %rax,-0x18(%rbp) 0xffffff7f80e847df <zfs_kmem_alloc+15>: mov  %rax,%r15 0xffffff7f80e90649 <arc_buf_alloc+41>: mov  %rax,-0x28(%rbp) How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df occurred 3999 times in log (100% of records) At least we know it is our fault.

How many times is it arc_buf_alloc?

(gdb) countpcs 0xffffff7f80e90649 occurred 2390 times in log (59% of records)

Flamegraphs

Huge thanks to BrendanGregg for so much of the dtrace magic.

dtrace the kernel while running a command:

dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
   @[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks

it will run for 60 seconds.

Convert it to a flamegraph

./stackcollapse.pl out.stacks > out.folded
./flamegraph.pl out.folded > out.svg


This is rsync -ar /usr/ /BOOM/deletea/ running:

rsync flamegraph


Or running bonnie++ in various stages:


ZVOL block size

At the moment, we can only handle block size of 512 and 4096 in ZFS. And 512 is handled poorly. To write a single 512 block, IOKit layer will read in 8 blocks (to make up a PAGE_SIZE read) modify the buffer, then write 8 blocks. This makes ZFS think we wrote 8 blocks, and all stats are updated as such. This is undesirable since compression ratio etc cannot be reported correctly.

This limitation is in specfs, which is applied to any BLK device created in /dev. For usage with Apple and the GUI, there is not much we can do. But we are planning to create a secondary blk/chr nodes (maybe in /var/run/zfs/dsk/$POOL/$name or similar for compatibility) which will have our implementation attached as vnops. This will let us handle any block size required.


vnode_create thread

Currently, we have to protect the call to vnode_create() due to the possibility that it calls several vnops (fsync, pageout, reclaim) and have a reclaim thread to deal with that. One issue is reclaim can both be called as a separate thread (periodic reclaims) and as the calling thread of vnode_create. This makes locking tricky.

One idea is we create a vnode_create thread (with each dataset). The in zfs_zget and zfs_znode_alloc, which calls vnode_create, we simply place the newly allocated zp on the vnode_create thread's request list, and resume execution. Once we have passed the "unlock" part of the functions, we can wait for the vnode_create thread to complete the request so we do not resume execution without the vp attached.

In the vnode_create thread, we pop items off the list, call vnode_create (guaranteed as a separate thread now) and once completed, mark the node done, and signal the process which might be waiting.

In theory this should let us handle reclaim, fsync, pageout as normal upstream ZFS. no special cases required. This should alleviate the current situation where the reclaim_list grows to very large numbers (230,000 nodes observed).

It might mean we need to be careful in any function which might end up in zfs_znode_alloc, to make sure we have a vp attached before we resume. For example, zfs_lookup and zfs_create.


The branch vnode_thread is just this idea, it creates a vnode_create_thread per dataset, when we need to call vnode_create() it simply adds the zp to the list of requests, then signals the thread. The thread will call vnode_create() and upon completion, set zp->z_vnode then signal back. The requester for zp will sit in zfs_znode_wait_vnode() waiting for the signal back.

This means the ZFS code base is littered with calls to zfs_znode_wait_vnode() (46 to be exact) placed at the correct location. Ie, after all the locks are released, and zil_commit() has been called. It is possible that this number could be decreased, as the calls to zfs_zget() appear to not suffer the zil_commit() issue, and can probably just block at the end of zfs_zget(). However the calls to zfs_mknode() is what causes the issue.

sysctl zfs.vnode_create_list tracks the number of zp nodes in the list waiting for vnode_create() to complete. Typically, 0, or 1. Rarely higher.

Appears to deadlock from time to time.


The second branch vnode_threadX takes a slightly different approach. Instead of a permanent vnode_create_thread, it simply spawns a thread when zfs_znode_getvnode() is called. This new thread calls _zfs_znode_getvnode() which functions as above. Call vnode_create() then signal back. The same zfs_znode_wait_vnode() blockers exist.

sysctl zfs.vnode_create_list tracks the number of vnode_create threads we have started. Interestingly, these remain 0, or 1. Rarely higher.

Has not yet deadlocked.


Conclusions:

  • It is undesirable that we have zfs_znode_wait_vnode() placed all over the source, and care needs to be taken for each one. Although it does not hurt to call it in excess, as no wait will happen if zp->z_vnode is already set.
  • It is unknown if it is OK to resume ZFS execution while z_vnode is still NULL, and only block (to wait for it to be filled in) once we are close to leaving the VNOP.


  • However, that vnop_reclaim are direct and can be cleaned up immediately is very desirable. We no longer need to check for the zp without vp case in zfs_zget().
  • We no longer need to lock protect vnop_fsync, vnop_pageout in case they are called from vnode_create().
  • We don't have to throttle the reclaim thread due to the list being massive (populating the list is much faster than cleaning up a zp node - up to 250,000 nodes in the list has been observed).


Create files in sequential order


IOzone flamegraph
IOzone flamegraph (untrimmed)




Iozone

Quick peek at how they compare, just to see how much we should improve it by.

HFS+ and ZFS were created on the same virtual disk in VMware. No, this is not ideal testing specs, but should serve as an indicator. The pool was created with

# zpool create -f -O atime=off -o ashift=12 -O casesensitivity=insensitive -O normalization=formD BOOM /dev/disk1

and HFS+ created with standard OS X Disk Utility, with everything default. (Journaled).

Iozone was run with standard automode, ie:

# iozone -a -b outfile.xls
HFS+ read
HFS+ write
ZFS read
ZFS write

As a guess, writes need to double, and reads need to triple.