Development

From OpenZFS on OS X
Revision as of 07:08, 23 March 2014 by Lundman (Talk | contribs)

Jump to: navigation, search

Development

Flamegraphs

Huge thanks to BrendanGregg for so much of the dtrace magic.

dtrace the kernel while running a command:

dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
   @[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks

it will run for 60 seconds.

Convert it to a flamegraph

./stackcollapse.pl out.stacks > out.folded
./flamegraph.pl out.folded > out.svg


This is rsync -ar /usr/ /BOOM/deletea/ running;

rsync flamegraph


Or running bonnie++ in various stages;


ZVOL block size

At the moment, we can only handle block size of 512 and 4096 in ZFS. And 512 is handled poorly. To write a single 512 block, iokit layer will read in 8 blocks (to make up a PAGE_SIZE read) modify the buffer, then write 8 blocks. This makes ZFS think we wrote 8 blocks, and all stats are updated as such. This is undesirable since compression ratio etc can not be reported correctly.

This limitation is in specfs, which is applied to any BLK device created in /dev. For usage with Apple and the GUI, there is not much we can do. But we are planning to create a secondary blk/chr nodes (maybe in /var/run/zfs/dsk/$POOL/$name or similar for compatibility) which will have our implementation attached as vnops. This will let us handle any block size required.


vnode_create thread

Currently, we have to protect the call to vnode_create() due to the possibility that it calls several vnops (fsync, pageout, reclaim) and have a reclaim thread to deal with that. One issue is reclaim can both be called as a separate thread (periodic reclaims) and as the calling thread of vnode_create. This makes locking tricky.

One idea is we create a vnode_create thread (with each dataset). The in zfs_zget and zfs_znode_alloc, which calls vnode_create, we simply place the newly allocated zp on the vnode_create thread's request list, and resume execution. Once we have passed the "unlock" part of the functions, we can wait for the vnode_create thread to complete the request so we do not resume execution without the vp attached.

In the vnode_create thread, we pop items off the list, call vnode_create (guaranteed as a separate thread now) and once completed, mark the node done, and signal the process which might be waiting.

In theory this should let us handle reclaim, fsync, pageout as normal upstream ZFS. no special cases required. This should alleviate the current situation where the reclaim_list grows to very large numbers (230,000 nodes observed).

It might mean we need to be careful in any function which might end up in zfs_znode_alloc, to make sure we have a vp attached before we resume. For example, zfs_lookup and zfs_create.