Development

Flamegraphs

Huge thanks to BrendanGregg for so much of the dtrace magic.

dtrace the kernel while running a command:

dtrace -x stackframes=100 -n 'profile-997 /arg0/ {
   @[stack()] = count(); } tick-60s { exit(0); }' -o out.stacks

it will run for 60 seconds.

Convert it to a flamegraph

./stackcollapse.pl out.stacks > out.folded
./flamegraph.pl out.folded > out.svg

This is rsync -ar /usr/ /BOOM/deletea/ running;

rsync flamegraph

Or running bonnie++ in various stages;

Create files in sequential order
Stat files in sequential order
Delete files in sequential order

ZVOL block size

At the moment, we can only handle block size of 512 and 4096 in ZFS. And 512 is handled poorly. To write a single 512 block, iokit layer will read in 8 blocks (to make up a PAGE_SIZE read) modify the buffer, then write 8 blocks. This makes ZFS think we wrote 8 blocks, and all stats are updated as such. This is undesirable since compression ratio etc can not be reported correctly.

This limitation is in specfs, which is applied to any BLK device created in /dev. For usage with Apple and the GUI, there is not much we can do. But we are planning to create a secondary blk/chr nodes (maybe in /var/run/zfs/dsk/$POOL/$name or similar for compatibility) which will have our implementation attached as vnops. This will let us handle any block size required.

vnode_create thread

Currently, we have to protect the call to vnode_create() due to the possibility that it calls several vnops (fsync, pageout, reclaim) and have a reclaim thread to deal with that. One issue is reclaim can both be called as a separate thread (periodic reclaims) and as the calling thread of vnode_create. This makes locking tricky.

One idea is we create a vnode_create thread (with each dataset). The in zfs_zget and zfs_znode_alloc, which calls vnode_create, we simply place the newly allocated zp on the vnode_create thread's request list, and resume execution. Once we have passed the "unlock" part of the functions, we can wait for the vnode_create thread to complete the request so we do not resume execution without the vp attached.

In the vnode_create thread, we pop items off the list, call vnode_create (guaranteed as a separate thread now) and once completed, mark the node done, and signal the process which might be waiting.

In theory this should let us handle reclaim, fsync, pageout as normal upstream ZFS. no special cases required. This should alleviate the current situation where the reclaim_list grows to very large numbers (230,000 nodes observed).

It might mean we need to be careful in any function which might end up in zfs_znode_alloc, to make sure we have a vp attached before we resume. For example, zfs_lookup and zfs_create.

The branch vnode_thread is just this idea, it creates a vnode_create_thread per dataset, when we need to call vnode_create() it simply adds the zp to the list of requests, then signals the thread. The thread will call vnode_create() and upon completion, set zp->z_vnode then signal back. The requester for zp will sit in zfs_znode_wait_vnode() waiting for the signal back.

This means the ZFS code base is littered with calls to zfs_znode_wait_vnode() (46 to be exact) placed at the correct location. Ie, after all the locks are released, and zil_commit() has been called. It is possible that this number could be decreased, as the calls to zfs_zget() appear to not suffer the zil_commit() issue, and can probably just block at the end of zfs_zget(). However the calls to zfs_mknode() is what causes the issue.

sysctl zfs.vnode_create_list tracks the number of zp nodes in the list waiting for vnode_create() to complete. Typically, 0, or 1. Rarely higher.

Appears to deadlock from time to time.

The second branch vnode_threadX takes a slightly different approach. Instead of a permanent vnode_create_thread, it simply spawns a thread when zfs_znode_getvnode() is called. This new thread calls _zfs_znode_getvnode() which functions as above. Call vnode_create() then signal back. The same zfs_znode_wait_vnode() blockers exist.

sysctl zfs.vnode_create_list tracks the number of vnode_create threads we have started. Interestingly, these remain 0, or 1. Rarely higher.

Has not yet deadlocked.

Conclusions;

It is undesirable that we have zfs_znode_wait_vnode() placed all over the source, and care needs to be taken for each one. Although it does not hurt to call it in excess, as no wait will happen if zp->z_vnode is already set.
It is unknown if it is OK to resume ZFS execution while z_vnode is still NULL, and only block (to wait for it to be filled in) once we are close to leaving the VNOP.

However, that vnop_reclaim are direct and can be cleaned up immediately is very desirable. We no longer need to check for the zp without vp case in zfs_zget().
We no longer need to lock protect vnop_fsync, vnop_pageout in case they are called from vnode_create().
We don't have to throttle the reclaim thread due to the list being massive (populating the list is much faster than cleaning up a zp node - up to 250,000 nodes in the list has been observed).

Create files in sequential order

VX create.svg

Development

Contents

Development

Flamegraphs

ZVOL block size

vnode_create thread

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools