Editing Development

Jump to: navigation, search

Warning: You are not logged in.

Your IP address will be recorded in this page's edit history.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 13: Line 13:
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
$ sudo nvram boot-args="-v keepsyms=1 debug=0x144"
+
$ sudo nvram boot-args="-v keepsyms=y debug=0x144"
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Line 59: Line 59:
 
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
 
(lldb) addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000
 
</syntaxhighlight>
 
</syntaxhighlight>
 
addkext seems broken, now use:
 
 
(lldb) target modules add ../spl/module/spl/spl.kext/Contents/MacOS/spl
 
(lldb) target modules load --file spl --slide 0xffffff7f91e63000
 
  
 
Then follow the guide for GDB above.
 
Then follow the guide for GDB above.
Line 137: Line 132:
 
Voilà!
 
Voilà!
  
=== Memory leaks ===
 
  
(Note that this section is only relevant to old O3X implementation that used the zones allocator - we now use our own kmem allocator).
+
If you wish to compile O3X to a specific OSX version, compile ZFS with
 +
 
 +
CFLAGS=-mmacosx-version-min=10.9
 +
 
 +
 
 +
 
 +
=== Memory leaks ===
  
 
In some cases, you may suspect memory issues, for instance if you saw the following panic:
 
In some cases, you may suspect memory issues, for instance if you saw the following panic:
Line 215: Line 215:
 
Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.
 
Our strategy was to determine how much of the Illumos allocator could be implemented on OS X. After a series of experiments where we implemented significant portions of the kmem code from illumos on top of bmalloc, we had learned enough to take the final step of essentially copying the entire kmem/vmem allocator stack from Illumos. Some portions of the kmem code have been disabled in kmem such as logging, and hot swap CPU support have been disabled due to architectural differences between OS X and Illumos.
  
By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use a KMEM_QUANTUM of 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.   
+
By default kmem/vmem require a certain level of performance from the OS page allocator. It is easy to overwhelm the OS X page allocator. We tuned vmem to use 512Kb chunks of memory from the page allocator rather than the smaller allocations that vmem prefers. This is less than ideal as it reduces the ability for vmem to smoothly release memory to the page allocator when the machine is under pressure. While we have an adequately performing solution now, there will always be a tension between our allocator and OS X itself. OS X only provides minimal mechanisms to observe and respond to memory pressure in the machine, so we are somewhat limited in what can be achieved in this regard.   
 
+
As of 1.5.2 we switched the KMEM_QUANTUM to 128k based on feedback from a user. It was believed at the time that some tuning in the allocator had enabled this improvement. Surprisingly this has lead to reduced performance and some stuttering/beachballing on various machines. There is no apparent predictability to which class of machine will suffer from this, i.e. newer fast machines are apparently susceptible to this over the reference machine (a mac mini) around which the 128k opinion was formed. It also seems that allowing wired memory to become very large can (does?) result in performance problems.
+
 
+
There has been further investigation into exactly why we need to gain large blocks of memory from the page allocator, when the kernels own level 2 allocator does not. It turns out that vmem does not return memory to the page allocator in general on Illumos as it is the system wide allocator. In our case we do have to release memory back to the OS under pressure situations. To achieve this we need to configure vmem to act more like libumem does in user space, that is to know that it has an upstream allocator that must be cooperated with. Furthermore it turns out that the "quantum caches" in the heap vmem arena were not active, due to the vmem arena chaining not working at all (this is a bug). While this bug remains, the size of KMEM_QUANTUM is a proxy for frequency of memory allocations/frees via the kernel page allocator. High frequency is not good - the page allocator is slow and heavily impacts operation of the machine (TLB shootdowns etc).
+
  
 
References:
 
References:
  
 
Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/
 
Jeff Bonwicks paper - kmem and vmem implement this design. https://www.usenix.org/legacy/event/usenix01/full_papers/bonwick/bonwick_html/
 
=== Detecting memory handling errors ===
 
 
The kmem allocator has an internal diagnostic mode. In diagnostic mode the allocator instruments heap memory with various features and markers as it is allocated and released by application code. These markers are checked as the program runs, and can determine when an application has exhibited one or more of a set of common memory handling errors. The debugging mode is disabled by default as it carries a significant performance penalty.
 
 
The memory handling errors that can be detected include:
 
* Modify after free
 
* Write past end of buffer
 
* Free of memory not managed by kmem
 
* Double free of memory
 
* Various other corruptions
 
* Freed size != allocated size
 
* Freed address != allocated address
 
 
Debug mode is enabled by compiling with the preprocessor symbol DEBUG defined. At a minimum spl-kmem.c and spl-osx.c need to see this define for the debugging features to be completely enabled.
 
 
In debugging mode you must choose whether kmem will log the fault and then panic, or just log. If you elect to panic, there is a very high chance that the full log message will not be stored in system.log before the OS halts, and you will have to connect to the machine with lldb and use the "systemlog" command to view the diagnostic message. If you elect to not panic, the program will continue to run despite the memory corruption, with undefined consequences. In spl-kmem.c set kmem_panic=0 to log, kmem_panic=1 to log+panic.
 
 
Example:
 
 
I modified spl_start() to include the following:
 
 
  {
 
      ...
 
      int *p;
 
      for(int i=0; i<20;i++) {
 
          p = (int*)spl_kmem_alloc(1024);
 
          spl_kmem_free(p);
 
          *p = 0;
 
      }
 
  }
 
 
With the debug mode enabled the following was logged:
 
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: kernel memory allocator: buffer modified after being freed
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: modification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0xdeadbeef00000000)
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: buffer=0xffffff887a87d980  bufctl=0xffffff887a7ad840  cache: kmem_alloc_1152
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: previous transaction on buffer 0xffffff887a87d980:
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: thread=0  time=T-0.000001383  slab=0xffffff887a5ffe68  cache: kmem_alloc_1152
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free_debug + 0x227
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _kmem_cache_free + 0x173
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _zfs_kmem_free + 0x2c4
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: net.lundman.spl : _spl_start + 0x2bb
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext5startEb + 0x40b
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0xdd
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext4loadEhhP7OSArray + 0x3e1
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN6OSKext22loadKextWithIdentifierEP8OSStringbbhhP7OSArray + 0xf2
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZNK11IOCatalogue14isModuleLoadedEP12OSDictionary + 0xe0
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService15probeCandidatesEP12OSOrderedSet + 0x2c4
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN9IOService14doServiceMatchEj + 0x22a
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : __ZN15_IOConfigThread4mainEPvi + 0x13c
 
  14/08/2015 5:09:47.000 PM kernel[0]: SPL: mach_kernel : _call_continuation + 0x17
 
 
You can clearly see the kind of memory corruption, the actual corrupted data, which kmem cache was involved, the relative time that the last action occurred, and the stack trace for the last action (which was a call to zfs_kmem_free()) - indicating that spl_start() was implicated in the fault. This event would have been logged the next time the modified after free buffer was allocated.
 
 
=== Compiling to lower OSX versions ===
 
 
If you wish to compile O3X to a specific OSX version, in this case, compiling for 10.9 on a 10.10
 
 
SPL:
 
./configure --with-kernel-headers=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9
 
 
ZFS:
 
./configure --with-kernelsrc=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/System/Library/Frameworks/Kernel.framework/ CFLAGS=-mmacosx-version-min=10.9
 
  
 
== Flamegraphs ==
 
== Flamegraphs ==
Line 336: Line 268:
  
 
------
 
------
 
== Unit Test ==
 
 
We have created an initial port of the standard ZFS test suite. It consists of a collection of scripts and miscellaneous utility programs and exercise the complete breadth and depth of the ZFS filesystem.
 
 
The tests are best run in a virtual machine with a baseline configured setup that has been captured in a snapshot. The tests should be run on the VM, and then due to the destructive nature of the tests, the VM should be reverted to the snapshot in preparation for future test runs.  The tests take 2-4 hours to run depending on hardware setup.
 
 
=== Setup ===
 
 
The user zfs-test needs to be able to run sudo without issuing a password. Add the following to sudoers:
 
 
  zfs-tests ALL=(ALL) NOPASSWD: ALL
 
 
The sudo root environment must be configured to pass certain enviroment variables from zfs-test through to the root environment. Add the following to sudoers:
 
 
  Defaults env_keep += "__ZFS_MAIN_MOUNTPOINT_DIR"
 
 
Modify /etc/bashrc to contain
 
 
  export __ZFS_MAIN_MOUNTPOINT_DIR="/"
 
 
If your development directory is ~you/Developer, clone zfs, spl and bfs-tests into that directory
 
 
  # cd ~you/Developer
 
  # git clone git@github.com:openzfsonosx/zfs-test.git
 
  # git clone git@github.com:openzfsonosx/zfs.git
 
  # git clone git@github.com:openzfsonosx/spl.git
 
 
Build the ZFS is built using the building from source instructions.
 
 
Ensure that /var/tmp has approximately 100GB of free space.
 
 
Create theee virtual hard drives of 10-20GB capacity each.
 
 
=== Run Test Suite ===
 
 
Setup the tests to run
 
 
  # cd ~you/Developer/zfs-tests
 
  # ./autogen.sh
 
  # ./configure CC=clang CXX=clang++
 
 
Edit the generated Makefile, change the recipe for the test_hw target such that your three virtual disks are listed in the DISKS environment variable.
 
 
  test_hw: test_verify test/zfs-tests/cmd
 
          @KEEP="`zpool list -H -oname`" \
 
            STF_TOOLS=$(abs_top_srcdir)/test/test-runner/stf \
 
            STF_SUITE=$(abs_top_srcdir)/test/zfs-tests \
 
            DISKS="/dev/disk3 /dev/disk1 /dev/disk2" \
 
            su zfs-tests -c "ksh $(abs_top_srcdir)/test/zfs-tests/cmd/scripts/zfstest.ksh $$RUNFILE"
 
 
Run the test suite
 
 
  sudo make test_hw
 
 
=== Results ===
 
 
The test suite write summary pass/fail information to the console as they run. On completion of the test run summary statistics are written to the console.
 
 
Test log files are stored in /var/tmp/<testrun> (where test run is a unique looking number). In that directory there is a log file, and a directory per test. Within the test directory is detailed log information regarding the specific test.
 
  
 
== Iozone ==
 
== Iozone ==
Line 498: Line 370:
 
This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.
 
This section is an attempt to outline the differences from ZFS versions of other platforms, as compared to OS X. To assist developers new to the Apple platform, who wishes to assist, or understand, development of the O3X version.
  
=== VFS nolocks ===
+
=== Reclaim ===
  
To avoid deadlocking complications, we do not call VFS with any locks, at any time. This means we differ a little bit from original IllumOS code. In particular, the calls that create a znode (zfs_mknode and zfs_znode_alloc)
+
One of the biggest hassles with OS X is the VFS layer's handling of reclaim. First it is worth noting that "struct vnode" is an opaque type, so we are not allowed to see, nor modify, the contents of a vnode.
we do not attach the vnode here, as we are inside that of a dmu_tx. For example, the VNOPs zfs_create, zfs_mkdir, zfs_symlink and zfs_make_xattrdir, have instead been patched to call zfs_znode_getvnode() to attach the vnode '''after'''
+
(Of course, we could craft a mirror struct of vnode and tailor it to each OS X version where vnode changes. But that is rather hacky.)
the dmu_tx has been completed.  This means there is a small window where another thread can call zget() on the same object, and it does not yet have the vnode. This is detected in zget, and will delay until the vnode is
+
attached. We should look into a better delay, perhaps a condvar with wakeup.
+
  
There is further work in zil_lwb_commit() to ensure we call zfs_get_data without locks, and vnode attached. This this area also differs somewhat from IllumOS. The zget() call has been extended to allow flags to specify if
+
Following that, the '''only''' place where you can set the '''vtype''' (VREG, VDIR), '''vdata''' (user pointer to hold the ZFS znode), '''vfsops''' (list of filesystem calls "vnops") etc, is '''only in calling vnode_create()'''.
we should allow zget on unlinked files zil, and zget without attaching vnode (for delayed attachment after locks are released).
+
So there is no way to "allocate an empty vnode, and set its values later". The FreeBSD method of pre-allocating vnodes, to avoid reclaim, can not be done.
 +
ZFS will start a new dmu_tx, then call zfs_mknode which will eventually call vnode_create, so we can not do anything with dmu_tx in those vnops.
 +
 
 +
The problem is, if vnode_create decides to reclaim, it will do so directly, as the same thread. It will end up in vclean() which can call vnop_fsync, vnop_pageout, vnop_inactive and vnop_reclaim. The first three of these calls, we can
 +
use the API call vnode_isrecycled() to detect if these vnops are called "the normal way", or from vclean. If we come from vclean, and the vnode is doomed, we will do as little as possible. We can not open a new TX, and
 +
we can not use mutex locks (panic: locking against ourselves).
 +
 
 +
Nor is there any way to defer, or delay, a doomed vnode. If vnop_reclaim returns anything but 0, you find the lovely XNU code of
 +
2205        if (VNOP_RECLAIM(vp, ctx))
 +
2206                panic("vclean: cannot reclaim");
 +
in vfs_subr.c
 +
 
 +
 
 +
So, at the moment there is some extra logic in '''zfs_vnop_reclaim''' to handle that we might be re-entrant as the '''vnode_create''' thread.
 +
 
 +
    exception = ((zp->z_sa_hdl != NULL) &&
 +
        zp->z_unlinked) ? B_TRUE : B_FALSE;
 +
    fastpath = zp->z_fastpath;
 +
 
 +
if both exception and fastpath are FALSE, we can call direct reclaim right there. As in those cases, no final dmu_tx is caused. Following
 +
the zfs_rmnode->zfs_purgedir->zget and similar paths, exception is set to TRUE.
 +
 
 +
If exception is TRUE, we add the zp to the reclaim_list, and the separate reclaim_thread will call zfs_rmnode(zp). As a separate thread it can handle calling
 +
dmu_tx.
 +
 
 +
If fastpath is TRUE, we do no more/nothing in zfs_vnop_reclaim. See below.
  
 
=== Fastpath vs Recycle ===
 
=== Fastpath vs Recycle ===
Line 540: Line 435:
  
 
There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
 
There are two calls to vn_rdwr() in OSX's SPL. The '''spl_vn_rdwr()''' call needs to be used when zfs_onexit is in use. For example, dmu_send.c (zfs recv/send) and zfs_ioc_diff (zfs diff). The XNU implementation of
zfs_onexit (as in calls to ''' getf ''' and ''' releasef ''' ) need to place the internal XNU ''struct fileproc'' in the wrapper ''struct spl_fileproc'' , so that '''spl_vn_rdwr()''' can use it to do IO.
+
zfs_onexit (as in calls to '''getf''' and '''releasef''') need to place the internal XNU ''struct fileproc''' in the wrapper ''struct spl_fileproc'', so that '''spl_vn_rdwr()''' can use it to do IO.
 
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.
 
This is the only way to do IO on a non-file based vnode (ie, pipe or socket). Other places that call vn_rdwr(), for example vdev_file.c, needs to call the regular vn_rdwr.
 +
  
 
=== getattr ===
 
=== getattr ===
  
 
XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,  
 
XNU has a whole bunch of items that it can ask for in vnop_getattr, including VA_NAME, which is used heavily by Finder (especially in the vfs_vget path). Care is needed here to return the correct name,  
including for link (hard links) targets. Due to this, O3X caches the name in the znode so that we can return it without cost in vnop_getattr if available, and fall back to using zap_value_search otherwise.
+
including for link (hard links) targets. VNOP_LOOKUP records the name that was used in the lookup, so that a following stat call (vnop_getattr) on the vnode will return the correct name if VA_NAME is requested.
This value is also set on the vnode in vfs_vget() using the vnode_update_identity() call. This is expected by mds/Spotlight to work correctly.
+
 
+
=== hardlinks ===
+
 
+
There is further complications with hardlinks as well. In posix, hardlinks all share va_fileid, and z_links reference counter is incremented for each target. Finder in OS X requires a new va_linkid to be queried in
+
vnap_getattr. It also demands va_linkid to be a unique value for each link target (they all share the same va_fileid, but each one has its own va_linkid). And this va_linkid is used
+
in calls to vfs_vget(), to map va_linkid to the actual znode/fileid in use, and update both name and parent id.
+
 
+
For hardlinks, we then build two AVL trees, zfsvfs->z_hardlinks and zfsvfs->z_hardlinks_linkid in vnop_getattr. The first AVL is indexed by (parent id, file id, name) and the latter by (linkid). This allows us to create new
+
and unique va_linkid for hardlinks when we come across them (starting at 0x80000000 due to 32bitness, in a weak attempt to avoid collisions). vfs_vget() then checks the AVL tree for va_linkid and if found, can zget the correct
+
va_fileid, and set the hardlink name and parent id. If the AVL does not contain the va_linkid, it falls back to regular va_fileid lookup. So even if there are collisions, it should be able to cope.
+
 
+
Care is also needed in vnop_remove, to clean out the hardlink AVL node in both trees, as well as in vnop_rename, to update the new source mapping (parent id, file id, name). The AVL trees are unloaded at unmount.
+
 
+
 
+
== Merging with OpenZFS ==
+
 
+
Add upstream OpenZFS repo as a remote source;
+
 
+
  [remote "upstream"]
+
        url = git@github.com:openzfs/openzfs.git
+
        fetch = +refs/heads/*:refs/remotes/upstream/*
+
 
+
Set the rename limit really high to make it find our locations
+
 
+
# git config merge.renameLimit 999999
+
+
  [merge]
+
        renameLimit = 999999
+
 
+
 
+
Make sure it is up to date
+
 
+
  # git fetch upstream
+
 
+
Check whats new:
+
 
+
  # git log --stat upstream/master
+
 
+
For each new commit, bring it in, for example f4a6fedc42535abef5f0584fa0c6cb2af46b9ddf
+
 
+
  # git cherry-pick f4a6fedc42535abef5f0584fa0c6cb2af46b9ddf
+
 
+
fix any clashes, if any, make sure it compiles with no new errors or warnings.
+
 
+
Always update the commit message
+
 
+
  # git commit --amend
+
 
+
and delete any lines that are like
+
 
+
  Closes #324
+
 
+
since they do not match our issues.
+
 
+
 
+
=== Merging PRs ===
+
 
+
Check out the PR branch, for example PR 124
+
 
+
  # git fetch upstream pull/124/head:pr124
+
  # git checkout pr124
+
 
+
And view the commits that you want
+
 
+
[[w]]
+

Please note that all contributions to OpenZFS on OS X may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see OpenZFS on OS X:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)