OpenZFS on OS X

by **lundman** » Mon Jan 30, 2017 4:41 pm

You have a thread near the bottom with is heavy in dedup code, calling ddt_get_dedup_histogram() from spa_sync, this is slowing down your tx_sync, and zfs_rmnode calls above. It doesn't look dead, but very bogged down.

A similar ZOL issue is https://github.com/zfsonlinux/zfs/issues/5400, which suggested dataset might be close to quota/space limits

by **rottegift** » Mon Jan 30, 2017 6:21 pm

lundman wrote:You have a thread near the bottom with is heavy in dedup code, calling ddt_get_dedup_histogram()

In particular in spindump3-switchingfaces-... see:

Thread 0x123b 1000 samples (1-1000) priority 81 (base 81) cpu time 9.965

which is spending a lot of time in deduplication code (the get_histogram as lundman mentions, and also ddt_sync(), so you are writing to a dedup!=off dataset or zvol).

It's also one of only two kernel_task threads with cpu time > 1 second, and one of only five threads on your whole system that racked up > 1 cpu second during your spindump run (and two of those are directly related to spindump).

The other busy kernel thread is doing txg_sync and waiting on the deduplication thread, so essentially your deduplication activity is stalling all your I/O on that pool while waiting for synchronous reads (which are all effectivaly random I/O, and so waiting for your disks to seek).

Unlike lundman, I think this is the fundamental source of your problem, and the differences 1.5 - 1.6 from upstream that do better scheduling of writes in the normal case is letting you do more write() calls to a dedup!=off dataset than before, so worsening the head-of-line blocking whenever a deduplication table entry needs to be read in (synchronously, during txg_sync). Also you were probably memory constrained on the amount of dirty data you could produce under 1.5's spl, but now you can use the whole kstat.zfs.darwin.tunable.zfs_dirty_data_max. If that's about 1.5G, and with 128k records, you're looking at ~ 12000 ddt consultations per txg, and when your ARC is cold, that means more than 12000 random I/Os issued for the DDT entries and the metadata pointing to them, all done while syncing the transaction group (txg) out to disk. This will head-of-line block async reads (probably causing Photos.app to pizza wheel), and leave with you only up to dirty_data_max worth of async writes before all writes to that pool block.

Are you saving photos to a dataset where dedup is anything but "off" ? Or is this another dataset on the same pool doing concurrent writing ?

Photos won't deduplicate well, and you have a spinning-rust pool, so this seems like a recipe for bad behaviour.

In fact, friends don't let friends do deduplication *at all*, and certainly not on anything that doesn't have >> 2.0x deduplication (zpool get dedup), and even then not on anything where random LBA-to-LBA seek times are more than a microsecond or so (disks are ~ 10 milliseconds).

Moreover, you have lots of free space according to your zpool iostat -v output, so you should consider turning deduplication off and doing a local zfs send/recv (with -ec flags) from each deduplicated dataset to a nondeduplicated replacement. This will shrink your DDT, which you will want, because there are situations wherein the entire DDT has to be traversed at import time, when your ARC is cold. All sync tasks on all pools will block during a long import that has to do that. (Additionally, scrubs traverse the DDT first, and the random I/Os will interfere with normal I/O on the pool for the first hours of the scrub).

ETA: you can have more dirty data in 1.6 by design. (defensively you could try setting the tunable very low, but this will kill performance to all pools / datasets; it's a global value, although it controls the maximum of dirty data *per* pool).

by **roemer** » Tue Jan 31, 2017 6:38 am

rottegift wrote:
lundman wrote:You have a thread near the bottom with is heavy in dedup code, calling ddt_get_dedup_histogram()

In particular in spindump3-switchingfaces-... see:
Thread 0x123b 1000 samples (1-1000) priority 81 (base 81) cpu time 9.965
which is spending a lot of time in deduplication code (the get_histogram as lundman mentions, and also ddt_sync(), so you are writing to a dedup!=off dataset or zvol).
...
Are you saving photos to a dataset where dedup is anything but "off" ? Or is this another dataset on the same pool doing concurrent writing ?

Thanks to both of you for your analysis.

@lundman: I think I still have 1.6 TB available in my pool and no quota set. Where should it run out of space?

@rottegift: I do not intend to have dedup on at all, because I am aware that it would be a resource hog. But perhaps I misconfigured my pool??
(I am not aware of concurrent writes on the same pool, except for perhaps the background processes of Photos plus Photos itself. How can I check?).

Code: Select all: bash-3.2$ zfs get all ztank/Pictures NAME PROPERTY VALUE SOURCE ztank/Pictures type filesystem - ztank/Pictures creation Mon Apr 14 22:56 2014 - ztank/Pictures used 407G - ztank/Pictures available 1.60T - ztank/Pictures referenced 359G - ztank/Pictures compressratio 1.00x - ztank/Pictures mounted yes - ztank/Pictures quota none default ztank/Pictures reservation none default ztank/Pictures recordsize 128K default ztank/Pictures mountpoint /Users/Shared/Pictures local ztank/Pictures sharenfs off default ztank/Pictures checksum on default ztank/Pictures compression off default ztank/Pictures atime off inherited from ztank ztank/Pictures devices on default ztank/Pictures exec on default ztank/Pictures setuid on default ztank/Pictures readonly off default ztank/Pictures zoned off default ztank/Pictures snapdir hidden default ztank/Pictures aclmode passthrough default ztank/Pictures aclinherit restricted default ztank/Pictures canmount on default ztank/Pictures xattr on default ztank/Pictures copies 1 default ztank/Pictures version 5 - ztank/Pictures utf8only on - ztank/Pictures normalization formD - ztank/Pictures casesensitivity insensitive - ztank/Pictures vscan off default ztank/Pictures nbmand off default ztank/Pictures sharesmb off default ztank/Pictures refquota none default ztank/Pictures refreservation none default ztank/Pictures primarycache all default ztank/Pictures secondarycache all default ztank/Pictures usedbysnapshots 48.0G - ztank/Pictures usedbydataset 359G - ztank/Pictures usedbychildren 0 - ztank/Pictures usedbyrefreservation 0 - ztank/Pictures logbias latency default ztank/Pictures dedup off default ztank/Pictures mlslabel none default ztank/Pictures sync standard default ztank/Pictures refcompressratio 1.00x - ztank/Pictures written 123M - ztank/Pictures logicalused 398G - ztank/Pictures logicalreferenced 354G - ztank/Pictures filesystem_limit none default ztank/Pictures snapshot_limit none default ztank/Pictures filesystem_count none default ztank/Pictures snapshot_count none default ztank/Pictures snapdev hidden default ztank/Pictures com.apple.browse on default ztank/Pictures com.apple.ignoreowner off default ztank/Pictures com.apple.mimic_hfs on inherited from ztank ztank/Pictures shareafp off default ztank/Pictures redundant_metadata all default ztank/Pictures overlay off default

Zpool features:

Code: Select all: bash-3.2$ zpool get all NAME PROPERTY VALUE SOURCE ztank size 10.9T - ztank capacity 66% - ztank altroot - default ztank health ONLINE - ztank guid 14676218948438013388 default ztank version - default ztank bootfs - default ztank delegation on default ztank autoreplace off default ztank cachefile - default ztank failmode continue local ztank listsnapshots off default ztank autoexpand off default ztank dedupditto 0 default ztank dedupratio 1.00x - ztank free 3.63T - ztank allocated 7.24T - ztank readonly off - ztank ashift 12 local ztank comment - default ztank expandsize - - ztank freeing 0 default ztank fragmentation 7% - ztank leaked 0 default ztank feature@async_destroy enabled local ztank feature@empty_bpobj active local ztank feature@lz4_compress active local ztank feature@multi_vdev_crash_dump disabled local ztank feature@spacemap_histogram active local ztank feature@enabled_txg active local ztank feature@hole_birth active local ztank feature@extensible_dataset enabled local ztank feature@embedded_data active local ztank feature@bookmarks enabled local ztank feature@filesystem_limits disabled local ztank feature@large_blocks disabled local ztank feature@sha512 disabled local ztank feature@skein disabled local ztank feature@edonr disabled local

Seems Ok (besides that I didn't update the pool yet to the latest feature set).
What am I missing here?

by **rottegift** » Tue Jan 31, 2017 12:23 pm

Is ztank your only imported pool? If so, please run zdb -DDD ztank and paste results. If there is more than one pool run zdb -DDD for each pool.

by **roemer** » Tue Jan 31, 2017 4:31 pm

rottegift wrote:Is ztank your only imported pool? If so, please run zdb -DDD ztank and paste results. If there is more than one pool run zdb -DDD for each pool.

There's only one pool imported and there is no unusual CPU-activity at this moment.
All seems fine. Shall I run this again once I get into another stress-test situation?

Code: Select all: bash-3.2$ zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT ztank 10.9T 7.24T 3.63T - 7% 66% 1.00x ONLINE - bash-3.2$ sudo zdb -DDD ztank All DDTs are empty bash-3.2$ sudo zdb -DDD ztank/Pictures Dataset ztank/Pictures [ZPL], ID 102, cr_txg 9392, 359G, 1466587 objects

by **rottegift** » Tue Jan 31, 2017 9:25 pm

zdb -DDD only does useful things on a pool (there's a man page if you're dying to know).

This is really strange, why is your kernel entering ddt_sync() at all?

We'll have a think and get back to you.

by **roemer** » Wed Feb 01, 2017 1:23 am

rottegift wrote:... Are you saving photos to a dataset where dedup is anything but "off" ?

As far as I can tell (cf. my previous posts), dedup is indeed off.

rottegift wrote:Or is this another dataset on the same pool doing concurrent writing ?

I cannot completely rule this out as the photo library is on a file server which shares multiple datasets, all of which are stored on the same zfs pool.
Up-to the change to Photos 2.x and to O3X 1.6.0rc1, I had no issues with this setup.

There is one additional piece of information, which might be of interest:
I had used this pool just shortly before to temporarily backup the content of a 2 TB external HDD, to be able to securely do some maintenance on that drive.
During this operation, the pool was almost completely full (something like a few GB still free).
Once finished, I deleted that temporary backup again. Delete was very fast and seemed to have freed all this data again.
But could it be that there is some 'lazy' data cleanup going on which I trigger now and then with many writes to my Pictures dataset?

OpenZFS on OS X

zfs 1.6.0rc1 and Photos

Re: zfs 1.6.0rc1 and Photos

Re: zfs 1.6.0rc1 and Photos

Re: zfs 1.6.0rc1 and Photos

Re: zfs 1.6.0rc1 and Photos

Re: zfs 1.6.0rc1 and Photos

Re: zfs 1.6.0rc1 and Photos

Re: zfs 1.6.0rc1 and Photos

Who is online