lundman wrote:You have a thread near the bottom with is heavy in dedup code, calling ddt_get_dedup_histogram()
In particular in spindump3-switchingfaces-... see:
Thread 0x123b 1000 samples (1-1000) priority 81 (base 81) cpu time 9.965
which is spending a lot of time in deduplication code (the get_histogram as lundman mentions, and also ddt_sync(), so you are writing to a dedup!=off dataset or zvol).
It's also one of only two kernel_task threads with cpu time > 1 second, and one of only five threads on your whole system that racked up > 1 cpu second during your spindump run (and two of those are directly related to spindump).
The other busy kernel thread is doing txg_sync and waiting on the deduplication thread, so essentially your deduplication activity is stalling all your I/O on that pool while waiting for synchronous reads (which are all effectivaly random I/O, and so waiting for your disks to seek).
Unlike lundman, I think this is the fundamental source of your problem, and the differences 1.5 - 1.6 from upstream that do better scheduling of writes in the normal case is letting you do more write() calls to a dedup!=off dataset than before, so worsening the head-of-line blocking whenever a deduplication table entry needs to be read in (synchronously, during txg_sync). Also you were probably memory constrained on the amount of dirty data you could produce under 1.5's spl, but now you can use the whole kstat.zfs.darwin.tunable.zfs_dirty_data_max. If that's about 1.5G, and with 128k records, you're looking at ~ 12000 ddt consultations per txg, and when your ARC is cold, that means more than 12000 random I/Os issued for the DDT entries and the metadata pointing to them, all done while syncing the transaction group (txg) out to disk. This will head-of-line block async reads (probably causing Photos.app to pizza wheel), and leave with you only up to dirty_data_max worth of async writes before all writes to that pool block.
Are you saving photos to a dataset where dedup is anything but "off" ? Or is this another dataset on the same pool doing concurrent writing ?
Photos won't deduplicate well, and you have a spinning-rust pool, so this seems like a recipe for bad behaviour.
In fact, friends don't let friends do deduplication *at all*, and certainly not on anything that doesn't have >> 2.0x deduplication (zpool get dedup), and even then not on anything where random LBA-to-LBA seek times are more than a microsecond or so (disks are ~ 10 milliseconds).
Moreover, you have lots of free space according to your zpool iostat -v output, so you should consider turning deduplication off and doing a local zfs send/recv (with -ec flags) from each deduplicated dataset to a nondeduplicated replacement. This will shrink your DDT, which you will want, because there are situations wherein the entire DDT has to be traversed at import time, when your ARC is cold. All sync tasks on all pools will block during a long import that has to do that. (Additionally, scrubs traverse the DDT first, and the random I/Os will interfere with normal I/O on the pool for the first hours of the scrub).
ETA: you can have more dirty data in 1.6 by design. (defensively you could try setting the tunable very low, but this will kill performance to all pools / datasets; it's a global value, although it controls the maximum of dirty data *per* pool).