+1 on the quoted bits.
Also:
grahamperrin wrote:For what it's worth: with my own setup, which is not optimised for performance, I suspect that occasional slowness – during scrub of a particular pool – is to be expected when 'going over' points in time where compression was greatest for blocks for a large compressible file (or for a succession of relatively small compressible files such as bands of a .sparsebundle).
That's a reasonable suspicion, but wrong.
The write pipeline is (for snv_149 and beyond) compression, encryption, checksumming and deduplication, in that order.
The normal read pipeline is: check if in ARC, check if in L2ARC, check if in duplicate table, retrieve-properly-checksumming-blocks-from-disk, decrypt, decompress.
ZEVO CE 1.1.1 obviously has not implemented encryption or enabled deduplication, but is extremely unlikely to have otherwise departed from these pipeline.
Scrubbing and resilvering do not use the normal read pipeline; neither can stop when the first properly checksumming block is retrieved from disk, and neither has to decrypt or decompress. Both processes are also metadata-driven. In general, they grab the current transaction group (txg) from the uberblock and stash it in the resilver or scrub name-value pair, and then descend mostly breadth-first from all the uberblocks (and the configuration metadata) to the pool root dataset and snapshot layer metadata, to all the descendant DSL metadata, then back up to the data from the now-known-to-be-correct pool root, examining data blocks by order of birth time, then through its children's data blocks, and so forth. The primary constraints is that no child is considered clean until its parents are considered clean and no child is considered clean unless all its copies and/or parity are also clean.
Once the whole tree of the stashed txg is clean, if the most recently committed txg has a more recent birth date, that tree will be descended through, except that subtrees below blocks with a birth date older than the clean txg aren't examined. This takes account of COW, the layout of the zfs metadata tree, and the transactional nature of zfs.
Data which is highly compressed and written in a sequential bulk then left alone is likely to be faster to scrub than other data, since the blocks will all have closely related parent metadata and highly similar birth dates. That favours locality on disk.
Scrubbing is typically IOPS-limited, so locality on disk is the main determinant of speed.
When you have an object which is regularly updated, by appending or by writes in the middle, you generate a lot of new blocks via COW. Each COW will have an updated birth date, and will be tied to a metadata tree back up to a newer txg.
Therefore for an 8MB band file, when you rewrite a 4096-byte dmg block contained in it, you reduce locality on disk compared to the other blocks backed in that band file. Time machine does this quite a bit, particularly in the blocks holding JHFS+ metadata. So when you are scrubbing away, it's not the compression that slows you down but rather the previous rewriting, since blocks with later birth orders with a common DSL parent are likely to be scattered around a bunch, possibly into widely-separated metaslabs on the same vdev (or, optimistically, onto the pool's other storage vdevs that can thereby produce a concurrency gain).
On the other hand, the DSL parent may have been updated for a variety of reasons (differently aged blocks held in different snapshots, and the posix mtime and atime attribute updates), which means less jumping around to scrub the blocks of a file all at once, and more descending through new metadata (sub) trees. Except in very full vdevs, that recovers locality on disk extremely well. (Indeed in the time machine case, if you have multiple storage vdevs, the scrub will *tend* to be checking the updated blocks for different time machine runs concurrently.)
Scrub slowdowns are usually because of bursts of POSIX-layer data changes -- updates to directories, creating new files, and rewrites of previously existing POSIX files -- such as when making and installing a project like gcc, retrieving a bazillion new Mail messages from an IMAP server, or pushing all those changes into a Time Machine Backups volume (whether SPARSEBUNDLE or SPARSE dmg).
This is usually referred to as zfs's file fragmentation problem. It might not be a problem for some workloads; it should not present a problem for your time machine activity, which should benefit from zfs's large record size, it's sequentialization of writes, and inherent localities of reference within the backup task itself. However, the usual fragmentation workaround is to simply copy the fragmented objects; zfs send/recv can do this because it works with DMU objects rather than blocks. For SPARSEBUNDLE and SPARSE dmgs, you can use your favourite system tool to make a new copy all at once, or you can dig into the bundle and copy suspect bands, which you can probably find using "ls -lat" over time; lower-numbered bands which accumulate recent modification times are hot and likely fragmented. (Beware that "file defragmentation by copying" using POSIX-layer tools is complicated by zfs snapshots). However, that seems like a lot of work to shave seconds or even minutes off occasional scrubs.