The problem here is that USB is slowwwwwwww. Let this script run for a bit, it times the function boundaries of zio_write_phys.
This is during heavy traffic mainly to an SSD-based pool, with some light (~ 1 MB/s) background I/O on a 3-way mirror of FW800 disks. Your numbers will certainly differ for your "tall" pool and the like, but any distribution skewed towards longer times is bad. Large numbers of phys writes lasting longer than 5 seconds is likely causing throttling, and anything approaching 30 seconds certainly is.
Dirtied blocks are locked in memory for at least this amount of time.
Let it run for a minute or two, then interrupt it for the graph.
- Code: Select all
$ sudo dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:zio_write_phys:entry { self->s = timestamp; }
arc_buf_add_ref:return /self->s/ {
@["ns"] = quantize(timestamp - self->s); self->s = 0; }'
dtrace: description 'fbt:com.getgreenbytes.filesystem.zfs:zio_write_phys:entry ' matched 2 probes
^C
ns
value ------------- Distribution ------------- count
2097152 | 0
4194304 |@@ 1
8388608 |@@@@@@@@@@@ 6
16777216 |@@@@@@@@@ 5
33554432 |@@@@@ 3
67108864 | 0
134217728 | 0
268435456 |@@ 1
536870912 | 0
1073741824 | 0
2147483648 |@@ 1
4294967296 |@@@@@@@@@ 5
8589934592 | 0
You could also trace zio_read_phys using the same one-liner (just change "write" to "read"). A skew towards slow reads indicates bad performance, which you'll also "feel". They are not so critical, since not so many resources are locked down for reads. Moreover, they generally pile up waiting for writes to complete, although the scheduling algorithm will opportunistically do a read that is sufficiently nearby (approximately the same track) as a write.
- Code: Select all
dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:zio_read_phys:entry { self->s = timestamp; }
arc_buf_add_ref:return /self->s/ {
@["ns"] = quantize(timestamp - self->s); self->s = 0; }'
dtrace: description 'fbt:com.getgreenbytes.filesystem.zfs:zio_read_phys:entry ' matched 2 probes
^C
ns
value ------------- Distribution ------------- count
32768 | 0
65536 | 1
131072 |@@@ 462
262144 |@@@@ 483
524288 |@@@@@@@@@@@@@@@@@@@@@ 2769
1048576 |@@@@@@ 784
2097152 |@@ 259
4194304 |@ 186
8388608 |@ 128
16777216 | 34
33554432 | 7
67108864 | 26
134217728 |@ 79
268435456 | 7
536870912 | 36
1073741824 | 28
2147483648 | 34
4294967296 | 26
8589934592 | 13
17179869184 | 10
34359738368 | 3
68719476736 | 0
You can see who is doing the writing
- Code: Select all
dtrace -n 'syscall::write:entry /fds[arg0].fi_fs == "zfs"/ { @[execname] = count(); }'
dtrace: description 'syscall::write:entry ' matched 1 probe
^C
SafariForWebKit 1
Terminal 1
BetterTouchTool 2
XXX 3
dd 128
It's also probably useful to figure out what's hitting your ARC the most in general
- Code: Select all
$ sudo dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:arc_buf_add_ref:entry { @[execname] = count() }'
dtrace: description 'fbt:com.getgreenbytes.filesystem.zfs:arc_buf_add_ref:entry ' matched 1 probe
^C
Dropbox 2
storeagent 3
ubd 3
lssave 4
BetterTouchTool 5
EyeTV Helper 10
squid 11
SystemUIServer 13
com.apple.iClou 16
XXXX 18
log_file_daemon 18
XXXX 18
Finder 21
getopt 29
taskgated 34
Skype 40
Terminal 40
WebProcess 47
fseventsd 87
dbfseventsd 100
pbs 102
cfprefsd 130
CalendarAgent 176
mds 199
mdworker 537
XXX 605
DragThing 790
kernel_task 1137
SafariForWebKit 2017
zfs 28546
And finally, while not really diagnostic, this traces reads and writes to zfs-backed files and after interruption will show the
summary for the top ten for the duration of the trace.
- Code: Select all
$ sudo dtrace -qn 'syscall::read:entry,syscall::write:entry /fds[arg0].fi_fs == "zfs"/ { @[probefunc, fds[arg0].fi_pathname] = count(); printf("%Y %s %s\n", walltimestamp, stringof(probefunc), stringof(fds[arg0].fi_pathname)); } END { trunc(@,10); }'
Naturally you can change "zfs" to "hfs" etc. in that one-liner.
Again, though, *THE* key thing is the amount of time it takes to complete writes out to the devices. If that's slow, you will be unhappy.
It's going to be slow for any nontrivial amount of writing to a USB2 device, even a device that is remarkably, atypically, unusually, and astonishingly well behaved, which few USB2 mass storage devices really are.
Edit: zio_flush is also interesting to quantize; that is a write with DKIOCFLUSHWRITECACHE which happens for synchronous writes and at the very least for the labels at each TXG. This will trace only writes, naturally. Shorter times and short-skewed distributions are better; the timing of this is critical in gating the overall IOPS of any pool, but this will trace all pools with traffic. Fiddling around with a walk up the structures to the spa_name so that one can trace only one pool is an exercise for the reader.

- Code: Select all
dtrace -n 'fbt:com.getgreenbytes.filesystem.zfs:zio_flush:entry { self->s = timestamp; }
arc_buf_add_ref:return /self->s/ {
@["ns"] = quantize(timestamp - self->s); self->s = 0; }'
value ------------- Distribution ------------- count
524288 | 0
1048576 |@@@@ 1
2097152 |@@@@ 1
4194304 |@@@@ 1
8388608 |@@@@@@@@@@@@@@@ 4
16777216 |@@@@ 1
33554432 | 0
67108864 | 0
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 |@@@@ 1
2147483648 | 0
4294967296 |@@@@@@@ 2
8589934592 | 0
...
274877906944 | 1
Note the extreme outlier: that's a flush to a pool made up of a set of external drives that go into idle power saving mode. ZEVO delays waking up external drives.