I've run into an odd problem when updating my main ZFS pool's offline backup.
The main pool is composed of a pair of Lacie 2Big Thunderbolts, each a ZFS mirror in the same pool. The offline backup is a cheap-ish USB-3 enclosure with 4 drive slots, configured as a zraid. The main pool's disks are also faster than those in the offline pool, so the backup process quickly gets bottlenecked.
The problem is that some minutes after the backup program[*] finishes scanning the two pools and starts copying files, system performance tanks badly. I get near-constant beach balls, menu clicks take many seconds to respond, and even typing in Terminal can lag badly. I've taken to running such backups only when I'm away from the computer, and so don't care if the system is nearly unusable. Although interactive performance is bad during this time, the backup does proceed.
Normally my offline backup isn't too far out of date, but this time it's both been longer than normal since the last backup, and there's been more change in the pool since the last backup, so it's looking like it will take a few days to get it all copied.
Last night before bed, I restarted the backup, and this morning I woke up to find my machine nearly unresponsive, as expected. The thing is, after finally getting the backup process killed off about a minute after telling the backup program to quit, the lights on the external disk enclosure kept flashing continuously for over half an hour. System performance took about a minute to return to normal.
What was ZFS doing for all that time?
I just exported and reimported the offline backup pool, then attempted to benchmark it, but got this utterly bogus result:
About two minutes after stopping the benchmark, the Write needle flew off the zero peg, going into the hundreds of MB/sec, then it fell back in stages over many minutes to about 51 MB/sec. It continued to flutter around the dial for minutes more. Clearly the program was badly confused by ZFS. It looks like some kind of horrible latency, as if the offline pool were out in space, and it was having to wait for the radio round trip delay to update the UI.
If we take 51 MB/sec as the actual speed of this enclosure — which is pitiful for USB-3, but then, USB disk performance has always been a disappointment to me — you find that over the half hour I watched the enclosure's lights flash that ZFS should have been able to write about 90 gigabytes. My system "only" has 32 GiB of RAM, so ZFS couldn't simply be flushing data from RAM buffer caches to disk.
Thinking that this discrepancy might be telling me that ZFS is pushing the system into swapping, I restarted the backup with Activity Monitor running, and the memory pressure graph remained in the green throughout. The graph shape wasn't completely level; I'd have to describe it as a bland landscape silhouette.
System performance took about 5 minutes to start tanking again with this test, and about 10-15 to become nearly unusable. The progression wasn't even; performance just fell badly at one point. These times are after the backup program started copying files again, not counting the time it took to scan both targets to find out where it should resume the backup. Even after I killed the backup again, the memory pressure graph didn't change much.
On stopping the backup a second time, the lights on the disk enclosure repeated their first performance this morning, flashing continuously for many minutes afterward. I've been composing this message in parallel, so I'm not sure, but I think it stopped sooner this time than the first time, but I again return to that same question: why is ZFS still transferring data in the background so many minutes after the data source was killed off?
All of this feels like O3X is misinterpreting the latencies involved and getting behind on processing write requests. Is there a ZFS write queue depth I can monitor to test that hypothesis?
O3X doesn't run into problems in another bottlenecking situation here, which is to my online backup, a NAS array, accessed over GigE via SMB. I can run that backup continuously in the background while using the system normally.
All of this testing is with O3X 1.5.2 on OS X 10.11.6.
[*] I usually use ChronoSync as my backup program, but for this series of tests I've fallen back to rsync to gain some control over the process. If you're wondering why I don't use ZFS send/receive, it's just a bit too advanced for me at this point. I'll figure that out someday, but for now, rsync and ChronoSync suffice for my purposes.