OpenZFS on OS X

by **tangent** » Thu Aug 18, 2016 10:16 am

I've run into an odd problem when updating my main ZFS pool's offline backup.

The main pool is composed of a pair of Lacie 2Big Thunderbolts, each a ZFS mirror in the same pool. The offline backup is a cheap-ish USB-3 enclosure with 4 drive slots, configured as a zraid. The main pool's disks are also faster than those in the offline pool, so the backup process quickly gets bottlenecked.

The problem is that some minutes after the backup program[*] finishes scanning the two pools and starts copying files, system performance tanks badly. I get near-constant beach balls, menu clicks take many seconds to respond, and even typing in Terminal can lag badly. I've taken to running such backups only when I'm away from the computer, and so don't care if the system is nearly unusable. Although interactive performance is bad during this time, the backup does proceed.

Normally my offline backup isn't too far out of date, but this time it's both been longer than normal since the last backup, and there's been more change in the pool since the last backup, so it's looking like it will take a few days to get it all copied.

Last night before bed, I restarted the backup, and this morning I woke up to find my machine nearly unresponsive, as expected. The thing is, after finally getting the backup process killed off about a minute after telling the backup program to quit, the lights on the external disk enclosure kept flashing continuously for over half an hour. System performance took about a minute to return to normal.

What was ZFS doing for all that time?

I just exported and reimported the offline backup pool, then attempted to benchmark it, but got this utterly bogus result:

About two minutes after stopping the benchmark, the Write needle flew off the zero peg, going into the hundreds of MB/sec, then it fell back in stages over many minutes to about 51 MB/sec. It continued to flutter around the dial for minutes more. Clearly the program was badly confused by ZFS. It looks like some kind of horrible latency, as if the offline pool were out in space, and it was having to wait for the radio round trip delay to update the UI.

If we take 51 MB/sec as the actual speed of this enclosure — which is pitiful for USB-3, but then, USB disk performance has always been a disappointment to me — you find that over the half hour I watched the enclosure's lights flash that ZFS should have been able to write about 90 gigabytes. My system "only" has 32 GiB of RAM, so ZFS couldn't simply be flushing data from RAM buffer caches to disk.

Thinking that this discrepancy might be telling me that ZFS is pushing the system into swapping, I restarted the backup with Activity Monitor running, and the memory pressure graph remained in the green throughout. The graph shape wasn't completely level; I'd have to describe it as a bland landscape silhouette.

System performance took about 5 minutes to start tanking again with this test, and about 10-15 to become nearly unusable. The progression wasn't even; performance just fell badly at one point. These times are after the backup program started copying files again, not counting the time it took to scan both targets to find out where it should resume the backup. Even after I killed the backup again, the memory pressure graph didn't change much.

On stopping the backup a second time, the lights on the disk enclosure repeated their first performance this morning, flashing continuously for many minutes afterward. I've been composing this message in parallel, so I'm not sure, but I think it stopped sooner this time than the first time, but I again return to that same question: why is ZFS still transferring data in the background so many minutes after the data source was killed off?

All of this feels like O3X is misinterpreting the latencies involved and getting behind on processing write requests. Is there a ZFS write queue depth I can monitor to test that hypothesis?

O3X doesn't run into problems in another bottlenecking situation here, which is to my online backup, a NAS array, accessed over GigE via SMB. I can run that backup continuously in the background while using the system normally.

All of this testing is with O3X 1.5.2 on OS X 10.11.6.

[*] I usually use ChronoSync as my backup program, but for this series of tests I've fallen back to rsync to gain some control over the process. If you're wondering why I don't use ZFS send/receive, it's just a bit too advanced for me at this point. I'll figure that out someday, but for now, rsync and ChronoSync suffice for my purposes.

by **Brendon** » Thu Aug 18, 2016 1:08 pm

Try again using the current source code repository code.

I'd also add that the benchmarking program is not compatible with ZFS, it gets confused by the ARC.

- Brendon

by **tangent** » Fri Aug 19, 2016 10:57 pm

I've just pointed out that my backup is outdated, and your response is to try experimental software?

Thanks, but no thanks.

Any tests I do until I get that backup completed will have to be done with the stable version.

One new thought: my NAS test gives us two alternatives to choose from:

1. There is something wrong with the USB enclosure.

2. The enclosure is fine, but the problem only shows up over high-latency connections like USB, and only when writing, not reading. If the problem were related to reading from the main ZFS pool, it would happen in both the NAS and USB cases.

by **tangles** » Wed Aug 24, 2016 12:50 am

do a dd command from yr LaCie pool to /dev/null and see how it performs...

then do the same with yr USB enclosure (I.e put it under load, but admittedly as reads and not writes)

same results as your backup program?

by **tangent** » Wed Aug 24, 2016 8:47 pm

That's a fair test, tangles. I may try it later.

I would do it now except that the backup seems to be a little less hard on system performance since I restarted it, about 10 hours ago. The system UI still chokes a bit occasionally, but I have yet to see a beach ball. I wonder if ZFS just had some background reorganization to finish, and now that it's done, all will be okay? We shall see.

Incidentally, this external USB enclosure also has an eSATA port, and I have an eSATA to Thunderbolt adapter which I bought long ago hoping it would make the backup faster, but it didn't, so I went back to USB, since the cabling is simpler and more durable. In the interests of science, I've switched back to the eSATA + Thunderbolt cabling, but the initial tests gave the same bad performance as with USB.

I'm still on the "faster" interface, but I don't think it is responsible for the recently improved system performance. When I restart the backup again, I will switch back to USB to test that. (I can't keep it running continuously. It only runs while my system is otherwise idle, which is one of several reasons I am still in the middle of finishing that original backup.)

by **Brendon** » Wed Aug 24, 2016 9:04 pm

Its almost certainly the memory allocator. Its known to cause choking, beachballing and stuttering in 1.5.2.

Cheers
Brendon

by **tangent** » Thu Aug 25, 2016 12:41 am

If the current unreleased code fixes that, but you're confident in its stability, I'll try it if you'll cut a release.

by **Brendon** » Thu Aug 25, 2016 12:58 am

Unable to help you with that, sorry. I don't control or have the ability to create the binary packages.

Brendon.

by **Sharko** » Thu Aug 25, 2016 7:20 am

Tangent, have you tried to limiting the amount of RAM allocated to ZFS via zsysctl.conf? Instructions here: https://openzfsonosx.org/wiki/Memory_utilization . I doubt that it will make your backup go faster, but I've found that limiting RAM to ZFS preserves UI responsiveness. In my situation I have 24GB ram, I set the limit to 8GB for ZFS; ZFS apparently has some extra overhead, and ends up wiring down 11 to 12GB, but this leaves half my RAM available for the OS and apps.

Cheers,

Kurt

by **tangent** » Fri Aug 26, 2016 1:03 pm

Sharko wrote:have you tried to limiting the amount of RAM allocated to ZFS via zsysctl.conf?

That did it!

I'm back on the USB connection, and have limited O3X to 12 GiB, and I'm getting zero stuttering or beachballs.

Ultimately, it looks like the solution is to make the core O3X memory allocator smarter, then.

OpenZFS on OS X

Backing up from a fast pool to a slow pool tanks performance

Backing up from a fast pool to a slow pool tanks performance

Re: Backing up from a fast pool to a slow pool tanks perform

Re: Backing up from a fast pool to a slow pool tanks perform

Re: Backing up from a fast pool to a slow pool tanks perform

Re: Backing up from a fast pool to a slow pool tanks perform

Re: Backing up from a fast pool to a slow pool tanks perform

Re: Backing up from a fast pool to a slow pool tanks perform

Re: Backing up from a fast pool to a slow pool tanks perform

Re: Backing up from a fast pool to a slow pool tanks perform

Re: Backing up from a fast pool to a slow pool tanks perform

Who is online