Extreme Performance Issues with v2.1.6 (and v2.2.3)

All your general support questions for OpenZFS on OS X.

Re: Extreme Performance Issues with v2.1.6 (and v2.2.3)

Postby Arne » Wed Mar 27, 2024 7:54 am

I have a Mini 2009 with 8gb, ElCapitan 10.11.6 and zfs 2.2.2.

Starting an app like Firefox or Obsidian takes very long, it is almost not possible from a zfs dataset. After 30 sec I stopped waiting.
I remembered that it was possible sometime ago (maybe 1.9.4?).
One reason was the recordsize. It was set to 1M. Setting it to 128K let the apps start but in double the time of starting them from hfs.
When I set primarycache to "all" then the apps start only a few seconds slower then from hfs. Even when recordsize is set to 1M.
The problem is that with primarycache set to "all" the apps start faster and even the history display in firefox is more agile but the performance of copying data ist bad. Almost half the performance as of when primarycache is set to "metadata".

I searched for the tunable that manages the meta-balance (and found it) but setting it to any other value as 500 (0, -1, 5000) didn't have any effect.

Code: Select all
sysctl -w kstat.zfs.darwin.tunable.zfs_arc.meta_balance=500 (default)


But I also found this way the setting for the compression of the arc.
It showed an effect on the performance even when primarycache is set to "all".
But it uses more memory. For me it was more than 6gb used with more than 4gb reserved.

Code: Select all
sysctl -w kstat.zfs.darwin.tunable.zfs.compressed_arc_enabled=0


Fiddeling with all of these and other settings on hand the performance varied sometimes a lot even with the same settings (don't know why), setting compressed_arc_enabled to 0 had the best effect on the performance overall.

Please try it yourself.
If it works for you, too, at least setting this tuneable might be a good compromise of getting a better performance while using primarycache, till the arc problem is solved.
My system: Mini 2009 (early) with El-Capitan 10.11.6
Arne
 
Posts: 30
Joined: Mon Oct 29, 2018 7:59 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.3)

Postby armdn » Thu Mar 28, 2024 10:55 am

Arne wrote:I have a Mini 2009 with 8gb, ElCapitan 10.11.6 and zfs 2.2.2.

Starting an app like Firefox or Obsidian takes very long, it is almost not possible from a zfs dataset. After 30 sec I stopped waiting.
I remembered that it was possible sometime ago (maybe 1.9.4?).
One reason was the recordsize. It was set to 1M. Setting it to 128K let the apps start but in double the time of starting them from hfs.
When I set primarycache to "all" then the apps start only a few seconds slower then from hfs. Even when recordsize is set to 1M.
The problem is that with primarycache set to "all" the apps start faster and even the history display in firefox is more agile but the performance of copying data ist bad. Almost half the performance as of when primarycache is set to "metadata".

I searched for the tunable that manages the meta-balance (and found it) but setting it to any other value as 500 (0, -1, 5000) didn't have any effect.

Code: Select all
sysctl -w kstat.zfs.darwin.tunable.zfs_arc.meta_balance=500 (default)


But I also found this way the setting for the compression of the arc.
It showed an effect on the performance even when primarycache is set to "all".
But it uses more memory. For me it was more than 6gb used with more than 4gb reserved.

Code: Select all
sysctl -w kstat.zfs.darwin.tunable.zfs.compressed_arc_enabled=0


Fiddeling with all of these and other settings on hand the performance varied sometimes a lot even with the same settings (don't know why), setting compressed_arc_enabled to 0 had the best effect on the performance overall.

Please try it yourself.
If it works for you, too, at least setting this tuneable might be a good compromise of getting a better performance while using primarycache, till the arc problem is solved.


Nice find. I won’t be able to check anymore, since I migrated all my pools back to 2.1.0. But at least now there is some understanding of the problem and some temporary solutions.
armdn
 
Posts: 16
Joined: Mon Mar 24, 2014 9:05 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.3)

Postby Haravikk » Thu Mar 28, 2024 11:02 am

armdn wrote:Nice find. I won’t be able to check anymore, since I migrated all my pools back to 2.1.0. But at least now there is some understanding of the problem and some temporary solutions.

With your pools downgraded to v2.1.0 compatibility you can still install a newer version for testing, you just have to make sure you don't run zpool upgrade to avoid enabling features you can't downgrade from, if you see no improvement you can simply reinstall v2.1.0 (I usually need to restart when I do this).

I'm planning to do this at the weekend (probably Sunday) to see if I can confirm this finding for my own setup though, so feel free to wait if you'd prefer not to do it right now; I'll report back whether this change works on my worst affected system.
Last edited by Haravikk on Fri Mar 29, 2024 3:52 am, edited 1 time in total.
Haravikk
 
Posts: 82
Joined: Tue Mar 17, 2015 4:52 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.3)

Postby lundman » Thu Mar 28, 2024 5:30 pm

User avatar
lundman
 
Posts: 1337
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: Extreme Performance Issues with v2.1.6 (and v2.2.3)

Postby Haravikk » Fri Mar 29, 2024 2:11 am

lundman wrote:This could be interesting?
https://github.com/openzfs/zfs/pull/16040

Sounds beneficial, but also L2ARC specific?

While I never tried fully removing my L2ARC device, I still had poor performance with primarycache=all and secondarycache=none, and armdn also discounted L2ARC as the cause in their case. I'm going to try disabling ARC compression tomorrow or Sunday to see if that fixes performance for me, if not I'll try detaching my L2ARC device to see what difference that makes.
Haravikk
 
Posts: 82
Joined: Tue Mar 17, 2015 4:52 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.3)

Postby Haravikk » Sun Mar 31, 2024 5:30 am

So I ran my intended tests and updated the Github issue, but I'll paste my comment here for reference:

Unfortunately while disabling compressed ARC did improve performance overall, it didn't solve this problem – while the system was much more responsive with less entries in ARC and relatively low write activity, as soon as I started writing large quantities of data and ARC reached around ~4gb, performance took a nose dive as normal.

For anyone else looking to test for improved with compressed ARC disabled, make sure to disable it before importing your pool(s), or export then re-import them afters; in my case the difference was only noticeable after primary ARC was emptied.

I also tested removing my L2ARC device from my main working pool, but this made no discernible difference. The only thing that seems to help is bypassing primary ARC using primarycache=none on all active datasets, but this is of course no acceptable for a working pool as the lack of caching also hurts performance.

The issue is definitely proportional to the amount of primary ARC being utilised; when primary ARC is under 4gb my system was still generally responsive, but as this amount climbed it became more and more unusable, and at around the 10gb mark windowserver locked up and presumably crashed, causing all of my user accounts to be logged out. I'm not convinced that limiting the ARC maximum is a viable option; as even when my ARC used was around 1gb I was seeing some misbehaving processes (opendirectoryd was the main one, despite having no active directory users).

Before windowserver crashed I was able to capture another spindump while copying (via rsync) between two datasets; I thought this might be useful to compare a spindump with ARC compression disabled.
spindump.v2.2.3rc4.compresse_arc_disabled.zip

My conclusions from all of this are:

  • The problem is at its most severe with larger amounts of write activity passing through primary ARC.
  • While it's possible read activity is also affected, the issue is far less severe.
  • Disabling compressed ARC improved performance but didn't fix the problem. My theory is that decompressing compressed ARC entries somehow has a multiplicative effect on the performance problem (slow operations became even slower).
  • L2ARC appears to have no particular effect in so far as it requires data stored in primary ARC – removing L2ARC device(s) made no difference.
    The problem gets worse the larger the primary ARC gets; it's hard to tell if this is because of increased cache hits, or the issue is related to ARC structures that have grown larger. By the time I get to this point it becomes very difficult to run useful tests without triggering windowserver to crash.


Basically the only thing that seems to work is bypassing ARC with primarycache=none, but this just isn't viable for working datasets.

Update: I've also cross-posted the issue to the main OpenZFS Github, I'm hoping to get more eyes on the problem from anyone familiar with the ARC internals, and to be sure this isn't an issue that isn't also affecting Linux (just not as severely).
Update 2: Or not, apparently I'm fighting rincebrain over whether it's actually a bug or not, but if they broke something upstream post v2.1.0 I don't see what else it can be.

lundman have you had any luck reproducing this or finding what the cause might be? Is there anything else we can do to help? Have you had a chance to look at the latest spindump? It seems to have captured a lot of ZFS related calls, but I'm still not really sure what I'm looking for, and I'm not sure what else to try at this point. I don't want to be stuck on v2.1.0 forever, as I really need to upgrade to a newer version of macOS.

Update 3: Going through the openzfs changelogs from v2.1.1 to 2.1.6 I've identified two possible new areas to look into:

  • There were some changes relating to prefetching into ARC, might be worth trying with prefetching disabled (kstat.zfs.darwin.tunable.zfs_prefetch_disable=1)?
  • According to issue #11997 xattr=sa wasn't implemented for FreeBSD until v2.1.1. Is it possible to confirm whether xattr=sa is working in v2.1.0? Doesn't answer why it should be slower if that's the culprit, or why downgrading a version is seemingly fine (or why xattrs work without it?).
Haravikk
 
Posts: 82
Joined: Tue Mar 17, 2015 4:52 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.3)

Postby Haravikk » Sat Apr 06, 2024 5:36 am

So I tried some more things today:

  • Tried disabling prefetch with sysctl kstat.zfs.darwin.tunable.zfs_prefetch.disable=1. Like disabling ARC compression this did seem to improve system performance, but it didn't solve the problem as such.
  • Running with both prefetch and ARC compression disabled gave even better results, but again did not resolve the core problem, which still became worse over time.

So again, neither of these is the cause, they seem to just relieve some of the performance impacts. Since this is yet another couple of tests, I've gathered (and am attaching) another pair of spindumps. The first is with prefetch disabled, the second is with both prefetch and ARC compression disabled (after flushing ARC), in both cases I was running an rsync copy from a zvol into a dataset (one of the many reasons I want to upgrade, because I'm using zvols to avoid xattr issues that have since been fixed).

You can find the stack traces for the receiving side of these rsync processes by searching for "rsync [7820]" (spindump.txt) and "rsync [14497]" (spindump-2.txt).

Hopefully these might give some new clues as to where the problem lies, but I really am feeling like I'm running out of things I can test. Lundman, can you please take a look at at least these latest two spindumps and see if there's anything that stands out as problematic, or can give us something else to test? There's definitely some kind of major regression since v2.1.0, and we desperately need to find a fix for it.

There's an openzfs GitHub issue regarding prefetch that seems to suggest an issue with prefetching emerged for all ZFS versions from v2.1.4 onwards (#15214) but since disabling prefetch doesn't fully resolve the issue on macOS I can't be sure it's actually related.
Attachments
spindumps-v2.2.3rc4-2.zip
(3.37 MiB) Downloaded 35 times
Haravikk
 
Posts: 82
Joined: Tue Mar 17, 2015 4:52 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.3)

Postby jawbroken » Mon Apr 15, 2024 6:33 am

Would probably be useful if the 2 or 3 people in this thread experiencing this issue posted more details about their hardware, OS, and zfs configurations to try to work out what you have in common. It seems like you might be using fairly old hardware, which presumably implies older operating systems also? Are you using encryption? L2ARC? etc.

If it helps, I don't have this problem on a 2022 Mac Studio, running macOS Sonoma 14.4.1. I have OpenZFS 2.1.6-1 installed. I don't have dedup or encryption enabled, compression is on, I've disabled atime, and I don't have L2ARC.

Edit: I should also probably say that I'm not using mimic HFS or APFS, and I'm not booting from a ZFS volume, etc.
jawbroken
 
Posts: 64
Joined: Wed Apr 01, 2015 4:46 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.3)

Postby Haravikk » Mon Apr 15, 2024 7:06 am

jawbroken wrote:Would probably be useful if the 2 or 3 people in this thread experiencing this issue posted more details about their hardware, OS, and zfs configurations to try to work out what you have in common. It seems like you might be using fairly old hardware, which presumably implies older operating systems also? Are you using encryption? L2ARC? etc.

My main system is a 2018 Mac Mini so older but not that old, and should still very much be compatible. I do have other older (2009 Mac Mini and 2010 iMac) systems running ZFS but they're not affected (but they're also only running ZFS for a single zvol to backup via Time Machine into, so they are much simpler setups).

My main machine is a hexacore i7 2018 Mac Mini with 64gb of RAM and an 256gb internal SSD.

My ZFS setup consists of two pools, zdata which hosts my standard (non administrator) user accounts, consisting of two mirrored pairs (four disks total), the second (zbackup) is a mirror pair of larger (slower) disks for sending backups to. All datasets are AES-GCM encrypted with ZSTD compression enabled, but I've verified neither of these features is the problem, as duplicating one of my user accounts with compression and encryption disabled, then restarting with none of the compressed/encrypted datasets mounted still triggers this issue (as activity on the dataset(s) grows the problem just gets worse and worse).

Since I'm hosting user accounts I need to have mimic HFS enabled, I have atime enabled with relatime enabled. I do use a chunk of the internal SSD as an L2ARD for zdata, but disabling that makes no difference. As noted in the thread already, the only thing that seems to eliminate the problem is bypassing the primary arc (setting all datasets to primarycache=none and secondarycache=none), which makes an immediate difference.
Haravikk
 
Posts: 82
Joined: Tue Mar 17, 2015 4:52 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.3)

Postby jawbroken » Thu Apr 18, 2024 5:30 am

The person who opened this issue, that you already commented on, seems to have eventually tracked it down to their enclosure. So I should probably also say that my ZFS drives are in a few of these Thunderbolt enclosures.

Edit: And to your earlier posts about the Blackmagic Disk Speed Test, that says it can read and write at about 1 GB/s to my 9 disk ZFS pool, with no notable impact on system responsiveness.
jawbroken
 
Posts: 64
Joined: Wed Apr 01, 2015 4:46 am

PreviousNext

Return to General Help

Who is online

Users browsing this forum: No registered users and 91 guests