Extreme Performance Issues with v2.1.6 (and v2.2.3)

All your general support questions for OpenZFS on OS X.

Extreme Performance Issues with v2.1.6 (and v2.2.3)

Postby Haravikk » Sun Dec 18, 2022 4:22 am

I tried to upgrade to v2.1.6 today but I had to rollback to v2.1.0 again almost immediately as my system was barely usable with v2.1.6 installed.

I couldn't find any obvious cause other than persistently high kernel_task CPU usage averaging 30-140%, with various apps showing similar high CPU usage with no obvious pattern. In my case the main culprits were securityd, corespotlightd, Plex Media Scanner and Mail.app, but I don't believe any of these were to blame. Killing these would temporarily ease the problem only for it to return again. The only option that "solved" the problem was exporting all pools.

I often see high-ish kernel_task CPU usage on v2.1.0 as well, but this is usually under heavier load, tops out around 30%, and tends to ease off as the ARC fills up. Under v2.1.6 I was seeing that kind of CPU usage under extremely light load after letting the ARC fill up (to around 11.5gb). This included just trying to listen to some music, which was impossible, as coreaudiod handling my external audio driver kept locking up. I see no reason why ZFS' CPU usage should be so absurdly high? Indeed, one of my main reasons for wanting to upgrade was the performance improvements made to encryption which is supposed to solve or at least reduce the high CPU usage (and seems to work as expected on Linux).

My setup includes a large number of encrypted datasets across two pools (one is main storage, the other is purely for backups), and I run all user accounts from datasets on main storage, except for my administrator account (which is on the internal APFS startup disk). I'm running on macOS 10.15.7 though I'm planning to upgrade macOS soon. Also worth noting that I did not upgrade either of my pools in case that meant I couldn't downgrade.


Sorry I can't provide more in the way of useful information, but when I say my system was barely usable I mean it; Activity Monitor barely ran, I could only get CPU usage values by running top in Terminal, and even that was unbearably slow.

Here's a representative sample of the settings on one of my datasets:

Code: Select all
NAME            PROPERTY                VALUE                   SOURCE
zdata/haravikk  type                    filesystem              -
zdata/haravikk  creation                Mon Jan 17 13:07 2022   -
zdata/haravikk  used                    38.0G                   -
zdata/haravikk  available               182G                    -
zdata/haravikk  referenced              37.8G                   -
zdata/haravikk  compressratio           1.05x                   -
zdata/haravikk  mounted                 no                      -
zdata/haravikk  quota                   none                    default
zdata/haravikk  reservation             none                    default
zdata/haravikk  recordsize              128K                    default
zdata/haravikk  mountpoint              /Users/haravikk       local
zdata/haravikk  sharenfs                off                     default
zdata/haravikk  checksum                on                      default
zdata/haravikk  compression             zstd                    inherited from zdata
zdata/haravikk  atime                   on                      default
zdata/haravikk  devices                 on                      default
zdata/haravikk  exec                    on                      default
zdata/haravikk  setuid                  on                      default
zdata/haravikk  readonly                off                     local
zdata/haravikk  zoned                   off                     default
zdata/haravikk  snapdir                 hidden                  default
zdata/haravikk  aclmode                 discard                 default
zdata/haravikk  aclinherit              restricted              default
zdata/haravikk  createtxg               242                     -
zdata/haravikk  canmount                on                     local
zdata/haravikk  xattr                   sa                      inherited from zdata
zdata/haravikk  copies                  2                       inherited from zdata
zdata/haravikk  version                 5                       -
zdata/haravikk  utf8only                on                      -
zdata/haravikk  normalization           formD                   -
zdata/haravikk  casesensitivity         insensitive             -
zdata/haravikk  vscan                   off                     default
zdata/haravikk  nbmand                  off                     default
zdata/haravikk  sharesmb                off                     default
zdata/haravikk  refquota                none                    default
zdata/haravikk  refreservation          none                    default
zdata/haravikk  guid                    13847935460132862844    -
zdata/haravikk  primarycache            all                     default
zdata/haravikk  secondarycache          all                     default
zdata/haravikk  usedbysnapshots         0B                      -
zdata/haravikk  usedbydataset           37.8G                   -
zdata/haravikk  usedbychildren          205M                    -
zdata/haravikk  usedbyrefreservation    0B                      -
zdata/haravikk  logbias                 latency                 default
zdata/haravikk  objsetid                271                     -
zdata/haravikk  dedup                   off                     default
zdata/haravikk  mlslabel                none                    default
zdata/haravikk  sync                    standard                default
zdata/haravikk  dnodesize               auto                    inherited from zdata
zdata/haravikk  refcompressratio        1.05x                   -
zdata/haravikk  written                 37.8G                   -
zdata/haravikk  logicalused             19.6G                   -
zdata/haravikk  logicalreferenced       19.4G                   -
zdata/haravikk  volmode                 default                 default
zdata/haravikk  filesystem_limit        none                    default
zdata/haravikk  snapshot_limit          none                    default
zdata/haravikk  filesystem_count        none                    default
zdata/haravikk  snapshot_count          none                    default
zdata/haravikk  snapdev                 hidden                  default
zdata/haravikk  acltype                 nfsv4                   default
zdata/haravikk  context                 none                    default
zdata/haravikk  fscontext               none                    default
zdata/haravikk  defcontext              none                    default
zdata/haravikk  rootcontext             none                    default
zdata/haravikk  relatime                on                      inherited from zdata
zdata/haravikk  redundant_metadata      all                     default
zdata/haravikk  overlay                 on                      default
zdata/haravikk  encryption              aes-256-gcm             -
zdata/haravikk  keylocation             prompt                  local
zdata/haravikk  keyformat               passphrase              -
zdata/haravikk  pbkdf2iters             350000                  -
zdata/haravikk  encryptionroot          zdata/haravikk          -
zdata/haravikk  keystatus               unavailable             -
zdata/haravikk  special_small_blocks    0                       default
zdata/haravikk  com.apple.browse        on                      inherited from zdata
zdata/haravikk  com.apple.ignoreowner   off                     inherited from zdata
zdata/haravikk  com.apple.mimic         hfs                     inherited from zdata
zdata/haravikk  com.apple.devdisk       on                      inherited from zdata


The pool has no physical redundancy (terrible I know but it's just a stop-gap till I get a new enclosure so I can add more disks) which is why most of the datasets have copies=2 so self-healing is still possible (media datasets don't as it would require too much storage). My backup pool however has physical redundancy and is just used as a target for raw zfs send/receive while discarding extra copies. I don't usually notice any increase in CPU usage during sends to zbackup, presumably since it's raw (so nothing should be getting decrypted/encrypted) so it never goes above what I would see for a similar amount of reading from zdata only.
Last edited by Haravikk on Wed Mar 20, 2024 2:56 am, edited 2 times in total.
Haravikk
 
Posts: 75
Joined: Tue Mar 17, 2015 4:52 am

Re: Extreme Performance Issues with v2.1.6

Postby lundman » Tue Dec 20, 2022 4:19 pm

During a heavy-load/slow time, you can run "sudo spindump" and share with us. it will show what processes are in ZFS and where, and how much time spent in there.
User avatar
lundman
 
Posts: 1337
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: Extreme Performance Issues with v2.1.6

Postby Sharko » Tue Dec 20, 2022 10:13 pm

10.15 is Catalina, correct? I wonder if the Catalina build of v2.1.6 is somehow missing the AES-NI hardware capability. I had this problem a few weeks ago with the Monterey build, and Mr. Lundman was able to go in and tweak the build (with tunables?) to bring back the hardware encryption. You might want to review this thread:

https://openzfsonosx.org/forum/viewtopic.php?f=24&t=3734
Sharko
 
Posts: 230
Joined: Thu May 12, 2016 12:19 pm

Re: Extreme Performance Issues with v2.1.6

Postby Haravikk » Wed Dec 21, 2022 7:30 am

Sharko wrote:10.15 is Catalina, correct? I wonder if the Catalina build of v2.1.6 is somehow missing the AES-NI hardware capability. I had this problem a few weeks ago with the Monterey build, and Mr. Lundman was able to go in and tweak the build (with tunables?) to bring back the hardware encryption. You might want to review this thread:

https://openzfsonosx.org/forum/viewtopic.php?f=24&t=3734

Under v2.1.0 I see the following from sysctl -a:

Code: Select all
kstat.zfs.darwin.tunable.icp_aes_impl: cycle [fastest] generic x86_64


Would this not imply I'm already limited to software encryption/decryption under v2.1.0? The performance I was seeing in v2.1.6 was substantially worse, but I didn't grab any of the sysctl values when I was trying it. Not sure when I'll get a chance to try it again as it requires multiple restarts to install, and like I say the system was practically unusable; a spindump could take hours to complete with the performance I was seeing.
Haravikk
 
Posts: 75
Joined: Tue Mar 17, 2015 4:52 am

Re: Extreme Performance Issues with v2.1.6

Postby lundman » Mon Dec 26, 2022 4:34 pm

kstat.zfs.darwin.tunable.icp_aes_impl: cycle [fastest] generic x86_64

Odd, my VM has;

kstat.zfs.darwin.tunable.icp_aes_impl: cycle [fastest] generic x86_64 aesni

it'd be interesting to see what CPUID it pulls out of your machine:

"log show --source --predicate 'sender == "zfs"' --style compact --last 5m"
(if you booted in the last 5 mins)

<zfs`spl_cpuid_features (spl-processor.c:109)> SPL: CPUID 0x1f8bfbfffffa320b and leaf7 0xbc000400009c27ab


_AES_BIT (1U << 25)
0x1f8bfbfffffa320b
0x00000002000000
User avatar
lundman
 
Posts: 1337
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: Extreme Performance Issues with v2.1.6

Postby Haravikk » Tue Dec 27, 2022 4:15 am

lundman wrote:kstat.zfs.darwin.tunable.icp_aes_impl: cycle [fastest] generic x86_64

Odd, my VM has;

kstat.zfs.darwin.tunable.icp_aes_impl: cycle [fastest] generic x86_64 aesni

Is that for v2.1.0? I didn't think full hardware decryption support was added until a later version? That was part of why I wanted to upgrade. I won't be able to grab my sysctl values for v2.1.6 until I get another chance to switch versions, probably not going to be for a week or so as I have stuff I need to run atm, I'll create a list of commands I need to run and try to get as much data as I can on what might be going on.

My hardware is a 2018 Mac Mini with the hexacore i7 upgrade (i7-8700B) so it should definitely have hardware decryption support, but I'm pretty sure v2.1.0 never had it (there was a lot of buzz about a later version gaining all of that iirc, along with a load of assembly code to boost performance)?
Haravikk
 
Posts: 75
Joined: Tue Mar 17, 2015 4:52 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.2)

Postby Haravikk » Tue Dec 19, 2023 4:45 am

So I finally got around to trying an upgrade again and sadly this issue still exists in v2.2.2; installing v2.2.2 renders my system slow to the point of being unusable so long as there is any activity on any of my ZFS datasets (which there always is as I use them to store user folders), the only way to restore performance is to unmount them all.

I was able to confirm my aes_impl value is as expected:

Code: Select all
kstat.zfs.darwin.tunable.icp_aes_impl: cycle [fastest] generic x86_64 aesni


But my performance is still absolutely terrible; after only around an hour and a half running at v2.2.2, kernel task accumulated nearly 34 hours worth of CPU time, so there is clearly some huge CPU bottleneck somewhere. What confuses me is why nobody else is complaining about this? It means that v2.1.0 is the newest usable version of ZFS that I can run, and the performance of even that isn't great (still has high kernel task CPU usage at times, but at least the system is responsive).

I'm attaching the two spindumps I was able to generate. The first (spindump-normal.txt) is under "normal" usage with three users logged in, two of which are running from their own ZFS datasets. The second (spindump-disk-speed.txt) was generated while running Blackmagic Disk Speed test, so might give more focused results. I couldn't get the disk speed test to complete though, it started out okay (120mb/sec or so writing speed) but the speed quickly declined until the system became completely unresponsive – not just slow, it hung entirely and loginwindow crashed, kicking me back to a fresh login window with no users logged in.

I should also add that I haven't run zpool upgrade because again I didn't want to risk not being able to downgrade back to v2.1.0; I see no reason why this should affect performance though as I would expect v2.2.2 to perform no worse than v2.1.0 even if hardware acceleration is being ignored? Or could it be the opposite problem, could it be that the hardware accelerated code is somehow giving me worse performance? Is there any way I can forcibly disable it to find out?

What the hell could be the cause of this? I really, really want to upgrade to a newer version of ZFS, I don't want to be stuck in limbo on v2.1.0 forever, especially because v2.1.0 only has releases up to Big Sur.
Attachments
spindumps.zip
(1.71 MiB) Downloaded 139 times
Haravikk
 
Posts: 75
Joined: Tue Mar 17, 2015 4:52 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.2)

Postby nodarkthings » Tue Dec 19, 2023 3:40 pm

If I read well, your Users are on a dataset? Does shit happen when you log into a user account or already when you're at the login window?
I would try with a user on the MacOS boot disk. If you moved the entire /Users folder, you can try logging into the Root User. If it works ok, it's a good clue that it's related to the user folder.
nodarkthings
 
Posts: 174
Joined: Mon Jan 26, 2015 10:32 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.2)

Postby Haravikk » Tue Dec 19, 2023 4:01 pm

nodarkthings wrote:If I read well, your Users are on a dataset? Does shit happen when you log into a user account or already when you're at the login window?
I would try with a user on the MacOS boot disk. If you moved the entire /Users folder, you can try logging into the Root User. If it works ok, it's a good clue that it's related to the user folder.

Basically I have my system set up with an APFS system disk on the internal SSD that handles macOS, apps and an administrator user, then I have a ZFS main storage pool (zdata) on which my actual working (standard) user accounts are stored, each with their own dataset (plus some nested to optimise things like recordsize, copies=N etc.). These datasets are stored on a pool made from regular spinning disks (for the capacity).

When I start up the machine I login as the admin account only temporarily to unlock FileVault (for the APFS disk) and trigger the unlocking and mounting of the ZFS datasets (using passwords stored in the keychain) so I can then login to my other accounts. These are setup as standard accounts for various purposes, like I have one for general use (browsing etc.), one for work (Xcode etc.), and one stores all my media files (music, tv shows etc.).

I don't actually need to login to any of the ZFS accounts to experience the poor performance; the spindump-disk-speed.txt file was taken from the admin account with no other users logged in, I just used one of the datasets as the target for Black Magic Disk Speed test. The entire system performance was crippled until the loginwindow crashed and interrupted it, and performance remained sluggish until I exported the pool, at which time it went back to normal.

I've since reinstalled v2.1.0 and performance is normal again, so whatever the problem was it's specific to a newer version of ZFS on macOS; I originally experienced this (and posted the thread) when trying to upgrade to v2.1.6, but v2.2.2 is showing the exact same problem.
Haravikk
 
Posts: 75
Joined: Tue Mar 17, 2015 4:52 am

Re: Extreme Performance Issues with v2.1.6 (and v2.2.2)

Postby lundman » Tue Dec 19, 2023 4:27 pm

Heh you only got 14 cycles into the spindump.speed? peculiar, guess it is busy.

So of your busiest threads, you have
Code: Select all
 1 Thread 0x2a4a    Thread name "z_wr_iss_1"    14 samples (1-14)    priority 76 (base 76)    cpu time 2.975s (11.6G cycles, 1961.3M instructions, 5.93c/i)
 2 Spindump 797.6M
 3 mdsync [5649]
 4 com.apple.spindump.stackshot_parsing"(28)    1 sample (2)    priority 31 (base 31)    cpu time 0.061s (255.4M cycles, 319.4M instructions, 0.80c/i)
 5 fseventsd [150]
 6 Activity Monitor [6079]


Since #1, #5 (ZFS writing) is ending up in vm_map_find_space() your system is completely out of RAM, and is desperate to find more. I think you should tell ZFS
to use less (keep halving), and see if it improves. Could be there is a memory leak as well, that you trigger. The full output of sysctl for ZFS should show that, once it has started to slow down.
User avatar
lundman
 
Posts: 1337
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Next

Return to General Help

Who is online

Users browsing this forum: No registered users and 25 guests

cron