ZFS freezes userspace on write operations

All your general support questions for OpenZFS on OS X.

ZFS freezes userspace on write operations

Postby ranvel » Sat Sep 30, 2023 4:11 pm

Hello there!

I'm running into an issue with my ZFS pool (O3X 2.1.6) on Mojave 10.14.6 (18G9323).

Here is my setup:

Code: Select all
   NAME                                            STATE     READ WRITE CKSUM
   ranvol                                          ONLINE       0     0     0
     raidz1-0                                      ONLINE       0     0     0
       media-CB90139A-5957-11EC-8988-3C07547EECD7  ONLINE       0     0     0
       media-47247846-0AD3-4348-8B7E-4B50DA8341B5  ONLINE       0     0     0
       media-6754CB3A-ADE7-BD4E-A34F-FBEBB3166868  ONLINE       0     0     0


I recently replaces all of these drives and expanded the pool and ever since I expanded the pool, large write operations make the user space completely freeze. The operations successfully complete and I get no panics, but I can't interact with the OS using the keyboard or mouse and in-progress operations on the screen freeze. I'm pasting in my arc-stat log where you can see that at one point, it skips a few minutes.

I replaced the disks one-at-a-time, once per week over the course of month and did a scrub before I expanded the pool. Everything was functioning normally right before I expanded the pool but afterwards, I started having this behavior.

I've always had this experience whenever I've used ZFS in the past, but on laptops, without a lot of memory to go around, but now I have 128 GB so I figured that should be enough.

Code: Select all
ranvel@grey-lodge %> zfs list                                                                                                             ~
NAME          USED  AVAIL  REFER  MOUNTPOINT
ranvol        4.51T  6.27T  128K  none
ranvol/data   2.33T  6.27T  2.33T  /var/data/rdata
ranvol/media  2.18T  6.27T  2.18T  /var/data/media
ranvel@grey-lodge %> zpool list                                                                                                           ~
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
ranvol  16.4T  6.76T  9.59T        -         -     1%    41%  1.00x    ONLINE  -


Also, there seems to be a disparity between the 16.4T in `zpool` and the maybe 11T available in `zfs`.

In any case, I don't know if there is a tunable or if this is a known issue and I just skipped a step somewhere.

Thank you!
Attachments
iostat.log
`zpool iostat 1` logs during write operation
(3.78 KiB) Downloaded 112 times
arc-stat.log
arcstat logs during write
(4.22 KiB) Downloaded 109 times
ranvel
 
Posts: 3
Joined: Sun Sep 10, 2023 7:31 am

Re: ZFS freezes userspace on write operations

Postby ranvel » Mon Oct 09, 2023 2:41 pm

I saw a few other people on the forum saying that they downgraded to 2.1.0 from 2.1.6. This completely fixed the problem for me, as well.

`arcstat` stopped working, but that was because 'darwin' wasn't referenced as an OS, so you can just throw in the following after line 186 if you want to get that working, too:

Code: Select all
elif sys.platform.startswith('darwin'):
    import subprocess

    def kstat_update():
        global kstat

        process = subprocess.Popen(['sysctl', 'kstat.zfs.misc.arcstats'],
                                   stdout=subprocess.PIPE,
                                   stderr=subprocess.STDOUT)
        kstats = process.communicate()[0].decode('ascii').splitlines()

        kstat = {}
        for l in kstats:
            items = l.split(':')
            name = items[0]
            value = items[1].strip()
            kstat[name[24:]] = int(value)


You can just use Suspicious Package to extract the arcstat from the 2.1.6 pkg, too. It runs without modification
ranvel
 
Posts: 3
Joined: Sun Sep 10, 2023 7:31 am

Re: ZFS freezes userspace on write operations

Postby Haravikk » Tue Mar 26, 2024 1:41 am

This sounds like the same issue I and a few other users have been struggling with a for a while, you can see my thread about it here, I've also posted a Github issue to try and focus it down to only what we know.

In that thread, armdn discovered that setting primarycache=none and secondarycache=none "solves" this problem (though obviously you take a different performance hit for running without caching). So the issue definitely lies somewhere with the primary ARC, as the L2ARC doesn't seem to matter (though having only secondarycache enabled also triggers the issue, probably because it still keeps data in primary cache). It doesn't even seem to matter when you set these values; even if you've been running with caching for a while, then set it to none, you'll see an improvement within a few minutes.

If you want to try this out, it's safe to upgrade ZFS temporarily so long as you don't run zpool upgrade, so you test the settings then downgrade back to v2.1.0 and re-enable your caching. Also please feel free to add any info you think is missing from the Github issue!

My own experience hasn't been specific to write intensive workloads, it's happened on datasets that are barely writing anything, but it's possible any amount of writing is enough.
Haravikk
 
Posts: 82
Joined: Tue Mar 17, 2015 4:52 am


Return to General Help

Who is online

Users browsing this forum: Google [Bot] and 128 guests