OpenZFS on OS X

by **kowalczt** » Sat Nov 26, 2022 5:27 am

I have setup my pool on Monterey, Intel 3rd gen Ivybridge.

Code: Select all: zpool status -v pool: pub state: ONLINE scan: scrub repaired 0B in 00:00:18 with 0 errors on Sat Nov 26 13:44:45 2022 config: NAME STATE READ WRITE CKSUM pub ONLINE 0 0 0 media-B85D1393-9BEB-A14E-A2C4-518A96EF598C ONLINE 0 0 0 logs media-FC049095-6CB3-11ED-9C00-50E54938D3F1 ONLINE 0 0 0 cache media-02C7F916-6CB4-11ED-9C00-50E54938D3F1 ONLINE 0 0 0

Code: Select all: zfs get all pub NAME PROPERTY VALUE SOURCE pub type filesystem - pub creation Fri Nov 25 11:05 2022 - pub used 31.8G - pub available 1.73T - pub referenced 1.88M - pub compressratio 1.07x - pub mounted yes - pub quota none default pub reservation none default pub recordsize 1M local pub mountpoint /Volumes/pub default pub sharenfs off default pub checksum skein local pub compression zstd local pub atime off local pub devices on default pub exec on default pub setuid on default pub readonly off default pub zoned off default pub snapdir hidden default pub aclmode discard default pub aclinherit restricted default pub createtxg 1 - pub canmount on local pub xattr on temporary pub copies 1 default pub version 5 - pub utf8only on - pub normalization formD - pub casesensitivity insensitive - pub vscan off default pub nbmand off default pub sharesmb off default pub refquota none default pub refreservation none default pub guid 13352087833388962919 - pub primarycache all default pub secondarycache all default pub usedbysnapshots 0B - pub usedbydataset 1.88M - pub usedbychildren 31.8G - pub usedbyrefreservation 0B - pub logbias latency default pub objsetid 54 - pub dedup off default pub mlslabel none default pub sync standard default pub dnodesize legacy default pub refcompressratio 1.54x - pub written 1.88M - pub logicalused 34.2G - pub logicalreferenced 2.75M - pub volmode default default pub filesystem_limit none default pub snapshot_limit none default pub filesystem_count none default pub snapshot_count none default pub snapdev hidden default pub acltype off local pub context none default pub fscontext none default pub defcontext none default pub rootcontext none default pub relatime on default pub redundant_metadata all default pub overlay on default pub encryption off default pub keylocation none default pub keyformat none default pub pbkdf2iters 0 default pub special_small_blocks 0 default pub com.apple.browse on default pub com.apple.ignoreowner off default pub com.apple.mimic apfs local pub com.apple.devdisk poolonly default

Now testing checksum=blake3.
When I copied 20G file from apfs disk to zpool it was not giving any errors.
But then scrub did show errors on after completion.

Code: Select all: sudo zpool status -v  ✔  13:34:13 pool: pub state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 00:00:19 with 5 errors on Sat Nov 26 13:34:26 2022 config: NAME STATE READ WRITE CKSUM pub ONLINE 0 0 0 media-B85D1393-9BEB-A14E-A2C4-518A96EF598C ONLINE 0 0 10 logs media-FC049095-6CB3-11ED-9C00-50E54938D3F1 ONLINE 0 0 0 cache media-02C7F916-6CB4-11ED-9C00-50E54938D3F1 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /Volumes/pub/torrent/.VolumeIcon.icns /Volumes/pub/VM/Debian 11.vmwarevm/Virtual Disk-flat.vmdk

When I tried to copy more files, occasional reboot occurred, but I can't really get any useful information from it.

Setting back checksum=skein fixes errors.
installed: OpenZFSonOsX-2.1.6rc7-Catalina-10.15.pkg
PS. This issue isn't really affecting me much, I just wanted to test what's going on with new checksum algorithms.

by **lundman** » Sat Nov 26, 2022 4:41 pm

Ah ok, blake3 assembler is brand new, so entirely possible.

by **lundman** » Sat Nov 26, 2022 7:23 pm

Also, it would be interesting to check if one specifically is broken. You can swap the implementation with the
sysctl.

kstat.zfs.darwin.tunable.zfs.blake3_impl: cycle [fastest] generic sse2 sse41

So sysctl kstat.zfs.darwin.tunable.zfs.blake3_impl = generic
should work, since it's the C implementation. You'll have to write files for them to use new checksum.

1) change blake3_impl
2) copy files
3) scrub
4) repeat

I'll check my VM to see if I can reproduce.

by **kowalczt** » Sun Nov 27, 2022 2:01 pm

Thx for reply and some ideas.
Unfortunately changing algorithm to generic didn't change anything.
I will test more algorithms later to see any changes.

PS. I have checked fastes(default),generic,sse2,sse41 - all have checksum errors on scrub.
My cpu is really old (i7 3770), but still have all this sse instructions.
Im going to stick to skein/default for now.

PS2. I will test this on FreeBSD 14-current later this week on this same PC to check if this is even close to working on this rig.

by **rottegift** » Mon Nov 28, 2022 7:49 am

kowalczt wrote:PS. I have checked fastes(default),generic,sse2,sse41 - all have checksum errors on scrub.

Just to double-check, did you rewrite all the data previously written by blake3 each time you changed these values?

The checksum for a record (in your case that appears from your zfs properties listing up to 1M in size) is calculated and stored at write-time, and calculated and compared at read-time. However, reads do not change the previously stored checksum.

It was still useful to know that you get the same errors when you read back previously written data, but changing implementations (even changing to FreeBSD) can't recover a blake3 checksum that was bad when written out to your pool's primary storage vdev.

Also, did you check your devices' SMART values to try to rule out hardware problems? Things like pending or remapped blocks, or UDMA CRC Error counts are hopefully all zero. (I unhesitatingly recommend https://binaryfruit.com/drivedx to Mac users).

kowalczt wrote:PS2. I will test this on FreeBSD 14-current later this week on this same PC to check if this is even close to working on this rig.

This is a very good idea. Using FreeBSD to populate a dataset on your pool with fresh data (with blake3 used as checksum) and then reading back on both FreeBSD and macOS would be helpful, if you have the time.

kowalczt wrote:Im going to stick to skein/default for now.

That's an even better idea. Skein is presently more portable across implementations and versions, it's older and therefore better tested, and it's unlikely that your choice of checksum is a bottleneck (although you can instrument this on your particular hardware with e.g. https://github.com/axboe/fio ("brew install fio")).

The fletcher4 checksum (the default) is reasonable for most systems. The cryptographic mechanisms are slower (especially on older hardware) and are really only useful for deduplication (DON'T DO THIS), "nopwrite" where one is frequently overwriting files with exactly the same data (and at the same offset), when using the "origin" property on a zfs send|recv in such a way that "nopwrite" is likely to produce a real gain (this is also quite rare), when you are exposed to MITM on a zfs send|recv (use ssh?), or where you have untrusted "root" users on your system (or with access to your physical devices) who are tempted to change the contents of the pool in a way that zpool scrub won't detect (there are easier ways to mess you up with that kind of access...). Mostly the stronger checksums make your scrubs take more energy.

I notice that you are using recordsize=1M. Performance with large blocks has substantial tradeoffs due to read-modify-write burden, and that your choice of compression (zstd) does not have a good mechanism to stop trying to compress uncompressible or only slighlty compressible data. (compression=lz4 has the best mechanisms for that, at present). There is also additional incore memory-management overhead. Unless you have good reasons for doing otherwise, I'd recommend recordsize=128k (the default). 1M doesn't hurt if you are writing only highly compressible unencrypted data sequentially (and at several megabytes/second), you do not plan to modify that data after it is written, and additionally you plan to read it back almost entirely sequentially (if ever), and your pool has ample free space. Otherwise it is likely to hurt rather than help performance (especially as your pool fills up).

by **kowalczt** » Thu Dec 01, 2022 9:51 am

rottegift wrote:
Just to double-check, did you rewrite all the data previously written by blake3 each time you changed these values?

The checksum for a record (in your case that appears from your zfs properties listing up to 1M in size) is calculated and stored at write-time, and calculated and compared at read-time. However, reads do not change the previously stored checksum.

It was still useful to know that you get the same errors when you read back previously written data, but changing implementations (even changing to FreeBSD) can't recover a blake3 checksum that was bad when written out to your pool's primary storage vdev.

Also, did you check your devices' SMART values to try to rule out hardware problems? Things like pending or remapped blocks, or UDMA CRC Error counts are hopefully all zero. (I unhesitatingly recommend https://binaryfruit.com/drivedx to Mac users).

Yes, I did this properly. Every test run I cleared zfs dataset, changed checksum algorithm settings, wrote data, and do scrub. Every time was scrub error.
Disks are all fine, no SMART errors.

rottegift wrote:This is a very good idea. Using FreeBSD to populate a dataset on your pool with fresh data (with blake3 used as checksum) and then reading back on both FreeBSD and macOS would be helpful, if you have the time.

Unfortunately doing scrub running freebsd 14 is giving me kernel panic, not going to dig what's wrong there.

rottegift wrote:That's an even better idea. Skein is presently more portable across implementations and versions, it's older and therefore better tested, and it's unlikely that your choice of checksum is a bottleneck (although you can instrument this on your particular hardware with e.g. h ("brew install fio")).

The fletcher4 checksum (the default) is reasonable for most systems. The cryptographic mechanisms are slower (especially on older hardware) and are really only useful for deduplication (DON'T DO THIS), "nopwrite" where one is frequently overwriting files with exactly the same data (and at the same offset), when using the "origin" property on a zfs send|recv in such a way that "nopwrite" is likely to produce a real gain (this is also quite rare), when you are exposed to MITM on a zfs send|recv (use ssh?), or where you have untrusted "root" users on your system (or with access to your physical devices) who are tempted to change the contents of the pool in a way that zpool scrub won't detect (there are easier ways to mess you up with that kind of access...). Mostly the stronger checksums make your scrubs take more energy.

I notice that you are using recordsize=1M. Performance with large blocks has substantial tradeoffs due to read-modify-write burden, and that your choice of compression (zstd) does not have a good mechanism to stop trying to compress uncompressible or only slighlty compressible data. (compression=lz4 has the best mechanisms for that, at present). There is also additional incore memory-management overhead. Unless you have good reasons for doing otherwise, I'd recommend recordsize=128k (the default). 1M doesn't hurt if you are writing only highly compressible unencrypted data sequentially (and at several megabytes/second), you do not plan to modify that data after it is written, and additionally you plan to read it back almost entirely sequentially (if ever), and your pool has ample free space. Otherwise it is likely to hurt rather than help performance (especially as your pool fills up).

I didn't really noticed any significant write speed change copying data from NVME disk to zfs pool after changing from zstd to lz4 compression. Data was mostly large video files, so not really be able to compress them.
Anyway im gone set recordsize=1m only for my videos dataset, and leave 128k for rest.
Thanks for suggestions.

OpenZFS on OS X

Monterey, checksum errors with blake3

Monterey, checksum errors with blake3

Re: Monterey, checksum errors with blake3

Re: Monterey, checksum errors with blake3

Re: Monterey, checksum errors with blake3

Re: Monterey, checksum errors with blake3

Re: Monterey, checksum errors with blake3

Who is online