Monterey, checksum errors with blake3

All your general support questions for OpenZFS on OS X.

Monterey, checksum errors with blake3

Postby kowalczt » Sat Nov 26, 2022 5:27 am

I have setup my pool on Monterey, Intel 3rd gen Ivybridge.
Code: Select all
zpool status -v
  pool: pub
 state: ONLINE
  scan: scrub repaired 0B in 00:00:18 with 0 errors on Sat Nov 26 13:44:45 2022
config:

   NAME                                          STATE     READ WRITE CKSUM
   pub                                           ONLINE       0     0     0
     media-B85D1393-9BEB-A14E-A2C4-518A96EF598C  ONLINE       0     0     0
   logs
     media-FC049095-6CB3-11ED-9C00-50E54938D3F1  ONLINE       0     0     0
   cache
     media-02C7F916-6CB4-11ED-9C00-50E54938D3F1  ONLINE       0     0     0

Code: Select all
zfs get all pub
NAME  PROPERTY               VALUE                  SOURCE
pub   type                   filesystem             -
pub   creation               Fri Nov 25 11:05 2022  -
pub   used                   31.8G                  -
pub   available              1.73T                  -
pub   referenced             1.88M                  -
pub   compressratio          1.07x                  -
pub   mounted                yes                    -
pub   quota                  none                   default
pub   reservation            none                   default
pub   recordsize             1M                     local
pub   mountpoint             /Volumes/pub           default
pub   sharenfs               off                    default
pub   checksum               skein                  local
pub   compression            zstd                   local
pub   atime                  off                    local
pub   devices                on                     default
pub   exec                   on                     default
pub   setuid                 on                     default
pub   readonly               off                    default
pub   zoned                  off                    default
pub   snapdir                hidden                 default
pub   aclmode                discard                default
pub   aclinherit             restricted             default
pub   createtxg              1                      -
pub   canmount               on                     local
pub   xattr                  on                     temporary
pub   copies                 1                      default
pub   version                5                      -
pub   utf8only               on                     -
pub   normalization          formD                  -
pub   casesensitivity        insensitive            -
pub   vscan                  off                    default
pub   nbmand                 off                    default
pub   sharesmb               off                    default
pub   refquota               none                   default
pub   refreservation         none                   default
pub   guid                   13352087833388962919   -
pub   primarycache           all                    default
pub   secondarycache         all                    default
pub   usedbysnapshots        0B                     -
pub   usedbydataset          1.88M                  -
pub   usedbychildren         31.8G                  -
pub   usedbyrefreservation   0B                     -
pub   logbias                latency                default
pub   objsetid               54                     -
pub   dedup                  off                    default
pub   mlslabel               none                   default
pub   sync                   standard               default
pub   dnodesize              legacy                 default
pub   refcompressratio       1.54x                  -
pub   written                1.88M                  -
pub   logicalused            34.2G                  -
pub   logicalreferenced      2.75M                  -
pub   volmode                default                default
pub   filesystem_limit       none                   default
pub   snapshot_limit         none                   default
pub   filesystem_count       none                   default
pub   snapshot_count         none                   default
pub   snapdev                hidden                 default
pub   acltype                off                    local
pub   context                none                   default
pub   fscontext              none                   default
pub   defcontext             none                   default
pub   rootcontext            none                   default
pub   relatime               on                     default
pub   redundant_metadata     all                    default
pub   overlay                on                     default
pub   encryption             off                    default
pub   keylocation            none                   default
pub   keyformat              none                   default
pub   pbkdf2iters            0                      default
pub   special_small_blocks   0                      default
pub   com.apple.browse       on                     default
pub   com.apple.ignoreowner  off                    default
pub   com.apple.mimic        apfs                   local
pub   com.apple.devdisk      poolonly               default

Now testing checksum=blake3.
When I copied 20G file from apfs disk to zpool it was not giving any errors.
But then scrub did show errors on after completion.
Code: Select all
sudo zpool status -v                                                                            ✔  13:34:13
  pool: pub
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:00:19 with 5 errors on Sat Nov 26 13:34:26 2022
config:

   NAME                                          STATE     READ WRITE CKSUM
   pub                                           ONLINE       0     0     0
     media-B85D1393-9BEB-A14E-A2C4-518A96EF598C  ONLINE       0     0    10
   logs
     media-FC049095-6CB3-11ED-9C00-50E54938D3F1  ONLINE       0     0     0
   cache
     media-02C7F916-6CB4-11ED-9C00-50E54938D3F1  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /Volumes/pub/torrent/.VolumeIcon.icns
        /Volumes/pub/VM/Debian 11.vmwarevm/Virtual Disk-flat.vmdk

When I tried to copy more files, occasional reboot occurred, but I can't really get any useful information from it.

Setting back checksum=skein fixes errors.
installed: OpenZFSonOsX-2.1.6rc7-Catalina-10.15.pkg
PS. This issue isn't really affecting me much, I just wanted to test what's going on with new checksum algorithms.
kowalczt
 
Posts: 5
Joined: Fri Jan 14, 2022 10:51 am

Re: Monterey, checksum errors with blake3

Postby lundman » Sat Nov 26, 2022 4:41 pm

Ah ok, blake3 assembler is brand new, so entirely possible.
User avatar
lundman
 
Posts: 1335
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: Monterey, checksum errors with blake3

Postby lundman » Sat Nov 26, 2022 7:23 pm

Also, it would be interesting to check if one specifically is broken. You can swap the implementation with the
sysctl.

kstat.zfs.darwin.tunable.zfs.blake3_impl: cycle [fastest] generic sse2 sse41

So sysctl kstat.zfs.darwin.tunable.zfs.blake3_impl = generic
should work, since it's the C implementation. You'll have to write files for them to use new checksum.

1) change blake3_impl
2) copy files
3) scrub
4) repeat

I'll check my VM to see if I can reproduce.
User avatar
lundman
 
Posts: 1335
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: Monterey, checksum errors with blake3

Postby kowalczt » Sun Nov 27, 2022 2:01 pm

Thx for reply and some ideas.
Unfortunately changing algorithm to generic didn't change anything.
I will test more algorithms later to see any changes.

PS. I have checked fastes(default),generic,sse2,sse41 - all have checksum errors on scrub.
My cpu is really old (i7 3770), but still have all this sse instructions.
Im going to stick to skein/default for now.

PS2. I will test this on FreeBSD 14-current later this week on this same PC to check if this is even close to working on this rig.
kowalczt
 
Posts: 5
Joined: Fri Jan 14, 2022 10:51 am

Re: Monterey, checksum errors with blake3

Postby rottegift » Mon Nov 28, 2022 7:49 am

kowalczt wrote:PS. I have checked fastes(default),generic,sse2,sse41 - all have checksum errors on scrub.


Just to double-check, did you rewrite all the data previously written by blake3 each time you changed these values?

The checksum for a record (in your case that appears from your zfs properties listing up to 1M in size) is calculated and stored at write-time, and calculated and compared at read-time. However, reads do not change the previously stored checksum.

It was still useful to know that you get the same errors when you read back previously written data, but changing implementations (even changing to FreeBSD) can't recover a blake3 checksum that was bad when written out to your pool's primary storage vdev.

Also, did you check your devices' SMART values to try to rule out hardware problems? Things like pending or remapped blocks, or UDMA CRC Error counts are hopefully all zero. (I unhesitatingly recommend https://binaryfruit.com/drivedx to Mac users).

kowalczt wrote:PS2. I will test this on FreeBSD 14-current later this week on this same PC to check if this is even close to working on this rig.


This is a very good idea. Using FreeBSD to populate a dataset on your pool with fresh data (with blake3 used as checksum) and then reading back on both FreeBSD and macOS would be helpful, if you have the time.

kowalczt wrote:Im going to stick to skein/default for now.


That's an even better idea. Skein is presently more portable across implementations and versions, it's older and therefore better tested, and it's unlikely that your choice of checksum is a bottleneck (although you can instrument this on your particular hardware with e.g. https://github.com/axboe/fio ("brew install fio")).

The fletcher4 checksum (the default) is reasonable for most systems. The cryptographic mechanisms are slower (especially on older hardware) and are really only useful for deduplication (DON'T DO THIS), "nopwrite" where one is frequently overwriting files with exactly the same data (and at the same offset), when using the "origin" property on a zfs send|recv in such a way that "nopwrite" is likely to produce a real gain (this is also quite rare), when you are exposed to MITM on a zfs send|recv (use ssh?), or where you have untrusted "root" users on your system (or with access to your physical devices) who are tempted to change the contents of the pool in a way that zpool scrub won't detect (there are easier ways to mess you up with that kind of access...). Mostly the stronger checksums make your scrubs take more energy.

I notice that you are using recordsize=1M. Performance with large blocks has substantial tradeoffs due to read-modify-write burden, and that your choice of compression (zstd) does not have a good mechanism to stop trying to compress uncompressible or only slighlty compressible data. (compression=lz4 has the best mechanisms for that, at present). There is also additional incore memory-management overhead. Unless you have good reasons for doing otherwise, I'd recommend recordsize=128k (the default). 1M doesn't hurt if you are writing only highly compressible unencrypted data sequentially (and at several megabytes/second), you do not plan to modify that data after it is written, and additionally you plan to read it back almost entirely sequentially (if ever), and your pool has ample free space. Otherwise it is likely to hurt rather than help performance (especially as your pool fills up).
rottegift
 
Posts: 26
Joined: Fri Apr 25, 2014 12:00 am

Re: Monterey, checksum errors with blake3

Postby kowalczt » Thu Dec 01, 2022 9:51 am

rottegift wrote:
Just to double-check, did you rewrite all the data previously written by blake3 each time you changed these values?

The checksum for a record (in your case that appears from your zfs properties listing up to 1M in size) is calculated and stored at write-time, and calculated and compared at read-time. However, reads do not change the previously stored checksum.

It was still useful to know that you get the same errors when you read back previously written data, but changing implementations (even changing to FreeBSD) can't recover a blake3 checksum that was bad when written out to your pool's primary storage vdev.

Also, did you check your devices' SMART values to try to rule out hardware problems? Things like pending or remapped blocks, or UDMA CRC Error counts are hopefully all zero. (I unhesitatingly recommend https://binaryfruit.com/drivedx to Mac users).


Yes, I did this properly. Every test run I cleared zfs dataset, changed checksum algorithm settings, wrote data, and do scrub. Every time was scrub error.
Disks are all fine, no SMART errors.

rottegift wrote:This is a very good idea. Using FreeBSD to populate a dataset on your pool with fresh data (with blake3 used as checksum) and then reading back on both FreeBSD and macOS would be helpful, if you have the time.

Unfortunately doing scrub running freebsd 14 is giving me kernel panic, not going to dig what's wrong there.

rottegift wrote:That's an even better idea. Skein is presently more portable across implementations and versions, it's older and therefore better tested, and it's unlikely that your choice of checksum is a bottleneck (although you can instrument this on your particular hardware with e.g. h ("brew install fio")).

The fletcher4 checksum (the default) is reasonable for most systems. The cryptographic mechanisms are slower (especially on older hardware) and are really only useful for deduplication (DON'T DO THIS), "nopwrite" where one is frequently overwriting files with exactly the same data (and at the same offset), when using the "origin" property on a zfs send|recv in such a way that "nopwrite" is likely to produce a real gain (this is also quite rare), when you are exposed to MITM on a zfs send|recv (use ssh?), or where you have untrusted "root" users on your system (or with access to your physical devices) who are tempted to change the contents of the pool in a way that zpool scrub won't detect (there are easier ways to mess you up with that kind of access...). Mostly the stronger checksums make your scrubs take more energy.

I notice that you are using recordsize=1M. Performance with large blocks has substantial tradeoffs due to read-modify-write burden, and that your choice of compression (zstd) does not have a good mechanism to stop trying to compress uncompressible or only slighlty compressible data. (compression=lz4 has the best mechanisms for that, at present). There is also additional incore memory-management overhead. Unless you have good reasons for doing otherwise, I'd recommend recordsize=128k (the default). 1M doesn't hurt if you are writing only highly compressible unencrypted data sequentially (and at several megabytes/second), you do not plan to modify that data after it is written, and additionally you plan to read it back almost entirely sequentially (if ever), and your pool has ample free space. Otherwise it is likely to hurt rather than help performance (especially as your pool fills up).


I didn't really noticed any significant write speed change copying data from NVME disk to zfs pool after changing from zstd to lz4 compression. Data was mostly large video files, so not really be able to compress them.
Anyway im gone set recordsize=1m only for my videos dataset, and leave 128k for rest.
Thanks for suggestions.
kowalczt
 
Posts: 5
Joined: Fri Jan 14, 2022 10:51 am


Return to General Help

Who is online

Users browsing this forum: Google [Bot] and 25 guests