OpenZFS on OS X

Posted: **Sat Jan 25, 2020 12:07 am**

I am about to move my photo library onto a new SSD-backed ZFS mirror, and would like to understand which performance parameters I best use for creating the new pool.

The central question I have is:
What is the relationship of a ZFS pool's ashift value, a ZFS dataset's recordsize parameter, and the ARC caching?
And how does encryption and compression fit into this picture?

My current understanding is as follows:
(I am concentrating on the file content data, ignoring ZFS metadata for the moment, as well as no ZVOL)

- ZFS pool's ashift
actual physical I/O size; set for a whole pool; immutable after creation;
ideally aligns with the I/O granularity of the storage medium, so e.g. for HDDs with Advanced Format (4096K block size) it should be ashift=12 (2^12 = 4096)
question: which ashift value is suggested for SSDs? Is ashift=12 still ok?

- ZFS dataset's recordsize parameter
basically a "logical block size" used by a ZFS dataset
unit of transport(?) for a dataset, though for smaller files, ZFS can allocates/request less data
unit of checksumming and hence unit of Copy-on-Write per dataset?
unit of compression and encryption too?
set per dataset, but can be changed, though changes then only affect new files; so in effect it is per file?
ZFS default seems to be a recordsize of 128K...

- Adaptive Replacement Cache (ARC)
seems to buffer on the level of dataset, i.e. has mainly buffers of the same granularity as the dataset default recordsize,
though as ZFS can also write smaller files (such as .DS_Store, small config or icon files), some buffers can be smaller.
But in principle would align with the default recordsize of the datasets, so out-of-the-box mostly 128K buffer entries?

If I understand it correctly, if I have for example an image file of 4 MB on a ZFS dataset of default recordsize 128K, and a pool with ashift of 12, this image is internally stored as 32 ZFS "blocks" (each with their own checksum), each of which gets read from disk by 32 I/O requests of 4K? Which means 1024 I/O request in total? Or is it 32 I/O requests, each of 128K length?

And each read dataset 'block' (default: 128k) would be cached in the ARC as one entry?
So the ARC would have 32 entries in its 128K buffer list after reading the 4Mb image file?

What happens if I switch on compression? In case of images probably nothing, as my understanding is that ZFS stores the original uncompressed data if it can't compress it. But for 4MB text or (uncompressed) PDF files, I could make quite a difference. But where?

Where is encryption done - and does encryption change the block or record size, or is it working with the same amount of output data?

And finally: When I send an existing dataset to a new pool, which values of ashift and recordsize are used during writing the received data? Is it the ashift and the recordsize of the receiver, which means it could be used to change the storage layout, or are the parameters of the sending dataset kept?

Sorry for so many questions, but basically everything is heavily inter-related in ZFS...

Posted: **Mon Jan 27, 2020 3:58 pm**

These are pretty meaty questions, most of which I think I know the answer to, and which align with your suppositions. However, you might want to condense your questions down a little and submit them to the BSD Now podcast, hosted in part by Allan Jude (along with Benedict Reuschling [sp?]); Allan has worked on ZFS code, wrote two books about using ZFS, and is amazingly generous in sharing his knowledge with the community (even the Mac community, though perhaps you should leave out any exclusively Mac-specific aspects in your queries). The email address for questions is:

feedback@bsdnow.tv

Kurt

Posted: **Tue Jan 28, 2020 2:43 am**

As mentioned in https://openzfsonosx.org/wiki/Zpool:
"If you are using SSDs in your pool now, or anticipate replacing any of your pool's disks with SSDs in the future, then an ashift of 13 is a better choice. It is important to get this right now because a vdev's ashift cannot be changed after the vdev is created."

Posted: **Thu Jan 30, 2020 2:30 pm**

Sharko wrote:These are pretty meaty questions, most of which I think I know the answer to, and which align with your suppositions. However, you might want to condense your questions down a little and submit them to the BSD Now podcast, hosted in part by Allan Jude (along with Benedict Reuschling [sp?]); Allan has worked on ZFS code, wrote two books about using ZFS, and is amazingly generous in sharing his knowledge with the community (even the Mac community, though perhaps you should leave out any exclusively Mac-specific aspects in your queries). The email address for questions is:

feedback@bsdnow.tv

Kurt

Fair enough - I agree that my post is quite an extensive 'thought dump'

Thanks for the suggestion; I will have a look at Allan's podcast too.

To condense my questions a bit:
- What is the best ashift value for SSD pools?
- Can an existing pool be 'converted' to a new ashift value using send/receive?
- Can an existing dataset be 'converted' to a new recordsize value using send/receive?
- What is the tradeoff for aligning the recordsize of a dataset with the page size of a database running on that dataset?

Posted: **Thu Jan 30, 2020 2:41 pm**

nodarkthings wrote:As mentioned in https://openzfsonosx.org/wiki/Zpool:
"If you are using SSDs in your pool now, or anticipate replacing any of your pool's disks with SSDs in the future, then an ashift of 13 is a better choice. It is important to get this right now because a vdev's ashift cannot be changed after the vdev is created."

Thanks for flagging this. I must admit I did overlook this sentence in the Wiki.
This advice is however six years old from 2014, and there is no further explanation given on why an ashift of 13 (meaning 8K blocks) is a better choice for SSDs than smaller values.
Does this relate to TRIM support and the potential overhead of tracking (many) deleted blocks by the SSD firmware?

Posted: **Sat Feb 01, 2020 4:00 pm**

Does this relate to TRIM support and the potential overhead of tracking (many) deleted blocks by the SSD firmware?

Not exactly. It is more related to a phenomena known as write amplification. Ideally, ZFS wants to squirt out chunks of data that match the block size, so that a block gets assembled, a block gets sent to the SSD controller, a single block gets written.

Now suppose that ZFS has an incorrect notion about the disk block size, specifically, suppose it thinks blocks are 512 bytes but in reality the disk is 4096 byte block size. Suppose it is writing a large file, and thus it sequentially is sending 512 byte chunks of data to the SSD controller. Every time one of those chunks arrives the controller reads in a 4096 sector of data, replaces a 512 byte section of it, and writes all 4096 bytes back out (because, recall, the underlying physical page size is 4096). That means that in order to write eight of those 512 byte chunks of data, as you certainly would do if the file to be written is bigger than 4kb, the controller is going to read the same page 8 times and write it back out eight times. So instead of writing one 4k block of data once, it had to write it 8 times to get the entire block written. That's write amplification.

The opposite problem is when ZFS thinks that the block size (set by ashift) is greater than the native block size. This is also sub-optimal, but not as bad as the case we just discussed. In this case ZFS holds onto the chunk of data longer than it needs to, and the SSD controller has multiple blocks to write to fulfill the request. The SSD controller might not work quite as optimally in managing its work queue, since it is getting big bursts of data rather than a steady stream of optimally sized requests.

What is the best ashift value for SSD pools?

The problem is, there is no clear standard for SSD's, and SSD manufacturers are notoriously closed-mouthed about their native block size. Early SSD's seemed to be 8k block size (ashift=13), but as SSD sizes have ballooned the devices may be using 16k or 32k block sizes now. You can search on the web for clues on forums and such, but you might have to run tests yourself by creating test pools with different ashift sizes and then running BlackMagic Speedtest or Crystalmark speed test software to figure out what works best for your SSD.

Can an existing pool be 'converted' to a new ashift value using send/receive?

Once a pool exists ashift can't be changed, so you would have to send the data to a second pool on a different disk, delete the original pool and recreate it with a different ashift, and then ship the data back with send/receive.

Can an existing dataset be 'converted' to a new recordsize value using send/receive?

The recordsize is a dataset property, and you can have different record sizes for different datasets in the same pool. I think you can even change the record size on an existing dataset with data in it, but I'm sure that if you do that it won't cause all the data in the dataset to be re-written automatically with that new recordsize - all it does is cause new data to be written with the new record size. So, yes, you can create a new dataset, set the recordsize you want, replicate data to it using send / receive. If the new dataset is on the pool that you want it to be on, destroy your original dataset and you're done; otherwise, destroy the original dataset, create a new one in its place with the correct recordsize, and send/receive data back.

What is the tradeoff for aligning the recordsize of a dataset with the page size of a database running on that dataset?

Allan Jude's "Advanced ZFS" books goes into this great detail in one whole chapter, but basically it's the write amplification problem again, only this time between the database application and ZFS: you don't want small chunks of data from the database to cause large pages to get re-written (you won't get write amplification at the SSD level generally because the database record size is usually greater than the disk page size).

Posted: **Mon Feb 17, 2020 3:38 am**

I did some measurements, and the results are actually quite interesting.

Setting:
- OpenZFS 1.9.3-3 on macOS Catalina (10.15.3)
- iMac with 4GHz Core i7 CPU and 32GB RAM
- 2 x Samsung 860 QVO SSDs in an external enclosure (OWC Mercury Elite Pro Dual Mini) via USB 3.0
configured as a JBOD with ZFS Mirror
- AmorpheusDiskMark benchmark with 3 test repetitions over a 500MB file

I tested ZFS with different ashift and recordsize combinations, and the results are roughly as follows:

1. ZFS sequential reads are _very_ fast due to the ARC cache, quite independently of ashift or recordsize.
Enabling ZFS encryption costs some performance, though it still is much faster than Apple's APFS on the SSDs;
Larger recordsizes help with encryption (faster).

2. ZFS sequential writes are about as fast as the I/O bottleneck allows - in my case the USB3 bus.
Encryption or compression make no difference, so basically both come for "free" for sequential writes. (*)
With the 860QVO SSDs, using ashift=13 is however slightly slower than ashift=12 or smaller.
(the SSDs report themselves as having a logical and physical block size of 512 KB)

3. Random 4K reads are much slower than sequential reads;
for random 4K reads, smaller values (ashift=10 or ashift=9) seem to be slightly faster.

4. Random 4K writes are interesting.
Note that AmorpheusDiskMark creates one 500MB file and then tries to write arbitrary 4KB of data within that large file.
This setup makes the ZFS write amplification very visible.
I get horrible write throughputs reported by DiskMark (like 8-16 MB/s), unless I match the ZFS recordsize to 4KB - or unless I enable compression!
Compression seems to be able to hide this write amplification problem to some extend, at least for the kind of data generated by this benchmark.
Otherwise, for random 4K writes (within a larger file), a smaller recordsize is beneficial, and (to a lesser extend) a smaller ashift value is beneficial too.
With the Samsung 860QVOs, I get the best 4K random write performance with ashift=12 or =10, and recordsize=4K.

My conclusions are:
- For the Samsung 860qvo, ashift=12 seems to be the best allrounder for reads and sequential writes.
- The default recordszie of 128K seems OK for typical sequential workloads.
- If however your workload consists mainly of random read/writes inside a larger file, such as by a database system or a VM, then one should enable compression and try to match the recordsize to the page size of the DB or the VM.
The latter is also a tip which Allen Jude gives in his book.

PS: (*) Compression and encryption increase CPU load; it is not a problem with an Intel Core i7, but on a Core i5 you see the CPU load going up quite substantially. Even worse for older CPUs...

Posted: **Tue Feb 18, 2020 10:51 pm**

I'm surprised your SSD report as 512, but thanks for the work and report.

Posted: **Wed Feb 19, 2020 4:11 am**

lundman wrote:I'm surprised your SSD report as 512, but thanks for the work and report.

I get this from smartctl:

Code: Select all: smartctl 7.1 2019-12-30 r5022 [Darwin 19.3.0 x86_64] (sf-7.1-1) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: Samsung SSD 860 QVO 2TB Firmware Version: RVQ01B6Q User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: Not in smartctl database [for details use: -P showall] SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) SMART support is: Available - device has SMART capability. SMART support is: Enabled

OpenZFS on OS X

understanding ZFS ashift and recordsize parameters

understanding ZFS ashift and recordsize parameters

Re: understanding ZFS ashift and recordsize parameters

Re: understanding ZFS ashift and recordsize parameters

Re: understanding ZFS ashift and recordsize parameters

Re: understanding ZFS ashift and recordsize parameters

Re: understanding ZFS ashift and recordsize parameters

Re: understanding ZFS ashift and recordsize parameters

Re: understanding ZFS ashift and recordsize parameters

Re: understanding ZFS ashift and recordsize parameters