understanding ZFS ashift and recordsize parameters
Posted: Sat Jan 25, 2020 12:07 am
I am about to move my photo library onto a new SSD-backed ZFS mirror, and would like to understand which performance parameters I best use for creating the new pool.
The central question I have is:
What is the relationship of a ZFS pool's ashift value, a ZFS dataset's recordsize parameter, and the ARC caching?
And how does encryption and compression fit into this picture?
My current understanding is as follows:
(I am concentrating on the file content data, ignoring ZFS metadata for the moment, as well as no ZVOL)
- ZFS pool's ashift
actual physical I/O size; set for a whole pool; immutable after creation;
ideally aligns with the I/O granularity of the storage medium, so e.g. for HDDs with Advanced Format (4096K block size) it should be ashift=12 (2^12 = 4096)
question: which ashift value is suggested for SSDs? Is ashift=12 still ok?
- ZFS dataset's recordsize parameter
basically a "logical block size" used by a ZFS dataset
unit of transport(?) for a dataset, though for smaller files, ZFS can allocates/request less data
unit of checksumming and hence unit of Copy-on-Write per dataset?
unit of compression and encryption too?
set per dataset, but can be changed, though changes then only affect new files; so in effect it is per file?
ZFS default seems to be a recordsize of 128K...
- Adaptive Replacement Cache (ARC)
seems to buffer on the level of dataset, i.e. has mainly buffers of the same granularity as the dataset default recordsize,
though as ZFS can also write smaller files (such as .DS_Store, small config or icon files), some buffers can be smaller.
But in principle would align with the default recordsize of the datasets, so out-of-the-box mostly 128K buffer entries?
If I understand it correctly, if I have for example an image file of 4 MB on a ZFS dataset of default recordsize 128K, and a pool with ashift of 12, this image is internally stored as 32 ZFS "blocks" (each with their own checksum), each of which gets read from disk by 32 I/O requests of 4K? Which means 1024 I/O request in total? Or is it 32 I/O requests, each of 128K length?
And each read dataset 'block' (default: 128k) would be cached in the ARC as one entry?
So the ARC would have 32 entries in its 128K buffer list after reading the 4Mb image file?
What happens if I switch on compression? In case of images probably nothing, as my understanding is that ZFS stores the original uncompressed data if it can't compress it. But for 4MB text or (uncompressed) PDF files, I could make quite a difference. But where?
Where is encryption done - and does encryption change the block or record size, or is it working with the same amount of output data?
And finally: When I send an existing dataset to a new pool, which values of ashift and recordsize are used during writing the received data? Is it the ashift and the recordsize of the receiver, which means it could be used to change the storage layout, or are the parameters of the sending dataset kept?
Sorry for so many questions, but basically everything is heavily inter-related in ZFS...
The central question I have is:
What is the relationship of a ZFS pool's ashift value, a ZFS dataset's recordsize parameter, and the ARC caching?
And how does encryption and compression fit into this picture?
My current understanding is as follows:
(I am concentrating on the file content data, ignoring ZFS metadata for the moment, as well as no ZVOL)
- ZFS pool's ashift
actual physical I/O size; set for a whole pool; immutable after creation;
ideally aligns with the I/O granularity of the storage medium, so e.g. for HDDs with Advanced Format (4096K block size) it should be ashift=12 (2^12 = 4096)
question: which ashift value is suggested for SSDs? Is ashift=12 still ok?
- ZFS dataset's recordsize parameter
basically a "logical block size" used by a ZFS dataset
unit of transport(?) for a dataset, though for smaller files, ZFS can allocates/request less data
unit of checksumming and hence unit of Copy-on-Write per dataset?
unit of compression and encryption too?
set per dataset, but can be changed, though changes then only affect new files; so in effect it is per file?
ZFS default seems to be a recordsize of 128K...
- Adaptive Replacement Cache (ARC)
seems to buffer on the level of dataset, i.e. has mainly buffers of the same granularity as the dataset default recordsize,
though as ZFS can also write smaller files (such as .DS_Store, small config or icon files), some buffers can be smaller.
But in principle would align with the default recordsize of the datasets, so out-of-the-box mostly 128K buffer entries?
If I understand it correctly, if I have for example an image file of 4 MB on a ZFS dataset of default recordsize 128K, and a pool with ashift of 12, this image is internally stored as 32 ZFS "blocks" (each with their own checksum), each of which gets read from disk by 32 I/O requests of 4K? Which means 1024 I/O request in total? Or is it 32 I/O requests, each of 128K length?
And each read dataset 'block' (default: 128k) would be cached in the ARC as one entry?
So the ARC would have 32 entries in its 128K buffer list after reading the 4Mb image file?
What happens if I switch on compression? In case of images probably nothing, as my understanding is that ZFS stores the original uncompressed data if it can't compress it. But for 4MB text or (uncompressed) PDF files, I could make quite a difference. But where?
Where is encryption done - and does encryption change the block or record size, or is it working with the same amount of output data?
And finally: When I send an existing dataset to a new pool, which values of ashift and recordsize are used during writing the received data? Is it the ashift and the recordsize of the receiver, which means it could be used to change the storage layout, or are the parameters of the sending dataset kept?
Sorry for so many questions, but basically everything is heavily inter-related in ZFS...