ashift property values

Moderators: jhartley, MSR734, nola

ashift property values

Post by grahamperrin » Sun Mar 24, 2013 3:44 am

(This topic was originally entitled ashift=13 for a cache vdev …)

In irc://irc.freenode.net/#zfsonlinux there's discussion of
    ashift=13

I asked:

How do I specify an ashift value when adding a cache vdev to an existing pool?


Considerations

With ZEVO Community Edition 1.1.1, man pages: zpool(8) is without the Cache Devices section.

I hesitate to experiment with the combination of zpool and -o because there's no file system on a cache vdev.
Last edited by grahamperrin on Wed Mar 27, 2013 1:39 am, edited 1 time in total.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Add "-o ashift" to zpool add

Post by grahamperrin » Sun Mar 24, 2013 5:28 am

Found amongst the issues for ZFS on Linux:


dasjoe wrote:… one can't force a specific ashift when adding vdevs to an existing pool …


There, it's a closed issue.

Here, it seems my wish is premature :)
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: ashift=13 for a cache vdev

Post by raattgift » Mon Mar 25, 2013 1:55 pm

most of the stuff that changes with a change to ashift (e.g. layout of uberblocks, location and number of metaslabs; there is simply more overhead for metadata when ashift is larger) aren't applicable to cache vdevs, no matter what the cache vdev is.

the one that is relevant to a cache vdev is the minimum IO size to the vdev (and implicitly the alignment of the first byte of each ashift-sized "block"). that's important in the case of devices which do a read-modify-write (only) for significantly smaller-sized writes, and where (by misalignment) the edge(s) of even multi-kilobyte writes will entail an RMW.

the RMW problem is not relevant to cache vdevs, since they are filled slowly by a separate thread that does not block the main operation of the pool (or the ARC); misaligned or too-small writes are immaterial for any reasonable choice of underlying hardware. furthermore, all writes are *purely* sequential, until the vdev fills to 100%, at which point new data is written from the start of the vdev and continue sequentially, overwriting what was there before.

therefore, it does not really matter what the ashift is for cache vdevs from a write performance perspective. however, because larger ashifts are more wasteful of space, a smaller ashift is usually appropriate for cache vdevs, so more blocks can fit into the L2ARC, so in general one would *prefer* the minimum ashift of 9 over any other ashift.

reads from cache vdevs are expected to be highly random, so it is important that they have low seek time on reads. SSDs typically always have that property, even when there is a big mismatch between the size of the read unit and the size of the underlying hardware erase block. very fast disks (say, doing short-stroking) may have acceptable seek times, but are unlikely to benefit much from a larger ashift, rather than a larger dataset recordsize, since the ARC stores records, and the thread that copies from ARC to L2ARC works with records. the pool's ashift property thus puts a floor on recordsize for the whole pool; the cache vdev or hw ashifts aren't relevant to that, as long as they are the same size *or smaller* (if they are larger, you are guaranteed space waste). larger recordsizes is what helps make L2ARC reads more sequential.

zfs's general philosophy -- at least under Bonwick in particular -- was to expose only necessary tunables in the CLI. the idea is that sound engineering should mean that tuning only ever makes things worse. this continues to be true, *even* for *many* devices which emulate 512-byte-LBAs but use 4096-byte physical blocks. that seems counterintuitive, but there is evidence even in the zfs on linux disucssion you link to. that is, the RMW hit in practice hurts more when there is misalignment (which the Mac OS X handling of GPT avoids!) than when the pool ashift is 9 (i.e., a logical blocksize of 2^9) and the hw physical block size is 2^12.

some of this is applicable to log devices too. they may be pounded by sequentialized writes, so an underlying RMW cycle should be avoided. however, if they accept synchronous writes faster than a ZIL spread across the pool's storage vdevs would, the log vdev is still a win.

log vdevs are ONLY read at pool import time, when they typically are empty (assuming a clean export).

unless someone can show -- quantitatively and with results repeatable across a variety of platforms -- that the ashift of cache or log vdevs makes a substantial performance difference, the issue is not going to gain much attention. additionally, even if (and it's a big if!) a substantial difference in a common deployment can be shown, the fix is not necessarily to expose a tunable.

personally, i would prefer to remove the ashift tunable altogether, and just assume that modern hardware should work acceptably with ashift=12 in all cases, and to simply not report an ashift for non-storage vdevs (cache, log). a future version could bump that to 13 if that becomes widely desirable.

Indeed, that would also make the most frequent problem associated with ashift go away: devices that report a native LBA size greater than the pool's ashift cannot be added to the pool. That usually happens when a new disk is meant to replace an old one in an old pool, or when someone wants to add a new storage vdev to an old pool. If ashift is always sufficiently large to start with, this never causes problems, since you can add a device with a smaller native LBA to a pool with a larger ashift.
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: ashift=13 for a cache vdev

Post by raattgift » Mon Mar 25, 2013 2:34 pm

see also

http://wiki.illumos.org/display/illumos ... rmat+disks

[edited to make the link work. i'd chopped "wiki.i". oops.]
Last edited by raattgift on Tue Mar 26, 2013 3:28 pm, edited 3 times in total.
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Re: ashift=13 for a cache vdev

Post by raattgift » Mon Mar 25, 2013 2:42 pm

Also, a small correction to the previous previous posting: zdb -l does NOT report an ashift for a cache vdev.
(It does report an ashift for log vdevs, however)

cache vdev

--------------------------------------------
LABEL 0
--------------------------------------------
version: 28...
state: 4
guid: 12
pool_guid: 59...


log vdev

--------------------------------------------
LABEL 0
--------------------------------------------
version: 28
name: ''
state: 0
txg: 37206664
pool_guid: 15...
hostname: '...'
top_guid: 77...
guid: 77...
is_log: 1
vdev_children: 2
vdev_tree:
type: 'disk'
id: 1
guid: 77..
path: '/dev/dsk/GPTE_B6...'
whole_disk: 0
metaslab_array: 38
metaslab_shift: 23
ashift: 9
asize: 995098624
is_log: 1
DTL: 1022
create_txg: 4673744

(Most of my log vdevs report ashift: 9; I have a handful (all LVs in a CoreStorage LVG whose only PV is an SSD) that report ashift: 12; I have zdb -l ashift mismatches between storage vdev devices and log vdev devices in both directions)
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Links

Post by grahamperrin » Mon Mar 25, 2013 7:59 pm

Thanks – George Wilson's talk is also recommended under RAID-Z and 4K sector size (Advanced Format). https://diigo.com/0xvdx for highlights from the illumos page.

Your additional info about non-storage vdevs is greatly appreciated!
Last edited by grahamperrin on Wed Mar 27, 2013 2:11 am, edited 1 time in total.
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom

Re: ashift=13 for a cache vdev

Post by raattgift » Tue Mar 26, 2013 4:31 pm

Note that these comments are generic with respect to zfs. There is practically nothing ZEVO-specific below.

I would not use a raidz with more than three large disks, advanced format or no, for any workload. Raidz2 and raidz3 do not have the mind-the-gap problem of raidz and will help you avoid being bitten by "needing to resilver in the middle of a resilver" types of problems that will fault a raidz. Resilvers are lonnnnnng for big full disks. The impact of pool ashift size on random-IO performance is much less important than keeping data actually available, at least for me. Additionally, for heavy random-IO one really wants to use mirrors anyway, to maximize IOPS, so the concern about raidz (which will ALWAYS offer lower random IOPS than mirrors for the same number of spindles) is probably academic.

Most home users are probably not hitting IO performance walls except when doing maintenance (backups, moving files from one dataset to another, and so on).

Large ashift almost certainly *improves* performance for I/O that is highly sequential. Most workloads are highly sequential. ZFS is awesomesauce at scheduling read-dominated sequential workloads, thanks to (among other things) dmu_zfetch, which also sequentializes spanned, striped, and even backwards I/O. Likewise, ZFS is very good at scheduling sequential writes. For highly sequential workloads, even when there are multiple users doing sequential activity, no reasonable pool layout will have more than a marginal impact on performance. That includes the choice of ashift. (It also includes L2ARC, which will only rarely see sequential traffic copied into it, and log vdevs likewise).

Pool layout can have an enormous impact on the performance of highly random I/O. Ashift is one aspect of pool layout, but so are things like numbers of top level vdevs, numbers of spindles per vdev, and so on. ZFS also has ashift-independent mechanisms for dealing with highly random I/O, namely the cache and log vdevs.

If your random-IO workload is mainly reads (which is the most common case) and fits in ARC and any L2ARC (and in ZEVO's case any UBC or application-managed caching), then you don't care about the performance impact of the pool ashift once the cache is reasonably warm. Capturing the working set in media that has no track-to-track or rotational seek penalty is vital, so that any negative impact in the layout of the pool's storage vdevs vanishes.

If it's mainly small writes with many barriers (e.g. fsync calls), then it may actually be useful to consider a write-optimized SSD as a log vdev and setting the dataset "sync=always" property, to dump everything into that for safety and to give zfs a chance to schedule all the asynchronous writes of the data safely copied on the log vdev. (This will remove write pressure from the on-storage-vdev ZIL). For mainly small writes with far fewer barriers, it may still be worth instrumenting performance with and without this approach, however zfs is good at scheduling asynchronous writes.

If it's mainly large writes with few barriers, even random ones, then zfs's transactional COW system is especially clever about scheduling large writes to take advantage of movements among metaslabs (which will move the write heads) anyway, and the difference in pool ashift is going to be marginal.

New writes (appends) -- which is the case for some databases which manage their own COW within the database files -- require less activity than rewrites. For rewrites, setting the dataset recordsize to the most common write quantum *may* improve performance.

All of this is independent of the question of the pool ashift. if one is still IOPS limited at this point, then thinking about the marginal improvement from ashift is (a) too late and (b) not going to matter as much as adding more top level vdevs (more vdevs = more iops) to the pool and perhaps spreading the load among multiple systems. Validating the *may* parts of "may improve performance" may produce different and counterintuitive results as ashift varies with or without varying the native block size of the underlying hardware.

There are also space implications (rather than performance ones) when varying ashift.

raidz has a deficiency in that it has to leave empty blocks ("gaps") at the end of writes that are not a multiple of the pool ashift times the number of physical devices in the raidz vdev. those will be ashift size. It's fairly common, and thus wasted raidz space is going to be higher as the ashift is higher (and much higher in the presence of mannnnny small files). mirrors, raidz2 and raidz3 do not have this problem.

raidz, raidz2 and raidz3 all require (respectively), 1, 2, and 3 x ashift blocks per subrecord (many people call this a "stripe", but because of dynamic striping in raidz[123], it's really any multiple of ashift that is equal to or less than the number of devices in the vdev minus the replication level). this only matters when records (or objects aka "files" in the posix layer) are very small (e.g., less than or equal to ashift).

zfs is far from the only file system which wastes space in the presence of small files; it's just that raidz[123] increases the waste, especially as ashift grows.

there are also some space disadvantages for *any* type of vdev as one increases ashift:

zfs metadata is aligned in 4k chunks, which are then compressed. there is no savings compressing a 4k block of metadata if the underlying ashift is 4k.

compression can only squeeze things into a smaller number of ashift blocks. if you write 8k of data that compresses to 6k, you save 4 ashift=9 blocks and 0 ashift=12 blocks.

on the other hand, real 4k devices are big and tend to be cheap, so the space wasteage is only surprising, not a disaster.

there is also likely a performance improvement, as advanced format drives tend to return 4k blocks faster than non-advanced ones return 8x512b blocks.

advanced format drives really do return far fewer block checksum errors and delay-retry non-errors when connected in the same way as non-advanced format ones.

so even though greater compression typically means a record gets into memory or onto disk faster, using the larger ashift for advanced format drives might improve the tradeoff. also, not all data is compressible in the first place.

finally, as one increases ashift by one, one halves the number of txgs available in the circular rollback buffer. for ashift=12, that means the "rollback history" is 32 * n, where n is typically between 5 and 30 seconds. one should never really have to rely on a 128-entry rollback history for anything, including repelling restores-from-backup, but some people argue that it's "nice to have".
raattgift Offline


 
Posts: 98
Joined: Mon Sep 24, 2012 11:18 pm

Broader discussion

Post by grahamperrin » Wed Mar 27, 2013 1:40 am

Again, big thanks!

(I have changed the subject of the opening post to reflect the broader discussion.)
grahamperrin Offline

User avatar
 
Posts: 1596
Joined: Fri Sep 14, 2012 10:21 pm
Location: Brighton and Hove, United Kingdom


Return to General Discussion

Who is online

Users browsing this forum: ilovezfs and 0 guests

cron