by raattgift » Tue Mar 26, 2013 4:31 pm
Note that these comments are generic with respect to zfs. There is practically nothing ZEVO-specific below.
I would not use a raidz with more than three large disks, advanced format or no, for any workload. Raidz2 and raidz3 do not have the mind-the-gap problem of raidz and will help you avoid being bitten by "needing to resilver in the middle of a resilver" types of problems that will fault a raidz. Resilvers are lonnnnnng for big full disks. The impact of pool ashift size on random-IO performance is much less important than keeping data actually available, at least for me. Additionally, for heavy random-IO one really wants to use mirrors anyway, to maximize IOPS, so the concern about raidz (which will ALWAYS offer lower random IOPS than mirrors for the same number of spindles) is probably academic.
Most home users are probably not hitting IO performance walls except when doing maintenance (backups, moving files from one dataset to another, and so on).
Large ashift almost certainly *improves* performance for I/O that is highly sequential. Most workloads are highly sequential. ZFS is awesomesauce at scheduling read-dominated sequential workloads, thanks to (among other things) dmu_zfetch, which also sequentializes spanned, striped, and even backwards I/O. Likewise, ZFS is very good at scheduling sequential writes. For highly sequential workloads, even when there are multiple users doing sequential activity, no reasonable pool layout will have more than a marginal impact on performance. That includes the choice of ashift. (It also includes L2ARC, which will only rarely see sequential traffic copied into it, and log vdevs likewise).
Pool layout can have an enormous impact on the performance of highly random I/O. Ashift is one aspect of pool layout, but so are things like numbers of top level vdevs, numbers of spindles per vdev, and so on. ZFS also has ashift-independent mechanisms for dealing with highly random I/O, namely the cache and log vdevs.
If your random-IO workload is mainly reads (which is the most common case) and fits in ARC and any L2ARC (and in ZEVO's case any UBC or application-managed caching), then you don't care about the performance impact of the pool ashift once the cache is reasonably warm. Capturing the working set in media that has no track-to-track or rotational seek penalty is vital, so that any negative impact in the layout of the pool's storage vdevs vanishes.
If it's mainly small writes with many barriers (e.g. fsync calls), then it may actually be useful to consider a write-optimized SSD as a log vdev and setting the dataset "sync=always" property, to dump everything into that for safety and to give zfs a chance to schedule all the asynchronous writes of the data safely copied on the log vdev. (This will remove write pressure from the on-storage-vdev ZIL). For mainly small writes with far fewer barriers, it may still be worth instrumenting performance with and without this approach, however zfs is good at scheduling asynchronous writes.
If it's mainly large writes with few barriers, even random ones, then zfs's transactional COW system is especially clever about scheduling large writes to take advantage of movements among metaslabs (which will move the write heads) anyway, and the difference in pool ashift is going to be marginal.
New writes (appends) -- which is the case for some databases which manage their own COW within the database files -- require less activity than rewrites. For rewrites, setting the dataset recordsize to the most common write quantum *may* improve performance.
All of this is independent of the question of the pool ashift. if one is still IOPS limited at this point, then thinking about the marginal improvement from ashift is (a) too late and (b) not going to matter as much as adding more top level vdevs (more vdevs = more iops) to the pool and perhaps spreading the load among multiple systems. Validating the *may* parts of "may improve performance" may produce different and counterintuitive results as ashift varies with or without varying the native block size of the underlying hardware.
There are also space implications (rather than performance ones) when varying ashift.
raidz has a deficiency in that it has to leave empty blocks ("gaps") at the end of writes that are not a multiple of the pool ashift times the number of physical devices in the raidz vdev. those will be ashift size. It's fairly common, and thus wasted raidz space is going to be higher as the ashift is higher (and much higher in the presence of mannnnny small files). mirrors, raidz2 and raidz3 do not have this problem.
raidz, raidz2 and raidz3 all require (respectively), 1, 2, and 3 x ashift blocks per subrecord (many people call this a "stripe", but because of dynamic striping in raidz[123], it's really any multiple of ashift that is equal to or less than the number of devices in the vdev minus the replication level). this only matters when records (or objects aka "files" in the posix layer) are very small (e.g., less than or equal to ashift).
zfs is far from the only file system which wastes space in the presence of small files; it's just that raidz[123] increases the waste, especially as ashift grows.
there are also some space disadvantages for *any* type of vdev as one increases ashift:
zfs metadata is aligned in 4k chunks, which are then compressed. there is no savings compressing a 4k block of metadata if the underlying ashift is 4k.
compression can only squeeze things into a smaller number of ashift blocks. if you write 8k of data that compresses to 6k, you save 4 ashift=9 blocks and 0 ashift=12 blocks.
on the other hand, real 4k devices are big and tend to be cheap, so the space wasteage is only surprising, not a disaster.
there is also likely a performance improvement, as advanced format drives tend to return 4k blocks faster than non-advanced ones return 8x512b blocks.
advanced format drives really do return far fewer block checksum errors and delay-retry non-errors when connected in the same way as non-advanced format ones.
so even though greater compression typically means a record gets into memory or onto disk faster, using the larger ashift for advanced format drives might improve the tradeoff. also, not all data is compressible in the first place.
finally, as one increases ashift by one, one halves the number of txgs available in the circular rollback buffer. for ashift=12, that means the "rollback history" is 32 * n, where n is typically between 5 and 30 seconds. one should never really have to rely on a 128-entry rollback history for anything, including repelling restores-from-backup, but some people argue that it's "nice to have".