WRT (a)-(d) in your IRC log
http://echelog.com/logs/browse/illumos/1350424800+ (e) horrible layering, putting something that migrates blocks around *under* ZFS
As you know, this can be done now with Core Storage; because CS LVs end up being assigned /dev/diskNN, and because ZEVO is not bright enough to look deeper, it will just treat it like any block device and put a GPT label on it when using it in a command like zpool create.
As you also know, this confuses Core Storage such that it's hard to modify the LV in any way. You may not know that you have to write several blocks of zeros into the start of the LV's /dev/diskNN device then newfs_hfs the /dev/diskNN before doing a CS operation like deleteLV, resizeLV (which is "unsupported"), or resizestack (which is even more "unsupported", but hugely useful when layering mutually-ignorant disk block management systems).
If Apple decides to support arbitrary GPT labels in LVs, you could just grow an LV and zfs will Do The Right Thing if the pool autoexpand property is "on" or otherwise if "zpool expand" is run afterwards.
I have (in household "production") a pool which is constructed like this.
I am not entirely happy that the boot volume is not mirrored. I may do something about that at some point, as I have done with my favourite household workstation. (The mirroring is using softraid's excellent software, but that does not do data integrity or highly flexible online dataset/mountpoint management like zfs does).
LVG 1
PVs:
disk0 (ssd, 250MB)
disk2 (rotating rust, 1.5TB)
LV A: JHFS+ (boot volume, 400G)
LV B: GPT label, ZFS in partition 2. (128.8 GB)
LV C: JHFS+ (mostly empty) (600G)
Crucially, this is 10.8.3 on a platform that supports Fusion Drives (a late 2012 Mac Mini stuffed with memory).
It therefore reports:
localhost kernel[0]: thr 0xffffff802d41caa0 Composite Disk alg="bloomclock" unit_nbytes=131072
making this a not-totally-silly LVG. Not all kernels support Fusion Drive; most older hardware won't boot kernels built for newer hardware; most newer hardware won't boot kernels for older hardware; the kernel itself has a key that unlocks the CPDK MigrCS algorithms. This is only mostly market segmentation driven; there is checksummed metadata in CPDKs that will eat older processors for breakfast. :/
LVG 2
PV:
disk1 (ssd, 256GB)
LV Z: GPT label, ZFS in partition 2 (128.8 GB)
LV Y: GPT label, ZFS in partition 2 (200 MB), a SLOG for a rotating rust pool
LV X: similar to LV Y
LV W: similar to LV Y
LV V: similar to LV Y
LV U: 64 GB (a cache vdev, definitely oversized, due for shrinking to ca. 10G)
LV T: 50 GB (a cache vdev, unused, because the IOKit volume UUID does not persist properly and so it is always UNAVAILABLE at boot; it is also oversized, due for deletion)
I have a pool with one storage vdev, It is mainly used for my $HOME.
mirror-0
GPTE_ ... at diskLVBs2
GPTE_ ... at diskLVZs2
logs
GPTE_ ... at diskLVWs2
cache
GPTE_ ... at diskAnotherSSD
I will probably ditch the L2ARC or give it a small slice. The mere existence of big L2ARC cache vdevs can lead to having too little space in ARC, which devastates performance. Additionally, I've seen zero read slowdowns on the pool; the occasional scrub and the COW nature coupled with the use patterns means pretty much all reads on that side of the mirror are serviced from the ssd PV. The working set of the boot volume also largely fits in the ssd PV, so the rotating rust PV tends to be in power saving mode almost always. If a block on the composite disk side of the mirror is on the rotating rust, then the read is usually serviced from the other side of the mirror first (this is the way vdev_mirror.c works), and the 128k chunk the read was waiting on will in due course be migrated onto the ssd PV. It's neato.
I am undecided about the SLOG for this pool.
Operationally, I have grown the mirror vdevs, and will do so again after changing LVU and LVT to free up space on the non-composite LVG.
It'll involve:
1. Verify that I have a bootable backup and usable backups for all data.
2. zpool offline LVU; zpool remove pool LVU
3. dd if=/dev/zero bs=128k count=100 of=/dev/diskLVU
4. same as 3 for LVT
5. newfs_hfs /dev/diskLVU; newfs_hfs /dev/diskLVT
6. diskutil cs deleteLV LVT
7. diskutil cs resizeLV LVU 10g
8. zpool add pool cache /dev/diskLVU
9. zpool detach pool LVZ
10. dd if=/dev/zero ... of=LVZ; newfs_hfs LVZ
11. diskutil cs resizeLV LVZ 228g
12. zpool attach pool LVB LVZ; zpool set autoexpand=on pool
13. zpool detach pool LVB
14. dd if=/dev/zero ... of=LVA; newfs_hfs LVA
15. diskutil cs resizeLV LVB 228g
16. zpool attach pool LVZ LVA
17. zpool list pool
I could of course make a third side to the mirror, but step (1) is OK for a small amount of data. Restoring if something goes wrong and makes the pool and boot volume unusable won't be a 72 hour process.
If LVG 2 were also a composite disk, I could just create a pair of new LVs, one on each LVG. I could then zpool replace LVA LVa, wait for a resilver, then zpool replace LVZ LVz; and see the pool expand when that finished. That would lead to a lot of block migrations though, so would be a utility rather than performance win. Resizing the existing LVs does not disturb the bloom filter / clock scores much, and won't use many "cold" blocks that are on the spinning rust PV.
Also, if the pool on 2 large composite CPDKs were sufficiently large, there would inevitably be times when both sides of the mirror vdev would have to wait on the slow spinning rust devices.
One feature requiring block pointer rewrite is shrinking pools. With two CS CPDKs, I could create a new smaller pool with new smaller LVs on each side of the mirror. zfs send/zfs receive to migrate the data. Destroy the old pool. Destroy the old LVs. All online, with hot blocks migrating to the SSD in due course, no messy wiring or plugging, and no new hardware. BPRW would avoid the short unavailability of data needed to adjust mountpoints.
Another feature requiring BPRW is changing the replication level. With three CS CPDKs I could create a new pool with three LVs for a raidz1. Same zfs send/receive migration required, same fiddling with mountpoints. Going from raidz1 to a mirror or mirrors also follows the same approach: build a new pool with new LVs in the existing CPDK LVGs (for 2 mirrors, add a fourth CS LVG), migrate the data accordingly. All the same disadvantages compared to real online BPRW.
One big gotcha of course is that "Fusion Drive" block migration only happens when an appropriate SSD is one of the PVs and any other PV is a non-SSD. This means each CPDK needs an SSD. Also, you will get unpredictable performance fall-offs if ZFS data gets migrated to the slower disk. Working set management could be a little hairy.
However, if the working set fits comfortably within the set of SSDs in the "Fusion" LVGs, then zfs fragmentation wouldn't be relevant, as seek times vanish. Likewise, the BigMigrCS (and to an extent the MigrCS) has the side effect of improving on-disk locality of reference for blocks accessed or created at about the same time. This effectively "defragments" zfs data as it becomes old enough to be moved off the SSD PV. (JHFS+ LVs gain from this too).
As with Ahren's explanation, there is a giant table of mappings between LV block address and LVG block address, maintained by CS. ZFS is unaware of it, and there is at present no means to expose the mapping to ZFS, and nothing in ZFS that would take advantage of that anyway.
Unfortunately, the extra physical devices and software layers tend to increase vdev fragility (while increasing cost), and may actually worsen performance. Additionally, it is unlikely to be supported by anyone.
However, it is a step towards (e), and can be done in 10.8.3 on many Macs.
( One could also do (e) with, for example, iSCSI or XSan. A number of university computer centres have done precisely that over the years, although there are critical gotchas with that approach as well. An example:
http://utcc.utoronto.ca/~cks/space/blog ... rverDesign )