by RJVB » Wed Jan 02, 2019 9:25 am
My previous post notwithstanding:
I now have a working ZFS version of my afsctool fork in github:RJVB/afsctool : zfsctool (built but not installed automatically). I've removed a number of HFS-specific and less useful options (filetype filtering for instance) but have retained the multithreaded implementation. It's not fast as soon as you select a compression that is significantly more space-efficient than lz4 but it does the trick (and because it's not fast it doesn't burn a lot of CPU either).
Beta quality so feedback welcome, below is a little introduction:
Given that libzfs doesn't currently contain API to query or set dataset properties (or do a pool sync) I invoke the zfs (or zpool) driver application (through a tailor-made popen implementation) and parse its output where required. The upside is that there are no library dependencies on ZFS whatsoever.
Contrary to afsctool it is not feasible to restore a file after a failed first rewrite; even the optional backup file will evidently have the new compression. A second rewrite attempt is still made however, and I have kept the verification step after the rewrite (disable with the -n option).
Since it is not currently possible to set the compression property on a per-file basis I simply set the property at the dataset level, read the specified files and write them back to disk. By default property setting is done in JIT fashion: switch to the requested compression just before it's needed and reset when no other threads are rewriting files. This reduces the likelihood that unrelated filewrites also use the new compression but since there is a lot of overhead to this approach it is possible to do the reset at the end (-q option).
The same goes for (re)compression gain statistics gathering: this requires syncing; syncing is done only in verbose mode (-v option, but see below). NB: syncing only has the intended effect reliably with ZoL which has `zpool sync`.
Performance stats are always gathered and have neglible cost; they can be printed by activating a crippled verbose mode via the VERBOSE env.var (1 corresponds to -v, 2 to -vv, etc.) I'm not certain how meaningful they are (since I use child processes) but they might be useful in benchmarking.
The desired compression can be any of the ones supported by zfs and is specified with the -T option. By default, zfsctool will not recompress to the dataset's current compression type unless the -F option is given. The `off` compression is an exception: it is always accepted. It will also verify if there is enough free space (the estimated uncompressed size in disk blocks) to write the file; this requires syncing the pool after each rewrite so uncompressing always has considerable overhead. If the dataset does get full it is blocked from further rewrites to limit data loss.
NB: zfsctool will indeed redo another compression happily time and time again because there is no way to know what compression a file was last written with (unless we store it in a dedicated extended attribute?).
There is an additional compression type: `test`. With this type no actual changes are made (but the -simulated- before/after comparison is still made); on-disk sizes are thus valid and the statistics print-out represents the actual current size data. (This can be eye-opening because it shows how typical short files like scripts, plists etc. are often (much) larger on disk than in bytes.)