offline (re)compression?

Here you can discuss every aspect of OpenZFS on OS X. Note: not for support requests!

offline (re)compression?

Postby RJVB » Fri Jun 01, 2018 5:58 am

I was comparing equivalent build directories on my Mac (HFS+) and on Linux (ZFS), noticing how the transparent LZ4 compression on the latter is certainly nice but doesn't really get close to the space gain I get with offline HFS compression (ZIP level 8) on the Mac.

The compression setting of a ZFS dataset applies to files (blocks?) being written; IIUC if you change the dataset setting files only change compression when they're rewritten. (What happens when you change the setting in the middle of a large file write?)

Apple's HFS compression as applied with a utility like afsctool is an interesting complement to transparent/online compression as it allows to optimise disk usage at a convenient "offline" moment. If my assumption above is correct, it should be feasible to write a utility or add a command to the `zfs` driver to rewrite given files and files in given directories with another compression setting. You'd then have the best of both worlds: transparent compression to keep disk usage down (and speeds up on slow media) for negligible cost, and targeted offline space optimisation.

In its simplest implementation this would just change the compression parameter temporarily, rewrite the selected files one way or another and then restore the parameter, but more fine-grained control of the parameter would be useful (if it doesn't already exist) so other files being (re)written continue to use the regular dataset compression type. Alternatively, a `zfs recompress` command could be provided that rewrites all dataset files that require updating with the current compression.

Thoughts?
RJVB
 
Posts: 29
Joined: Tue May 23, 2017 12:32 pm

Re: offline (re)compression?

Postby lundman » Sun Jun 03, 2018 7:39 pm

I believe the following procedure should work::

Code: Select all
zfs snapshot pool/stabledata@now
zfs create -o compression=gzip9 pool/longterm
zfs send pool/stabledata@now | zfs recv pool/longterm/stabledata


or a combination of such. Ie, set gzip9 on a new dataset, then copy the files over with send/recv. Once we get zfs recv -o compression= option, we can skip setting the parent inherit value
User avatar
lundman
 
Posts: 530
Joined: Thu Mar 06, 2014 2:05 pm
Location: Tokyo, Japan

Re: offline (re)compression?

Postby RJVB » Sun Jun 03, 2018 11:48 pm

Yes, that should probably work, with the caveat that you recompress an entire dataset. From the looks of it your proposal is the send/receive equivalent of

```
> zfs create -o compression=gzip9 pool/longterm
> rsync -aAXH <stabledata_mp>/. <longterm_mp>
```

That's not in-place compression, and assumes you have the free space available to hold and additional copy of that entire dataset.
RJVB
 
Posts: 29
Joined: Tue May 23, 2017 12:32 pm

Re: offline (re)compression?

Postby RJVB » Mon Aug 06, 2018 3:12 am

RJVB
 
Posts: 29
Joined: Tue May 23, 2017 12:32 pm

Re: offline (re)compression?

Postby RJVB » Wed Jan 02, 2019 9:25 am

My previous post notwithstanding:

I now have a working ZFS version of my afsctool fork in github:RJVB/afsctool : zfsctool (built but not installed automatically). I've removed a number of HFS-specific and less useful options (filetype filtering for instance) but have retained the multithreaded implementation. It's not fast as soon as you select a compression that is significantly more space-efficient than lz4 but it does the trick (and because it's not fast it doesn't burn a lot of CPU either).

Beta quality so feedback welcome, below is a little introduction:

Given that libzfs doesn't currently contain API to query or set dataset properties (or do a pool sync) I invoke the zfs (or zpool) driver application (through a tailor-made popen implementation) and parse its output where required. The upside is that there are no library dependencies on ZFS whatsoever.
Contrary to afsctool it is not feasible to restore a file after a failed first rewrite; even the optional backup file will evidently have the new compression. A second rewrite attempt is still made however, and I have kept the verification step after the rewrite (disable with the -n option).

Since it is not currently possible to set the compression property on a per-file basis I simply set the property at the dataset level, read the specified files and write them back to disk. By default property setting is done in JIT fashion: switch to the requested compression just before it's needed and reset when no other threads are rewriting files. This reduces the likelihood that unrelated filewrites also use the new compression but since there is a lot of overhead to this approach it is possible to do the reset at the end (-q option).

The same goes for (re)compression gain statistics gathering: this requires syncing; syncing is done only in verbose mode (-v option, but see below). NB: syncing only has the intended effect reliably with ZoL which has `zpool sync`.

Performance stats are always gathered and have neglible cost; they can be printed by activating a crippled verbose mode via the VERBOSE env.var (1 corresponds to -v, 2 to -vv, etc.) I'm not certain how meaningful they are (since I use child processes) but they might be useful in benchmarking.

The desired compression can be any of the ones supported by zfs and is specified with the -T option. By default, zfsctool will not recompress to the dataset's current compression type unless the -F option is given. The `off` compression is an exception: it is always accepted. It will also verify if there is enough free space (the estimated uncompressed size in disk blocks) to write the file; this requires syncing the pool after each rewrite so uncompressing always has considerable overhead. If the dataset does get full it is blocked from further rewrites to limit data loss.
NB: zfsctool will indeed redo another compression happily time and time again because there is no way to know what compression a file was last written with (unless we store it in a dedicated extended attribute?).
There is an additional compression type: `test`. With this type no actual changes are made (but the -simulated- before/after comparison is still made); on-disk sizes are thus valid and the statistics print-out represents the actual current size data. (This can be eye-opening because it shows how typical short files like scripts, plists etc. are often (much) larger on disk than in bytes.)
RJVB
 
Posts: 29
Joined: Tue May 23, 2017 12:32 pm

Re: offline (re)compression?

Postby RJVB » Sat Jan 05, 2019 2:25 am

RJVB wrote:NB: zfsctool will indeed redo another compression happily time and time again because there is no way to know what compression a file was last written with (unless we store it in a dedicated extended attribute?).


No longer, (re)compression type and timestamp (file mod. time) is now stored in a dedicated xattr. It's not entirely foolproof but should be reliable enough.
RJVB
 
Posts: 29
Joined: Tue May 23, 2017 12:32 pm


Return to General Discussions

Who is online

Users browsing this forum: No registered users and 0 guests