rattlehead wrote:For the 6TB of data, the recommended minimum RAM was like 24GB or so, counting only the memory dedicated to ZFS, I do not have that much, so performance would be worse, I knew that. But the host system became practically useless, GUI feedback for moving the mouse took seconds, etc.
Wow, I knew it recommended quite a bit of RAM, but I didn't expect it to be that bad; this was why I asking about caching though as I'm also only on 16gb like you, but I haven't been able to find any good articles on whether a cache disk/partition/file can be configured to offload the de-duplication table onto. If it's something I can enable later though then that's good to know, as I'll just leave it until I can find out more, or run a test once I've thinned down my cable forest enough to hook up a spare disk for testing
I did find reference to a feature that can improve deduplication performance, but it seems OpenZFS doesn't support it; which is the ability to specify a less demanding checksum as part of the algorithm,verify notation, but it looks like while OpenZFS supports the notation, it only supports SHA-256 for checksums at the moment. Shame, as several examples using dedup=fletcher4,verify seem to show much lower memory requirements due to the smaller checksum size, and it remains safe (because collisions are verified before the blocks are linked).
rattlehead wrote: compression for me also made things worse.
That's surprising; which compression algorithm were you using? I wouldn't have thought that ZFS compression could result in larger sizes at all, as surely that would break the block size? Most compression algorithms I know would only store the compressed data if it's smaller than the original, otherwise the original is written as-is. For example, I had to port the LZMA algorithm a while back, and it structures data into discreet units with a tiny, ~10 bytes or so, header at the start that describes the mode of compression (if any) and the size of the uncompressed data (which, in the case of ZFS, should always be the same, e.g- 4kb), so at most an incompressible piece of data would only add a few bytes of overhead for this header. This overhead might add up eventually, but it shouldn't be more than a couple of gigabytes to every terabyte (max).
Of course you're right that not all data will benefit, and I'm currently undecided as de-duplication is the kind of "compression" that would likely benefit me the most since I'll just be using OS X's Time Machine for backup (onto an unencrypted disk image) and as you may know Time Machine only operates at the file level, so it often copies a whole file even if only a few bytes have changed, which seems like the kind of thing that de-duplication would handle well (since the unchanged blocks would be linked together), assuming I can avoid or diminish the huge RAM requirements somehow. But yeah, I'm not sure what portion of my backups would benefit from compression, I might have a hunt around and see if there are any compression tools that could do a "dry run" at my data and tell me how much space I could theoretically save.