tl;dr: if you think you need dedup now, try to wait it out.
Brendon wrote:"Friends don't let friends Dedup"
I largely agree with that (and have typed it myself from time to time, too).
I have five main (on o3x) uses for dedup.
1. zfs send/recv backups from multiple hosts where there is enormous overlap of data
(for example, managed software trees for things like macports or homebrew or ~/Library/Mail)
where I'm using zfs DAS where one might equally use network storage.
2. big blobs of data whose structure is not visible to mac os x (e.g. storage for VMs)
3. inconveniently different layouts of largely similar data
4. squeezing the maximal space out of extremely high random IOPS solid state devices.
5. "oops"es where rolling backups have become decoupled such that the sending side
and receiving side lack a common snapshot. Rarer since bookmarks, but there's still a history of that committed on long term storage.
The fifth is no longer relevant, thanks to the illumos 6393 change (which is now in test branches). Nopwrite using the new fast checksums (also in test branches; edonr is amazing, skein is really good too) fixes this problem at least as space-efficiently as deduplication, and without all the extra pain associated with randomly-accessing-and-updating deduplication tables.
The second is a tradeoff. On spinny storage, just trust nopwrite, and give it a hand with a 6393+zfs promote tree-folding as needed. On non-spinny storage, dedup will work better with some data (and systems that write it) and nopwrite will work better with others. But dedup on solid state storage is generally less painful.
The third case is where I'll use dedup for where the pain (and it *is* pain, even on my ssd mirror that gets a million 4k iops) of deduplication is outweighed by the space constraint.
The first case is to a slow dedicated pool anyway, but if I were to start over now, I'd add more mirror vdevs to the pool rather than do dedupliction. Disk space is much much much much cheaper than IOPS.
Finally, persistent l2arc helps a bunch. The whole DDT ends up on L2 eventually, and stays there. When it's there at pool import time, it's a massive advantage in processing the first backups (and snapshot deletions and so on). However, it uses memory in the arc on its own, and doesn't do away with all the extra memory needed to update ddt. A busily changing ddt might not propagate via l2arc_feed_thread to the L2 device right away. And worst of all, import on a pool with a ddt can take a long time (for instance, if there is an async_destroy going) and the l2 won't be useful for shortening that time.
Early next year openzfs will see some things that will help integrated into it, but in the mean time, space will get cheaper, and the cost of backing out of deduplication won't.
Just as an example, five days ago I destroyed several deduplicated datasets that had accumulated over time on a spinny-disk pool. Even with persistent l2arc and a big arc with a big arc_meta_limit, the freeing property is presently 219G on that pool. FIVE DAYS. Why? Because it's mainly 8k recordsize data, and even when the whole DDT is in ARC, the number of *writes* to update the on-disk structures remains very high.