OpenZFS on OS X

by **haer22** » Tue Nov 17, 2015 11:00 am

Before I enabled dedup my pool (ca 12 TB) took ca 12hrs to scrub (ca 470 MB/s) . Now it takes a looong time, usually 3-4 days.

It seems that scrubbing dedups all blocks before it does its thing. Is that correct? If so, why???

If the dedupped block checked out fine the first time it was read, it ought to be correct the next times the same block is read.

by **rottegift** » Mon Nov 23, 2015 5:44 pm

Not quite.

See the comment block above dsl_scan_ddt(). Quoting a part of that:

" * To prevent excess scrubbing, the scrub begins by walking the DDT
* to find all blocks with refcnt > 1, and scrubs each of these once.
* Since there are two replication classes which contain blocks with
* refcnt > 1, we scrub the highest replication class (DDT_CLASS_DITTO) first.
* Finally the top-down scrub begins, only visiting blocks with refcnt == 1."

The rest of that comment block deals with your other points.

In general, the time it takes to do the DDT walking phases of a scan (that's a scrub, or *importantly* a full resilver) is proportional to the slowest track-to-track seek time in your pool (if you have spinny disks, that's probably around 10 ms) times the number of DDT entries reported in zpool status -vD poolname. If you are using a single raidz, raidz2 or raidz3 vdev, these phases will take even longer, because scrub-reading has to be synchronized among all the devices in each vdev.

If you have a large number and proportion of refcnt=1 entries in the DDT, you may want to rethink using dedup at all. On non-solid-state disks, the costs of dedup in future random I/O is huge compared to the space savings, unless you are doing something consciously that deliberately leverages the way ZFS does deduplication (in which case there will be a huge difference in totals between the DSIZE line of status -vD, and a large value in the zfs list -o name,dedup poolname output).

Penultimately, there are artificial slowdowns designed to make your pool usable during a scan. If your use is light during the scrub, you can probably get a slight boost by doing this:

/usr/sbin/sysctl -w kstat.zfs.darwin.tunable.scrub_max_active=6 kstat.zfs.darwin.tunable.zfs_resilver_delay=0 kstat.zfs.darwin.tunable.zfs_scrub_delay=0

But this will noticeably reduce performance if you want to use the pool for anything else during the scrub, so you will want to make note of the defaults in case you want to switch back.

Finally, I'll magnify my point above: you need to expect roughly double this amount of slowness to happen if you have to replace or otherwise do a full resilver on any device in your raidz2, and plan accordingly.

by **rottegift** » Mon Nov 23, 2015 6:06 pm

Other things to watch out for when you do dedup:

* destroying any deduplicated data is very slow, again, proportional to the number of blocks times the track-to-track seek time of the slowest-seeking device in your pool. That affects destroying snapshots, destroying datasets or zvols, overwriting or unlinking non-snapshotted data, and so forth.

** however, async_destroy pushes the slowness out of immediate visibility. "zpool list -o name,freeing" will show some value for freeing that can help you guess how long it will take. The time is noninear with the total size of the pool's DDT as reported by zpool status -vD, and the size of your arc.
Bigger arc will be faster (once it is heated); bigger DDT will be (much) slower.

*** async_destroy has the annoying problem that when you do an import of a pool that had a nonzero freeing value when it was last in use, it will take a long time to complete the import. Long time can be on the order of hours, depending on the size of the DDT, the value of the zfs_free_max_blocks variable, and the IOPS you can get from the pool's slowest device. For a substantial DDT, the expectation will be roughly slowest_seek_time_in_seconds * zfs_free_max_blocks.

* writes to a deduplicated dataset or zvol will take more time for a larger DDT, and *much* more time when the full DDT is not in zero-seek-time media (i.e., in neither arc nor fast l2arc).

** this is also nonlinear; you increase the amount of DDT data with each brand-new (i.e., different checksum) block written, but also the metadata pointing to the DDT at somewhere around log(DDT_size) .

* there is a fundamental difficulty in aligning the DMU records with one another to achieve higher deduplication if the objects are ever subject to extension, replacement, or rewriting, or are smaller than the recordsize or volblocksize. This difficulty worsens with the size of recordsize or volblocksize.

** but, the smaller the blocks you are deduplicating (capped by volblocksize or recordsize), the more space the DDT will require on disk and in memory, and the more space will be occupied by l2 headers if you use an l2arc.

*** small blocks and sha256 play together very badly; this affects even non-deduplicated datsets and zvols which use checksum=sha256. Deduplication uses sha256 for each block.

**** future checksums play much better with small blocks and CPU, and you can switch the dedup property to one of them (skein, for example), but any block rewritten identically on either size of a change in the dedup property will look like a new block, and thus add a refcnt=1 entry to the DDT.

***** changing the checksum doesn't help with any of the I/O per second problems of deduplication; it will probably mostly matter to people who do deduplication on speedy SSDs, where random IOPS are thousands of times faster than you get from spinny disks.

Finally, deduplication gets harder and harder to back out of over time, but also gets more and more costly over time. If you aren't likely to gain a really critical space savings out of deduplication (e.g., your primary storage for the pool is extremely fast and expensive SSD, where every byte counts and where every seek is "free"), stop now, and don't start again until you're certain.

by **Brendon** » Tue Nov 24, 2015 2:12 pm

"Friends don't let friends Dedup"

- Brendon

by **rottegift** » Thu Nov 26, 2015 12:27 am

tl;dr: if you think you need dedup now, try to wait it out.

Brendon wrote:"Friends don't let friends Dedup"

I largely agree with that (and have typed it myself from time to time, too).

I have five main (on o3x) uses for dedup.

1. zfs send/recv backups from multiple hosts where there is enormous overlap of data
(for example, managed software trees for things like macports or homebrew or ~/Library/Mail)
where I'm using zfs DAS where one might equally use network storage.

2. big blobs of data whose structure is not visible to mac os x (e.g. storage for VMs)

3. inconveniently different layouts of largely similar data

4. squeezing the maximal space out of extremely high random IOPS solid state devices.

5. "oops"es where rolling backups have become decoupled such that the sending side
and receiving side lack a common snapshot. Rarer since bookmarks, but there's still a history of that committed on long term storage.

The fifth is no longer relevant, thanks to the illumos 6393 change (which is now in test branches). Nopwrite using the new fast checksums (also in test branches; edonr is amazing, skein is really good too) fixes this problem at least as space-efficiently as deduplication, and without all the extra pain associated with randomly-accessing-and-updating deduplication tables.

The second is a tradeoff. On spinny storage, just trust nopwrite, and give it a hand with a 6393+zfs promote tree-folding as needed. On non-spinny storage, dedup will work better with some data (and systems that write it) and nopwrite will work better with others. But dedup on solid state storage is generally less painful.

The third case is where I'll use dedup for where the pain (and it *is* pain, even on my ssd mirror that gets a million 4k iops) of deduplication is outweighed by the space constraint.

The first case is to a slow dedicated pool anyway, but if I were to start over now, I'd add more mirror vdevs to the pool rather than do dedupliction. Disk space is much much much much cheaper than IOPS.

Finally, persistent l2arc helps a bunch. The whole DDT ends up on L2 eventually, and stays there. When it's there at pool import time, it's a massive advantage in processing the first backups (and snapshot deletions and so on). However, it uses memory in the arc on its own, and doesn't do away with all the extra memory needed to update ddt. A busily changing ddt might not propagate via l2arc_feed_thread to the L2 device right away. And worst of all, import on a pool with a ddt can take a long time (for instance, if there is an async_destroy going) and the l2 won't be useful for shortening that time.

Early next year openzfs will see some things that will help integrated into it, but in the mean time, space will get cheaper, and the cost of backing out of deduplication won't.

Just as an example, five days ago I destroyed several deduplicated datasets that had accumulated over time on a spinny-disk pool. Even with persistent l2arc and a big arc with a big arc_meta_limit, the freeing property is presently 219G on that pool. FIVE DAYS. Why? Because it's mainly 8k recordsize data, and even when the whole DDT is in ARC, the number of *writes* to update the on-disk structures remains very high.

by **haer22** » Sat Nov 28, 2015 2:48 pm

Thanks a tremendous lot for the very illuminating and extensive comment on dedup!
My dedup-factor is currently 1.22 and it is still rising as old non-dedup data gets replaced and "dedupped".
I'll let it stew some more weeks and see if it levels out at some point and then give it some thinking.

again, thanks for the info!

by **rottegift** » Sat Nov 28, 2015 3:19 pm

Ok, another, terser way of putting it: all that cumulative time you are writing into your dedup=on datasets will have to be spent unwinding the deduplication when you destroy the datasets, times an amount that goes up as the number of blocks counted by zpool status -vD grows.

IOW, have an exit strategy that involves copying the data to a brand new pool.

Otherwise, if you spend a week or a month constantly writing, you'll take a week or a month to do a zfs destroy.

If you spend a long time writing the equivalent of a week or a month into datasets and snapshots, over the course of a few years, it'll still take you a week or a month to do the zfs destroys (and more, because the destroy is not actually time-reversal but rather random-walk through a sparse table whose elements are scattered across your disks).

1.22 is an AWFUL ratio for deduplication. You can do better than that by using a decent compression algorithm.

by **haer22** » Sun Nov 29, 2015 12:33 pm

I use compression as well. Gives 1.17. Combined with dedup I get ca 1.42 total.

I have two big pools (each 24TB max, ca 15TB data) where the "front-end" is 2 vdev 4*4 TB (raidz1) with no compression and no dedup. Optimized for speed (somewhat :-)

. Maybe should enable compression as I have cpu to spare.

The back-end (8*4TB raidz2) is where I play with dedup. So when I am fed up waiting for better dedup numbers, I just scrap the pool and start some multi-TB send/receive. It will take 2-3 days if I remember correctly from the last reset. During that transfer I can only survive 1-2 disk failure whereas I normally can survive 3-4 disk failures.

OpenZFS on OS X

scrubbing is sloooow

scrubbing is sloooow

Re: scrubbing is sloooow

Re: scrubbing is sloooow

Re: scrubbing is sloooow

Re: scrubbing is sloooow

Re: scrubbing is sloooow

Re: scrubbing is sloooow

Re: scrubbing is sloooow

Who is online