How Quickly Does Deduplication Memory Shrink?

All your general support questions for OpenZFS on OS X.

How Quickly Does Deduplication Memory Shrink?

Postby Haravikk » Fri Apr 22, 2022 11:46 am

When I setup my ZFS storage I initially dismissed deduplication since the bulk of my data (in terms of capacity used) is unique, so there would little or no savings from deduplication.

However, I do have a working directory where I use tools that have a tendency to copy a lot of data with only small modifications (usually in header blocks), so I'm thinking of enabling deduplication only for this directory (which I'll put in its own new dataset).

What I'm wondering though is what my expected memory impact will be when accounting for the fact that files are only in this directory temporarily (while being worked on, as once complete they're copied elsewhere).

For example, let's say I have 10gb file and it's copied five times. With deduplication enabled this should only result in 10gb of data in the dataset (plus change) with, if I'm working it out correctly, a deduplication table of maybe 80mb (320 bytes per dedup record multiplied by the number of unique records?). Now let's say I move the file out of the working dataset and into its final location; will that 80mb of RAM be freed from the deduplication table (assuming no snapshots etc.) since the records no longer exist in that dataset?

My hope is that I can essentially have a deduplication setup in which the memory usage rises while active, then drops back to near zero when done. If that's not how it works while active then are there alternatives; for example, are there any drawbacks to enabling deduplication only temporarily, i.e- turn it when I start a task, then turn it off again once I've completed it?
Haravikk
 
Posts: 42
Joined: Tue Mar 17, 2015 4:52 am

Re: How Quickly Does Deduplication Memory Shrink?

Postby tangles » Sat May 07, 2022 4:22 am

from https://linuxhint.com/zfs-deduplication/#a2

Whether deduplication saves disk spaces or not, ZFS will still have to keep track of all the data blocks of your ZFS pool/filesystem in the deduplication table (DDT).

So, if you have a big ZFS pool/filesystem, ZFS will have to use a lot of memory to store the deduplication table (DDT). If ZFS deduplication is not saving you much disk space, all of that memory is wasted. This is a big problem of deduplication.

Another problem is the high CPU utilization. If the deduplication table (DDT) is too big, ZFS may also have to do a lot of comparison operations and it may increase the CPU utilization of your computer.
tangles
 
Posts: 191
Joined: Tue Jun 17, 2014 6:54 am

Re: How Quickly Does Deduplication Memory Shrink?

Postby Haravikk » Sat May 07, 2022 2:18 pm

I don't think this answers my question though; my understanding is that the deduplication table is basically just a table of checksums (or partial checksums?) for every block (record?) already in the dataset, so that if the same block/record is written again then it can simply be linked to the new entry rather than written twice. This means the size of the DDT should be based on the amount of data in the dataset(s) for which deduplication is enabled.

What my question is is how quickly deleted blocks are removed from the DDT, and memory freed as a result, i.e- if a dataset contains almost nothing, then the DDT should in theory be very small, and as you add data to it the DDT will grow, but does it shrink, and if so, how quickly?

I may just need to devise some tests of my own, but I'm really not sure how you're supposed to measure the current size of the DDT; there doesn't seem to be a simple measure for it anyway (looks like you need to work out how many unique and deduplicated blocks are in the DDT at a given moment in time, and then work out from known sizes (or just fudge it, I think both entries are roughly 160 bytes on average).
Haravikk
 
Posts: 42
Joined: Tue Mar 17, 2015 4:52 am

Re: How Quickly Does Deduplication Memory Shrink?

Postby tangles » Mon May 09, 2022 5:38 am

create your test pool environment.

grep the following line from: zpool status -D <poolname>

Code: Select all
DDT entries a, size b on disk, c in core


and use awk to print/get the third and eighth item on the line to (being a and c) into variables

a x c / (1024^2) = ~MB of RAM used for your DDT

Create a menu entry to refresh each 5 seconds or so using xbar

https://github.com/matryer/xbar

You should soon know when your table increases and then decreases while copying data over and deleting it in the Finder.
tangles
 
Posts: 191
Joined: Tue Jun 17, 2014 6:54 am


Return to General Help

Who is online

Users browsing this forum: No registered users and 2 guests

cron