On 12/1/2011 6:44 PM, Ragnar Sundblad wrote:
I'm pretty sure it's NOT 1:1, but I'd have to go look at the code. In
any case, it's not a very big number, so you're still looking at the
same O(n) as the number of DDT entries (n).
Thanks for your answers!
On 2 dec 2011, at 02:54, Erik Trimble wrote:
On 12/1/2011 4:59 PM, Ragnar Sundblad wrote:
I am sorry if these are dumb questions. If there are explanations
available somewhere for those questions that I just haven't found, please
let me know! :-)
1. It has been said that when the DDT entries, some 376 bytes or so, are
rolled out on L2ARC, there still is some 170 bytes in the ARC to reference
them (or rather the ZAP objects I believe). In some places it sounds like
those 170 bytes refers to ZAP objects that contain several DDT entries.
In other cases it sounds like for each DDT entry in the L2ARC there must
be one 170 byte reference in the ARC. What is the story here really?
Yup. Each entry (not just a DDT entry, but any cached reference) in the L2ARC
requires a pointer record in the ARC, so the DDT entries held in L2ARC also
consume ARC space. It's a bad situation.
Yes, it is a bad situation. But how many DDT entries can there be in each ZAP
object? Some have suggested an 1:1 relationship, others have suggested that it
2. Deletion with dedup enabled is a lot heavier for some reason that I don't
understand. It is said that the DDT entries have to be updated for each
deleted reference to that block. Since zfs already have a mechanism for sharing
blocks (for example with snapshots), I don't understand why the DDT has to
contain any more block references at all, or why deletion should be much harder
just because there are checksums (DDT entries) tied to those blocks, and even
if they have to, why it would be much harder than the other block reference
mechanism. If anyone could explain this (or give me a pointer to an
explanation), I'd be very happy!
Remember that, when using Dedup, each block can potentially be part of a very
large number of files. So, when you delete a file, you have to go look at the
DDT entry FOR EACH BLOCK IN THAT FILE, and make the appropriate DDT updates.
It's essentially the same problem that erasing snapshots has - for each block
you delete, you have to find and update the metadata for all the other files
that share that block usage. Dedup and snapshot deletion share the same
problem, it's just usually worse for dedup, since there's a much larger number
of blocks that have to be updated.
What is it that must be updated in the DDT entries - a ref count?
And how does that differ from the snapshot case, which seems like
a very similar mechanism?
It is similar to the snapshot case, in that the block itself has a
reference count in it's structure (for use in both dedup and snapshots)
that would get updated upon "delete", but you also have to consider that
the DDT entry itself, which is a separate structure from the block
structure, also has to be updated. This is a whole new IOPS to get that
additional structure. So, more or less, a dedup delete has to do two
operations for every one that a snapshot delete does. Plus,
ZFS currently treats all metadata (of which DDT entries are) and data
slabs the same when it comes to choosing to migrate them from ARC to
L2ARC, so the most-frequently-accessed info is in the ARC (regardless of
what that info is), and everything else sits in the L2ARC. But, ALL
entries in the L2ARC require an ARC reference pointer.
The problem is that you really need to have the entire DDT in some form of
high-speed random-access memory in order for things to be efficient. If you
have to search the entire hard drive to get the proper DDT entry every time you
delete a block, then your IOPs limits are going to get hammered hard.
3. I, as many others, would of course like to be able to have very large
datasets deduped without having to have enormous amounts of RAM.
Since the DDT is a AVL tree, couldn't just that entire tree be cached on
for example a SSD and be searched there without necessarily having to store
anything of it in RAM? That would probably require some changes to the DDT
lookup code, and some mechanism to gather the tree to be able to lift it
over to the SSD cache, and some other stuff, but still that sounds - with
my very basic (non-)understanding of zfs - like a not to overwhelming change.
L2ARC typically sits on an SSD, and the DDT is usually held there, if the L2ARC
Well, it rather seems to be ZAP objects, referenced from the ARC, which
happens to contain DDT entries, that is in the L2ARC.
I mean that you could just move the entire AVL tree onto the SSD, completely
outside of zfs if you will, and have it being searched there, not dependent
of what is in RAM at all.
Every DDT lookup would take up to [tree depth] number of reads, but that could
be OK if you have a SSD which is fast on reading (which many are).
Under normal operation, you really should have an L2ARC device capable
of holding the entire DDT, to get the random IOPS benefit from that.
However, using the current design, that still consumes a rather large
amount of ARC space to hold the L2ARC reference pointers. A redesign
effort should definitely reconsider how this is done - probably the most
efficient way would be to delete L2ARC ref pointers completely in ARC,
and just force a search of L2ARC if the data isn't found in the ARC.
But, that's just a guess at a new implementation; I'm sure there's
gotchas around that, and, like I said, I suspect that the only way to
save dedup is to kill dedup (then redo it from scratch).
Not that I know of, and there hasn't been any talk on any of these lists
There does need to be serious work on changing how the DDT in the L2ARC is
referenced, however; the ARC memory requirements for DDT-in-L2ARC definitely
need to be removed (which requires a non-trivial rearchitecting of dedup).
There are some other changes that have to happen for Dedup to be really usable.
Unfortunately, I can't see anyone around willing to do those changes, and my
understanding of the code says that it is much more likely that we will simply
remove and replace the entire dedup feature rather than trying to fix the
Yes, replacing it is certainly one possibility.
Is there any work going on for a replacement mechanism?
4. Now and then people mention that the problem with bp_rewrite has been
explained, on this very mailing list I believe, but I haven't found that
explanation. Could someone please give me a pointer to that description
(or perhaps explain it again :-) )?
Thanks for any enlightenment!
bp_rewrite is a feature which stands for the (as yet unimplemented) system call
of the same name, which does Block Pointer re-writing. That is, it would allow
ZFS to change the physical location on media of an existing ZFS data slab. That
is, bp_rewrite is necessary to allow ZFS to change the Physical layout of data
on media, without changing the Conceptual arrangement of such data.
It's been the #1 most-wanted feature of ZFS since I can remember, probably for
10 years now.
Yes, I got that much. :-)
But what is the problem really?
Being naive/ignorant (and completely ignoring any possible dependencies between
the different layers in the zfs stack), it doesn't seem that magic or esoteric
when compared to the rest of the stuff in there.
Conceptually, it's not *that* bad. From an implementation point of
view, it's a major feature add, which touches a big chunk of the code.
As always, the Devil is in the details. One area of problem is how to
guaranty the move has taken place - that is, when I say I'm going to
move Slab A from disk location X to location Y, how can I atomically
guaranty this? While I'm doing other I/O. When there might be a power
loss (or other pool loss). Plus lots of other non-best-case events
The major problem with "active" (vs off-line) deduplication is that no
matter what strategy you use, you MUST keep a *complete* copy of all
blocks currently in the pool, with their checksums. So, for something
like ZFS, you need a structure that holds the physical block location, a
256-bit checksum, and a reference count, at the minimum, for each and
every block in the entire pool. If you want good performance, this
lookup table has to be on something that has very good random I/O
zfs-discuss mailing list