On 12/12/2011 12:23 PM, Richard Elling wrote:
On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote:
Not exactly. What is dedup'ed is the stream only, which is infect not very
efficient. Real dedup aware replication is taking the necessary steps to
avoid sending a block that exists on the other storage system.
These exist outside of ZFS (eg rsync) and scale poorly.
Given that dedup is done at the pool level and ZFS send/receive is done at
the dataset level, how would you propose implementing a dedup-aware
ZFS send command?
I'm with Richard.
There is no practical "optimally efficient" way to dedup a stream from
one system to another. The only way to do so would be to have total
information about the pool composition on BOTH the receiver and sender
side. That would involve sending the checksums for the complete pool
blocks between the receiver and sender, which is a non-trivial overhead,
and, indeed, would usually be far worse than simply doing what 'zfs send
-D' does now (dedup the sending stream itself). The only possible way
that such a scheme would work would be if the receiver and sender were
the same machine (note: not VMs or Zones on the same machine, but the
same OS instance, since you would need the DDT to be shared). And,
that's not a use case that 'zfs send' is generally optimized for - that
is, while it's entirely possible, it's not the primary use case for 'zfs
Given the overhead of network communications, there's no way that
sending block checksums between hosts can ever be more efficient than
just sending the self-deduped whole stream (except in pedantic cases).
Let's look at possible implementations (all assume that the local
sending machine does its own dedup - that is, the stream-to-be-sent is
already deduped within itself):
(1) when constructing the stream, every time a block is read from a
fileset (or volume), its checksum is sent to the receiving machine. The
receiving machine then looks up that checksum in its DDT, and sends back
a "needed" or "not-needed" reply to the sender. While this lookup is
being done, the sender must hold the original block in RAM, and cannot
write it out to the to-be-sent-stream.
(2) The sending machine reads all the to-be-sent blocks, creates a
stream, AND creates a checksum table (a mini-DDT, if you will). The
sender communicates to the receiver this mini-DDT. The receiver diffs
this against its own master pool DDT, and then sends back an edited
mini-DDT containing only the checksums that match blocks which aren't on
the receiver. The original sending machine must then go back and
re-construct the stream (either as a whole, or parse the stream as it is
being sent) to leave out the unneeded blocks.
(3) some combo of #1 and #2 where several checksums are stuffed into a
packet, and sent over the wire to be checked at the destination, with
the receiver sending back only those to be included in the stream.
In the first scenario, you produce a huge amount of small network packet
traffic, which trashes network throughput, with no real expectation that
the reduction in the send stream will be worth it. In the second case,
you induce a huge amount of latency into the construction of the sending
stream - that is, the "sender" has to wait around and then spend a
non-trivial amount of processing power on essentially double processing
the send stream, when, in the current implementation, it just sends out
stuff as soon as it gets it. The third scenario is only an optimization
of #1 and #2, and doesn't avoid the pitfalls of either.
That is, even if ZFS did pool-level sends, you're still trapped by the
need to share the DDT, which induces an overhead that can't be
reasonably made up vs simply sending an internally-deduped souce stream
in the first place. I'm sure I can construct an instance where such DDT
sharing would be better than the current 'zfs send' implementation; I'm
just as sure that such an instance would be the small minority of usage,
and that such a required implementation would radically alter the
"typical" use case's performance to the negative.
In any case, as 'zfs send' works on filesets and volumes, and ZFS
maintains DDT information on a pool-level, there's no way to share an
existing whole DDT between two systems (and, given the potential size of
a pool-level DDT, that's a bad idea anyway).
I see no ability to optimize the 'zfs send/receive' concept beyond what
is currently done.
zfs-discuss mailing list