Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

Erik Trimble Mon, 12 Dec 2011 20:06:27 -0800

On 12/12/2011 12:23 PM, Richard Elling wrote:

On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote:

Not exactly. What is dedup'ed is the stream only, which is infect not very
efficient. Real dedup aware replication is taking the necessary steps to
avoid sending a block that exists on the other storage system.

These exist outside of ZFS (eg rsync) and scale poorly.

Given that dedup is done at the pool level and ZFS send/receive is done at
the dataset level, how would you propose implementing a dedup-aware
ZFS send command?
  -- richard

I'm with Richard.

There is no practical "optimally efficient" way to dedup a stream fromone system to another. The only way to do so would be to have totalinformation about the pool composition on BOTH the receiver and senderside. That would involve sending the checksums for the complete poolblocks between the receiver and sender, which is a non-trivial overhead,and, indeed, would usually be far worse than simply doing what 'zfs send-D' does now (dedup the sending stream itself). The only possible waythat such a scheme would work would be if the receiver and sender werethe same machine (note: not VMs or Zones on the same machine, but thesame OS instance, since you would need the DDT to be shared). And,that's not a use case that 'zfs send' is generally optimized for - thatis, while it's entirely possible, it's not the primary use case for 'zfssend'

Given the overhead of network communications, there's no way thatsending block checksums between hosts can ever be more efficient thanjust sending the self-deduped whole stream (except in pedantic cases).Let's look at possible implementations (all assume that the localsending machine does its own dedup - that is, the stream-to-be-sent isalready deduped within itself):

(1) when constructing the stream, every time a block is read from afileset (or volume), its checksum is sent to the receiving machine. Thereceiving machine then looks up that checksum in its DDT, and sends backa "needed" or "not-needed" reply to the sender. While this lookup isbeing done, the sender must hold the original block in RAM, and cannotwrite it out to the to-be-sent-stream.

(2) The sending machine reads all the to-be-sent blocks, creates astream, AND creates a checksum table (a mini-DDT, if you will). Thesender communicates to the receiver this mini-DDT. The receiver diffsthis against its own master pool DDT, and then sends back an editedmini-DDT containing only the checksums that match blocks which aren't onthe receiver. The original sending machine must then go back andre-construct the stream (either as a whole, or parse the stream as it isbeing sent) to leave out the unneeded blocks.

(3) some combo of #1 and #2 where several checksums are stuffed into apacket, and sent over the wire to be checked at the destination, withthe receiver sending back only those to be included in the stream.

In the first scenario, you produce a huge amount of small network packettraffic, which trashes network throughput, with no real expectation thatthe reduction in the send stream will be worth it. In the second case,you induce a huge amount of latency into the construction of the sendingstream - that is, the "sender" has to wait around and then spend anon-trivial amount of processing power on essentially double processingthe send stream, when, in the current implementation, it just sends outstuff as soon as it gets it. The third scenario is only an optimizationof #1 and #2, and doesn't avoid the pitfalls of either.

That is, even if ZFS did pool-level sends, you're still trapped by theneed to share the DDT, which induces an overhead that can't bereasonably made up vs simply sending an internally-deduped souce streamin the first place. I'm sure I can construct an instance where such DDTsharing would be better than the current 'zfs send' implementation; I'mjust as sure that such an instance would be the small minority of usage,and that such a required implementation would radically alter the"typical" use case's performance to the negative.

In any case, as 'zfs send' works on filesets and volumes, and ZFSmaintains DDT information on a pool-level, there's no way to share anexisting whole DDT between two systems (and, given the potential size ofa pool-level DDT, that's a bad idea anyway).

I see no ability to optimize the 'zfs send/receive' concept beyond whatis currently done.


-Erik
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

Reply via email to