Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Nico Williams
On Thu, Dec 29, 2011 at 6:44 PM, Matthew Ahrens  wrote:
> On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble  wrote:
>> (1) when constructing the stream, every time a block is read from a fileset
>> (or volume), its checksum is sent to the receiving machine. The receiving
>> machine then looks up that checksum in its DDT, and sends back a "needed" or
>> "not-needed" reply to the sender. While this lookup is being done, the
>> sender must hold the original block in RAM, and cannot write it out to the
>> to-be-sent-stream.
> ...
>> you produce a huge amount of small network packet
>> traffic, which trashes network throughput
>
> This seems like a valid approach to me.  When constructing the stream,
> the sender need not read the actual data, just the checksum in the
> indirect block.  So there is nothing that the sender "must hold in
> RAM".  There is no need to create small (or synchronous) network
> packets, because sender need not wait for the receiver to determine if
> it needs the block or not.  There can be multiple asynchronous
> communication streams:  one where the sender sends all the checksums
> to the receiver; another where the receiver requests blocks that it
> does not have from the sender; and another where the sender sends
> requested blocks back to the receiver.  Implementing this may not be
> trivial, and in some cases it will not improve on the current
> implementation.  But in others it would be a considerable improvement.

Right, you'd want to let the socket/transport buffer/flow control
writes of "I have this new block checksum" messages from the zfs
sender and "I need the block with this checksum" messages from the zfs
receiver.

I like this.

A separate channel for bulk data definitely comes recommended for flow
control reasons, but if you do that then securing the transport gets
complicated: you couldn't just zfs send .. | ssh ... zfs receive.  You
could use SSH channel multiplexing, but that will net you lousy
performance (well, no lousier than one already gets with SSH
anyways)[*].  (And SunSSH lacks this feature anyways)  It'd then begin
to pay to have have a bonafide zfs send network protocol, and now
we're talking about significantly more work.  Another option would be
to have send/receive options to create the two separate channels, so
one would do something like:

% zfs send --dedup-control-channel ... | ssh-or-netcat-or... zfs
receive --dedup-control-channel ... &
% zfs send --dedup-bulk-channel ... | ssh-or-netcat-or... zfs receive
--dedup-bulk-channel
% wait

The second zfs receive would rendezvous with the first and go from there.

[*] The problem with SSHv2 is that it has flow controlled channels
layered over a flow controlled congestion channel (TCP), and there's
not enough information flowing from TCP to SSHv2 to make this work
well, but also, the SSHv2 channels cannot have their window shrink
except by the sender consuming it, which makes it impossible to mix
high-bandwidth bulk and small control data over a congested link.
This means that in practice SSHv2 channels have to have relatively
small windows, which then forces the protocol to work very
synchronously (i.e., with effectively synchronous ACKs of bulk data).
I now believe the idea of mixing bulk and non-bulk data over a single
TCP connection in SSHv2 is a failure.  SSHv2 over SCTP, or over
multiple TCP connections, would be much better.

Nico
--
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Matthew Ahrens
On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble  wrote:
> On 12/12/2011 12:23 PM, Richard Elling wrote:
>>
>> On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote:
>>
>>> Not exactly. What is dedup'ed is the stream only, which is infect not
>>> very
>>> efficient. Real dedup aware replication is taking the necessary steps to
>>> avoid sending a block that exists on the other storage system.

As with all dedup-related performance, the efficiency of various
methods of implementing "zfs send -D" will vary widely, depending on
the dedup-ability of the data, and what is being sent.  However,
sending no blocks that already exist on the target system does seem
like a good goal, since it addresses some use cases that intra-stream
dedup does not.

> (1) when constructing the stream, every time a block is read from a fileset
> (or volume), its checksum is sent to the receiving machine. The receiving
> machine then looks up that checksum in its DDT, and sends back a "needed" or
> "not-needed" reply to the sender. While this lookup is being done, the
> sender must hold the original block in RAM, and cannot write it out to the
> to-be-sent-stream.
...
> you produce a huge amount of small network packet
> traffic, which trashes network throughput

This seems like a valid approach to me.  When constructing the stream,
the sender need not read the actual data, just the checksum in the
indirect block.  So there is nothing that the sender "must hold in
RAM".  There is no need to create small (or synchronous) network
packets, because sender need not wait for the receiver to determine if
it needs the block or not.  There can be multiple asynchronous
communication streams:  one where the sender sends all the checksums
to the receiver; another where the receiver requests blocks that it
does not have from the sender; and another where the sender sends
requested blocks back to the receiver.  Implementing this may not be
trivial, and in some cases it will not improve on the current
implementation.  But in others it would be a considerable improvement.

--matt
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Nico Williams
On Thu, Dec 29, 2011 at 9:53 AM, Brad Diggs  wrote:
> Jim,
>
> You are spot on.  I was hoping that the writes would be close enough to 
> identical that
> there would be a high ratio of duplicate data since I use the same record 
> size, page size,
> compression algorithm, … etc.  However, that was not the case.  The main 
> thing that I
> wanted to prove though was that if the data was the same the L1 ARC only 
> caches the
> data that was actually written to storage.  That is a really cool thing!  I 
> am sure there will
> be future study on this topic as it applies to other scenarios.
>
> With regards to directory engineering investing any energy into optimizing 
> ODSEE DS
> to more effectively leverage this caching potential, that won't happen.  OUD 
> far out
> performs ODSEE.  That said OUD may get some focus in this area.  However, 
> time will
> tell on that one.

Databases are not as likely to benefit from dedup as virtual machines,
indeed, DBs are likely to not benefit at all from dedup.  The VM use
case benefits from dedup for the obvious reason that many VMs will
have the same exact software installed most of the time, using the
same filesystems, and the same patch/update installation order, so if
you keep data out of their root filesystems then you can expect
enormous deduplicatiousness.  But databases, not so much.  The unit of
deduplicable data in a VM use case is the guest's preferred block
size, while in a DB the unit of deduplicable data might be a
variable-sized table row, or even smaller: a single row/column value
-- and you have no way to ensure alignment of individual deduplicable
units nor ordering of sets of deduplicable units into larger ones.

When it comes to databases your best bets will be: a) database-level
compression or dedup features (e.g., Oracle's column-level compression
feature) or b) ZFS compression.

(Dedup makes VM management easier, because the alternative is to patch
one master guest VM [per-guest type] then re-clone and re-configure
all instances of that guest type, in the process possibly losing any
customizations in those guests.  But even before dedup, the ability to
snapshot and clone datasets was an impressive dedup-like tool for the
VM use-case, just not as convenient as dedup.)

Nico
--
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Robert Milkowski
 

Citing yourself:

 

"The average block size for a given data block should be used as the metric
to map all other datablock sizes to. For example, the ZFS recordsize is
128kb by default. If the average block (or page) size of a directory server
is 2k, then the mismatch in size will result in degraded throughput for both
read and write operations. One of the benefits of ZFS is that you can change
the recordsize of all write operations from the time you set the new value
going forward.

"

 

And the above is not even entirely correct as if a file is bigger than a
current value of recordsize property reducing a recordsize won't change
block size for the file (it will continue to use the previous size, for
example 128K). This is why you need to set recordsize to a desired value for
large files *before* you create them (or you will have to copy them later
on).

 

>From the performance point of view it really depends on a workload but as
you described in your blog the default recordsize of 128K with an average
write/read of 2K for many workloads will negatively impact performance, and
lowering recordsize can potentially improve it.

 

Nevertheless I was referring to dedup efficiency which with lower recordsize
values should improve dedup ratios (although it will require more memory for
ddt).

 

 

 

From: [email protected]
[mailto:[email protected]] On Behalf Of Brad Diggs
Sent: 29 December 2011 15:55
To: Robert Milkowski
Cc: 'zfs-discuss discussion list'
Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

 

Reducing the record size would negatively impact performance.  For rational
why, see the

section titled "Match Average I/O Block Sizes" in my blog post on filesystem
caching:

http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html

 

Brad


Brad Diggs | Principal Sales Consultant | 972.814.3698

eMail: [email protected]

Tech Blog:  <http://TheZoneManager.com/> http://TheZoneManager.com

LinkedIn: http://www.linkedin.com/in/braddiggs

 

On Dec 29, 2011, at 8:08 AM, Robert Milkowski wrote:





 

Try reducing recordsize to 8K or even less *before* you put any data.

This can potentially improve your dedup ratio and keep it higher after you
start modifying data.

 

 

From: [email protected]
[mailto:[email protected]] On Behalf Of Brad Diggs
Sent: 28 December 2011 21:15
To: zfs-discuss discussion list
Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

 

As promised, here are the findings from my testing.  I created 6 directory
server instances where the first

instance has roughly 8.5GB of data.  Then I initialized the remaining 5
instances from a binary backup of

the first instance.  Then, I rebooted the server to start off with an empty
ZFS cache.  The following table

shows the increased L1ARC size, increased search rate performance, and
increase CPU% busy with

each starting and applying load to each successive directory server
instance.  The L1ARC cache grew

a little bit with each additional instance but largely stayed the same size.
Likewise, the ZFS dedup ratio

remained the same because no data on the directory server instances was
changing.

 



However, once I started modifying the data of the replicated directory
server topology, the caching efficiency 

quickly diminished.  The following table shows that the delta for each
instance increased by roughly 2GB 

after only 300k of changes.

 



I suspect the divergence in data as seen by ZFS deduplication most likely
occurs because reduplication 

occurs at the block level rather than at the byte level.  When a write is
sent to one directory server instance, 

the exact same write is propagated to the other 5 instances and therefore
should be considered a duplicate.  

However this was not the case.  There could be other reasons for the
divergence as well.

 

The two key takeaways from this exercise were as follows.  There is
tremendous caching potential

through the use of ZFS deduplication.  However, the current block level
deduplication does not 

benefit directory as much as it perhaps could if deduplication occurred at
the byte level rather than

the block level.  It very could be that even byte level deduplication
doesn't work as well either.  

Until that option is available, we won't know for sure.

 

Regards,

 

Brad



Brad Diggs | Principal Sales Consultant

Tech Blog:  <http://TheZoneManager.com/> http://TheZoneManager.com

LinkedIn: http://www.linkedin.com/in/braddiggs

 

On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:






Thanks everyone for your input on this thread.  It sounds like there is
sufficient weight

behind the affirmative that I will include this methodology into my
performance analysis

test plan.  If the performance goes well, I will share some of the results
when we conclude

in January/February timeframe.

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Brad Diggs
Reducing the record size would negatively impact performance.  For rational why, see thesection titled "Match Average I/O Block Sizes" in my blog post on filesystem caching:http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.htmlBrad
Brad Diggs | Principal Sales Consultant | 972.814.3698eMail: [email protected] Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs

On Dec 29, 2011, at 8:08 AM, Robert Milkowski wrote: Try reducing recordsize to 8K or even less *before* you put any data.This can potentially improve your dedup ratio and keep it higher after you start modifying data.  From: [email protected] [mailto:[email protected]] On Behalf Of Brad DiggsSent: 28 December 2011 21:15To: zfs-discuss discussion listSubject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup As promised, here are the findings from my testing.  I created 6 directory server instances where the firstinstance has roughly 8.5GB of data.  Then I initialized the remaining 5 instances from a binary backup ofthe first instance.  Then, I rebooted the server to start off with an empty ZFS cache.  The following tableshows the increased L1ARC size, increased search rate performance, and increase CPU% busy witheach starting and applying load to each successive directory server instance.  The L1ARC cache grewa little bit with each additional instance but largely stayed the same size.  Likewise, the ZFS dedup ratioremained the same because no data on the directory server instances was changing. However, once I started modifying the data of the replicated directory server topology, the caching efficiency quickly diminished.  The following table shows that the delta for each instance increased by roughly 2GB after only 300k of changes. I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication occurs at the block level rather than at the byte level.  When a write is sent to one directory server instance, the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate.  However this was not the case.  There could be other reasons for the divergence as well. The two key takeaways from this exercise were as follows.  There is tremendous caching potentialthrough the use of ZFS deduplication.  However, the current block level deduplication does not benefit directory as much as it perhaps could if deduplication occurred at the byte level rather thanthe block level.  It very could be that even byte level deduplication doesn't work as well either.  Until that option is available, we won't know for sure. Regards, BradBrad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:Thanks everyone for your input on this thread.  It sounds like there is sufficient weightbehind the affirmative that I will include this methodology into my performance analysistest plan.  If the performance goes well, I will share some of the results when we concludein January/February timeframe. Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC detect and prevent streaming reads such as from dd from populating the cache.  Seemy previous blog post at the web site link below for a way around this protectivecaching control of ZFS. http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.html Thanks again! BradBrad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 8, 2011, at 4:22 PM, Mark Musante wrote:You can see the original ARC case here:http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.altOn 8 Dec 2011, at 16:41, Ian Collins wrote:On 12/ 9/11 12:39 AM, Darren J Moffat wrote:On 12/07/11 20:48, Mertol Ozyoney wrote:Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp In fact , most of our functions, like replication is not dedup aware.For example, thecnicaly it's possible to optimize our replication thatit does not send daya chunks if a data chunk with the same chechsumexists in target, without enabling dedup on target and source.We already do that with 'zfs send -D':  -D  Perform dedup processing on the stream. Deduplicated streams  cannot  be  received on systems that do not support the stream deduplication feature.Is there any more published information on how this feature works? --Ian. ___zfs-discuss mailing [email protected]://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing [email protected]://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Brad Diggs
S11 FCSBrad
Brad Diggs | Principal Sales Consultant | 972.814.3698eMail: [email protected] Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs

On Dec 29, 2011, at 8:11 AM, Robert Milkowski wrote: And these results are from S11 FCS I assume.On older builds or Illumos based distros I would expect L1 arc to grow much bigger.  From: [email protected] [mailto:[email protected]] On Behalf Of Brad DiggsSent: 28 December 2011 21:15To: zfs-discuss discussion listSubject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup As promised, here are the findings from my testing.  I created 6 directory server instances where the firstinstance has roughly 8.5GB of data.  Then I initialized the remaining 5 instances from a binary backup ofthe first instance.  Then, I rebooted the server to start off with an empty ZFS cache.  The following tableshows the increased L1ARC size, increased search rate performance, and increase CPU% busy witheach starting and applying load to each successive directory server instance.  The L1ARC cache grewa little bit with each additional instance but largely stayed the same size.  Likewise, the ZFS dedup ratioremained the same because no data on the directory server instances was changing. However, once I started modifying the data of the replicated directory server topology, the caching efficiency quickly diminished.  The following table shows that the delta for each instance increased by roughly 2GB after only 300k of changes. I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication occurs at the block level rather than at the byte level.  When a write is sent to one directory server instance, the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate.  However this was not the case.  There could be other reasons for the divergence as well. The two key takeaways from this exercise were as follows.  There is tremendous caching potentialthrough the use of ZFS deduplication.  However, the current block level deduplication does not benefit directory as much as it perhaps could if deduplication occurred at the byte level rather thanthe block level.  It very could be that even byte level deduplication doesn't work as well either.  Until that option is available, we won't know for sure. Regards, BradBrad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:Thanks everyone for your input on this thread.  It sounds like there is sufficient weightbehind the affirmative that I will include this methodology into my performance analysistest plan.  If the performance goes well, I will share some of the results when we concludein January/February timeframe. Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC detect and prevent streaming reads such as from dd from populating the cache.  Seemy previous blog post at the web site link below for a way around this protectivecaching control of ZFS. http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.html Thanks again! BradBrad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs On Dec 8, 2011, at 4:22 PM, Mark Musante wrote:You can see the original ARC case here:http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.altOn 8 Dec 2011, at 16:41, Ian Collins wrote:On 12/ 9/11 12:39 AM, Darren J Moffat wrote:On 12/07/11 20:48, Mertol Ozyoney wrote:Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp In fact , most of our functions, like replication is not dedup aware.For example, thecnicaly it's possible to optimize our replication thatit does not send daya chunks if a data chunk with the same chechsumexists in target, without enabling dedup on target and source.We already do that with 'zfs send -D':  -D  Perform dedup processing on the stream. Deduplicated streams  cannot  be  received on systems that do not support the stream deduplication feature.Is there any more published information on how this feature works? --Ian. ___zfs-discuss mailing [email protected]://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing [email protected]://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___zfs-discuss mailing [email protected]://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Brad Diggs
Jim,You are spot on.  I was hoping that the writes would be close enough to identical thatthere would be a high ratio of duplicate data since I use the same record size, page size,compression algorithm, … etc.  However, that was not the case.  The main thing that Iwanted to prove though was that if the data was the same the L1 ARC only caches thedata that was actually written to storage.  That is a really cool thing!  I am sure there willbe future study on this topic as it applies to other scenarios.With regards to directory engineering investing any energy into optimizing ODSEE DS to more effectively leverage this caching potential, that won't happen.  OUD far outperforms ODSEE.  That said OUD may get some focus in this area.  However, time willtell on that one.For now, I hope everyone benefits from the little that I did validate.Have a great day!Brad
Brad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs


On Dec 29, 2011, at 4:45 AM, Jim Klimov wrote:Thanks for running and publishing the tests :)A comment on your testing technique follows, though.2011-12-29 1:14, Brad Diggs wrote:As promised, here are the findings from my testing. I created 6directory server instances ...However, once I started modifying the data of the replicated directoryserver topology, the caching efficiencyquickly diminished. The following table shows that the delta for eachinstance increased by roughly 2GBafter only 300k of changes.I suspect the divergence in data as seen by ZFS deduplication mostlikely occurs because reduplicationoccurs at the block level rather than at the byte level. When a write issent to one directory server instance,the exact same write is propagated to the other 5 instances andtherefore should be considered a duplicate.However this was not the case. There could be other reasons for thedivergence as well.Hello, Brad,If you tested with Sun DSEE (and I have no reason tobelieve other descendants of iPlanet Directory serverwould work differently under the hood), then there aretwo factors hindering your block-dedup gains:1) The data is stored in the backend BerkeleyDB binaryfile. In Sun DSEE7 and/or in ZFS this could also becompressed data. Since for ZFS you dedup unique blocks,including same data at same offsets, it is quite unlikelyyou'd get the same data often enough. For example, eachdatabase might position same userdata blocks at differentoffsets due to garbage collection or whatever otheroptimisation the DB might think of, making on-diskblocks different and undedupable.You might look if it is possible to tune the databaseto write in sector-sized -> min.block-sized (512b/4096b)records and consistently use the same DSEE compression(or lack thereof) - in this case you might get more sameblocks and win with dedup. But you'll likely lose withcompression, especially of the empty sparse structurewhich a database initially is.2) During replication each database actually becomesunique. There are hidden records with "ns" prefix whichmark when the record was created and replicated, whoinitiated it, etc. Timestamps in the data alreadywarrant uniqueness ;)This might be an RFE for the DSEE team though - to keepsuch volatile metadata separately from userdata. Thenyour DS instances would more likely dedup well afterreplication, and unique metadata would be storedseparately and stay unique. You might even keep it ina different dataset with no dedup, then... :)---So, at the moment, this expectation does not hold true:  "When a write is sent to one directory server instance,  the exact same write is propagated to the other five  instances and therefore should be considered a duplicate."These writes are not exact.HTH,//Jim Klimov___zfs-discuss mailing [email protected]://mail.opensolaris.org/mailman/listinfo/zfs-discuss___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Jim Klimov

Thanks for running and publishing the tests :)

A comment on your testing technique follows, though.

2011-12-29 1:14, Brad Diggs wrote:

As promised, here are the findings from my testing. I created 6
directory server instances ...

However, once I started modifying the data of the replicated directory
server topology, the caching efficiency
quickly diminished. The following table shows that the delta for each
instance increased by roughly 2GB
after only 300k of changes.

I suspect the divergence in data as seen by ZFS deduplication most
likely occurs because reduplication
occurs at the block level rather than at the byte level. When a write is
sent to one directory server instance,
the exact same write is propagated to the other 5 instances and
therefore should be considered a duplicate.
However this was not the case. There could be other reasons for the
divergence as well.


Hello, Brad,

If you tested with Sun DSEE (and I have no reason to
believe other descendants of iPlanet Directory server
would work differently under the hood), then there are
two factors hindering your block-dedup gains:

1) The data is stored in the backend BerkeleyDB binary
file. In Sun DSEE7 and/or in ZFS this could also be
compressed data. Since for ZFS you dedup unique blocks,
including same data at same offsets, it is quite unlikely
you'd get the same data often enough. For example, each
database might position same userdata blocks at different
offsets due to garbage collection or whatever other
optimisation the DB might think of, making on-disk
blocks different and undedupable.

You might look if it is possible to tune the database
to write in sector-sized -> min.block-sized (512b/4096b)
records and consistently use the same DSEE compression
(or lack thereof) - in this case you might get more same
blocks and win with dedup. But you'll likely lose with
compression, especially of the empty sparse structure
which a database initially is.

2) During replication each database actually becomes
unique. There are hidden records with "ns" prefix which
mark when the record was created and replicated, who
initiated it, etc. Timestamps in the data already
warrant uniqueness ;)

This might be an RFE for the DSEE team though - to keep
such volatile metadata separately from userdata. Then
your DS instances would more likely dedup well after
replication, and unique metadata would be stored
separately and stay unique. You might even keep it in
a different dataset with no dedup, then... :)

---


So, at the moment, this expectation does not hold true:
  "When a write is sent to one directory server instance,
  the exact same write is propagated to the other five
  instances and therefore should be considered a duplicate."
These writes are not exact.

HTH,
//Jim Klimov


___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-28 Thread Nico Williams
On Wed, Dec 28, 2011 at 3:14 PM, Brad Diggs  wrote:
>
> The two key takeaways from this exercise were as follows.  There is 
> tremendous caching potential
> through the use of ZFS deduplication.  However, the current block level 
> deduplication does not
> benefit directory as much as it perhaps could if deduplication occurred at 
> the byte level rather than
> the block level.  It very could be that even byte level deduplication doesn't 
> work as well either.
> Until that option is available, we won't know for sure.

How would byte-level dedup even work?  My best idea would be to apply
the rsync algorithm and then start searching for little chunks of data
with matching rsync CRCs, rolling the rsync CRC over the data until a
match is found for some chunk (which then has to be read and
compared), and so on.  The result would be incredibly slow on write
and would have huge storage overhead.  On the read side you could have
many more I/Os too, so read would get much slower as well.  I suspect
any other byte-level dedup solutions would be similarly lousy.
There'd be no real savings to be had, making the idea not worthwhile.

Dedup is for very specific use cases.  If your use case doesn't
benefit from block-level dedup, then don't bother with dedup.  (The
same applies to compression, but compression is much more likely to be
useful in general, which is why it should generally be on.)

Nico
--
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-16 Thread Robert Milkowski


> -Original Message-
> From: [email protected] [mailto:zfs-discuss-
> [email protected]] On Behalf Of Pawel Jakub Dawidek
> Sent: 10 December 2011 14:05
> To: Mertol Ozyoney
> Cc: [email protected]
> Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
> 
> On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:
> > Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.
> >
> > The only vendor i know that can do this is Netapp
> 
> And you really work at Oracle?:)
> 
> The answer is definiately yes. ARC caches on-disk blocks and dedup just
> reference those blocks. When you read dedup code is not involved at all.
> Let me show it to you with simple test:
> 
> Create a file (dedup is on):
> 
>   # dd if=/dev/random of=/foo/a bs=1m count=1024
> 
> Copy this file so that it is deduped:
> 
>   # dd if=/foo/a of=/foo/b bs=1m
> 
> Export the pool so all cache is removed and reimport it:
> 
>   # zpool export foo
>   # zpool import foo
> 
> Now let's read one file:
> 
>   # dd if=/foo/a of=/dev/null bs=1m
>   1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec)
> 
> We read file 'a' and all its blocks are in cache now. The 'b' file shares
all the
> same blocks, so if ARC caches blocks only once, reading 'b' should be much
> faster:
> 
>   # dd if=/foo/b of=/dev/null bs=1m
>   1073741824 bytes transferred in 0.870501 secs (1233475634
> bytes/sec)
> 
> Now look at it, 'b' was read 12.5 times faster than 'a' with no disk
activity.
> Magic?:)


Yep, however in pre Solaris 11 GA (and in Illumos) you would end up with 2x
copies of blocks in ARC cache, while in S11 GA ARC will keep only 1 copy of
all blocks. This can make a big difference if there are even more than just
2x files being dedupped and you need arc memory to cache other data as well.

-- 
Robert Milkowski



___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-13 Thread Nico Williams
On Dec 11, 2011 5:12 AM, "Nathan Kroenert"  wrote:
>
>  On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote:
>>
>> On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:
>>>
>>> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.
>>>
>>> The only vendor i know that can do this is Netapp
>>
>> And you really work at Oracle?:)
>>
>> The answer is definiately yes. ARC caches on-disk blocks and dedup just
>> reference those blocks. When you read dedup code is not involved at all.
>> Let me show it to you with simple test:
>>
>> Create a file (dedup is on):
>>
>># dd if=/dev/random of=/foo/a bs=1m count=1024
>>
>> Copy this file so that it is deduped:
>>
>># dd if=/foo/a of=/foo/b bs=1m
>>
>> Export the pool so all cache is removed and reimport it:
>>
>># zpool export foo
>># zpool import foo
>>
>> Now let's read one file:
>>
>># dd if=/foo/a of=/dev/null bs=1m
>>1073741824 bytes transferred in 10.855750 secs (98909962
bytes/sec)
>>
>> We read file 'a' and all its blocks are in cache now. The 'b' file
>> shares all the same blocks, so if ARC caches blocks only once, reading
>> 'b' should be much faster:
>>
>># dd if=/foo/b of=/dev/null bs=1m
>>1073741824 bytes transferred in 0.870501 secs (1233475634
bytes/sec)
>>
>> Now look at it, 'b' was read 12.5 times faster than 'a' with no disk
>> activity. Magic?:)
>>
>
> Hey all,
>
> That reminds me of something I have been wondering about... Why only 12x
faster? If we are effectively reading from memory - as compared to a disk
reading at approximately 100MB/s (which is about an average PC HDD reading
sequentially), I'd have thought it should be a lot faster than 12x.
>
> Can we really only pull stuff from cache at only a little over one
gigabyte per second if it's dedup data?

The second file may gave the same data, but not the same metadata -the
inode number at least must be different- so the znode for it must get read
in, and that will slow reading the copy down a bit.

Nico
--
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-13 Thread Pawel Jakub Dawidek
On Mon, Dec 12, 2011 at 08:30:56PM +0400, Jim Klimov wrote:
> 2011-12-12 19:03, Pawel Jakub Dawidek пишет:
> > As I said, ZFS reading path involves no dedup code. No at all.
> 
> I am not sure if we contradicted each other ;)
> 
> What I meant was that the ZFS reading path involves reading
> logical data blocks at some point, consulting the cache(s)
> if the block is already cached (and up-to-date). These blocks
> are addressed by some unique ID, and now with dedup there are
> several pointers to same block.
> 
> So, basically, reading a file involves reading ZFS metadata,
> determining data block IDs, fetching them from disk or cache.
> 
> Indeed, this does not need to be dedup-aware; but if the other
> chain of metadata blocks points to same data or metadata blocks
> which were already cached (for whatever reason, not limited to
> dedup) - this is where the read-speed boost appears.
> Likewise, if some blocks are not cached, such as metadata
> needed to determine the second file's block IDs, this incurs
> disk IOs and may decrease overall speed.

Ok, you are right, although in this test, I believe metadata of the
other file was already prefetched. I'm using this box for something else
now, so can't retest, but the procedure is so easy that everyone is
welcome to try it:)

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpNepBs6v1MX.pgp
Description: PGP signature
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-12 Thread Erik Trimble

On 12/12/2011 12:23 PM, Richard Elling wrote:

On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote:


Not exactly. What is dedup'ed is the stream only, which is infect not very
efficient. Real dedup aware replication is taking the necessary steps to
avoid sending a block that exists on the other storage system.

These exist outside of ZFS (eg rsync) and scale poorly.

Given that dedup is done at the pool level and ZFS send/receive is done at
the dataset level, how would you propose implementing a dedup-aware
ZFS send command?
  -- richard


I'm with Richard.

There is no practical "optimally efficient" way to dedup a stream from 
one system to another.  The only way to do so would be to have total 
information about the pool composition on BOTH the receiver and sender 
side.  That would involve sending the checksums for the complete pool 
blocks between the receiver and sender, which is a non-trivial overhead, 
and, indeed, would usually be far worse than simply doing what 'zfs send 
-D' does now (dedup the sending stream itself).  The only possible way 
that such a scheme would work would be if the receiver and sender were 
the same machine (note: not VMs or Zones on the same machine, but the 
same OS instance, since you would need the DDT to be shared).  And, 
that's not a use case that 'zfs send' is generally optimized for - that 
is, while it's entirely possible, it's not the primary use case for 'zfs 
send'


Given the overhead of network communications, there's no way that 
sending block checksums between hosts can ever be more efficient than 
just sending the self-deduped whole stream (except in pedantic cases).  
Let's look at  possible implementations (all assume that the local 
sending machine does its own dedup - that is, the stream-to-be-sent is 
already deduped within itself):


(1) when constructing the stream, every time a block is read from a 
fileset (or volume), its checksum is sent to the receiving machine. The 
receiving machine then looks up that checksum in its DDT, and sends back 
a "needed" or "not-needed" reply to the sender. While this lookup is 
being done, the sender must hold the original block in RAM, and cannot 
write it out to the to-be-sent-stream.


(2) The sending machine reads all the to-be-sent blocks, creates a 
stream, AND creates a checksum table (a mini-DDT, if you will).  The 
sender communicates to the receiver this mini-DDT.  The receiver diffs 
this against its own master pool DDT, and then sends back an edited 
mini-DDT containing only the checksums that match blocks which aren't on 
the receiver.  The original sending machine must then go back and 
re-construct the stream (either as a whole, or parse the stream as it is 
being sent) to leave out the unneeded blocks.


(3) some combo of #1 and #2 where several checksums are stuffed into a 
packet, and sent over the wire to be checked at the destination, with 
the receiver sending back only those to be included in the stream.



In the first scenario, you produce a huge amount of small network packet 
traffic, which trashes network throughput, with no real expectation that 
the reduction in the send stream will be worth it.  In the second case, 
you induce a huge amount of latency into the construction of the sending 
stream - that is, the "sender" has to wait around and then spend a 
non-trivial amount of processing power on essentially double processing 
the send stream, when, in the current implementation, it just sends out 
stuff as soon as it gets it.  The third scenario is only an optimization 
of #1 and #2, and doesn't avoid the pitfalls of either.


That is, even if ZFS did pool-level sends, you're still trapped by the 
need to share the DDT, which induces an overhead that can't be 
reasonably made up vs simply sending an internally-deduped souce stream 
in the first place.  I'm sure I can construct an instance where such DDT 
sharing would be better than the current 'zfs send' implementation; I'm 
just as sure that such an instance would be the small minority of usage, 
and that such a required implementation would radically alter the 
"typical" use case's performance to the negative.


In any case, as 'zfs send' works on filesets and volumes, and ZFS 
maintains DDT information on a pool-level, there's no way to share an 
existing whole DDT between two systems (and, given the potential size of 
a pool-level DDT, that's a bad idea anyway).


I see no ability to optimize the 'zfs send/receive' concept beyond what 
is currently done.


-Erik
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-12 Thread Richard Elling
On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote:

> Not exactly. What is dedup'ed is the stream only, which is infect not very
> efficient. Real dedup aware replication is taking the necessary steps to
> avoid sending a block that exists on the other storage system.

These exist outside of ZFS (eg rsync) and scale poorly.

Given that dedup is done at the pool level and ZFS send/receive is done at
the dataset level, how would you propose implementing a dedup-aware
ZFS send command?
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com






___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-12 Thread Jim Klimov

2011-12-12 19:03, Pawel Jakub Dawidek пишет:

On Sun, Dec 11, 2011 at 04:04:37PM +0400, Jim Klimov wrote:

I would not be surprised to see that there is some disk IO
adding delays for the second case (read of a deduped file
"clone"), because you still have to determine references
to this second file's blocks, and another path of on-disk
blocks might lead to it from a separate inode in a separate
dataset (or I might be wrong). Reading this second path of
pointers to the same cached data blocks might decrease speed
a little.


As I said, ZFS reading path involves no dedup code. No at all.


I am not sure if we contradicted each other ;)

What I meant was that the ZFS reading path involves reading
logical data blocks at some point, consulting the cache(s)
if the block is already cached (and up-to-date). These blocks
are addressed by some unique ID, and now with dedup there are
several pointers to same block.

So, basically, reading a file involves reading ZFS metadata,
determining data block IDs, fetching them from disk or cache.

Indeed, this does not need to be dedup-aware; but if the other
chain of metadata blocks points to same data or metadata blocks
which were already cached (for whatever reason, not limited to
dedup) - this is where the read-speed boost appears.
Likewise, if some blocks are not cached, such as metadata
needed to determine the second file's block IDs, this incurs
disk IOs and may decrease overall speed.

That's why I proposed redoing the test with re-reading both
files - now all relevant data and metadata would be cached
and we might see a bit faster read speed.

Just for kicks ;)

//Jim


___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-12 Thread Brad Diggs
Thanks everyone for your input on this thread.  It sounds like there is sufficient weightbehind the affirmative that I will include this methodology into my performance analysistest plan.  If the performance goes well, I will share some of the results when we concludein January/February timeframe.Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC detect and prevent streaming reads such as from dd from populating the cache.  Seemy previous blog post at the web site link below for a way around this protectivecaching control of ZFS.http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.htmlThanks again!Brad
Brad Diggs | Principal Sales ConsultantTech Blog: http://TheZoneManager.comLinkedIn: http://www.linkedin.com/in/braddiggs


On Dec 8, 2011, at 4:22 PM, Mark Musante wrote:You can see the original ARC case here:http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.altOn 8 Dec 2011, at 16:41, Ian Collins wrote:On 12/ 9/11 12:39 AM, Darren J Moffat wrote:On 12/07/11 20:48, Mertol Ozyoney wrote:Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.The only vendor i know that can do this is NetappIn fact , most of our functions, like replication is not dedup aware.For example, thecnicaly it's possible to optimize our replication thatit does not send daya chunks if a data chunk with the same chechsumexists in target, without enabling dedup on target and source.We already do that with 'zfs send -D':  -D  Perform dedup processing on the stream. Deduplicated  streams  cannot  be  received on systems that do not  support the stream deduplication feature.Is there any more published information on how this feature works?-- Ian.___zfs-discuss mailing [email protected]://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing [email protected]://mail.opensolaris.org/mailman/listinfo/zfs-discuss___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-12 Thread Mertol Ozyoney
I am almost sure that in cache things are still hydrated. There is an
outstanding RFE for this, while I am not sure, I think this feature will
be implemented sooner or later. And in theory there will be little
benefits as most dedup'ed shares are used for archive purposes...

PS: NetApp's do have significantly bigger problems in caching department ,
like virtually having no L1 cache. However it's also my duty to knw where
they have an advantage Š

Br
Mertol 
 
 
Mertol Özyöney | Storage Sales
Mobile: +90 533 931 0752
Email: [email protected]







On 12/10/11 4:05 PM, "Pawel Jakub Dawidek"  wrote:

>On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:
>> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.
>> 
>> The only vendor i know that can do this is Netapp
>
>And you really work at Oracle?:)
>
>The answer is definiately yes. ARC caches on-disk blocks and dedup just
>reference those blocks. When you read dedup code is not involved at all.
>Let me show it to you with simple test:
>
>Create a file (dedup is on):
>
>   # dd if=/dev/random of=/foo/a bs=1m count=1024
>
>Copy this file so that it is deduped:
>
>   # dd if=/foo/a of=/foo/b bs=1m
>
>Export the pool so all cache is removed and reimport it:
>
>   # zpool export foo
>   # zpool import foo
>
>Now let's read one file:
>
>   # dd if=/foo/a of=/dev/null bs=1m
>   1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec)
>
>We read file 'a' and all its blocks are in cache now. The 'b' file
>shares all the same blocks, so if ARC caches blocks only once, reading
>'b' should be much faster:
>
>   # dd if=/foo/b of=/dev/null bs=1m
>   1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec)
>
>Now look at it, 'b' was read 12.5 times faster than 'a' with no disk
>activity. Magic?:)
>
>-- 
>Pawel Jakub Dawidek   http://www.wheelsystems.com
>FreeBSD committer http://www.FreeBSD.org
>Am I Evil? Yes, I Am! http://yomoli.com
>___
>zfs-discuss mailing list
>[email protected]
>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-12 Thread Mertol Ozyoney
Not exactly. What is dedup'ed is the stream only, which is infect not very
efficient. Real dedup aware replication is taking the necessary steps to
avoid sending a block that exists on the other storage system.


 
 
Mertol Özyöney | Storage Sales
Mobile: +90 533 931 0752
Email: [email protected]







On 12/8/11 1:39 PM, "Darren J Moffat"  wrote:

>On 12/07/11 20:48, Mertol Ozyoney wrote:
>> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.
>>
>> The only vendor i know that can do this is Netapp
>>
>> In fact , most of our functions, like replication is not dedup aware.
>
>> For example, thecnicaly it's possible to optimize our replication that
>> it does not send daya chunks if a data chunk with the same chechsum
>> exists in target, without enabling dedup on target and source.
>
>We already do that with 'zfs send -D':
>
>  -D
>
>  Perform dedup processing on the stream. Deduplicated
>  streams  cannot  be  received on systems that do not
>  support the stream deduplication feature.
>
>
>
>
>-- 
>Darren J Moffat


___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-12 Thread Pawel Jakub Dawidek
On Sun, Dec 11, 2011 at 04:04:37PM +0400, Jim Klimov wrote:
> I would not be surprised to see that there is some disk IO
> adding delays for the second case (read of a deduped file
> "clone"), because you still have to determine references
> to this second file's blocks, and another path of on-disk
> blocks might lead to it from a separate inode in a separate
> dataset (or I might be wrong). Reading this second path of
> pointers to the same cached data blocks might decrease speed
> a little.

As I said, ZFS reading path involves no dedup code. No at all.
The proof would be being able to boot from ZFS with dedup turned on
eventhough ZFS boot code has 0 dedup code in it. Another proof would be
ZFS source code.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpOdlii40IHg.pgp
Description: PGP signature
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-11 Thread Gary Driggs
What kind of drives are we talking about? Even SATA drives are
available according to application type (desktop, enterprise server,
home PVR, surveillance PVR, etc). Then there are drives with SAS &
fiber channel interfaces. Then you've got Winchester platters vs SSD
vs hybrids. But even before considering that and all the other system
factors, throughput for direct attached storage can vary greatly not
only from interface type and storage tech but even small on drive
controller firmware differences could potentially introduce variances.
That's why server manufacturers like HP, DELL, et al prefer that you
replace failed drives with one of theirs instead of something off the
shelf because they usually have firmware that's been fine tuned in
house or in conjunction with the manufacturer.


On Dec 11, 2011, at 8:25 AM, Edward Ned Harvey
 wrote:

>> From: [email protected] [mailto:zfs-discuss-
>> [email protected]] On Behalf Of Nathan Kroenert
>>
>> That reminds me of something I have been wondering about... Why only 12x
>> faster? If we are effectively reading from memory - as compared to a
>> disk reading at approximately 100MB/s (which is about an average PC HDD
>> reading sequentially), I'd have thought it should be a lot faster than
> 12x.
>>
>> Can we really only pull stuff from cache at only a little over one
>> gigabyte per second if it's dedup data?
>
> Actually, cpu's and memory aren't as fast as you might think.  In a system
> with 12 disks, I've had to write my own "dd" replacement, because "dd
> if=/dev/zero bs=1024k" wasn't fast enough to keep the disks busy.  Later, I
> wanted to do something similar, using unique data, and it was simply
> impossible to generate random data fast enough.  I had to tweak my "dd"
> replacement to write serial numbers, which still wasn't fast enough, so I
> had to tweak my "dd" replacement to write a big block of static data,
> followed by a serial number, followed by another big block (always smaller
> than the disk block, so it would be treated as unique when hitting the
> pool...)
>
> 1 typical disk sustains 1Gbit/sec.  In theory, 12 should be able to sustain
> 12 Gbit/sec.  According to Nathan's email, the memory bandwidth might be 25
> Gbit, of which, you probably need to both read & write, thus making it
> effectively 12.5 Gbit...  I'm sure the actual bandwidth available varies by
> system and memory type.
>
> ___
> zfs-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-11 Thread Edward Ned Harvey
> From: [email protected] [mailto:zfs-discuss-
> [email protected]] On Behalf Of Nathan Kroenert
> 
> That reminds me of something I have been wondering about... Why only 12x
> faster? If we are effectively reading from memory - as compared to a
> disk reading at approximately 100MB/s (which is about an average PC HDD
> reading sequentially), I'd have thought it should be a lot faster than
12x.
> 
> Can we really only pull stuff from cache at only a little over one
> gigabyte per second if it's dedup data?

Actually, cpu's and memory aren't as fast as you might think.  In a system
with 12 disks, I've had to write my own "dd" replacement, because "dd
if=/dev/zero bs=1024k" wasn't fast enough to keep the disks busy.  Later, I
wanted to do something similar, using unique data, and it was simply
impossible to generate random data fast enough.  I had to tweak my "dd"
replacement to write serial numbers, which still wasn't fast enough, so I
had to tweak my "dd" replacement to write a big block of static data,
followed by a serial number, followed by another big block (always smaller
than the disk block, so it would be treated as unique when hitting the
pool...)

1 typical disk sustains 1Gbit/sec.  In theory, 12 should be able to sustain
12 Gbit/sec.  According to Nathan's email, the memory bandwidth might be 25
Gbit, of which, you probably need to both read & write, thus making it
effectively 12.5 Gbit...  I'm sure the actual bandwidth available varies by
system and memory type.

___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-11 Thread Jim Klimov

2011-12-11 15:10, Nathan Kroenert wrote:


Hey all,

That reminds me of something I have been wondering about... Why only 12x
faster? If we are effectively reading from memory - as compared to a
disk reading at approximately 100MB/s (which is about an average PC HDD
reading sequentially), I'd have thought it should be a lot faster than 12x.

Can we really only pull stuff from cache at only a little over one
gigabyte per second if it's dedup data?



I believe there's a couple of things in play.

One is that you'd rarely get 100Mb/s from a single HDD disk
due to fragmentation, especially inherent to ZFS. But you do
mention "sequential reading", so that's covered.

Besides, from Pavel's DD examples we see that he first read
at 98Mbyte/sec average, and then at 1233Mbyte/sec.

Another aspect is the RAM bandwidth, and we don't know the
specs of Pavel's test rig. For example, a 100MHz DDR2 would
peak out at 3200Mbyte/sec. That would include walking the
(cached) DDT tree for each block involved, determining which
(cached) data blocks correspond to it, and fetching them
from RAM or disk.

I would not be surprised to see that there is some disk IO
adding delays for the second case (read of a deduped file
"clone"), because you still have to determine references
to this second file's blocks, and another path of on-disk
blocks might lead to it from a separate inode in a separate
dataset (or I might be wrong). Reading this second path of
pointers to the same cached data blocks might decrease speed
a little.

It would be interesting to see Pavel's test updated with
second reads of both files (now that data and metadata are
all cached in RAM). It's possible that NOW reads would be
closer to RAM speeds with no disk IO. And I would be very
surprised if speeds would be noticeably different ;)

//Jim
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-11 Thread Nathan Kroenert

 On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote:

On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:

Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.

The only vendor i know that can do this is Netapp

And you really work at Oracle?:)

The answer is definiately yes. ARC caches on-disk blocks and dedup just
reference those blocks. When you read dedup code is not involved at all.
Let me show it to you with simple test:

Create a file (dedup is on):

# dd if=/dev/random of=/foo/a bs=1m count=1024

Copy this file so that it is deduped:

# dd if=/foo/a of=/foo/b bs=1m

Export the pool so all cache is removed and reimport it:

# zpool export foo
# zpool import foo

Now let's read one file:

# dd if=/foo/a of=/dev/null bs=1m
1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec)

We read file 'a' and all its blocks are in cache now. The 'b' file
shares all the same blocks, so if ARC caches blocks only once, reading
'b' should be much faster:

# dd if=/foo/b of=/dev/null bs=1m
1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec)

Now look at it, 'b' was read 12.5 times faster than 'a' with no disk
activity. Magic?:)



Hey all,

That reminds me of something I have been wondering about... Why only 12x 
faster? If we are effectively reading from memory - as compared to a 
disk reading at approximately 100MB/s (which is about an average PC HDD 
reading sequentially), I'd have thought it should be a lot faster than 12x.


Can we really only pull stuff from cache at only a little over one 
gigabyte per second if it's dedup data?


Cheers!

Nathan.


___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-10 Thread Pawel Jakub Dawidek
On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:
> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. 
> 
> The only vendor i know that can do this is Netapp 

And you really work at Oracle?:)

The answer is definiately yes. ARC caches on-disk blocks and dedup just
reference those blocks. When you read dedup code is not involved at all.
Let me show it to you with simple test:

Create a file (dedup is on):

# dd if=/dev/random of=/foo/a bs=1m count=1024

Copy this file so that it is deduped:

# dd if=/foo/a of=/foo/b bs=1m

Export the pool so all cache is removed and reimport it:

# zpool export foo
# zpool import foo

Now let's read one file:

# dd if=/foo/a of=/dev/null bs=1m
1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec)

We read file 'a' and all its blocks are in cache now. The 'b' file
shares all the same blocks, so if ARC caches blocks only once, reading
'b' should be much faster:

# dd if=/foo/b of=/dev/null bs=1m
1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec)

Now look at it, 'b' was read 12.5 times faster than 'a' with no disk
activity. Magic?:)

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgp3hvtU1DibZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-08 Thread Mark Musante

You can see the original ARC case here:

http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.alt

On 8 Dec 2011, at 16:41, Ian Collins wrote:

> On 12/ 9/11 12:39 AM, Darren J Moffat wrote:
>> On 12/07/11 20:48, Mertol Ozyoney wrote:
>>> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.
>>> 
>>> The only vendor i know that can do this is Netapp
>>> 
>>> In fact , most of our functions, like replication is not dedup aware.
>>> For example, thecnicaly it's possible to optimize our replication that
>>> it does not send daya chunks if a data chunk with the same chechsum
>>> exists in target, without enabling dedup on target and source.
>> We already do that with 'zfs send -D':
>> 
>>   -D
>> 
>>   Perform dedup processing on the stream. Deduplicated
>>   streams  cannot  be  received on systems that do not
>>   support the stream deduplication feature.
>> 
>> 
>> 
>> 
> Is there any more published information on how this feature works?
> 
> -- 
> Ian.
> 
> ___
> zfs-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-08 Thread Ian Collins

On 12/ 9/11 12:39 AM, Darren J Moffat wrote:

On 12/07/11 20:48, Mertol Ozyoney wrote:

Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.

The only vendor i know that can do this is Netapp

In fact , most of our functions, like replication is not dedup aware.
For example, thecnicaly it's possible to optimize our replication that
it does not send daya chunks if a data chunk with the same chechsum
exists in target, without enabling dedup on target and source.

We already do that with 'zfs send -D':

   -D

   Perform dedup processing on the stream. Deduplicated
   streams  cannot  be  received on systems that do not
   support the stream deduplication feature.





Is there any more published information on how this feature works?

--
Ian.

___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-08 Thread Edward Ned Harvey
> From: [email protected] [mailto:zfs-discuss-
> [email protected]] On Behalf Of Mertol Ozyoney
> Sent: Wednesday, December 07, 2011 3:49 PM
> To: Brad Diggs
> Cc: [email protected]
> Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
> 
> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.

I haven't read the code, but I can reference experimental results that seem to 
defy statement...

If you time write a large data stream of completely duplicated data to disk 
without dedup...
And time read it back...  It takes the same amount of time.

But,
If you enable dedup and repeat the same test, it goes much faster.  Depending 
on a lot of variables, it might be 2x-12x faster.  
To me, "significantly faster than disk speed," can only mean it's benefitting 
from cache.

___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-08 Thread Darren J Moffat

On 12/07/11 20:48, Mertol Ozyoney wrote:

Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.

The only vendor i know that can do this is Netapp

In fact , most of our functions, like replication is not dedup aware.



For example, thecnicaly it's possible to optimize our replication that
it does not send daya chunks if a data chunk with the same chechsum
exists in target, without enabling dedup on target and source.


We already do that with 'zfs send -D':

 -D

 Perform dedup processing on the stream. Deduplicated
 streams  cannot  be  received on systems that do not
 support the stream deduplication feature.




--
Darren J Moffat
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-07 Thread Jim Klimov

It was my understanding that both dedup and caching work on
block level. So if you have identical on-disk blocks (same
original data past same compression and encryption), they
turn into one(*) on-disk block with several references from
DDT. And that one block is only cached once, saving ARC space.

* (Technically, for very-often referenced blocks there is a
number of copies, controlled by ditto attribute).

HTH,
//Jim Klimov
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-07 Thread Mertol Ozyoney
Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. 

The only vendor i know that can do this is Netapp 

In fact , most of our functions, like replication is not dedup aware. 

However we have significant advantage that zfs keeps checksums regardless of 
the dedup being on and off. So, in the future we can perhaps make functions 
more dedup friendly regardless of dedup being enabled or not. 

For example, thecnicaly it's possible to optimize our replication that it does 
not send daya chunks if a data chunk with the same chechsum exists in target, 
without enabling dedup on target and source. 

Best regards
Mertol

Sent from a mobile device 

Mertol Ozyoney

On 07 Ara 2011, at 20:46, Brad Diggs  wrote:

> Hello,
> 
> I have a hypothetical question regarding ZFS reduplication.  Does the L1ARC 
> cache benefit from reduplication
> in the sense that the L1ARC will only need to cache one copy of the 
> reduplicated data versus many copies?  
> Here is an example:
> 
> Imagine that I have a server with 2TB of RAM and a PB of disk storage.  On 
> this server I create a single 1TB 
> data file that is full of unique data.  Then I make 9 copies of that file 
> giving each file a unique name and 
> location within the same ZFS zpool.  If I start up 10 application instances 
> where each application reads all of 
> its own unique copy of the data, will the L1ARC contain only the deduplicated 
> data or will it cache separate 
> copies the data from each file?  In simpler terms, will the L1ARC require 
> 10TB of RAM or just 1TB of RAM to 
> cache all 10 1TB files worth of data?
> 
> My hope is that since the data only physically occupies 1TB of storage via 
> deduplication that the L1ARC
> will also only require 1TB of RAM for the data.
> 
> Note that I know the deduplication table will use the L1ARC as well.  
> However, the focus of my question
> is on how the L1ARC would benefit from a data caching standpoint.
> 
> Thanks in advance!
> 
> Brad
> 
> Brad Diggs | Principal Sales Consultant
> Tech Blog: http://TheZoneManager.com
> LinkedIn: http://www.linkedin.com/in/braddiggs
> 
> ___
> zfs-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss