Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Brad Diggs
Reducing the record size would negatively impact performance. For rational why, see thesection titled "Match Average I/O Block Sizes" in my blog post on filesystem caching:http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.htmlBrad
Brad Diggs | Principal Sales Consultant |972.814.3698eMail:brad.di...@oracle.comTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggs

On Dec 29, 2011, at 8:08 AM, Robert Milkowski wrote:Try reducing recordsize to 8K or even less *before* you put any data.This can potentially improve your dedup ratio and keep it higher after you start modifying data.From:zfs-discuss-boun...@opensolaris.org[mailto:zfs-discuss-boun...@opensolaris.org]On Behalf OfBrad DiggsSent:28 December 2011 21:15To:zfs-discuss discussion listSubject:Re: [zfs-discuss] Improving L1ARC cache efficiency with dedupAs promised, here are the findings from my testing. I created 6 directory server instances where the firstinstance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup ofthe first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following tableshows the increased L1ARC size, increased search rate performance, and increase CPU% busy witheach starting and applying load to each successive directory server instance. The L1ARC cache grewa little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratioremained the same because no data on the directory server instances was changing.image001.pngHowever, once I started modifying the data of the replicated directory server topology, the caching efficiencyquickly diminished. The following table shows that the delta for each instance increased by roughly 2GBafter only 300k of changes.image002.pngI suspect the divergence in data as seen by ZFS deduplication most likelyoccurs because reduplicationoccurs at the block level rather than at the byte level. When a write is sent toone directory server instance,the exact same write is propagated to the other 5 instances and thereforeshould be considered a duplicate. However this was not the case. There could be other reasons for thedivergence as well.The two key takeaways from this exercise were as follows. There is tremendous caching potentialthrough the use of ZFS deduplication. However, the current block level deduplication does notbenefit directory as much as it perhaps could if deduplication occurred at the byte level rather thanthe block level. It very could be that even byte level deduplication doesn't work as well either. Until that option is available, we won't know for sure.Regards,Bradimage003.pngBrad Diggs | Principal Sales ConsultantTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggsOn Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:Thanks everyone for your input on this thread. It sounds like there is sufficient weightbehind the affirmative that I will include this methodology into my performance analysistest plan. If the performance goes well, I will share some of the results when we concludein January/February timeframe.Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARCdetect and prevent streaming reads such as from dd from populating the cache. Seemy previous blog post at the web site link below for a way around this protectivecaching control of ZFS.http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.htmlThanks again!BradPastedGraphic-2.tiffBrad Diggs | Principal Sales ConsultantTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggsOn Dec 8, 2011, at 4:22 PM, Mark Musante wrote:You can see the original ARC case here:http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.altOn 8 Dec 2011, at 16:41, Ian Collins wrote:On 12/ 9/11 12:39 AM, Darren J Moffat wrote:On 12/07/11 20:48, Mertol Ozyoney wrote:Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.The only vendor i know that can do this is NetappIn fact , most of our functions, like replication is not dedup aware.For example, thecnicaly it's possible to optimize our replication thatit does not send daya chunks if a data chunk with the same chechsumexists in target, without enabling dedup on target and source.We already do that with 'zfs send -D':-DPerform dedup processing on the stream. Deduplicatedstreams cannot be received on systems that do notsupport the stream deduplication feature.Is there any more published information on how this feature works?--Ian.___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing 

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Brad Diggs
Jim,You are spot on. I was hoping that the writes would be close enough to identical thatthere would be a high ratio of duplicate data since I use the same record size, page size,compression algorithm, … etc. However, that was not the case. The main thing that Iwanted to prove though was that if the data was the same the L1 ARC only caches thedata that was actually written to storage. That is a really cool thing! I am sure there willbe future study on this topic as it applies to other scenarios.With regards to directory engineering investing any energy into optimizing ODSEE DSto more effectively leverage this caching potential, that won't happen. OUD far outperforms ODSEE. That said OUD may get some focus in this area. However, time willtell on that one.For now, I hope everyone benefits from the little that I did validate.Have a great day!Brad
Brad Diggs | Principal Sales ConsultantTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggs


On Dec 29, 2011, at 4:45 AM, Jim Klimov wrote:Thanks for running and publishing the tests :)A comment on your testing technique follows, though.2011-12-29 1:14, Brad Diggs wrote:As promised, here are the findings from my testing. I created 6directory server instances ...However, once I started modifying the data of the replicated directoryserver topology, the caching efficiencyquickly diminished. The following table shows that the delta for eachinstance increased by roughly 2GBafter only 300k of changes.I suspect the divergence in data as seen by ZFS deduplication mostlikely occurs because reduplicationoccurs at the block level rather than at the byte level. When a write issent to one directory server instance,the exact same write is propagated to the other 5 instances andtherefore should be considered a duplicate.However this was not the case. There could be other reasons for thedivergence as well.Hello, Brad,If you tested with Sun DSEE (and I have no reason tobelieve other descendants of iPlanet Directory serverwould work differently under the hood), then there aretwo factors hindering your block-dedup gains:1) The data is stored in the backend BerkeleyDB binaryfile. In Sun DSEE7 and/or in ZFS this could also becompressed data. Since for ZFS you dedup unique blocks,including same data at same offsets, it is quite unlikelyyou'd get the same data often enough. For example, eachdatabase might position same userdata blocks at differentoffsets due to garbage collection or whatever otheroptimisation the DB might think of, making on-diskblocks different and undedupable.You might look if it is possible to tune the databaseto write in sector-sized - min.block-sized (512b/4096b)records and consistently use the same DSEE compression(or lack thereof) - in this case you might get more sameblocks and win with dedup. But you'll likely lose withcompression, especially of the empty sparse structurewhich a database initially is.2) During replication each database actually becomesunique. There are hidden records with "ns" prefix whichmark when the record was created and replicated, whoinitiated it, etc. Timestamps in the data alreadywarrant uniqueness ;)This might be an RFE for the DSEE team though - to keepsuch volatile metadata separately from userdata. Thenyour DS instances would more likely dedup well afterreplication, and unique metadata would be storedseparately and stay unique. You might even keep it ina different dataset with no dedup, then... :)---So, at the moment, this expectation does not hold true: "When a write is sent to one directory server instance, the exact same write is propagated to the other five instances and therefore should be considered a duplicate."These writes are not exact.HTH,//Jim Klimov___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Robert Milkowski
 

Citing yourself:

 

The average block size for a given data block should be used as the metric
to map all other datablock sizes to. For example, the ZFS recordsize is
128kb by default. If the average block (or page) size of a directory server
is 2k, then the mismatch in size will result in degraded throughput for both
read and write operations. One of the benefits of ZFS is that you can change
the recordsize of all write operations from the time you set the new value
going forward.



 

And the above is not even entirely correct as if a file is bigger than a
current value of recordsize property reducing a recordsize won't change
block size for the file (it will continue to use the previous size, for
example 128K). This is why you need to set recordsize to a desired value for
large files *before* you create them (or you will have to copy them later
on).

 

From the performance point of view it really depends on a workload but as
you described in your blog the default recordsize of 128K with an average
write/read of 2K for many workloads will negatively impact performance, and
lowering recordsize can potentially improve it.

 

Nevertheless I was referring to dedup efficiency which with lower recordsize
values should improve dedup ratios (although it will require more memory for
ddt).

 

 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brad Diggs
Sent: 29 December 2011 15:55
To: Robert Milkowski
Cc: 'zfs-discuss discussion list'
Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

 

Reducing the record size would negatively impact performance.  For rational
why, see the

section titled Match Average I/O Block Sizes in my blog post on filesystem
caching:

http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html

 

Brad


Brad Diggs | Principal Sales Consultant | 972.814.3698

eMail: brad.di...@oracle.com

Tech Blog:  http://TheZoneManager.com/ http://TheZoneManager.com

LinkedIn: http://www.linkedin.com/in/braddiggs

 

On Dec 29, 2011, at 8:08 AM, Robert Milkowski wrote:





 

Try reducing recordsize to 8K or even less *before* you put any data.

This can potentially improve your dedup ratio and keep it higher after you
start modifying data.

 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brad Diggs
Sent: 28 December 2011 21:15
To: zfs-discuss discussion list
Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

 

As promised, here are the findings from my testing.  I created 6 directory
server instances where the first

instance has roughly 8.5GB of data.  Then I initialized the remaining 5
instances from a binary backup of

the first instance.  Then, I rebooted the server to start off with an empty
ZFS cache.  The following table

shows the increased L1ARC size, increased search rate performance, and
increase CPU% busy with

each starting and applying load to each successive directory server
instance.  The L1ARC cache grew

a little bit with each additional instance but largely stayed the same size.
Likewise, the ZFS dedup ratio

remained the same because no data on the directory server instances was
changing.

 

image001.png

However, once I started modifying the data of the replicated directory
server topology, the caching efficiency 

quickly diminished.  The following table shows that the delta for each
instance increased by roughly 2GB 

after only 300k of changes.

 

image002.png

I suspect the divergence in data as seen by ZFS deduplication most likely
occurs because reduplication 

occurs at the block level rather than at the byte level.  When a write is
sent to one directory server instance, 

the exact same write is propagated to the other 5 instances and therefore
should be considered a duplicate.  

However this was not the case.  There could be other reasons for the
divergence as well.

 

The two key takeaways from this exercise were as follows.  There is
tremendous caching potential

through the use of ZFS deduplication.  However, the current block level
deduplication does not 

benefit directory as much as it perhaps could if deduplication occurred at
the byte level rather than

the block level.  It very could be that even byte level deduplication
doesn't work as well either.  

Until that option is available, we won't know for sure.

 

Regards,

 

Brad

image003.png

Brad Diggs | Principal Sales Consultant

Tech Blog:  http://TheZoneManager.com/ http://TheZoneManager.com

LinkedIn: http://www.linkedin.com/in/braddiggs

 

On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:






Thanks everyone for your input on this thread.  It sounds like there is
sufficient weight

behind the affirmative that I will include this methodology into my
performance analysis

test plan.  If the performance goes well, I will share some of the results
when we conclude

in January/February timeframe.

 

Regarding the great dd use 

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Jim Klimov

Thanks for running and publishing the tests :)

A comment on your testing technique follows, though.

2011-12-29 1:14, Brad Diggs wrote:

As promised, here are the findings from my testing. I created 6
directory server instances ...

However, once I started modifying the data of the replicated directory
server topology, the caching efficiency
quickly diminished. The following table shows that the delta for each
instance increased by roughly 2GB
after only 300k of changes.

I suspect the divergence in data as seen by ZFS deduplication most
likely occurs because reduplication
occurs at the block level rather than at the byte level. When a write is
sent to one directory server instance,
the exact same write is propagated to the other 5 instances and
therefore should be considered a duplicate.
However this was not the case. There could be other reasons for the
divergence as well.


Hello, Brad,

If you tested with Sun DSEE (and I have no reason to
believe other descendants of iPlanet Directory server
would work differently under the hood), then there are
two factors hindering your block-dedup gains:

1) The data is stored in the backend BerkeleyDB binary
file. In Sun DSEE7 and/or in ZFS this could also be
compressed data. Since for ZFS you dedup unique blocks,
including same data at same offsets, it is quite unlikely
you'd get the same data often enough. For example, each
database might position same userdata blocks at different
offsets due to garbage collection or whatever other
optimisation the DB might think of, making on-disk
blocks different and undedupable.

You might look if it is possible to tune the database
to write in sector-sized - min.block-sized (512b/4096b)
records and consistently use the same DSEE compression
(or lack thereof) - in this case you might get more same
blocks and win with dedup. But you'll likely lose with
compression, especially of the empty sparse structure
which a database initially is.

2) During replication each database actually becomes
unique. There are hidden records with ns prefix which
mark when the record was created and replicated, who
initiated it, etc. Timestamps in the data already
warrant uniqueness ;)

This might be an RFE for the DSEE team though - to keep
such volatile metadata separately from userdata. Then
your DS instances would more likely dedup well after
replication, and unique metadata would be stored
separately and stay unique. You might even keep it in
a different dataset with no dedup, then... :)

---


So, at the moment, this expectation does not hold true:
  When a write is sent to one directory server instance,
  the exact same write is propagated to the other five
  instances and therefore should be considered a duplicate.
These writes are not exact.

HTH,
//Jim Klimov


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Brad Diggs
S11 FCSBrad
Brad Diggs | Principal Sales Consultant |972.814.3698eMail:brad.di...@oracle.comTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggs

On Dec 29, 2011, at 8:11 AM, Robert Milkowski wrote:And these results are from S11 FCS I assume.On older builds or Illumos based distros I would expect L1 arc to grow much bigger.From:zfs-discuss-boun...@opensolaris.org[mailto:zfs-discuss-boun...@opensolaris.org]On Behalf OfBrad DiggsSent:28 December 2011 21:15To:zfs-discuss discussion listSubject:Re: [zfs-discuss] Improving L1ARC cache efficiency with dedupAs promised, here are the findings from my testing. I created 6 directory server instances where the firstinstance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup ofthe first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following tableshows the increased L1ARC size, increased search rate performance, and increase CPU% busy witheach starting and applying load to each successive directory server instance. The L1ARC cache grewa little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratioremained the same because no data on the directory server instances was changing.image001.pngHowever, once I started modifying the data of the replicated directory server topology, the caching efficiencyquickly diminished. The following table shows that the delta for each instance increased by roughly 2GBafter only 300k of changes.image002.pngI suspect the divergence in data as seen by ZFS deduplication most likelyoccurs because reduplicationoccurs at the block level rather than at the byte level. When a write is sent toone directory server instance,the exact same write is propagated to the other 5 instances and thereforeshould be considered a duplicate. However this was not the case. There could be other reasons for thedivergence as well.The two key takeaways from this exercise were as follows. There is tremendous caching potentialthrough the use of ZFS deduplication. However, the current block level deduplication does notbenefit directory as much as it perhaps could if deduplication occurred at the byte level rather thanthe block level. It very could be that even byte level deduplication doesn't work as well either. Until that option is available, we won't know for sure.Regards,Bradimage003.pngBrad Diggs | Principal Sales ConsultantTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggsOn Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:Thanks everyone for your input on this thread. It sounds like there is sufficient weightbehind the affirmative that I will include this methodology into my performance analysistest plan. If the performance goes well, I will share some of the results when we concludein January/February timeframe.Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARCdetect and prevent streaming reads such as from dd from populating the cache. Seemy previous blog post at the web site link below for a way around this protectivecaching control of ZFS.http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.htmlThanks again!BradPastedGraphic-2.tiffBrad Diggs | Principal Sales ConsultantTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggsOn Dec 8, 2011, at 4:22 PM, Mark Musante wrote:You can see the original ARC case here:http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.altOn 8 Dec 2011, at 16:41, Ian Collins wrote:On 12/ 9/11 12:39 AM, Darren J Moffat wrote:On 12/07/11 20:48, Mertol Ozyoney wrote:Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.The only vendor i know that can do this is NetappIn fact , most of our functions, like replication is not dedup aware.For example, thecnicaly it's possible to optimize our replication thatit does not send daya chunks if a data chunk with the same chechsumexists in target, without enabling dedup on target and source.We already do that with 'zfs send -D':-DPerform dedup processing on the stream. Deduplicatedstreams cannot be received on systems that do notsupport the stream deduplication feature.Is there any more published information on how this feature works?--Ian.___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Nico Williams
On Thu, Dec 29, 2011 at 9:53 AM, Brad Diggs brad.di...@oracle.com wrote:
 Jim,

 You are spot on.  I was hoping that the writes would be close enough to 
 identical that
 there would be a high ratio of duplicate data since I use the same record 
 size, page size,
 compression algorithm, … etc.  However, that was not the case.  The main 
 thing that I
 wanted to prove though was that if the data was the same the L1 ARC only 
 caches the
 data that was actually written to storage.  That is a really cool thing!  I 
 am sure there will
 be future study on this topic as it applies to other scenarios.

 With regards to directory engineering investing any energy into optimizing 
 ODSEE DS
 to more effectively leverage this caching potential, that won't happen.  OUD 
 far out
 performs ODSEE.  That said OUD may get some focus in this area.  However, 
 time will
 tell on that one.

Databases are not as likely to benefit from dedup as virtual machines,
indeed, DBs are likely to not benefit at all from dedup.  The VM use
case benefits from dedup for the obvious reason that many VMs will
have the same exact software installed most of the time, using the
same filesystems, and the same patch/update installation order, so if
you keep data out of their root filesystems then you can expect
enormous deduplicatiousness.  But databases, not so much.  The unit of
deduplicable data in a VM use case is the guest's preferred block
size, while in a DB the unit of deduplicable data might be a
variable-sized table row, or even smaller: a single row/column value
-- and you have no way to ensure alignment of individual deduplicable
units nor ordering of sets of deduplicable units into larger ones.

When it comes to databases your best bets will be: a) database-level
compression or dedup features (e.g., Oracle's column-level compression
feature) or b) ZFS compression.

(Dedup makes VM management easier, because the alternative is to patch
one master guest VM [per-guest type] then re-clone and re-configure
all instances of that guest type, in the process possibly losing any
customizations in those guests.  But even before dedup, the ability to
snapshot and clone datasets was an impressive dedup-like tool for the
VM use-case, just not as convenient as dedup.)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2011-12-29 Thread sol
Richard Elling wrote: 

 many of the former Sun ZFS team 
 regularly contribute to ZFS through the illumos developer community.  


Does this mean that if they provide a bug fix via illumos then the fix won't 
make it into the Oracle code?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2011-12-29 Thread Nico Williams
On Thu, Dec 29, 2011 at 2:06 PM, sol a...@yahoo.com wrote:
 Richard Elling wrote:
 many of the former Sun ZFS team
 regularly contribute to ZFS through the illumos developer community.

 Does this mean that if they provide a bug fix via illumos then the fix won't
 make it into the Oracle code?

If you're an Oracle customer you should report any ZFS bugs you find
to Oracle if you want fixes in Solaris.  You may want to (and I
encourage you to) report such bugs to Illumos if at all possible
(i.e., unless your agreement with Oracle or your employer's policies
somehow prevent you from doing so).

The following is complete speculation.  Take it with salt.

With reference to your question, it may mean that Oracle's ZFS team
would have to come up with their own fixes to the same bugs.  Oracle's
legal department would almost certainly have to clear the copying of
any non-trivial/obvious fix from Illumos into Oracle's ON tree.  And
if taking a fix from Illumos were to require opening the affected
files (because they are under CDDL in Illumos) then executive
management approval would also be required.  But the most likely case
is that the issue simply wouldn't come up in the first place because
Oracle's ZFS team would almost certainly ignore the Illumos repository
(perhaps not the Illumos bug tracker, but probably that too) as that's
simply the easiest way for them to avoid legal messes.  Think about
it.  Besides, I suspect that from Oracle's point of view what matters
are bug reports by Oracle customers to Oracle, so if a bug fixed in
Illumos is never reported to Oracle by a customer, it would likely
never get fixed in Solaris either except by accident, as a result of
another change.

Also, the Oracle ZFS team is not exactly devoid of clue, even with the
departures from it to date.  I suspect they will be able to fix bugs
in Oracle's ZFS and completely independently of the open ZFS
community, even if it means duplicating effort.

That said, Illumos is a fork of OpenSolaris, and as such it and
Solaris will necessarily diverge as at least one of the two (and
probably both, for a while) gets plenty of bug fixes and enhancements.
 This is a good thing, not a bad thing, at least for now.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2011-12-29 Thread Richard Elling
On Dec 29, 2011, at 1:29 PM, Nico Williams wrote:

 On Thu, Dec 29, 2011 at 2:06 PM, sol a...@yahoo.com wrote:
 Richard Elling wrote:
  many of the former Sun ZFS team
 regularly contribute to ZFS through the illumos developer community.
 
 Does this mean that if they provide a bug fix via illumos then the fix won't
 make it into the Oracle code?

I can't speak for Oracle, but I think the entire ZFS community benefits when
bugs are fixed. Squeaky wheels get the grease, so squeak often.

 If you're an Oracle customer you should report any ZFS bugs you find
 to Oracle if you want fixes in Solaris.  You may want to (and I
 encourage you to) report such bugs to Illumos if at all possible
 (i.e., unless your agreement with Oracle or your employer's policies
 somehow prevent you from doing so).

+1

 The following is complete speculation.  Take it with salt.
 
 With reference to your question, it may mean that Oracle's ZFS team
 would have to come up with their own fixes to the same bugs.  Oracle's
 legal department would almost certainly have to clear the copying of
 any non-trivial/obvious fix from Illumos into Oracle's ON tree.  And
 if taking a fix from Illumos were to require opening the affected
 files (because they are under CDDL in Illumos) then executive
 management approval would also be required.  But the most likely case
 is that the issue simply wouldn't come up in the first place because
 Oracle's ZFS team would almost certainly ignore the Illumos repository
 (perhaps not the Illumos bug tracker, but probably that too) as that's
 simply the easiest way for them to avoid legal messes.  Think about
 it.  Besides, I suspect that from Oracle's point of view what matters
 are bug reports by Oracle customers to Oracle, so if a bug fixed in
 Illumos is never reported to Oracle by a customer, it would likely
 never get fixed in Solaris either except by accident, as a result of
 another change.
 
 Also, the Oracle ZFS team is not exactly devoid of clue, even with the
 departures from it to date.  I suspect they will be able to fix bugs
 in Oracle's ZFS and completely independently of the open ZFS
 community, even if it means duplicating effort.

Yes, Oracle continues to develop and sustain its Solaris products. This
can only be viewed as a good thing.

 That said, Illumos is a fork of OpenSolaris, and as such it and
 Solaris will necessarily diverge as at least one of the two (and
 probably both, for a while) gets plenty of bug fixes and enhancements.
 This is a good thing, not a bad thing, at least for now.

Evolution continues, despite the rhetoric from the pulpit :-)
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/ 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Matthew Ahrens
On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble tr...@netdemons.com wrote:
 On 12/12/2011 12:23 PM, Richard Elling wrote:

 On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote:

 Not exactly. What is dedup'ed is the stream only, which is infect not
 very
 efficient. Real dedup aware replication is taking the necessary steps to
 avoid sending a block that exists on the other storage system.

As with all dedup-related performance, the efficiency of various
methods of implementing zfs send -D will vary widely, depending on
the dedup-ability of the data, and what is being sent.  However,
sending no blocks that already exist on the target system does seem
like a good goal, since it addresses some use cases that intra-stream
dedup does not.

 (1) when constructing the stream, every time a block is read from a fileset
 (or volume), its checksum is sent to the receiving machine. The receiving
 machine then looks up that checksum in its DDT, and sends back a needed or
 not-needed reply to the sender. While this lookup is being done, the
 sender must hold the original block in RAM, and cannot write it out to the
 to-be-sent-stream.
...
 you produce a huge amount of small network packet
 traffic, which trashes network throughput

This seems like a valid approach to me.  When constructing the stream,
the sender need not read the actual data, just the checksum in the
indirect block.  So there is nothing that the sender must hold in
RAM.  There is no need to create small (or synchronous) network
packets, because sender need not wait for the receiver to determine if
it needs the block or not.  There can be multiple asynchronous
communication streams:  one where the sender sends all the checksums
to the receiver; another where the receiver requests blocks that it
does not have from the sender; and another where the sender sends
requested blocks back to the receiver.  Implementing this may not be
trivial, and in some cases it will not improve on the current
implementation.  But in others it would be a considerable improvement.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Nico Williams
On Thu, Dec 29, 2011 at 6:44 PM, Matthew Ahrens mahr...@delphix.com wrote:
 On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble tr...@netdemons.com wrote:
 (1) when constructing the stream, every time a block is read from a fileset
 (or volume), its checksum is sent to the receiving machine. The receiving
 machine then looks up that checksum in its DDT, and sends back a needed or
 not-needed reply to the sender. While this lookup is being done, the
 sender must hold the original block in RAM, and cannot write it out to the
 to-be-sent-stream.
 ...
 you produce a huge amount of small network packet
 traffic, which trashes network throughput

 This seems like a valid approach to me.  When constructing the stream,
 the sender need not read the actual data, just the checksum in the
 indirect block.  So there is nothing that the sender must hold in
 RAM.  There is no need to create small (or synchronous) network
 packets, because sender need not wait for the receiver to determine if
 it needs the block or not.  There can be multiple asynchronous
 communication streams:  one where the sender sends all the checksums
 to the receiver; another where the receiver requests blocks that it
 does not have from the sender; and another where the sender sends
 requested blocks back to the receiver.  Implementing this may not be
 trivial, and in some cases it will not improve on the current
 implementation.  But in others it would be a considerable improvement.

Right, you'd want to let the socket/transport buffer/flow control
writes of I have this new block checksum messages from the zfs
sender and I need the block with this checksum messages from the zfs
receiver.

I like this.

A separate channel for bulk data definitely comes recommended for flow
control reasons, but if you do that then securing the transport gets
complicated: you couldn't just zfs send .. | ssh ... zfs receive.  You
could use SSH channel multiplexing, but that will net you lousy
performance (well, no lousier than one already gets with SSH
anyways)[*].  (And SunSSH lacks this feature anyways)  It'd then begin
to pay to have have a bonafide zfs send network protocol, and now
we're talking about significantly more work.  Another option would be
to have send/receive options to create the two separate channels, so
one would do something like:

% zfs send --dedup-control-channel ... | ssh-or-netcat-or... zfs
receive --dedup-control-channel ... 
% zfs send --dedup-bulk-channel ... | ssh-or-netcat-or... zfs receive
--dedup-bulk-channel
% wait

The second zfs receive would rendezvous with the first and go from there.

[*] The problem with SSHv2 is that it has flow controlled channels
layered over a flow controlled congestion channel (TCP), and there's
not enough information flowing from TCP to SSHv2 to make this work
well, but also, the SSHv2 channels cannot have their window shrink
except by the sender consuming it, which makes it impossible to mix
high-bandwidth bulk and small control data over a congested link.
This means that in practice SSHv2 channels have to have relatively
small windows, which then forces the protocol to work very
synchronously (i.e., with effectively synchronous ACKs of bulk data).
I now believe the idea of mixing bulk and non-bulk data over a single
TCP connection in SSHv2 is a failure.  SSHv2 over SCTP, or over
multiple TCP connections, would be much better.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-29 Thread Ray Van Dolson
Hi all;

We have a dev box running NexentaStor Community Edition 3.1.1 w/ 24GB
(we don't run dedupe on production boxes -- and we do pay for Nexenta
licenses on prd as well) RAM and an 8.5TB pool with deduplication
enabled (1.9TB or so in use).  Dedupe ratio is only 1.26x.

The box has an SLC-based SSD as ZIL and a 300GB MLC SSD as L2ARC.

The box has been performing fairly poorly lately, and we're thinking
it's due to deduplication:

  # echo ::arc | mdb -k | grep arc_meta
  arc_meta_used =  5884 MB
  arc_meta_limit=  5885 MB
  arc_meta_max  =  5888 MB

  # zpool status -D
  ...
  DDT entries 24529444, size 331 on disk, 185 in core

So, not only are we using up all of our metadata cache, but the DDT
table is taking up a pretty significant chunk of that (over 70%).

ARC sizing is as follows:

  p = 15331 MB
  c = 16354 MB
  c_min =  2942 MB
  c_max = 23542 MB
  size  = 16353 MB

I'm not really sure how to determine how many blocks are on this zpool
(is it the same as the # of DDT entries? -- deduplication has been on
since pool creation).  If I use a 64KB block size average, I get about
31 million blocks, but DDT entries are 24 million 

zdb -DD and zdb -bb | grep 'bp count both do not complete (zdb says
I/O error).  Probably because the pool is in use and is quite busy.

Without the block count I'm having a hard time determining how much
memory we _should_ have.  I can only speculate that it's more at this
point. :)

If I assume 24 million blocks is about accurate (from zpool status -D
output above), then at 320 bytes per block we're looking at about 7.1GB
for DDT table size.  We do have L2ARC, though I'm not sure how ZFS
decides what portion of the DDT stays in memory and what can go to
L2ARC -- if all of it went to L2ARC, then the references to this
information in arc_meta would be (at 176 bytes * 24million blocks)
around 4GB -- which again is a good chuck of arc_meta_max.

Given that our dedupe ratio on this pool is fairly low anyways, am
looking for strategies to back out.  Should we just disable
deduplication and then maybe bump up the size of the arc_meta_max?
Maybe also increase the size of arc.size as well (8GB left for the
system seems higher than we need)?

Is there a non-disruptive way to undeduplicate everything and expunge
the DDT?  zfs send/recv and then back perhaps (we have the extra
space)?

Thanks,
Ray

[1] http://markmail.org/message/db55j6zetifn4jkd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-29 Thread Fajar A. Nugraha
On Fri, Dec 30, 2011 at 1:31 PM, Ray Van Dolson rvandol...@esri.com wrote:
 Is there a non-disruptive way to undeduplicate everything and expunge
 the DDT?

AFAIK, no

  zfs send/recv and then back perhaps (we have the extra
 space)?

That should work, but it's disruptive :D

Others might provide better answer though.

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-29 Thread Ray Van Dolson
On Thu, Dec 29, 2011 at 10:59:04PM -0800, Fajar A. Nugraha wrote:
 On Fri, Dec 30, 2011 at 1:31 PM, Ray Van Dolson rvandol...@esri.com wrote:
  Is there a non-disruptive way to undeduplicate everything and expunge
  the DDT?
 
 AFAIK, no
 
   zfs send/recv and then back perhaps (we have the extra
  space)?
 
 That should work, but it's disruptive :D
 
 Others might provide better answer though.

Well, slightly _less_ disruptive perhaps.  We can zfs send to another
file system on the same system, but different set of disks.  We then
disable NFS shares on the original, do a final zfs send to sync, then
share out the new undeduplicated file system with the same name.
Hopefully the window here is short enough that NFS clients are able to
recover gracefully.

We'd then wipe out the old zpool, recreate and do the reverse to get
data back onto it..

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss