Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
Reducing the record size would negatively impact performance. For rational why, see thesection titled "Match Average I/O Block Sizes" in my blog post on filesystem caching:http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.htmlBrad Brad Diggs | Principal Sales Consultant |972.814.3698eMail:brad.di...@oracle.comTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 8:08 AM, Robert Milkowski wrote:Try reducing recordsize to 8K or even less *before* you put any data.This can potentially improve your dedup ratio and keep it higher after you start modifying data.From:zfs-discuss-boun...@opensolaris.org[mailto:zfs-discuss-boun...@opensolaris.org]On Behalf OfBrad DiggsSent:28 December 2011 21:15To:zfs-discuss discussion listSubject:Re: [zfs-discuss] Improving L1ARC cache efficiency with dedupAs promised, here are the findings from my testing. I created 6 directory server instances where the firstinstance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup ofthe first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following tableshows the increased L1ARC size, increased search rate performance, and increase CPU% busy witheach starting and applying load to each successive directory server instance. The L1ARC cache grewa little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratioremained the same because no data on the directory server instances was changing.image001.pngHowever, once I started modifying the data of the replicated directory server topology, the caching efficiencyquickly diminished. The following table shows that the delta for each instance increased by roughly 2GBafter only 300k of changes.image002.pngI suspect the divergence in data as seen by ZFS deduplication most likelyoccurs because reduplicationoccurs at the block level rather than at the byte level. When a write is sent toone directory server instance,the exact same write is propagated to the other 5 instances and thereforeshould be considered a duplicate. However this was not the case. There could be other reasons for thedivergence as well.The two key takeaways from this exercise were as follows. There is tremendous caching potentialthrough the use of ZFS deduplication. However, the current block level deduplication does notbenefit directory as much as it perhaps could if deduplication occurred at the byte level rather thanthe block level. It very could be that even byte level deduplication doesn't work as well either. Until that option is available, we won't know for sure.Regards,Bradimage003.pngBrad Diggs | Principal Sales ConsultantTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggsOn Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:Thanks everyone for your input on this thread. It sounds like there is sufficient weightbehind the affirmative that I will include this methodology into my performance analysistest plan. If the performance goes well, I will share some of the results when we concludein January/February timeframe.Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARCdetect and prevent streaming reads such as from dd from populating the cache. Seemy previous blog post at the web site link below for a way around this protectivecaching control of ZFS.http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.htmlThanks again!BradPastedGraphic-2.tiffBrad Diggs | Principal Sales ConsultantTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggsOn Dec 8, 2011, at 4:22 PM, Mark Musante wrote:You can see the original ARC case here:http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.altOn 8 Dec 2011, at 16:41, Ian Collins wrote:On 12/ 9/11 12:39 AM, Darren J Moffat wrote:On 12/07/11 20:48, Mertol Ozyoney wrote:Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.The only vendor i know that can do this is NetappIn fact , most of our functions, like replication is not dedup aware.For example, thecnicaly it's possible to optimize our replication thatit does not send daya chunks if a data chunk with the same chechsumexists in target, without enabling dedup on target and source.We already do that with 'zfs send -D':-DPerform dedup processing on the stream. Deduplicatedstreams cannot be received on systems that do notsupport the stream deduplication feature.Is there any more published information on how this feature works?--Ian.___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
Jim,You are spot on. I was hoping that the writes would be close enough to identical thatthere would be a high ratio of duplicate data since I use the same record size, page size,compression algorithm, … etc. However, that was not the case. The main thing that Iwanted to prove though was that if the data was the same the L1 ARC only caches thedata that was actually written to storage. That is a really cool thing! I am sure there willbe future study on this topic as it applies to other scenarios.With regards to directory engineering investing any energy into optimizing ODSEE DSto more effectively leverage this caching potential, that won't happen. OUD far outperforms ODSEE. That said OUD may get some focus in this area. However, time willtell on that one.For now, I hope everyone benefits from the little that I did validate.Have a great day!Brad Brad Diggs | Principal Sales ConsultantTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 4:45 AM, Jim Klimov wrote:Thanks for running and publishing the tests :)A comment on your testing technique follows, though.2011-12-29 1:14, Brad Diggs wrote:As promised, here are the findings from my testing. I created 6directory server instances ...However, once I started modifying the data of the replicated directoryserver topology, the caching efficiencyquickly diminished. The following table shows that the delta for eachinstance increased by roughly 2GBafter only 300k of changes.I suspect the divergence in data as seen by ZFS deduplication mostlikely occurs because reduplicationoccurs at the block level rather than at the byte level. When a write issent to one directory server instance,the exact same write is propagated to the other 5 instances andtherefore should be considered a duplicate.However this was not the case. There could be other reasons for thedivergence as well.Hello, Brad,If you tested with Sun DSEE (and I have no reason tobelieve other descendants of iPlanet Directory serverwould work differently under the hood), then there aretwo factors hindering your block-dedup gains:1) The data is stored in the backend BerkeleyDB binaryfile. In Sun DSEE7 and/or in ZFS this could also becompressed data. Since for ZFS you dedup unique blocks,including same data at same offsets, it is quite unlikelyyou'd get the same data often enough. For example, eachdatabase might position same userdata blocks at differentoffsets due to garbage collection or whatever otheroptimisation the DB might think of, making on-diskblocks different and undedupable.You might look if it is possible to tune the databaseto write in sector-sized - min.block-sized (512b/4096b)records and consistently use the same DSEE compression(or lack thereof) - in this case you might get more sameblocks and win with dedup. But you'll likely lose withcompression, especially of the empty sparse structurewhich a database initially is.2) During replication each database actually becomesunique. There are hidden records with "ns" prefix whichmark when the record was created and replicated, whoinitiated it, etc. Timestamps in the data alreadywarrant uniqueness ;)This might be an RFE for the DSEE team though - to keepsuch volatile metadata separately from userdata. Thenyour DS instances would more likely dedup well afterreplication, and unique metadata would be storedseparately and stay unique. You might even keep it ina different dataset with no dedup, then... :)---So, at the moment, this expectation does not hold true: "When a write is sent to one directory server instance, the exact same write is propagated to the other five instances and therefore should be considered a duplicate."These writes are not exact.HTH,//Jim Klimov___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
Citing yourself: The average block size for a given data block should be used as the metric to map all other datablock sizes to. For example, the ZFS recordsize is 128kb by default. If the average block (or page) size of a directory server is 2k, then the mismatch in size will result in degraded throughput for both read and write operations. One of the benefits of ZFS is that you can change the recordsize of all write operations from the time you set the new value going forward. And the above is not even entirely correct as if a file is bigger than a current value of recordsize property reducing a recordsize won't change block size for the file (it will continue to use the previous size, for example 128K). This is why you need to set recordsize to a desired value for large files *before* you create them (or you will have to copy them later on). From the performance point of view it really depends on a workload but as you described in your blog the default recordsize of 128K with an average write/read of 2K for many workloads will negatively impact performance, and lowering recordsize can potentially improve it. Nevertheless I was referring to dedup efficiency which with lower recordsize values should improve dedup ratios (although it will require more memory for ddt). From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brad Diggs Sent: 29 December 2011 15:55 To: Robert Milkowski Cc: 'zfs-discuss discussion list' Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup Reducing the record size would negatively impact performance. For rational why, see the section titled Match Average I/O Block Sizes in my blog post on filesystem caching: http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Brad Brad Diggs | Principal Sales Consultant | 972.814.3698 eMail: brad.di...@oracle.com Tech Blog: http://TheZoneManager.com/ http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 8:08 AM, Robert Milkowski wrote: Try reducing recordsize to 8K or even less *before* you put any data. This can potentially improve your dedup ratio and keep it higher after you start modifying data. From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brad Diggs Sent: 28 December 2011 21:15 To: zfs-discuss discussion list Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup As promised, here are the findings from my testing. I created 6 directory server instances where the first instance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup of the first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following table shows the increased L1ARC size, increased search rate performance, and increase CPU% busy with each starting and applying load to each successive directory server instance. The L1ARC cache grew a little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratio remained the same because no data on the directory server instances was changing. image001.png However, once I started modifying the data of the replicated directory server topology, the caching efficiency quickly diminished. The following table shows that the delta for each instance increased by roughly 2GB after only 300k of changes. image002.png I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication occurs at the block level rather than at the byte level. When a write is sent to one directory server instance, the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate. However this was not the case. There could be other reasons for the divergence as well. The two key takeaways from this exercise were as follows. There is tremendous caching potential through the use of ZFS deduplication. However, the current block level deduplication does not benefit directory as much as it perhaps could if deduplication occurred at the byte level rather than the block level. It very could be that even byte level deduplication doesn't work as well either. Until that option is available, we won't know for sure. Regards, Brad image003.png Brad Diggs | Principal Sales Consultant Tech Blog: http://TheZoneManager.com/ http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote: Thanks everyone for your input on this thread. It sounds like there is sufficient weight behind the affirmative that I will include this methodology into my performance analysis test plan. If the performance goes well, I will share some of the results when we conclude in January/February timeframe. Regarding the great dd use
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
Thanks for running and publishing the tests :) A comment on your testing technique follows, though. 2011-12-29 1:14, Brad Diggs wrote: As promised, here are the findings from my testing. I created 6 directory server instances ... However, once I started modifying the data of the replicated directory server topology, the caching efficiency quickly diminished. The following table shows that the delta for each instance increased by roughly 2GB after only 300k of changes. I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication occurs at the block level rather than at the byte level. When a write is sent to one directory server instance, the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate. However this was not the case. There could be other reasons for the divergence as well. Hello, Brad, If you tested with Sun DSEE (and I have no reason to believe other descendants of iPlanet Directory server would work differently under the hood), then there are two factors hindering your block-dedup gains: 1) The data is stored in the backend BerkeleyDB binary file. In Sun DSEE7 and/or in ZFS this could also be compressed data. Since for ZFS you dedup unique blocks, including same data at same offsets, it is quite unlikely you'd get the same data often enough. For example, each database might position same userdata blocks at different offsets due to garbage collection or whatever other optimisation the DB might think of, making on-disk blocks different and undedupable. You might look if it is possible to tune the database to write in sector-sized - min.block-sized (512b/4096b) records and consistently use the same DSEE compression (or lack thereof) - in this case you might get more same blocks and win with dedup. But you'll likely lose with compression, especially of the empty sparse structure which a database initially is. 2) During replication each database actually becomes unique. There are hidden records with ns prefix which mark when the record was created and replicated, who initiated it, etc. Timestamps in the data already warrant uniqueness ;) This might be an RFE for the DSEE team though - to keep such volatile metadata separately from userdata. Then your DS instances would more likely dedup well after replication, and unique metadata would be stored separately and stay unique. You might even keep it in a different dataset with no dedup, then... :) --- So, at the moment, this expectation does not hold true: When a write is sent to one directory server instance, the exact same write is propagated to the other five instances and therefore should be considered a duplicate. These writes are not exact. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
S11 FCSBrad Brad Diggs | Principal Sales Consultant |972.814.3698eMail:brad.di...@oracle.comTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 8:11 AM, Robert Milkowski wrote:And these results are from S11 FCS I assume.On older builds or Illumos based distros I would expect L1 arc to grow much bigger.From:zfs-discuss-boun...@opensolaris.org[mailto:zfs-discuss-boun...@opensolaris.org]On Behalf OfBrad DiggsSent:28 December 2011 21:15To:zfs-discuss discussion listSubject:Re: [zfs-discuss] Improving L1ARC cache efficiency with dedupAs promised, here are the findings from my testing. I created 6 directory server instances where the firstinstance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup ofthe first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following tableshows the increased L1ARC size, increased search rate performance, and increase CPU% busy witheach starting and applying load to each successive directory server instance. The L1ARC cache grewa little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratioremained the same because no data on the directory server instances was changing.image001.pngHowever, once I started modifying the data of the replicated directory server topology, the caching efficiencyquickly diminished. The following table shows that the delta for each instance increased by roughly 2GBafter only 300k of changes.image002.pngI suspect the divergence in data as seen by ZFS deduplication most likelyoccurs because reduplicationoccurs at the block level rather than at the byte level. When a write is sent toone directory server instance,the exact same write is propagated to the other 5 instances and thereforeshould be considered a duplicate. However this was not the case. There could be other reasons for thedivergence as well.The two key takeaways from this exercise were as follows. There is tremendous caching potentialthrough the use of ZFS deduplication. However, the current block level deduplication does notbenefit directory as much as it perhaps could if deduplication occurred at the byte level rather thanthe block level. It very could be that even byte level deduplication doesn't work as well either. Until that option is available, we won't know for sure.Regards,Bradimage003.pngBrad Diggs | Principal Sales ConsultantTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggsOn Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:Thanks everyone for your input on this thread. It sounds like there is sufficient weightbehind the affirmative that I will include this methodology into my performance analysistest plan. If the performance goes well, I will share some of the results when we concludein January/February timeframe.Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARCdetect and prevent streaming reads such as from dd from populating the cache. Seemy previous blog post at the web site link below for a way around this protectivecaching control of ZFS.http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.htmlThanks again!BradPastedGraphic-2.tiffBrad Diggs | Principal Sales ConsultantTech Blog:http://TheZoneManager.comLinkedIn:http://www.linkedin.com/in/braddiggsOn Dec 8, 2011, at 4:22 PM, Mark Musante wrote:You can see the original ARC case here:http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.altOn 8 Dec 2011, at 16:41, Ian Collins wrote:On 12/ 9/11 12:39 AM, Darren J Moffat wrote:On 12/07/11 20:48, Mertol Ozyoney wrote:Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.The only vendor i know that can do this is NetappIn fact , most of our functions, like replication is not dedup aware.For example, thecnicaly it's possible to optimize our replication thatit does not send daya chunks if a data chunk with the same chechsumexists in target, without enabling dedup on target and source.We already do that with 'zfs send -D':-DPerform dedup processing on the stream. Deduplicatedstreams cannot be received on systems that do notsupport the stream deduplication feature.Is there any more published information on how this feature works?--Ian.___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Thu, Dec 29, 2011 at 9:53 AM, Brad Diggs brad.di...@oracle.com wrote: Jim, You are spot on. I was hoping that the writes would be close enough to identical that there would be a high ratio of duplicate data since I use the same record size, page size, compression algorithm, … etc. However, that was not the case. The main thing that I wanted to prove though was that if the data was the same the L1 ARC only caches the data that was actually written to storage. That is a really cool thing! I am sure there will be future study on this topic as it applies to other scenarios. With regards to directory engineering investing any energy into optimizing ODSEE DS to more effectively leverage this caching potential, that won't happen. OUD far out performs ODSEE. That said OUD may get some focus in this area. However, time will tell on that one. Databases are not as likely to benefit from dedup as virtual machines, indeed, DBs are likely to not benefit at all from dedup. The VM use case benefits from dedup for the obvious reason that many VMs will have the same exact software installed most of the time, using the same filesystems, and the same patch/update installation order, so if you keep data out of their root filesystems then you can expect enormous deduplicatiousness. But databases, not so much. The unit of deduplicable data in a VM use case is the guest's preferred block size, while in a DB the unit of deduplicable data might be a variable-sized table row, or even smaller: a single row/column value -- and you have no way to ensure alignment of individual deduplicable units nor ordering of sets of deduplicable units into larger ones. When it comes to databases your best bets will be: a) database-level compression or dedup features (e.g., Oracle's column-level compression feature) or b) ZFS compression. (Dedup makes VM management easier, because the alternative is to patch one master guest VM [per-guest type] then re-clone and re-configure all instances of that guest type, in the process possibly losing any customizations in those guests. But even before dedup, the ability to snapshot and clone datasets was an impressive dedup-like tool for the VM use-case, just not as convenient as dedup.) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
Richard Elling wrote: many of the former Sun ZFS team regularly contribute to ZFS through the illumos developer community. Does this mean that if they provide a bug fix via illumos then the fix won't make it into the Oracle code? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
On Thu, Dec 29, 2011 at 2:06 PM, sol a...@yahoo.com wrote: Richard Elling wrote: many of the former Sun ZFS team regularly contribute to ZFS through the illumos developer community. Does this mean that if they provide a bug fix via illumos then the fix won't make it into the Oracle code? If you're an Oracle customer you should report any ZFS bugs you find to Oracle if you want fixes in Solaris. You may want to (and I encourage you to) report such bugs to Illumos if at all possible (i.e., unless your agreement with Oracle or your employer's policies somehow prevent you from doing so). The following is complete speculation. Take it with salt. With reference to your question, it may mean that Oracle's ZFS team would have to come up with their own fixes to the same bugs. Oracle's legal department would almost certainly have to clear the copying of any non-trivial/obvious fix from Illumos into Oracle's ON tree. And if taking a fix from Illumos were to require opening the affected files (because they are under CDDL in Illumos) then executive management approval would also be required. But the most likely case is that the issue simply wouldn't come up in the first place because Oracle's ZFS team would almost certainly ignore the Illumos repository (perhaps not the Illumos bug tracker, but probably that too) as that's simply the easiest way for them to avoid legal messes. Think about it. Besides, I suspect that from Oracle's point of view what matters are bug reports by Oracle customers to Oracle, so if a bug fixed in Illumos is never reported to Oracle by a customer, it would likely never get fixed in Solaris either except by accident, as a result of another change. Also, the Oracle ZFS team is not exactly devoid of clue, even with the departures from it to date. I suspect they will be able to fix bugs in Oracle's ZFS and completely independently of the open ZFS community, even if it means duplicating effort. That said, Illumos is a fork of OpenSolaris, and as such it and Solaris will necessarily diverge as at least one of the two (and probably both, for a while) gets plenty of bug fixes and enhancements. This is a good thing, not a bad thing, at least for now. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
On Dec 29, 2011, at 1:29 PM, Nico Williams wrote: On Thu, Dec 29, 2011 at 2:06 PM, sol a...@yahoo.com wrote: Richard Elling wrote: many of the former Sun ZFS team regularly contribute to ZFS through the illumos developer community. Does this mean that if they provide a bug fix via illumos then the fix won't make it into the Oracle code? I can't speak for Oracle, but I think the entire ZFS community benefits when bugs are fixed. Squeaky wheels get the grease, so squeak often. If you're an Oracle customer you should report any ZFS bugs you find to Oracle if you want fixes in Solaris. You may want to (and I encourage you to) report such bugs to Illumos if at all possible (i.e., unless your agreement with Oracle or your employer's policies somehow prevent you from doing so). +1 The following is complete speculation. Take it with salt. With reference to your question, it may mean that Oracle's ZFS team would have to come up with their own fixes to the same bugs. Oracle's legal department would almost certainly have to clear the copying of any non-trivial/obvious fix from Illumos into Oracle's ON tree. And if taking a fix from Illumos were to require opening the affected files (because they are under CDDL in Illumos) then executive management approval would also be required. But the most likely case is that the issue simply wouldn't come up in the first place because Oracle's ZFS team would almost certainly ignore the Illumos repository (perhaps not the Illumos bug tracker, but probably that too) as that's simply the easiest way for them to avoid legal messes. Think about it. Besides, I suspect that from Oracle's point of view what matters are bug reports by Oracle customers to Oracle, so if a bug fixed in Illumos is never reported to Oracle by a customer, it would likely never get fixed in Solaris either except by accident, as a result of another change. Also, the Oracle ZFS team is not exactly devoid of clue, even with the departures from it to date. I suspect they will be able to fix bugs in Oracle's ZFS and completely independently of the open ZFS community, even if it means duplicating effort. Yes, Oracle continues to develop and sustain its Solaris products. This can only be viewed as a good thing. That said, Illumos is a fork of OpenSolaris, and as such it and Solaris will necessarily diverge as at least one of the two (and probably both, for a while) gets plenty of bug fixes and enhancements. This is a good thing, not a bad thing, at least for now. Evolution continues, despite the rhetoric from the pulpit :-) -- richard -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble tr...@netdemons.com wrote: On 12/12/2011 12:23 PM, Richard Elling wrote: On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote: Not exactly. What is dedup'ed is the stream only, which is infect not very efficient. Real dedup aware replication is taking the necessary steps to avoid sending a block that exists on the other storage system. As with all dedup-related performance, the efficiency of various methods of implementing zfs send -D will vary widely, depending on the dedup-ability of the data, and what is being sent. However, sending no blocks that already exist on the target system does seem like a good goal, since it addresses some use cases that intra-stream dedup does not. (1) when constructing the stream, every time a block is read from a fileset (or volume), its checksum is sent to the receiving machine. The receiving machine then looks up that checksum in its DDT, and sends back a needed or not-needed reply to the sender. While this lookup is being done, the sender must hold the original block in RAM, and cannot write it out to the to-be-sent-stream. ... you produce a huge amount of small network packet traffic, which trashes network throughput This seems like a valid approach to me. When constructing the stream, the sender need not read the actual data, just the checksum in the indirect block. So there is nothing that the sender must hold in RAM. There is no need to create small (or synchronous) network packets, because sender need not wait for the receiver to determine if it needs the block or not. There can be multiple asynchronous communication streams: one where the sender sends all the checksums to the receiver; another where the receiver requests blocks that it does not have from the sender; and another where the sender sends requested blocks back to the receiver. Implementing this may not be trivial, and in some cases it will not improve on the current implementation. But in others it would be a considerable improvement. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Thu, Dec 29, 2011 at 6:44 PM, Matthew Ahrens mahr...@delphix.com wrote: On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble tr...@netdemons.com wrote: (1) when constructing the stream, every time a block is read from a fileset (or volume), its checksum is sent to the receiving machine. The receiving machine then looks up that checksum in its DDT, and sends back a needed or not-needed reply to the sender. While this lookup is being done, the sender must hold the original block in RAM, and cannot write it out to the to-be-sent-stream. ... you produce a huge amount of small network packet traffic, which trashes network throughput This seems like a valid approach to me. When constructing the stream, the sender need not read the actual data, just the checksum in the indirect block. So there is nothing that the sender must hold in RAM. There is no need to create small (or synchronous) network packets, because sender need not wait for the receiver to determine if it needs the block or not. There can be multiple asynchronous communication streams: one where the sender sends all the checksums to the receiver; another where the receiver requests blocks that it does not have from the sender; and another where the sender sends requested blocks back to the receiver. Implementing this may not be trivial, and in some cases it will not improve on the current implementation. But in others it would be a considerable improvement. Right, you'd want to let the socket/transport buffer/flow control writes of I have this new block checksum messages from the zfs sender and I need the block with this checksum messages from the zfs receiver. I like this. A separate channel for bulk data definitely comes recommended for flow control reasons, but if you do that then securing the transport gets complicated: you couldn't just zfs send .. | ssh ... zfs receive. You could use SSH channel multiplexing, but that will net you lousy performance (well, no lousier than one already gets with SSH anyways)[*]. (And SunSSH lacks this feature anyways) It'd then begin to pay to have have a bonafide zfs send network protocol, and now we're talking about significantly more work. Another option would be to have send/receive options to create the two separate channels, so one would do something like: % zfs send --dedup-control-channel ... | ssh-or-netcat-or... zfs receive --dedup-control-channel ... % zfs send --dedup-bulk-channel ... | ssh-or-netcat-or... zfs receive --dedup-bulk-channel % wait The second zfs receive would rendezvous with the first and go from there. [*] The problem with SSHv2 is that it has flow controlled channels layered over a flow controlled congestion channel (TCP), and there's not enough information flowing from TCP to SSHv2 to make this work well, but also, the SSHv2 channels cannot have their window shrink except by the sender consuming it, which makes it impossible to mix high-bandwidth bulk and small control data over a congested link. This means that in practice SSHv2 channels have to have relatively small windows, which then forces the protocol to work very synchronously (i.e., with effectively synchronous ACKs of bulk data). I now believe the idea of mixing bulk and non-bulk data over a single TCP connection in SSHv2 is a failure. SSHv2 over SCTP, or over multiple TCP connections, would be much better. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)
Hi all; We have a dev box running NexentaStor Community Edition 3.1.1 w/ 24GB (we don't run dedupe on production boxes -- and we do pay for Nexenta licenses on prd as well) RAM and an 8.5TB pool with deduplication enabled (1.9TB or so in use). Dedupe ratio is only 1.26x. The box has an SLC-based SSD as ZIL and a 300GB MLC SSD as L2ARC. The box has been performing fairly poorly lately, and we're thinking it's due to deduplication: # echo ::arc | mdb -k | grep arc_meta arc_meta_used = 5884 MB arc_meta_limit= 5885 MB arc_meta_max = 5888 MB # zpool status -D ... DDT entries 24529444, size 331 on disk, 185 in core So, not only are we using up all of our metadata cache, but the DDT table is taking up a pretty significant chunk of that (over 70%). ARC sizing is as follows: p = 15331 MB c = 16354 MB c_min = 2942 MB c_max = 23542 MB size = 16353 MB I'm not really sure how to determine how many blocks are on this zpool (is it the same as the # of DDT entries? -- deduplication has been on since pool creation). If I use a 64KB block size average, I get about 31 million blocks, but DDT entries are 24 million zdb -DD and zdb -bb | grep 'bp count both do not complete (zdb says I/O error). Probably because the pool is in use and is quite busy. Without the block count I'm having a hard time determining how much memory we _should_ have. I can only speculate that it's more at this point. :) If I assume 24 million blocks is about accurate (from zpool status -D output above), then at 320 bytes per block we're looking at about 7.1GB for DDT table size. We do have L2ARC, though I'm not sure how ZFS decides what portion of the DDT stays in memory and what can go to L2ARC -- if all of it went to L2ARC, then the references to this information in arc_meta would be (at 176 bytes * 24million blocks) around 4GB -- which again is a good chuck of arc_meta_max. Given that our dedupe ratio on this pool is fairly low anyways, am looking for strategies to back out. Should we just disable deduplication and then maybe bump up the size of the arc_meta_max? Maybe also increase the size of arc.size as well (8GB left for the system seems higher than we need)? Is there a non-disruptive way to undeduplicate everything and expunge the DDT? zfs send/recv and then back perhaps (we have the extra space)? Thanks, Ray [1] http://markmail.org/message/db55j6zetifn4jkd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)
On Fri, Dec 30, 2011 at 1:31 PM, Ray Van Dolson rvandol...@esri.com wrote: Is there a non-disruptive way to undeduplicate everything and expunge the DDT? AFAIK, no zfs send/recv and then back perhaps (we have the extra space)? That should work, but it's disruptive :D Others might provide better answer though. -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)
On Thu, Dec 29, 2011 at 10:59:04PM -0800, Fajar A. Nugraha wrote: On Fri, Dec 30, 2011 at 1:31 PM, Ray Van Dolson rvandol...@esri.com wrote: Is there a non-disruptive way to undeduplicate everything and expunge the DDT? AFAIK, no zfs send/recv and then back perhaps (we have the extra space)? That should work, but it's disruptive :D Others might provide better answer though. Well, slightly _less_ disruptive perhaps. We can zfs send to another file system on the same system, but different set of disks. We then disable NFS shares on the original, do a final zfs send to sync, then share out the new undeduplicated file system with the same name. Hopefully the window here is short enough that NFS clients are able to recover gracefully. We'd then wipe out the old zpool, recreate and do the reverse to get data back onto it.. Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss