Re: [zfs-discuss] questions about the DDT and other things
more below… On Dec 1, 2011, at 8:21 PM, Erik Trimble wrote: > On 12/1/2011 6:44 PM, Ragnar Sundblad wrote: >> Thanks for your answers! >> >> On 2 dec 2011, at 02:54, Erik Trimble wrote: >> >>> On 12/1/2011 4:59 PM, Ragnar Sundblad wrote: I am sorry if these are dumb questions. If there are explanations available somewhere for those questions that I just haven't found, please let me know! :-) 1. It has been said that when the DDT entries, some 376 bytes or so, are rolled out on L2ARC, there still is some 170 bytes in the ARC to reference them (or rather the ZAP objects I believe). In some places it sounds like those 170 bytes refers to ZAP objects that contain several DDT entries. In other cases it sounds like for each DDT entry in the L2ARC there must be one 170 byte reference in the ARC. What is the story here really? >>> Yup. Each entry (not just a DDT entry, but any cached reference) in the >>> L2ARC requires a pointer record in the ARC, so the DDT entries held in >>> L2ARC also consume ARC space. It's a bad situation. >> Yes, it is a bad situation. But how many DDT entries can there be in each ZAP >> object? Some have suggested an 1:1 relationship, others have suggested that >> it >> isn't. > I'm pretty sure it's NOT 1:1, but I'd have to go look at the code. In any > case, it's not a very big number, so you're still looking at the same O(n) as > the number of DDT entries (n). It is not a "bad thing" it is what it is. Almost all non-trivial caches have a directory (sometimes called tags in the case of CPU caches). Trivial caches do trivial manipulation of the address to find the data in cache, a technique that would not work well for more sophisticated data management systems, like databases or file systems. So, to implement the cache, we need to put the cache directory somewhere. Again, in the case of CPU caches, the size of the tags is not counted as the size of the cache, but can be quite substantially large. DDT is stored in an AVL tree. It is unlikely that each ZAP object will contain only one DDT entry. 2. Deletion with dedup enabled is a lot heavier for some reason that I don't understand. It is said that the DDT entries have to be updated for each deleted reference to that block. Since zfs already have a mechanism for sharing blocks (for example with snapshots), I don't understand why the DDT has to contain any more block references at all, or why deletion should be much harder just because there are checksums (DDT entries) tied to those blocks, and even if they have to, why it would be much harder than the other block reference mechanism. If anyone could explain this (or give me a pointer to an explanation), I'd be very happy! >>> Remember that, when using Dedup, each block can potentially be part of a >>> very large number of files. So, when you delete a file, you have to go look >>> at the DDT entry FOR EACH BLOCK IN THAT FILE, and make the appropriate DDT >>> updates. It's essentially the same problem that erasing snapshots has - >>> for each block you delete, you have to find and update the metadata for all >>> the other files that share that block usage. Dedup and snapshot deletion >>> share the same problem, it's just usually worse for dedup, since there's a >>> much larger number of blocks that have to be updated. >> What is it that must be updated in the DDT entries - a ref count? >> And how does that differ from the snapshot case, which seems like >> a very similar mechanism? > > It is similar to the snapshot case, in that the block itself has a reference > count in it's structure (for use in both dedup and snapshots) that would get > updated upon "delete", but you also have to consider that the DDT entry > itself, which is a separate structure from the block structure, also has to > be updated. This is a whole new IOPS to get that additional structure. So, > more or less, a dedup delete has to do two operations for every one that a > snapshot delete does. Plus, A snapshot does not modify blocks. Each block pointer has a birth txg entry. The txg number is guaranteed to be monotonically incremented, so we can tell the age of a block by its birth txg. When you delete a snapshot, the blocks that belong to that snapshot exclusively are returned to the free list. >>> The problem is that you really need to have the entire DDT in some form of >>> high-speed random-access memory in order for things to be efficient. If you >>> have to search the entire hard drive to get the proper DDT entry every time >>> you delete a block, then your IOPs limits are going to get hammered hard. >> Indeed! >> 3. I, as many others, would of course like to be able to have very large datasets deduped without having to have enormous amounts of RAM. Since the DDT is a AVL tree, couldn't just that entire tree be cached on for example a
Re: [zfs-discuss] questions about the DDT and other things
On 12/1/2011 6:44 PM, Ragnar Sundblad wrote: Thanks for your answers! On 2 dec 2011, at 02:54, Erik Trimble wrote: On 12/1/2011 4:59 PM, Ragnar Sundblad wrote: I am sorry if these are dumb questions. If there are explanations available somewhere for those questions that I just haven't found, please let me know! :-) 1. It has been said that when the DDT entries, some 376 bytes or so, are rolled out on L2ARC, there still is some 170 bytes in the ARC to reference them (or rather the ZAP objects I believe). In some places it sounds like those 170 bytes refers to ZAP objects that contain several DDT entries. In other cases it sounds like for each DDT entry in the L2ARC there must be one 170 byte reference in the ARC. What is the story here really? Yup. Each entry (not just a DDT entry, but any cached reference) in the L2ARC requires a pointer record in the ARC, so the DDT entries held in L2ARC also consume ARC space. It's a bad situation. Yes, it is a bad situation. But how many DDT entries can there be in each ZAP object? Some have suggested an 1:1 relationship, others have suggested that it isn't. I'm pretty sure it's NOT 1:1, but I'd have to go look at the code. In any case, it's not a very big number, so you're still looking at the same O(n) as the number of DDT entries (n). 2. Deletion with dedup enabled is a lot heavier for some reason that I don't understand. It is said that the DDT entries have to be updated for each deleted reference to that block. Since zfs already have a mechanism for sharing blocks (for example with snapshots), I don't understand why the DDT has to contain any more block references at all, or why deletion should be much harder just because there are checksums (DDT entries) tied to those blocks, and even if they have to, why it would be much harder than the other block reference mechanism. If anyone could explain this (or give me a pointer to an explanation), I'd be very happy! Remember that, when using Dedup, each block can potentially be part of a very large number of files. So, when you delete a file, you have to go look at the DDT entry FOR EACH BLOCK IN THAT FILE, and make the appropriate DDT updates. It's essentially the same problem that erasing snapshots has - for each block you delete, you have to find and update the metadata for all the other files that share that block usage. Dedup and snapshot deletion share the same problem, it's just usually worse for dedup, since there's a much larger number of blocks that have to be updated. What is it that must be updated in the DDT entries - a ref count? And how does that differ from the snapshot case, which seems like a very similar mechanism? It is similar to the snapshot case, in that the block itself has a reference count in it's structure (for use in both dedup and snapshots) that would get updated upon "delete", but you also have to consider that the DDT entry itself, which is a separate structure from the block structure, also has to be updated. This is a whole new IOPS to get that additional structure. So, more or less, a dedup delete has to do two operations for every one that a snapshot delete does. Plus, The problem is that you really need to have the entire DDT in some form of high-speed random-access memory in order for things to be efficient. If you have to search the entire hard drive to get the proper DDT entry every time you delete a block, then your IOPs limits are going to get hammered hard. Indeed! 3. I, as many others, would of course like to be able to have very large datasets deduped without having to have enormous amounts of RAM. Since the DDT is a AVL tree, couldn't just that entire tree be cached on for example a SSD and be searched there without necessarily having to store anything of it in RAM? That would probably require some changes to the DDT lookup code, and some mechanism to gather the tree to be able to lift it over to the SSD cache, and some other stuff, but still that sounds - with my very basic (non-)understanding of zfs - like a not to overwhelming change. L2ARC typically sits on an SSD, and the DDT is usually held there, if the L2ARC device exists. Well, it rather seems to be ZAP objects, referenced from the ARC, which happens to contain DDT entries, that is in the L2ARC. I mean that you could just move the entire AVL tree onto the SSD, completely outside of zfs if you will, and have it being searched there, not dependent of what is in RAM at all. Every DDT lookup would take up to [tree depth] number of reads, but that could be OK if you have a SSD which is fast on reading (which many are). ZFS currently treats all metadata (of which DDT entries are) and data slabs the same when it comes to choosing to migrate them from ARC to L2ARC, so the most-frequently-accessed info is in the ARC (regardless of what that info is), and everything else sits in the L2ARC. But, ALL entries in the L2ARC require an ARC reference pointer. Under normal ope
Re: [zfs-discuss] questions about the DDT and other things
On Fri, Dec 02, 2011 at 01:59:37AM +0100, Ragnar Sundblad wrote: > > I am sorry if these are dumb questions. If there are explanations > available somewhere for those questions that I just haven't found, please > let me know! :-) I'll give you a brief summary. > 1. It has been said that when the DDT entries, some 376 bytes or so, are > rolled out on L2ARC, there still is some 170 bytes in the ARC to reference > them (or rather the ZAP objects I believe). In some places it sounds like > those 170 bytes refers to ZAP objects that contain several DDT entries. > In other cases it sounds like for each DDT entry in the L2ARC there must > be one 170 byte reference in the ARC. What is the story here really? Currently, every object (not just DDT entries) stored in L2ARC is tracked in memory. This metadata identifies the object and where on L2ARC it is stored. The L2ARC on-disk doesn't contain metadata and is not self-describing. This is one reason why the L2ARC starts out empty/cold after every reboot, and why the usable size of L2ARC is limited by memory. DDT entries in core are used directly. If the relevant DDT node is not in core, it must be fetched from the pool, which may in turn be assisted by an L2ARC. It's my understanding that, yes, several DDT entries are stored in each on-disk "block", though I'm not certain of the number. The on-disk size of the DDT entry is different, too. > 2. Deletion with dedup enabled is a lot heavier for some reason that I don't > understand. It is said that the DDT entries have to be updated for each > deleted reference to that block. Since zfs already have a mechanism for > sharing > blocks (for example with snapshots), I don't understand why the DDT has to > contain any more block references at all, or why deletion should be much > harder > just because there are checksums (DDT entries) tied to those blocks, and even > if they have to, why it would be much harder than the other block reference > mechanism. If anyone could explain this (or give me a pointer to an > explanation), I'd be very happy! DDT entries are reference-counted. Unlike other things that look like multiple references, these are truly block-level independent. Everything else is either tree-structured or highly aggregated (metaslab free-space tracking). Snapshots, for example, are references to a certain internal node (the root of a filesystem tree at a certain txg), and that counts as a reference to the entire subtree underneath. Note that any changes to this subtree later (via writes into the live filesystem) diverge completely via CoW; an update produces a new CoW block tree all the way back to the root, above the snapshot node. When a snapshot is created, it starts out owning (almost) nothing. As data is overwritten, the ownership of the data that might otherwise be freed is transferred to the snapshot. When the oldest snapshot is freed, any data blocks it owns can be freed. When an intermediate snapshot is freed, data blocks it owns are either transferred to the previous older snapshot because they were shared with it (txg < snapshot's) or they're unique to this snapshot and can be freed. Either way, these decisions are tree based and can potentially free large swathes of space with a single decision, whereas the DDT needs refcount updates individually for each block (in random order, as per below). (This is not the same as the ZPL directory tree used for naming, however, don't get those confused, it's flatter than that). > 3. I, as many others, would of course like to be able to have very large > datasets deduped without having to have enormous amounts of RAM. > Since the DDT is a AVL tree, couldn't just that entire tree be cached on > for example a SSD and be searched there without necessarily having to store > anything of it in RAM? That would probably require some changes to the DDT > lookup code, and some mechanism to gather the tree to be able to lift it > over to the SSD cache, and some other stuff, but still that sounds - with > my very basic (non-)understanding of zfs - like a not to overwhelming change. Think of this the other way round. One could do this, and could require a dedicated device (SSD) in order to use dedup at all. Now, every DDT lookup requires IO to bring the DDT entry into memory. This would be slow, so we could add an in-memory cache for the DDT... and we're back to square one. The major issue with the DDT is that, being context-hash indexed, it is random-access, even for sequential-access data. There's no getting around that, it's in its job description. > 4. Now and then people mention that the problem with bp_rewrite has been > explained, on this very mailing list I believe, but I haven't found that > explanation. Could someone please give me a pointer to that description > (or perhaps explain it again :-) )? This relates to the answer for 2; all the pointers in the tree discussed there are block pointers to device virtual addresses. If you're go
Re: [zfs-discuss] questions about the DDT and other things
Thanks for your answers! On 2 dec 2011, at 02:54, Erik Trimble wrote: > On 12/1/2011 4:59 PM, Ragnar Sundblad wrote: >> I am sorry if these are dumb questions. If there are explanations >> available somewhere for those questions that I just haven't found, please >> let me know! :-) >> >> 1. It has been said that when the DDT entries, some 376 bytes or so, are >> rolled out on L2ARC, there still is some 170 bytes in the ARC to reference >> them (or rather the ZAP objects I believe). In some places it sounds like >> those 170 bytes refers to ZAP objects that contain several DDT entries. >> In other cases it sounds like for each DDT entry in the L2ARC there must >> be one 170 byte reference in the ARC. What is the story here really? > Yup. Each entry (not just a DDT entry, but any cached reference) in the L2ARC > requires a pointer record in the ARC, so the DDT entries held in L2ARC also > consume ARC space. It's a bad situation. Yes, it is a bad situation. But how many DDT entries can there be in each ZAP object? Some have suggested an 1:1 relationship, others have suggested that it isn't. >> 2. Deletion with dedup enabled is a lot heavier for some reason that I don't >> understand. It is said that the DDT entries have to be updated for each >> deleted reference to that block. Since zfs already have a mechanism for >> sharing >> blocks (for example with snapshots), I don't understand why the DDT has to >> contain any more block references at all, or why deletion should be much >> harder >> just because there are checksums (DDT entries) tied to those blocks, and even >> if they have to, why it would be much harder than the other block reference >> mechanism. If anyone could explain this (or give me a pointer to an >> explanation), I'd be very happy! > Remember that, when using Dedup, each block can potentially be part of a very > large number of files. So, when you delete a file, you have to go look at the > DDT entry FOR EACH BLOCK IN THAT FILE, and make the appropriate DDT updates. > It's essentially the same problem that erasing snapshots has - for each block > you delete, you have to find and update the metadata for all the other files > that share that block usage. Dedup and snapshot deletion share the same > problem, it's just usually worse for dedup, since there's a much larger > number of blocks that have to be updated. What is it that must be updated in the DDT entries - a ref count? And how does that differ from the snapshot case, which seems like a very similar mechanism? > The problem is that you really need to have the entire DDT in some form of > high-speed random-access memory in order for things to be efficient. If you > have to search the entire hard drive to get the proper DDT entry every time > you delete a block, then your IOPs limits are going to get hammered hard. Indeed! >> 3. I, as many others, would of course like to be able to have very large >> datasets deduped without having to have enormous amounts of RAM. >> Since the DDT is a AVL tree, couldn't just that entire tree be cached on >> for example a SSD and be searched there without necessarily having to store >> anything of it in RAM? That would probably require some changes to the DDT >> lookup code, and some mechanism to gather the tree to be able to lift it >> over to the SSD cache, and some other stuff, but still that sounds - with >> my very basic (non-)understanding of zfs - like a not to overwhelming change. > L2ARC typically sits on an SSD, and the DDT is usually held there, if the > L2ARC device exists. Well, it rather seems to be ZAP objects, referenced from the ARC, which happens to contain DDT entries, that is in the L2ARC. I mean that you could just move the entire AVL tree onto the SSD, completely outside of zfs if you will, and have it being searched there, not dependent of what is in RAM at all. Every DDT lookup would take up to [tree depth] number of reads, but that could be OK if you have a SSD which is fast on reading (which many are). > There does need to be serious work on changing how the DDT in the L2ARC is > referenced, however; the ARC memory requirements for DDT-in-L2ARC definitely > need to be removed (which requires a non-trivial rearchitecting of dedup). > There are some other changes that have to happen for Dedup to be really > usable. Unfortunately, I can't see anyone around willing to do those changes, > and my understanding of the code says that it is much more likely that we > will simply remove and replace the entire dedup feature rather than trying to > fix the existing design. Yes, replacing it is certainly one possibility. Is there any work going on for a replacement mechanism? >> 4. Now and then people mention that the problem with bp_rewrite has been >> explained, on this very mailing list I believe, but I haven't found that >> explanation. Could someone please give me a pointer to that description >> (or perhaps explain it again :-) )? >> >> Thanks
Re: [zfs-discuss] questions about the DDT and other things
On 12/1/2011 4:59 PM, Ragnar Sundblad wrote: I am sorry if these are dumb questions. If there are explanations available somewhere for those questions that I just haven't found, please let me know! :-) 1. It has been said that when the DDT entries, some 376 bytes or so, are rolled out on L2ARC, there still is some 170 bytes in the ARC to reference them (or rather the ZAP objects I believe). In some places it sounds like those 170 bytes refers to ZAP objects that contain several DDT entries. In other cases it sounds like for each DDT entry in the L2ARC there must be one 170 byte reference in the ARC. What is the story here really? Yup. Each entry (not just a DDT entry, but any cached reference) in the L2ARC requires a pointer record in the ARC, so the DDT entries held in L2ARC also consume ARC space. It's a bad situation. 2. Deletion with dedup enabled is a lot heavier for some reason that I don't understand. It is said that the DDT entries have to be updated for each deleted reference to that block. Since zfs already have a mechanism for sharing blocks (for example with snapshots), I don't understand why the DDT has to contain any more block references at all, or why deletion should be much harder just because there are checksums (DDT entries) tied to those blocks, and even if they have to, why it would be much harder than the other block reference mechanism. If anyone could explain this (or give me a pointer to an explanation), I'd be very happy! Remember that, when using Dedup, each block can potentially be part of a very large number of files. So, when you delete a file, you have to go look at the DDT entry FOR EACH BLOCK IN THAT FILE, and make the appropriate DDT updates. It's essentially the same problem that erasing snapshots has - for each block you delete, you have to find and update the metadata for all the other files that share that block usage. Dedup and snapshot deletion share the same problem, it's just usually worse for dedup, since there's a much larger number of blocks that have to be updated. The problem is that you really need to have the entire DDT in some form of high-speed random-access memory in order for things to be efficient. If you have to search the entire hard drive to get the proper DDT entry every time you delete a block, then your IOPs limits are going to get hammered hard. 3. I, as many others, would of course like to be able to have very large datasets deduped without having to have enormous amounts of RAM. Since the DDT is a AVL tree, couldn't just that entire tree be cached on for example a SSD and be searched there without necessarily having to store anything of it in RAM? That would probably require some changes to the DDT lookup code, and some mechanism to gather the tree to be able to lift it over to the SSD cache, and some other stuff, but still that sounds - with my very basic (non-)understanding of zfs - like a not to overwhelming change. L2ARC typically sits on an SSD, and the DDT is usually held there, if the L2ARC device exists. There does need to be serious work on changing how the DDT in the L2ARC is referenced, however; the ARC memory requirements for DDT-in-L2ARC definitely need to be removed (which requires a non-trivial rearchitecting of dedup). There are some other changes that have to happen for Dedup to be really usable. Unfortunately, I can't see anyone around willing to do those changes, and my understanding of the code says that it is much more likely that we will simply remove and replace the entire dedup feature rather than trying to fix the existing design. 4. Now and then people mention that the problem with bp_rewrite has been explained, on this very mailing list I believe, but I haven't found that explanation. Could someone please give me a pointer to that description (or perhaps explain it again :-) )? Thanks for any enlightenment! /ragge bp_rewrite is a feature which stands for the (as yet unimplemented) system call of the same name, which does Block Pointer re-writing. That is, it would allow ZFS to change the physical location on media of an existing ZFS data slab. That is, bp_rewrite is necessary to allow ZFS to change the Physical layout of data on media, without changing the Conceptual arrangement of such data. It's been the #1 most-wanted feature of ZFS since I can remember, probably for 10 years now. -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] questions about the DDT and other things
I am sorry if these are dumb questions. If there are explanations available somewhere for those questions that I just haven't found, please let me know! :-) 1. It has been said that when the DDT entries, some 376 bytes or so, are rolled out on L2ARC, there still is some 170 bytes in the ARC to reference them (or rather the ZAP objects I believe). In some places it sounds like those 170 bytes refers to ZAP objects that contain several DDT entries. In other cases it sounds like for each DDT entry in the L2ARC there must be one 170 byte reference in the ARC. What is the story here really? 2. Deletion with dedup enabled is a lot heavier for some reason that I don't understand. It is said that the DDT entries have to be updated for each deleted reference to that block. Since zfs already have a mechanism for sharing blocks (for example with snapshots), I don't understand why the DDT has to contain any more block references at all, or why deletion should be much harder just because there are checksums (DDT entries) tied to those blocks, and even if they have to, why it would be much harder than the other block reference mechanism. If anyone could explain this (or give me a pointer to an explanation), I'd be very happy! 3. I, as many others, would of course like to be able to have very large datasets deduped without having to have enormous amounts of RAM. Since the DDT is a AVL tree, couldn't just that entire tree be cached on for example a SSD and be searched there without necessarily having to store anything of it in RAM? That would probably require some changes to the DDT lookup code, and some mechanism to gather the tree to be able to lift it over to the SSD cache, and some other stuff, but still that sounds - with my very basic (non-)understanding of zfs - like a not to overwhelming change. 4. Now and then people mention that the problem with bp_rewrite has been explained, on this very mailing list I believe, but I haven't found that explanation. Could someone please give me a pointer to that description (or perhaps explain it again :-) )? Thanks for any enlightenment! /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss