Re: [zfs-discuss] ZFS file disk usage
On Mon, 21 Sep 2009 18:20:53 -0400 Richard Elling richard.ell...@gmail.com wrote: On Sep 21, 2009, at 2:43 PM, Andrew Deason wrote: On Mon, 21 Sep 2009 17:13:26 -0400 Richard Elling richard.ell...@gmail.com wrote: You don't know the max overhead for the file before it is allocated. You could guess at a max of 3x size + at least three blocks. Since you can't control this, it seems like the worst case is when copies=3. Is that max with copies=3? Assume copies=1; what is it then? 1x size + 1 block. That seems to differ quite a bit from what I've seen; perhaps I am misunderstanding... is the + 1 block of a different size than the recordsize? With recordsize=1k: $ ls -ls foo 2261 -rw-r--r-- 1 root root 1048576 Sep 22 10:59 foo 1024k vs 1130k -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Sep 22, 2009, at 8:07 AM, Andrew Deason wrote: On Mon, 21 Sep 2009 18:20:53 -0400 Richard Elling richard.ell...@gmail.com wrote: On Sep 21, 2009, at 2:43 PM, Andrew Deason wrote: On Mon, 21 Sep 2009 17:13:26 -0400 Richard Elling richard.ell...@gmail.com wrote: You don't know the max overhead for the file before it is allocated. You could guess at a max of 3x size + at least three blocks. Since you can't control this, it seems like the worst case is when copies=3. Is that max with copies=3? Assume copies=1; what is it then? 1x size + 1 block. That seems to differ quite a bit from what I've seen; perhaps I am misunderstanding... is the + 1 block of a different size than the recordsize? With recordsize=1k: $ ls -ls foo 2261 -rw-r--r-- 1 root root 1048576 Sep 22 10:59 foo Well, there it is. I suggest suitable guard bands. -- richard 1024k vs 1130k -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Tue, 22 Sep 2009 13:26:59 -0400 Richard Elling richard.ell...@gmail.com wrote: That seems to differ quite a bit from what I've seen; perhaps I am misunderstanding... is the + 1 block of a different size than the recordsize? With recordsize=1k: $ ls -ls foo 2261 -rw-r--r-- 1 root root 1048576 Sep 22 10:59 foo Well, there it is. I suggest suitable guard bands. So, you would say it's reasonable to assume the overhead will always be less than about 100k or 10%? And to be sure... if we're to be rounding up to the next recordsize boundary, are we guaranteed to be able to get the from the blocksize reported by statvfs? -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Sun, 20 Sep 2009 20:31:57 -0400 Richard Elling richard.ell...@gmail.com wrote: If you are just building a cache, why not just make a file system and put a reservation on it? Turn off auto snapshots and set other features as per best practices for your workload? In other words, treat it like we treat dump space. I think that we are getting caught up in trying to answer the question you ask rather than solving the problem you have... perhaps because we don't understand the problem. Yes, possibly... some of these suggestions dont quite make a lot of sense to me. We can't just make a filesystem and put a reservation on it; we are just an application the administrator puts on a machine for it to access AFS. So I'm not sure when you are imagining we do that; when the client starts up? Or part of the installation procedure? Requiring a separate filesystem seems unnecessarily restrictive. And I still don't see how that helps. Making an fs with a reservation would definitely limit us to the specified space, but we still can't get an accurate picture of the current disk usage. I already mentioned why using statvfs is not usable with that commit delay. But solving the general problem for me isn't necessary. If I could just get a ballpark estimate of the max overhead for a file, I would be fine. I haven't payed attention to it before, so I don't even have an intuitive feel for what it is. -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Sep 21, 2009, at 7:11 AM, Andrew Deason wrote: On Sun, 20 Sep 2009 20:31:57 -0400 Richard Elling richard.ell...@gmail.com wrote: If you are just building a cache, why not just make a file system and put a reservation on it? Turn off auto snapshots and set other features as per best practices for your workload? In other words, treat it like we treat dump space. I think that we are getting caught up in trying to answer the question you ask rather than solving the problem you have... perhaps because we don't understand the problem. Yes, possibly... some of these suggestions dont quite make a lot of sense to me. We can't just make a filesystem and put a reservation on it; we are just an application the administrator puts on a machine for it to access AFS. So I'm not sure when you are imagining we do that; when the client starts up? Or part of the installation procedure? Requiring a separate filesystem seems unnecessarily restrictive. And I still don't see how that helps. Making an fs with a reservation would definitely limit us to the specified space, but we still can't get an accurate picture of the current disk usage. I already mentioned why using statvfs is not usable with that commit delay. OK, so the problem you are trying to solve is how much stuff can I place in the remaining free space? I don't think this is knowable for a dynamic file system like ZFS where metadata is dynamically allocated. But solving the general problem for me isn't necessary. If I could just get a ballpark estimate of the max overhead for a file, I would be fine. I haven't payed attention to it before, so I don't even have an intuitive feel for what it is. You don't know the max overhead for the file before it is allocated. You could guess at a max of 3x size + at least three blocks. Since you can't control this, it seems like the worst case is when copies=3. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Mon, 21 Sep 2009 17:13:26 -0400 Richard Elling richard.ell...@gmail.com wrote: OK, so the problem you are trying to solve is how much stuff can I place in the remaining free space? I don't think this is knowable for a dynamic file system like ZFS where metadata is dynamically allocated. Yes. And I acknowledge that we can't know that precisely; I'm trying for an estimate on the bound. You don't know the max overhead for the file before it is allocated. You could guess at a max of 3x size + at least three blocks. Since you can't control this, it seems like the worst case is when copies=3. Is that max with copies=3? Assume copies=1; what is it then? -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Fri, 18 Sep 2009 17:54:41 -0400 Robert Milkowski mi...@task.gda.pl wrote: There will be a delay of up-to 30s currently. But how much data do you expect to be pushed within 30s? Lets say it would be even 10g to lots of small file and you would calculate the total size by only summing up a logical size of data. Would you really expect that an error would be greater than 5% which would be 500mb. Does it matter in practice? Well, that wasn't the problem I was thinking of. I meant, if we have to wait 30 seconds after the write to measure the disk usage... what do I do, just sleep 30s after the write before polling for disk usage? We could just ask for disk usage when we write, knowing that it doesn't take into account the write we are performing... but we're changing what we're measuring, then. If we are removing things from the cache in order to free up space, how do we know when to stop? To illustrate: normally when the cache is 98% full, we remove items until we are 95% full before we allow a write to happen again. If we relied on statvfs information for our disk usage information, we would start removing items at 98%, and have no idea when we hit 95% unless we wait 30 seconds. If you are simply saying that the difference in logical size and used disk blocks on ZFS are similar enough not to make a difference... well, that's what I've been asking. I have asked what the maximum difference is between logical size rounded up to recordsize and size taken up on disk, and haven't received an answer yet. If the answer is small enough that you don't care, then fantastic. what is user enables compression like lzjb or even gzip? How would you like to take it into account before doing writes? What if user creates a snapshot? How would you take it into account? Then it will be wrong; we do not take them into account. I do not care about those cases. It is already impossible to enforce that the cache tracking data is 100% correct all of the time. Imagine we somehow had a way to account for all of those cases you listed, and would make me happy. Say the directory the user uses for the cache data is /usr/vice/cache (one standard path to put it). The OpenAFS client will put cache data in e.g. /usr/vice/cache/D0/V1 and a bunch of other files. If the user puts their own file in /usr/vice/cache/reallybigfile, our cache tracking information will always be off, in all current implementations. We have no control over it, and we do not try to solve that problem. I am treating the cases of what if the user creates a snapshot and the like as a similar situation. If someone does that and runs out of space, it is pretty easy to troubleshoot their system and say you have a snapshot of the cache dataset; do not do that. Right now, if someone runs an OpenAFS client cache on zfs and runs out of space, the only thing I can tell them is don't use zfs, which I don't want to do. If it works for _a_ configuration -- the default one -- that is all I am asking for. I'm under suspicion that you are looking too closely for no real benefit. Especially if you don't want to dedicate a dataset to cache you would expect other applications in a system to write to the same file system but different locations which you have no control or ability to predict how much data will be written at all. Be it Linux, Solaris, BSD, ... the issue will be there. It is certainly possible for other applications to fill up the disk. We just need to ensure that we don't fill up the disk to block other applications. You may think this is fruitless, and just from that description alone, it may be. But you must understand that without an accurate bound on the cache, well... we can eat up the disk a lot faster than other applications without the user realizing it. -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
If you are just building a cache, why not just make a file system and put a reservation on it? Turn off auto snapshots and set other features as per best practices for your workload? In other words, treat it like we treat dump space. I think that we are getting caught up in trying to answer the question you ask rather than solving the problem you have... perhaps because we don't understand the problem. -- richard On Sep 20, 2009, at 2:17 PM, Andrew Deason wrote: On Fri, 18 Sep 2009 17:54:41 -0400 Robert Milkowski mi...@task.gda.pl wrote: There will be a delay of up-to 30s currently. But how much data do you expect to be pushed within 30s? Lets say it would be even 10g to lots of small file and you would calculate the total size by only summing up a logical size of data. Would you really expect that an error would be greater than 5% which would be 500mb. Does it matter in practice? Well, that wasn't the problem I was thinking of. I meant, if we have to wait 30 seconds after the write to measure the disk usage... what do I do, just sleep 30s after the write before polling for disk usage? We could just ask for disk usage when we write, knowing that it doesn't take into account the write we are performing... but we're changing what we're measuring, then. If we are removing things from the cache in order to free up space, how do we know when to stop? To illustrate: normally when the cache is 98% full, we remove items until we are 95% full before we allow a write to happen again. If we relied on statvfs information for our disk usage information, we would start removing items at 98%, and have no idea when we hit 95% unless we wait 30 seconds. If you are simply saying that the difference in logical size and used disk blocks on ZFS are similar enough not to make a difference... well, that's what I've been asking. I have asked what the maximum difference is between logical size rounded up to recordsize and size taken up on disk, and haven't received an answer yet. If the answer is small enough that you don't care, then fantastic. what is user enables compression like lzjb or even gzip? How would you like to take it into account before doing writes? What if user creates a snapshot? How would you take it into account? Then it will be wrong; we do not take them into account. I do not care about those cases. It is already impossible to enforce that the cache tracking data is 100% correct all of the time. Imagine we somehow had a way to account for all of those cases you listed, and would make me happy. Say the directory the user uses for the cache data is /usr/vice/cache (one standard path to put it). The OpenAFS client will put cache data in e.g. /usr/vice/cache/D0/V1 and a bunch of other files. If the user puts their own file in /usr/vice/cache/reallybigfile, our cache tracking information will always be off, in all current implementations. We have no control over it, and we do not try to solve that problem. I am treating the cases of what if the user creates a snapshot and the like as a similar situation. If someone does that and runs out of space, it is pretty easy to troubleshoot their system and say you have a snapshot of the cache dataset; do not do that. Right now, if someone runs an OpenAFS client cache on zfs and runs out of space, the only thing I can tell them is don't use zfs, which I don't want to do. If it works for _a_ configuration -- the default one -- that is all I am asking for. I'm under suspicion that you are looking too closely for no real benefit. Especially if you don't want to dedicate a dataset to cache you would expect other applications in a system to write to the same file system but different locations which you have no control or ability to predict how much data will be written at all. Be it Linux, Solaris, BSD, ... the issue will be there. It is certainly possible for other applications to fill up the disk. We just need to ensure that we don't fill up the disk to block other applications. You may think this is fruitless, and just from that description alone, it may be. But you must understand that without an accurate bound on the cache, well... we can eat up the disk a lot faster than other applications without the user realizing it. -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Thu, 17 Sep 2009 18:40:49 -0400 Robert Milkowski mi...@task.gda.pl wrote: if you would create a dedicated dataset for your cache and set quota on it then instead of tracking a disk space usage for each file you could easily check how much disk space is being used in the dataset. Would it suffice for you? No. We need to be able to tell how close to full we are, for determining when to start/stop removing things from the cache before we can add new items to the cache again. I'd also _like_ not to require a dedicated dataset for it, but it's not like it's difficult for users to create one. Setting recordsize to 1k if you have lots of files (I assume) larger than that doesn't really make sense. The problem with metadata is that by default it is also compressed so there is no easy way to tell how much disk space it occupies for a specified file using standard API. We do not know in advance what file sizes we'll be seeing in general. We could of course tell people to tune the cache dataset according to their usage pattern, but I don't think users are generally going to know what their cache usage pattern looks like. I can say that at least right now, usually each file will be at most 1M long (1M is the max unless the user specifically changes it). But between the range 1k-1M, I don't know what the distribution looks like. I can't get an /estimate/ on the data+metadata disk usage? What about in the hypothetical case of the metadata compression ratio being effectively the same as without compression, what would it be then? -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Fri, 18 Sep 2009 12:48:34 -0400 Richard Elling richard.ell...@gmail.com wrote: The transactional nature of ZFS may work against you here. Until the data is committed to disk, it is unclear how much space it will consume. Compression clouds the crystal ball further. ...but not impossible. I'm just looking for a reasonable upper bound. For example, if I always rounded up to the next 128k mark, and added an additional 128k, that would always give me an upper bound (for files = 1M), as far as I can tell. But that is not a very tight bound; can you suggest anything better? I'd also _like_ not to require a dedicated dataset for it, but it's not like it's difficult for users to create one. Use delegation. Users can create their own datasets, set parameters, etc. For this case, you could consider changing recordsize, if you really are so worried about 1k. IMHO, it is easier and less expensive in process and pain to just buy more disk when needed. Users of OpenAFS, not unprivileged users. All users I am talking about are the administrators for their machines. I would just like to reduce the number of filesystem-specific steps needed to be taken to set up the cache. You don't need to do anything special for a tmpfs cache, for instance, or ext2/3 caches on linux. -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Sep 18, 2009, at 7:36 AM, Andrew Deason wrote: On Thu, 17 Sep 2009 18:40:49 -0400 Robert Milkowski mi...@task.gda.pl wrote: if you would create a dedicated dataset for your cache and set quota on it then instead of tracking a disk space usage for each file you could easily check how much disk space is being used in the dataset. Would it suffice for you? No. We need to be able to tell how close to full we are, for determining when to start/stop removing things from the cache before we can add new items to the cache again. The transactional nature of ZFS may work against you here. Until the data is committed to disk, it is unclear how much space it will consume. Compression clouds the crystal ball further. I'd also _like_ not to require a dedicated dataset for it, but it's not like it's difficult for users to create one. Use delegation. Users can create their own datasets, set parameters, etc. For this case, you could consider changing recordsize, if you really are so worried about 1k. IMHO, it is easier and less expensive in process and pain to just buy more disk when needed. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
Andrew Deason wrote: On Thu, 17 Sep 2009 18:40:49 -0400 Robert Milkowski mi...@task.gda.pl wrote: if you would create a dedicated dataset for your cache and set quota on it then instead of tracking a disk space usage for each file you could easily check how much disk space is being used in the dataset. Would it suffice for you? No. We need to be able to tell how close to full we are, for determining when to start/stop removing things from the cache before we can add new items to the cache again. but having a dedicated dataset will let you answer such a question immediatelly as then you get from zfs information from for the dataset on how much space is used (everything: data + metadata) and how much is left. I'd also _like_ not to require a dedicated dataset for it, but it's not like it's difficult for users to create one. no, it is not. Setting recordsize to 1k if you have lots of files (I assume) larger than that doesn't really make sense. The problem with metadata is that by default it is also compressed so there is no easy way to tell how much disk space it occupies for a specified file using standard API. We do not know in advance what file sizes we'll be seeing in general. We could of course tell people to tune the cache dataset according to their usage pattern, but I don't think users are generally going to know what their cache usage pattern looks like. I can say that at least right now, usually each file will be at most 1M long (1M is the max unless the user specifically changes it). But between the range 1k-1M, I don't know what the distribution looks like. What I meant was that I believe that default recordsize of 128k should be fine for you (files smaller than 128k will use smaller recordsize, larger ones will use a recordsize of 128k). The only problem will be with files truncated to 0 and growing again as they will be stuck with an old recordsize. But in most cases it won't probably be a practical problem anyway. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Fri, 18 Sep 2009 16:38:28 -0400 Robert Milkowski mi...@task.gda.pl wrote: No. We need to be able to tell how close to full we are, for determining when to start/stop removing things from the cache before we can add new items to the cache again. but having a dedicated dataset will let you answer such a question immediatelly as then you get from zfs information from for the dataset on how much space is used (everything: data + metadata) and how much is left. Immediately? There isn't a delay between the write and the next commit when the space is recorded? (Do you mean a statvfs equivalent, or some zfs-specific call?) And the current code is structured such that we record usage changes before a write; it would be a huge pain to rely on the write to calculate the usage (for that and other reasons). Setting recordsize to 1k if you have lots of files (I assume) larger than that doesn't really make sense. The problem with metadata is that by default it is also compressed so there is no easy way to tell how much disk space it occupies for a specified file using standard API. We do not know in advance what file sizes we'll be seeing in general. We could of course tell people to tune the cache dataset according to their usage pattern, but I don't think users are generally going to know what their cache usage pattern looks like. I can say that at least right now, usually each file will be at most 1M long (1M is the max unless the user specifically changes it). But between the range 1k-1M, I don't know what the distribution looks like. What I meant was that I believe that default recordsize of 128k should be fine for you (files smaller than 128k will use smaller recordsize, larger ones will use a recordsize of 128k). The only problem will be with files truncated to 0 and growing again as they will be stuck with an old recordsize. But in most cases it won't probably be a practical problem anyway. Well, it may or may not be 'fine'; we may have a lot of little files in the cache, and rounding up to 128k for each one reduces our disk efficiency somewhat. Files are truncated to 0 and grow again quite often in busy clients. But that's an efficiency issue, we'd still be able to stay within the configured limit that way. But anyway, 128k may be fine for me, but what about if someone sets their recordsize to something different? That's why I was wondering about the overhead if someone sets the recordsize to 1k; is there no way to account for it even if I know the recordsize is 1k? -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
Andrew Deason wrote: On Fri, 18 Sep 2009 16:38:28 -0400 Robert Milkowski mi...@task.gda.pl wrote: No. We need to be able to tell how close to full we are, for determining when to start/stop removing things from the cache before we can add new items to the cache again. but having a dedicated dataset will let you answer such a question immediatelly as then you get from zfs information from for the dataset on how much space is used (everything: data + metadata) and how much is left. Immediately? There isn't a delay between the write and the next commit when the space is recorded? (Do you mean a statvfs equivalent, or some zfs-specific call?) And the current code is structured such that we record usage changes before a write; it would be a huge pain to rely on the write to calculate the usage (for that and other reasons). There will be a delay of up-to 30s currently. But how much data do you expect to be pushed within 30s? Lets say it would be even 10g to lots of small file and you would calculate the total size by only summing up a logical size of data. Would you really expect that an error would be greater than 5% which would be 500mb. Does it matter in practice? Setting recordsize to 1k if you have lots of files (I assume) larger than that doesn't really make sense. The problem with metadata is that by default it is also compressed so there is no easy way to tell how much disk space it occupies for a specified file using standard API. We do not know in advance what file sizes we'll be seeing in general. We could of course tell people to tune the cache dataset according to their usage pattern, but I don't think users are generally going to know what their cache usage pattern looks like. I can say that at least right now, usually each file will be at most 1M long (1M is the max unless the user specifically changes it). But between the range 1k-1M, I don't know what the distribution looks like. What I meant was that I believe that default recordsize of 128k should be fine for you (files smaller than 128k will use smaller recordsize, larger ones will use a recordsize of 128k). The only problem will be with files truncated to 0 and growing again as they will be stuck with an old recordsize. But in most cases it won't probably be a practical problem anyway. Well, it may or may not be 'fine'; we may have a lot of little files in the cache, and rounding up to 128k for each one reduces our disk efficiency somewhat. Files are truncated to 0 and grow again quite often in busy clients. But that's an efficiency issue, we'd still be able to stay within the configured limit that way. But anyway, 128k may be fine for me, but what about if someone sets their recordsize to something different? That's why I was wondering about the overhead if someone sets the recordsize to 1k; is there no way to account for it even if I know the recordsize is 1k? what is user enables compression like lzjb or even gzip? How would you like to take it into account before doing writes? What if user creates a snapshot? How would you take it into account? I'm under suspicion that you are looking too closely for no real benefit. Especially if you don't want to dedicate a dataset to cache you would expect other applications in a system to write to the same file system but different locations which you have no control or ability to predict how much data will be written at all. Be it Linux, Solaris, BSD, ... the issue will be there. IMHO a dedicated dataset and statvfs() on it should be good enough, eventually with an estimate before writing your data (as a total logical file size from application point of view) - however due to compression or dedup enabled by user that estimate could be totally wrong so probably doesn't actually make sense. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
Andrew Deason wrote: As I'm sure you're all aware, filesize in ZFS can differ greatly from actual disk usage, depending on access patterns. e.g. truncating a 1M file down to 1 byte still uses up about 130k on disk when recordsize=128k. I'm aware that this is a result of ZFS's rather different internals, and that it works well for normal usage, but this can make things difficult for applications that wish to restrain their own disk usage. The particular application I'm working on that has such a problem is the OpenAFS http://www.openafs.org/ client, when it uses ZFS as the disk cache partition. The disk cache is constrained to a user-configurable size, and the amount of cache used is tracked by counters internal to the OpenAFS client. Normally cache usage is tracked by just taking the file length of a particular file in the cache, and rounding it up to the next frsize boundary of the cache filesystem. This is obviously wrong when ZFS is used, and so our cache usage tracking can get very incorrect. So, I have two questions which would help us fix this: 1. Is there any interface to ZFS (or a configuration knob or something) that we can use from a kernel module to explicitly return a file to the more predictable size? In the above example, truncating a 1M file (call it 'A') to 1b mkes it take up 130k, but if we create a new file (call it 'B') with that 1b in it, it only takes up about 1k. Is there any operation we can perform on file 'A' to make it take up less space without having to create a new file 'B'? The cache files are often truncated and overwritten with new data, which is why this can become a problem. If there was some way to explicitly signal to ZFS that we want a particular file to be put in a smaller block or something, that would be helpful. (I am mostly ignorant on ZFS internals; if there's somewhere that would have told me this information, let me know) 2. Lacking 1., can anyone give an equation relating file length, max size on disk, and recordsize? (and any additional parameters needed). If we just have a way of knowing in advance how much disk space we're going to take up by writing a certain amount of data, we should be okay. Or, if anyone has any other ideas on how to overcome this, it would be welcomed. When creating a new file zfs will set its block size to be no larger than current value of recordsize. If there is at least recordsize of data to be written then the blocksize will equal to recordsize. From now on the file blocksize is frozen - that's why when you truncate it it keeps its original blocksize size. It also means that if file was smaller than recordsize (so its blocksize was smaller too) when you truncate it to 1B it will keep its smaller blocksize. IMHO you won't be able to lower a file blocksize other than by creating a new file. For example: mi...@r600:~/progs$ mkfile 10m file1 mi...@r600:~/progs$ ./stat file1 size: 10485760blksize: 131072 mi...@r600:~/progs$ truncate -s 1 file1 mi...@r600:~/progs$ ./stat file1 size: 1blksize: 131072 mi...@r600:~/progs$ mi...@r600:~/progs$ rm file1 mi...@r600:~/progs$ mi...@r600:~/progs$ mkfile 1 file1 mi...@r600:~/progs$ ./stat file1 size: 1blksize: 10240 mi...@r600:~/progs$ truncate -s 1 file1 mi...@r600:~/progs$ ./stat file1 size: 1blksize: 10240 mi...@r600:~/progs$ If you are not worried with this extra overhead and you are mostly concerned with proper accounting of used disk space than instead of relaying on a file size alone you should take intro account its blocksize and round file size up-to blocksize (actual file size on disk (not counting metadata) is N*blocksize). However IIRC there is an open bug/rfe asking for a special treatment of a file's tail block so it can be smaller than the file blocksize. Once it's integrated your math could be wrong again. Please also note that relaying on a logical file size could be even more misleading if compression is enabled in zfs (or dedup in the future). Relaying on blocksize will give you more accurate estimates. You can get a file blocksize by using stat() and getting value of buf.st_blksize or you can get a good estimate of used disk space by doing buf.st_blocks*512 mi...@r600:~/progs$ cat stat.c #include stdio.h #include errno.h #include fcntl.h #include sys/types.h #include sys/stat.h int main(int argc, char **argv) { struct stat buf; if(!stat(argv[1], buf)) { printf(size: %d\tblksize: %d\n, buf.st_size, buf.st_blksize); } else { printf(ERROR: stat(), errno: %d\n, errno); exit(1); } } mi...@r600:~/progs$ -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Thu, 17 Sep 2009 22:55:38 +0100 Robert Milkowski mi...@task.gda.pl wrote: IMHO you won't be able to lower a file blocksize other than by creating a new file. For example: Okay, thank you. If you are not worried with this extra overhead and you are mostly concerned with proper accounting of used disk space than instead of relaying on a file size alone you should take intro account its blocksize and round file size up-to blocksize (actual file size on disk (not counting metadata) is N*blocksize). Metadata can be nontrivial for small blocksizes, though, can't it? I tried similar tests with varying recordsizes and with recordsize=1k, a file with 1M bytes written to it took up significantly more than 1024 1k blocks. Is there a reliable way to account for this? Through experimenting with various recordsizes and file sizes I can see enough of a pattern to try and come up with an equation for the total disk usage, but that doesn't mean such a relation would be correct... if someone could give me something a bit more authoritative, it would be nice. However IIRC there is an open bug/rfe asking for a special treatment of a file's tail block so it can be smaller than the file blocksize. Once it's integrated your math could be wrong again. Please also note that relaying on a logical file size could be even more misleading if compression is enabled in zfs (or dedup in the future). Relaying on blocksize will give you more accurate estimates. I was a bit unclear. We're not so concerned about the math being wrong in general; we just need to make sure we are not significantly underestimating the usage. If we overestimate within reason, that's fine, but getting the tightest bound is obviously more desirable. So I'm not worried about compression, dedup, or the tail block being treated in such a way. You can get a file blocksize by using stat() and getting value of buf.st_blksize or you can get a good estimate of used disk space by doing buf.st_blocks*512 Hmm, I thought I had tried this, but st_blocks didn't seem to be updated accurately until after some time after a write. I'd also like to avoid having to stat the file each time after a write or truncate in order to get the file size. The current way the code is structured intends for the space calculations to be made /before/ the write is done. It may be possible to change that, but I'd rather not, if possible (and I'd have to make sure there's not a significant speed hit in doing so). -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
if you would create a dedicated dataset for your cache and set quota on it then instead of tracking a disk space usage for each file you could easily check how much disk space is being used in the dataset. Would it suffice for you? Setting recordsize to 1k if you have lots of files (I assume) larger than that doesn't really make sense. The problem with metadata is that by default it is also compressed so there is no easy way to tell how much disk space it occupies for a specified file using standard API. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss