Re: [zfs-discuss] ZFS file disk usage

2009-09-22 Thread Andrew Deason
On Mon, 21 Sep 2009 18:20:53 -0400
Richard Elling richard.ell...@gmail.com wrote:

 On Sep 21, 2009, at 2:43 PM, Andrew Deason wrote:
 
  On Mon, 21 Sep 2009 17:13:26 -0400
  Richard Elling richard.ell...@gmail.com wrote:
 
  You don't know the max overhead for the file before it is
  allocated. You could guess at a max of 3x size + at least three
  blocks.  Since you can't control this, it seems like the worst
  case is when copies=3.
 
  Is that max with copies=3? Assume copies=1; what is it then?
 
 1x size + 1 block.

That seems to differ quite a bit from what I've seen; perhaps I am
misunderstanding... is the + 1 block of a different size than the
recordsize? With recordsize=1k:

$ ls -ls foo
2261 -rw-r--r--   1 root root 1048576 Sep 22 10:59 foo

1024k vs 1130k

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-22 Thread Richard Elling

On Sep 22, 2009, at 8:07 AM, Andrew Deason wrote:


On Mon, 21 Sep 2009 18:20:53 -0400
Richard Elling richard.ell...@gmail.com wrote:


On Sep 21, 2009, at 2:43 PM, Andrew Deason wrote:


On Mon, 21 Sep 2009 17:13:26 -0400
Richard Elling richard.ell...@gmail.com wrote:


You don't know the max overhead for the file before it is
allocated. You could guess at a max of 3x size + at least three
blocks.  Since you can't control this, it seems like the worst
case is when copies=3.


Is that max with copies=3? Assume copies=1; what is it then?


1x size + 1 block.


That seems to differ quite a bit from what I've seen; perhaps I am
misunderstanding... is the + 1 block of a different size than the
recordsize? With recordsize=1k:

$ ls -ls foo
2261 -rw-r--r--   1 root root 1048576 Sep 22 10:59 foo


Well, there it is.  I suggest suitable guard bands.
 -- richard



1024k vs 1130k

--
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-22 Thread Andrew Deason
On Tue, 22 Sep 2009 13:26:59 -0400
Richard Elling richard.ell...@gmail.com wrote:

  That seems to differ quite a bit from what I've seen; perhaps I am
  misunderstanding... is the + 1 block of a different size than the
  recordsize? With recordsize=1k:
 
  $ ls -ls foo
  2261 -rw-r--r--   1 root root 1048576 Sep 22 10:59 foo
 
 Well, there it is.  I suggest suitable guard bands.

So, you would say it's reasonable to assume the overhead will always be
less than about 100k or 10%?

And to be sure... if we're to be rounding up to the next recordsize
boundary, are we guaranteed to be able to get the from the blocksize
reported by statvfs?

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-21 Thread Andrew Deason
On Sun, 20 Sep 2009 20:31:57 -0400
Richard Elling richard.ell...@gmail.com wrote:

 If you are just building a cache, why not just make a file system and
 put a reservation on it? Turn off auto snapshots and set other
 features as per best practices for your workload? In other words,
 treat it like we
 treat dump space.
 
 I think that we are getting caught up in trying to answer the question
 you ask rather than solving the problem you have... perhaps because
 we don't understand the problem.

Yes, possibly... some of these suggestions dont quite make a lot of
sense to me. We can't just make a filesystem and put a reservation on
it; we are just an application the administrator puts on a machine for
it to access AFS. So I'm not sure when you are imagining we do that;
when the client starts up? Or part of the installation procedure?
Requiring a separate filesystem seems unnecessarily restrictive.

And I still don't see how that helps. Making an fs with a reservation
would definitely limit us to the specified space, but we still can't get
an accurate picture of the current disk usage. I already mentioned why
using statvfs is not usable with that commit delay.

But solving the general problem for me isn't necessary. If I could just
get a ballpark estimate of the max overhead for a file, I would be fine.
I haven't payed attention to it before, so I don't even have an
intuitive feel for what it is.

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-21 Thread Richard Elling

On Sep 21, 2009, at 7:11 AM, Andrew Deason wrote:


On Sun, 20 Sep 2009 20:31:57 -0400
Richard Elling richard.ell...@gmail.com wrote:


If you are just building a cache, why not just make a file system and
put a reservation on it? Turn off auto snapshots and set other
features as per best practices for your workload? In other words,
treat it like we
treat dump space.

I think that we are getting caught up in trying to answer the  
question

you ask rather than solving the problem you have... perhaps because
we don't understand the problem.


Yes, possibly... some of these suggestions dont quite make a lot of
sense to me. We can't just make a filesystem and put a reservation on
it; we are just an application the administrator puts on a machine for
it to access AFS. So I'm not sure when you are imagining we do that;
when the client starts up? Or part of the installation procedure?
Requiring a separate filesystem seems unnecessarily restrictive.

And I still don't see how that helps. Making an fs with a reservation
would definitely limit us to the specified space, but we still can't  
get

an accurate picture of the current disk usage. I already mentioned why
using statvfs is not usable with that commit delay.


OK, so the problem you are trying to solve is how much stuff can I
place in the remaining free space?  I don't think this is knowable
for a dynamic file system like ZFS where metadata is dynamically
allocated.



But solving the general problem for me isn't necessary. If I could  
just
get a ballpark estimate of the max overhead for a file, I would be  
fine.

I haven't payed attention to it before, so I don't even have an
intuitive feel for what it is.


You don't know the max overhead for the file before it is allocated.
You could guess at a max of 3x size + at least three blocks.  Since
you can't control this, it seems like the worst case is when copies=3.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-21 Thread Andrew Deason
On Mon, 21 Sep 2009 17:13:26 -0400
Richard Elling richard.ell...@gmail.com wrote:

 OK, so the problem you are trying to solve is how much stuff can I
 place in the remaining free space?  I don't think this is knowable
 for a dynamic file system like ZFS where metadata is dynamically
 allocated.

Yes. And I acknowledge that we can't know that precisely; I'm trying for
an estimate on the bound.

 You don't know the max overhead for the file before it is allocated.
 You could guess at a max of 3x size + at least three blocks.  Since
 you can't control this, it seems like the worst case is when copies=3.

Is that max with copies=3? Assume copies=1; what is it then?

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-20 Thread Andrew Deason
On Fri, 18 Sep 2009 17:54:41 -0400
Robert Milkowski mi...@task.gda.pl wrote:

 There will be a delay of up-to 30s currently.
 
 But how much data do you expect to be pushed within 30s?
 Lets say it would be even 10g to lots of small file and you would 
 calculate the total size by only summing up a logical size of data. 
 Would you really expect that an error would be greater than 5% which 
 would be 500mb. Does it matter in practice?

Well, that wasn't the problem I was thinking of. I meant, if we have to
wait 30 seconds after the write to measure the disk usage... what do I
do, just sleep 30s after the write before polling for disk usage?

We could just ask for disk usage when we write, knowing that it doesn't
take into account the write we are performing... but we're changing what
we're measuring, then. If we are removing things from the cache in order
to free up space, how do we know when to stop?

To illustrate: normally when the cache is 98% full, we remove items
until we are 95% full before we allow a write to happen again. If we
relied on statvfs information for our disk usage information, we would
start removing items at 98%, and have no idea when we hit 95% unless we
wait 30 seconds.

If you are simply saying that the difference in logical size and used
disk blocks on ZFS are similar enough not to make a difference... well,
that's what I've been asking. I have asked what the maximum difference
is between logical size rounded up to recordsize and size taken up on
disk, and haven't received an answer yet. If the answer is small
enough that you don't care, then fantastic.

 what is user enables compression like lzjb or even gzip?
 How would you like to take it into account before doing writes?
 
 What if user creates a snapshot? How would you take it into account?

Then it will be wrong; we do not take them into account. I do not care
about those cases. It is already impossible to enforce that the cache
tracking data is 100% correct all of the time.

Imagine we somehow had a way to account for all of those cases you
listed, and would make me happy. Say the directory the user uses for the
cache data is /usr/vice/cache (one standard path to put it). The OpenAFS
client will put cache data in e.g. /usr/vice/cache/D0/V1 and a bunch of
other files.  If the user puts their own file in
/usr/vice/cache/reallybigfile, our cache tracking information will
always be off, in all current implementations.  We have no control over
it, and we do not try to solve that problem.

I am treating the cases of what if the user creates a snapshot and the
like as a similar situation. If someone does that and runs out of space,
it is pretty easy to troubleshoot their system and say you have a
snapshot of the cache dataset; do not do that. Right now, if someone
runs an OpenAFS client cache on zfs and runs out of space, the only
thing I can tell them is don't use zfs, which I don't want to do.

If it works for _a_ configuration -- the default one -- that is all I am
asking for.

 I'm under suspicion that you are looking too closely  for no real
 benefit. Especially if you don't want to dedicate a dataset to cache
 you would expect other  applications in a system  to write to the
 same file system but different locations which you have no control or
 ability to predict how much data will be written at all. Be it Linux,
 Solaris, BSD, ... the issue will be there.

It is certainly possible for other applications to fill up the disk. We
just need to ensure that we don't fill up the disk to block other
applications. You may think this is fruitless, and just from that
description alone, it may be. But you must understand that without an
accurate bound on the cache, well... we can eat up the disk a lot faster
than other applications without the user realizing it.

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-20 Thread Richard Elling

If you are just building a cache, why not just make a file system and
put a reservation on it? Turn off auto snapshots and set other features
as per best practices for your workload? In other words, treat it like  
we

treat dump space.

I think that we are getting caught up in trying to answer the question
you ask rather than solving the problem you have... perhaps because
we don't understand the problem.
 -- richard

On Sep 20, 2009, at 2:17 PM, Andrew Deason wrote:


On Fri, 18 Sep 2009 17:54:41 -0400
Robert Milkowski mi...@task.gda.pl wrote:


There will be a delay of up-to 30s currently.

But how much data do you expect to be pushed within 30s?
Lets say it would be even 10g to lots of small file and you would
calculate the total size by only summing up a logical size of data.
Would you really expect that an error would be greater than 5% which
would be 500mb. Does it matter in practice?


Well, that wasn't the problem I was thinking of. I meant, if we have  
to

wait 30 seconds after the write to measure the disk usage... what do I
do, just sleep 30s after the write before polling for disk usage?

We could just ask for disk usage when we write, knowing that it  
doesn't
take into account the write we are performing... but we're changing  
what
we're measuring, then. If we are removing things from the cache in  
order

to free up space, how do we know when to stop?

To illustrate: normally when the cache is 98% full, we remove items
until we are 95% full before we allow a write to happen again. If we
relied on statvfs information for our disk usage information, we would
start removing items at 98%, and have no idea when we hit 95% unless  
we

wait 30 seconds.

If you are simply saying that the difference in logical size and used
disk blocks on ZFS are similar enough not to make a difference...  
well,

that's what I've been asking. I have asked what the maximum difference
is between logical size rounded up to recordsize and size taken  
up on

disk, and haven't received an answer yet. If the answer is small
enough that you don't care, then fantastic.


what is user enables compression like lzjb or even gzip?
How would you like to take it into account before doing writes?

What if user creates a snapshot? How would you take it into account?


Then it will be wrong; we do not take them into account. I do not care
about those cases. It is already impossible to enforce that the cache
tracking data is 100% correct all of the time.

Imagine we somehow had a way to account for all of those cases you
listed, and would make me happy. Say the directory the user uses for  
the
cache data is /usr/vice/cache (one standard path to put it). The  
OpenAFS
client will put cache data in e.g. /usr/vice/cache/D0/V1 and a bunch  
of

other files.  If the user puts their own file in
/usr/vice/cache/reallybigfile, our cache tracking information will
always be off, in all current implementations.  We have no control  
over

it, and we do not try to solve that problem.

I am treating the cases of what if the user creates a snapshot and  
the
like as a similar situation. If someone does that and runs out of  
space,

it is pretty easy to troubleshoot their system and say you have a
snapshot of the cache dataset; do not do that. Right now, if someone
runs an OpenAFS client cache on zfs and runs out of space, the only
thing I can tell them is don't use zfs, which I don't want to do.

If it works for _a_ configuration -- the default one -- that is all  
I am

asking for.


I'm under suspicion that you are looking too closely  for no real
benefit. Especially if you don't want to dedicate a dataset to cache
you would expect other  applications in a system  to write to the
same file system but different locations which you have no control or
ability to predict how much data will be written at all. Be it Linux,
Solaris, BSD, ... the issue will be there.


It is certainly possible for other applications to fill up the disk.  
We

just need to ensure that we don't fill up the disk to block other
applications. You may think this is fruitless, and just from that
description alone, it may be. But you must understand that without an
accurate bound on the cache, well... we can eat up the disk a lot  
faster

than other applications without the user realizing it.

--
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Andrew Deason
On Thu, 17 Sep 2009 18:40:49 -0400
Robert Milkowski mi...@task.gda.pl wrote:

 if you would create a dedicated dataset for your cache and set quota
 on it then instead of tracking a disk space usage for each file you
 could easily check how much disk space is being used in the dataset.
 Would it suffice for you?

No. We need to be able to tell how close to full we are, for determining
when to start/stop removing things from the cache before we can add new
items to the cache again.

I'd also _like_ not to require a dedicated dataset for it, but it's not
like it's difficult for users to create one.

 Setting recordsize to 1k if you have lots of files (I assume) larger 
 than that doesn't really make sense.
 The problem with metadata is that by default it is also compressed so 
 there is no easy way to tell how much disk space it occupies for a 
 specified file using standard API.

We do not know in advance what file sizes we'll be seeing in general. We
could of course tell people to tune the cache dataset according to their
usage pattern, but I don't think users are generally going to know what
their cache usage pattern looks like.

I can say that at least right now, usually each file will be at most 1M
long (1M is the max unless the user specifically changes it). But
between the range 1k-1M, I don't know what the distribution looks like.

I can't get an /estimate/ on the data+metadata disk usage? What about in
the hypothetical case of the metadata compression ratio being
effectively the same as without compression, what would it be then?

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Andrew Deason
On Fri, 18 Sep 2009 12:48:34 -0400
Richard Elling richard.ell...@gmail.com wrote:

 The transactional nature of ZFS may work against you here.
 Until the data is committed to disk, it is unclear how much space
 it will consume. Compression clouds the crystal ball further.

...but not impossible. I'm just looking for a reasonable upper bound.
For example, if I always rounded up to the next 128k mark, and added an
additional 128k, that would always give me an upper bound (for files =
1M), as far as I can tell. But that is not a very tight bound; can you
suggest anything better?

  I'd also _like_ not to require a dedicated dataset for it, but
  it's not
  like it's difficult for users to create one.
 
 Use delegation.  Users can create their own datasets, set parameters,
 etc. For this case, you could consider changing recordsize, if you
 really are so worried about 1k. IMHO, it is easier and less expensive
 in process and pain to just buy more disk when needed.

Users of OpenAFS, not unprivileged users. All users I am talking about
are the administrators for their machines. I would just like to reduce
the number of filesystem-specific steps needed to be taken to set up the
cache. You don't need to do anything special for a tmpfs cache, for
instance, or ext2/3 caches on linux.

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Richard Elling

On Sep 18, 2009, at 7:36 AM, Andrew Deason wrote:


On Thu, 17 Sep 2009 18:40:49 -0400
Robert Milkowski mi...@task.gda.pl wrote:


if you would create a dedicated dataset for your cache and set quota
on it then instead of tracking a disk space usage for each file you
could easily check how much disk space is being used in the dataset.
Would it suffice for you?


No. We need to be able to tell how close to full we are, for  
determining
when to start/stop removing things from the cache before we can add  
new

items to the cache again.


The transactional nature of ZFS may work against you here.
Until the data is committed to disk, it is unclear how much space
it will consume. Compression clouds the crystal ball further.



I'd also _like_ not to require a dedicated dataset for it, but it's  
not

like it's difficult for users to create one.


Use delegation.  Users can create their own datasets, set parameters,
etc. For this case, you could consider changing recordsize, if you  
really

are so worried about 1k. IMHO, it is easier and less expensive in
process and pain to just buy more disk when needed.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Robert Milkowski

Andrew Deason wrote:

On Thu, 17 Sep 2009 18:40:49 -0400
Robert Milkowski mi...@task.gda.pl wrote:

  

if you would create a dedicated dataset for your cache and set quota
on it then instead of tracking a disk space usage for each file you
could easily check how much disk space is being used in the dataset.
Would it suffice for you?



No. We need to be able to tell how close to full we are, for determining
when to start/stop removing things from the cache before we can add new
items to the cache again.
  


but having a dedicated dataset will let you answer such a question 
immediatelly as then you get from zfs information from for the dataset 
on how much space is used (everything: data + metadata) and how much is 
left.



I'd also _like_ not to require a dedicated dataset for it, but it's not
like it's difficult for users to create one.

  

no, it is not.

Setting recordsize to 1k if you have lots of files (I assume) larger 
than that doesn't really make sense.
The problem with metadata is that by default it is also compressed so 
there is no easy way to tell how much disk space it occupies for a 
specified file using standard API.



We do not know in advance what file sizes we'll be seeing in general. We
could of course tell people to tune the cache dataset according to their
usage pattern, but I don't think users are generally going to know what
their cache usage pattern looks like.

I can say that at least right now, usually each file will be at most 1M
long (1M is the max unless the user specifically changes it). But
between the range 1k-1M, I don't know what the distribution looks like.

  
What I meant was that I believe that default recordsize of 128k should 
be fine for you (files smaller than 128k will use smaller recordsize, 
larger ones will use a recordsize of 128k). The only problem will be 
with files truncated to 0 and growing again as they will be stuck with 
an old recordsize. But in most cases it won't probably be a practical 
problem anyway.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Andrew Deason
On Fri, 18 Sep 2009 16:38:28 -0400
Robert Milkowski mi...@task.gda.pl wrote:

  No. We need to be able to tell how close to full we are, for
  determining when to start/stop removing things from the cache
  before we can add new items to the cache again.

 
 but having a dedicated dataset will let you answer such a question 
 immediatelly as then you get from zfs information from for the
 dataset on how much space is used (everything: data + metadata) and
 how much is left.

Immediately? There isn't a delay between the write and the next commit
when the space is recorded? (Do you mean a statvfs equivalent, or some
zfs-specific call?)

And the current code is structured such that we record usage changes
before a write; it would be a huge pain to rely on the write to
calculate the usage (for that and other reasons).

  Setting recordsize to 1k if you have lots of files (I assume)
  larger than that doesn't really make sense.
  The problem with metadata is that by default it is also compressed
  so there is no easy way to tell how much disk space it occupies
  for a specified file using standard API.
  
 
  We do not know in advance what file sizes we'll be seeing in
  general. We could of course tell people to tune the cache dataset
  according to their usage pattern, but I don't think users are
  generally going to know what their cache usage pattern looks like.
 
  I can say that at least right now, usually each file will be at
  most 1M long (1M is the max unless the user specifically changes
  it). But between the range 1k-1M, I don't know what the
  distribution looks like.
 

 What I meant was that I believe that default recordsize of 128k
 should be fine for you (files smaller than 128k will use smaller
 recordsize, larger ones will use a recordsize of 128k). The only
 problem will be with files truncated to 0 and growing again as they
 will be stuck with an old recordsize. But in most cases it won't
 probably be a practical problem anyway.

Well, it may or may not be 'fine'; we may have a lot of little files in
the cache, and rounding up to 128k for each one reduces our disk
efficiency somewhat. Files are truncated to 0 and grow again quite often
in busy clients. But that's an efficiency issue, we'd still be able to
stay within the configured limit that way.

But anyway, 128k may be fine for me, but what about if someone sets
their recordsize to something different? That's why I was wondering
about the overhead if someone sets the recordsize to 1k; is there no way
to account for it even if I know the recordsize is 1k?

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Robert Milkowski

Andrew Deason wrote:

On Fri, 18 Sep 2009 16:38:28 -0400
Robert Milkowski mi...@task.gda.pl wrote:

  

No. We need to be able to tell how close to full we are, for
determining when to start/stop removing things from the cache
before we can add new items to the cache again.
  
  
but having a dedicated dataset will let you answer such a question 
immediatelly as then you get from zfs information from for the

dataset on how much space is used (everything: data + metadata) and
how much is left.



Immediately? There isn't a delay between the write and the next commit
when the space is recorded? (Do you mean a statvfs equivalent, or some
zfs-specific call?)

And the current code is structured such that we record usage changes
before a write; it would be a huge pain to rely on the write to
calculate the usage (for that and other reasons).
  


There will be a delay of up-to 30s currently.

But how much data do you expect to be pushed within 30s?
Lets say it would be even 10g to lots of small file and you would 
calculate the total size by only summing up a logical size of data. 
Would you really expect that an error would be greater than 5% which 
would be 500mb. Does it matter in practice?





Setting recordsize to 1k if you have lots of files (I assume)
larger than that doesn't really make sense.
The problem with metadata is that by default it is also compressed
so there is no easy way to tell how much disk space it occupies
for a specified file using standard API.



We do not know in advance what file sizes we'll be seeing in
general. We could of course tell people to tune the cache dataset
according to their usage pattern, but I don't think users are
generally going to know what their cache usage pattern looks like.

I can say that at least right now, usually each file will be at
most 1M long (1M is the max unless the user specifically changes
it). But between the range 1k-1M, I don't know what the
distribution looks like.

  
  

What I meant was that I believe that default recordsize of 128k
should be fine for you (files smaller than 128k will use smaller
recordsize, larger ones will use a recordsize of 128k). The only
problem will be with files truncated to 0 and growing again as they
will be stuck with an old recordsize. But in most cases it won't
probably be a practical problem anyway.



Well, it may or may not be 'fine'; we may have a lot of little files in
the cache, and rounding up to 128k for each one reduces our disk
efficiency somewhat. Files are truncated to 0 and grow again quite often
in busy clients. But that's an efficiency issue, we'd still be able to
stay within the configured limit that way.

But anyway, 128k may be fine for me, but what about if someone sets
their recordsize to something different? That's why I was wondering
about the overhead if someone sets the recordsize to 1k; is there no way
to account for it even if I know the recordsize is 1k?

  


what is user enables compression like lzjb or even gzip?
How would you like to take it into account before doing writes?

What if user creates a snapshot? How would you take it into account?

I'm under suspicion that you are looking too closely  for no real benefit.
Especially if you don't want to dedicate a dataset to cache you would 
expect other  applications in a system  to write to the same file system 
but different locations which you have no control or ability to predict 
how much data will be written at all. Be it Linux, Solaris, BSD, ... the 
issue will be there.


IMHO a dedicated dataset and statvfs() on it should be good enough, 
eventually with an estimate before writing your data (as a total logical 
file size from application point of view) - however due to compression 
or dedup enabled by user that estimate could be totally wrong so 
probably doesn't actually make sense.



--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-17 Thread Robert Milkowski

Andrew Deason wrote:

As I'm sure you're all aware, filesize in ZFS can differ greatly from
actual disk usage, depending on access patterns. e.g. truncating a 1M
file down to 1 byte still uses up about 130k on disk when
recordsize=128k. I'm aware that this is a result of ZFS's rather
different internals, and that it works well for normal usage, but this
can make things difficult for applications that wish to restrain their
own disk usage.

The particular application I'm working on that has such a problem is the
OpenAFS http://www.openafs.org/ client, when it uses ZFS as the disk
cache partition. The disk cache is constrained to a user-configurable
size, and the amount of cache used is tracked by counters internal to
the OpenAFS client. Normally cache usage is tracked by just taking the
file length of a particular file in the cache, and rounding it up to the
next frsize boundary of the cache filesystem. This is obviously wrong
when ZFS is used, and so our cache usage tracking can get very
incorrect.  So, I have two questions which would help us fix this:

  1. Is there any interface to ZFS (or a configuration knob or
  something) that we can use from a kernel module to explicitly return a
  file to the more predictable size? In the above example, truncating a
  1M file (call it 'A') to 1b mkes it take up 130k, but if we create a
  new file (call it 'B') with that 1b in it, it only takes up about 1k.
  Is there any operation we can perform on file 'A' to make it take up
  less space without having to create a new file 'B'?

  The cache files are often truncated and overwritten with new data,
  which is why this can become a problem. If there was some way to
  explicitly signal to ZFS that we want a particular file to be put in a
  smaller block or something, that would be helpful. (I am mostly
  ignorant on ZFS internals; if there's somewhere that would have told
  me this information, let me know)

  2. Lacking 1., can anyone give an equation relating file length, max
  size on disk, and recordsize? (and any additional parameters needed).
  If we just have a way of knowing in advance how much disk space we're
  going to take up by writing a certain amount of data, we should be
  okay.

Or, if anyone has any other ideas on how to overcome this, it would be
welcomed.

  


When creating a new file zfs will set its block size to be no larger 
than current value of recordsize. If there is at least recordsize of 
data to be written then the blocksize will equal to recordsize. From now 
on the file blocksize is frozen  - that's why when you truncate it it 
keeps its original blocksize size. It also means that if file was 
smaller than recordsize (so its blocksize was smaller too) when you 
truncate it to 1B it will keep its  smaller blocksize. IMHO you won't be 
able to lower a file blocksize other than by creating a new file. For 
example:



mi...@r600:~/progs$ mkfile 10m file1
mi...@r600:~/progs$ ./stat file1
size: 10485760blksize: 131072
mi...@r600:~/progs$ truncate -s 1 file1
mi...@r600:~/progs$ ./stat file1
size: 1blksize: 131072
mi...@r600:~/progs$
mi...@r600:~/progs$ rm file1
mi...@r600:~/progs$
mi...@r600:~/progs$ mkfile 1 file1
mi...@r600:~/progs$ ./stat file1
size: 1blksize: 10240
mi...@r600:~/progs$ truncate -s 1 file1
mi...@r600:~/progs$ ./stat file1
size: 1blksize: 10240
mi...@r600:~/progs$


If you are not worried with this extra overhead and you are mostly 
concerned with proper accounting of used disk space than instead of 
relaying on a file size alone you should take intro account its 
blocksize and round file size up-to blocksize (actual file size on disk 
(not counting metadata) is N*blocksize).
However IIRC there is an open bug/rfe asking for a special treatment of 
a file's tail block so it can be smaller than the file blocksize. Once 
it's integrated your math could be wrong again.


Please also note that relaying on a logical file size could be even more 
misleading if compression is enabled in zfs (or dedup in the future). 
Relaying on blocksize will give you more accurate estimates.


You can get a file blocksize by using stat() and getting value of 
buf.st_blksize

or you can get a good estimate of used disk space by doing buf.st_blocks*512


mi...@r600:~/progs$ cat stat.c

#include stdio.h
#include errno.h
#include fcntl.h
#include sys/types.h
#include sys/stat.h

int main(int argc, char **argv)
{
  struct stat buf;

  if(!stat(argv[1], buf))
  {
printf(size: %d\tblksize: %d\n, buf.st_size, buf.st_blksize);
  }
  else
  {
printf(ERROR: stat(), errno: %d\n, errno);
exit(1);
  }
 


}

mi...@r600:~/progs$



--
Robert Milkowski
http://milek.blogspot.com






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-17 Thread Andrew Deason
On Thu, 17 Sep 2009 22:55:38 +0100
Robert Milkowski mi...@task.gda.pl wrote:

 IMHO you won't be able to lower a file blocksize other than by
 creating a new file. For example:

Okay, thank you.

 If you are not worried with this extra overhead and you are mostly 
 concerned with proper accounting of used disk space than instead of 
 relaying on a file size alone you should take intro account its 
 blocksize and round file size up-to blocksize (actual file size on
 disk (not counting metadata) is N*blocksize).

Metadata can be nontrivial for small blocksizes, though, can't it? I
tried similar tests with varying recordsizes and with recordsize=1k, a
file with 1M bytes written to it took up significantly more than 1024 1k
blocks.

Is there a reliable way to account for this? Through experimenting with
various recordsizes and file sizes I can see enough of a pattern to try
and come up with an equation for the total disk usage, but that doesn't
mean such a relation would be correct... if someone could give me
something a bit more authoritative, it would be nice.

 However IIRC there is an open bug/rfe asking for a special treatment
 of a file's tail block so it can be smaller than the file blocksize.
 Once it's integrated your math could be wrong again.

 Please also note that relaying on a logical file size could be even
 more misleading if compression is enabled in zfs (or dedup in the
 future). Relaying on blocksize will give you more accurate estimates.

I was a bit unclear. We're not so concerned about the math being wrong
in general; we just need to make sure we are not significantly
underestimating the usage. If we overestimate within reason, that's
fine, but getting the tightest bound is obviously more desirable. So I'm
not worried about compression, dedup, or the tail block being treated in
such a way.

 You can get a file blocksize by using stat() and getting value of 
 buf.st_blksize
 or you can get a good estimate of used disk space by doing
 buf.st_blocks*512

Hmm, I thought I had tried this, but st_blocks didn't seem to be updated
accurately until after some time after a write.

I'd also like to avoid having to stat the file each time after a write
or truncate in order to get the file size. The current way the code is
structured intends for the space calculations to be made /before/ the
write is done. It may be possible to change that, but I'd rather not, if
possible (and I'd have to make sure there's not a significant speed hit
in doing so).

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-17 Thread Robert Milkowski


if you would create a dedicated dataset for your cache and set quota on 
it then instead of tracking a disk space usage for each file you could 
easily check how much disk space is being used in the dataset.

Would it suffice for you?

Setting recordsize to 1k if you have lots of files (I assume) larger 
than that doesn't really make sense.
The problem with metadata is that by default it is also compressed so 
there is no easy way to tell how much disk space it occupies for a 
specified file using standard API.


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss