[zfs-code] statvfs change

Chris Kirby Tue, 28 Aug 2007 17:23:58 -0500

Don, thanks for your comments, please see below:

Don Cragun wrote:
>>Date: Mon, 27 Aug 2007 16:04:42 -0500
>>From: Chris Kirby <chris.kirby at sun.com>
 >
> The base point is that any time you lie to applications, some
> application software is going to make wrong decisions based on the lie.

Yes, and we certainly don't want to lie. But returning an
error when we can return valid (albeit less precise) info
will also cause applications to make wrong decisions.

In the case of the netBeans installer, it died because
it thought there wasn't enough free space when in fact,
there were several TB of space available.

I suspect that most apps that use f_bfree/f_bavail just
want to know if they have enough space to write their
data.

> 
>>For ZFS, we report f_frsize as 512 regardless of the size of
>>the fs.  ...
> 
> 
> Why?  Why shouldn't you always set f_frsize to the actual size of an
> allocation unit on the filesystem?  Is it still true that we don't
> support disks formatted with 1024 byte sectors?

For ZFS, we don't have a fixed allocation block size so in general
there won't be one true f_frsize across and entire VFS.  So we return
SPA_MINBLOCKSIZE (512) for f_frsize.

> When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the
> correct values for these fields are larger; you are not returning valid
> information.

I think it's valid in the sense that you will be able to create at
least UINT32_MAX files.  Of course once you've done so,
we might still report that you can create UINT32_MAX
additional files.  :-)

For any application making a decision on an available file count such
that UINT32_MAX is not enough, but UINT32_MAX+1 would be OK, is
using the correct largefile syscalls like statvfs64.

> 
> You may be returning "valid" values for f_frsize, f_blocks, f_bfree,
> and f_bavail; but you aren't checking to see if that is true or not.
> (If shifting f_blocks, f_bfree, or f_bavail right throws away a bit
> that was not a zero bit; the scaled values being returned are not
> valid.)

You're right that we're discarding some bytes through the scaling
process.  However, any non-zero bits that are discarded are effectively
partial f_frsize blocks.  For any filesystem large enough to get into
this situation, we're talking about a relatively *very* small
amount of rounding down.  (e.g. for a 1PB fs, f_frsize is
only 256K)

Remember that the fs code can be doing delayed writes, delayed
allocation, background delete processing, etc.  So the statvfs
values are just rumors anyway.  Most filesystems don't even bother
to grab a lock when reporting statvfs info.

> 
> Since the statvfs(2) and statvfs.h(3HEAD) man pages don't state any
> relationship between f_bsize and f_frsize, applications may well have
> made their own assumptions.  Is there documentation somewhere that
> specifies how many bytes should be written at a time (on boundaries
> that is a multiple of that value) to get the most efficiency out of
> the underlying hardware?  I would hope that f_bsize would be that
> value.  If it is, it seems that f_bsize should be an integral multiple
> of f_frsize.

Aside from the comment in statvfs(2) about f_bsize being the
"preferred file system block size", I can't find any documentation
that talks about that.

For filesystems that support direct I/O, f_bsize has traditionally
provided the most efficient I/O size multiplier.

But the setting of f_bsize is up to the underlying fs.  And at least
for QFS, UFS, and ZFS, its value is not scaled based on f_frsize.
That's also why I don't rescale f_bsize.

-Chris

[zfs-code] statvfs change

Reply via email to