[zfs-code] [ufs-discuss] statvfs change

Don Cragun Thu, 30 Aug 2007 16:21:34 -0700 (PDT)

Frank,
        Please find comments in-line below...

All,
        Note that I'll be off-line for some doctor's appointments part
of the day Friday and will be out of the country for a week starting
Saturday.  So, I may be very slow responding to issues raised here for
a while.
        
        Cheers,
        Don


>Date: Wed, 29 Aug 2007 19:41:53 +0100 (BST)
>From: Frank Hofmann <Frank.Hofmann at sun.com>
>Subject: Re: [ufs-discuss] [zfs-code] statvfs change
>To: Chris Kirby <chris.kirby at sun.com>
>Cc: Don Cragun <don.cragun at sun.com>, zfs-code at opensolaris.org, 
ufs-discuss at opensolaris.org, johansen-osdev at sun.com
>
>On Wed, 29 Aug 2007, Chris Kirby wrote:
>
>[ ... ]
>> None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs
>> handlers, so as Boyd Adamson correctly pointed out, all of the statvfs
>> values are rumors at best anyway.
>
>That is not correct for UFS. This piece:
>
>         /*
>          * Adjust the numbers based on things waiting to be deleted.
>          * modifies f_bfree and f_ffree.  Afterwards, everything we
>          * come up with will be self-consistent.  By definition, this
>          * is a point-in-time snapshot, so the fact that the delete
>          * thread's probably already invalidated the results is not a
>          * problem.  Note that if the delete thread is ever extended to
>          * non-logging ufs, this adjustment must always be made.
>          */
>         if (TRANS_ISTRANS(ufsvfsp))
>                 ufs_delete_adjust_stats(ufsvfsp, sp);
>
>in ufs_statvfs() is grabbing locks. And it was introduced because some 
>"nitpickers" claimed guaranteed point-in-time consistency of statvfs() is 
>mandated by the standards, and arguing that "the values are out of date 
>as soon as the syscall returns no matter what" wouldn't matter in that 
>context. For the gory details, see:
>
>5012326        delay between unlink/close and space becoming available may be
>       arbitrarily long
>
>and then:
>
>6251659        statvfs taking 30 second to complete
>
>Still wondering sometimes whether it'd been the right thing to claim 
>"statvfs must be 100% accurate".
>
>[ ... ]
>> I'm not suggesting that we just truncate the number of blocks.
>> We're also adjusting the unit size (f_frsize) so that
>> (f_blocks * f_frsize) is the same before and after the rescaling.
>>
>> Yes, there might be a rounding error of up to (f_frsize - 1) *bytes*.
>> So on a 1PB fs, we might be off by (256K - 1) bytes.  That's
>> 0.000000026 percent.  It doesn't seem like it would generate
>> many support calls.
>
>Even if. As was mentioned about ZFS behaviour a few times, ZFS will 
>"compactify" small files anyway (meaning that, even if the FS were full to 
>that degree, extending _small_ files may be possible), and since it 
>optionally also does disk compression the amount of free space can only be 
>estimated.
>The point, in particular for ZFS, can be made that the accurracy of all 
>statvfs() return values is, by design, not to the 'least significant bit'. 
>Hence, as long as things like "space free" is the lower bound of what 
>really is free, it's performing its task.
>
>Don's statement about the standard requiring "X free - not X+1" is, hmm, 
>to say it politically correct, "difficult to meet with a filesystem that 
>does compress/tail-compact".

The point I was complaining about with "X free - not X+1" was that the
suggested changes cap the values returned in f_files, f_ffree, and
f_favail at UINT32_MAX.  If a 32-bit statvfs() returns UINT32_MAX for
any of these values when a simultaneous 64-bit statvfs() or a
statvfs64() would set one or more of those values to a larger value (I
don't care if it is +1, *2, or *1000), I think we have a problem.

>The "X not X-1" is, on the other hand, to be taken very seriously. I 
>cannot claim to have free space if I haven't, and that's what the above 
>UFS bugfixes have been about.
>
>I agree with Don that we can't lie about having free space if we don't. 
>I also agree that statvfs() and statvfs64() shouldn't return different 
>values.

I'm not sure I even require this much.  As long as rescale_frsize()
doesn't drop non-zero bits in the values it shifts to the right, I
think I'm OK with statvfs() and statvfs64() returning different values
for f_frsize, f_blocks, f_bfree, and f_bavail as long as the values
f_frsize*f_blocks, f_frsize*f_bfree, and f_frsize*f_bavail are the same
(using types for the multiplication that don't overflow) in both
calls.

If statvfs() returns EOVERFLOW if the values returned would be
different from those returned by statvfs64(), I'll be happy (at least
as long as we don't have statvfs64() lie to make this true ;-} ).

>
>But I don't see why we couldn't round up the used space / round (down) the 
>free space to some value/blocksize chosen by the filesystem itself. We 
>just have to make this consistent. If:
>
>       statvfs()
>       statvfs64()
>       64bit statvfs()
>
>all return the same values, and:
>
>       - freeing a unit fs_frsize adds '1' to fs_blkfree
>       - allocating a unit fs_frsize subtracts '1' from fs_blkfree
>       - an alloc attempt when fs_blkfree is zero must fail
>
>are met, we're standards compliant, aren't we ?

Yes.

>
>The standard, for statvfs(), doesn't make any statement about what happens 
>on attempts to allocate/free amounts _less than_ fs_frsize bytes, except 
>that "if I release N bytes I must be able to allocate N bytes again - to 
>the same file".

Yes, but it is about extending by blocks; not by bytes.  (Even on a
filesystem that doesn't do any compression, you can grow files whose
size is not a multiple of the blocksize to fill up the last block
allocated to the file even if the filesystem is 100% full.)

>
>As I read the proposal, the blocked behaviour is exactly what's being 
>requested, isn't it ?

I don't have any problem with the things being adjusted by
rescale_frsize() in the proposal.  Some applications may be surprised
because they have made assumptions based on poor <----limited
documentation, but I'm not really concerned by that as long as a 32-bit
app and a 64-bit app using statvfs() see the same numbers.  But, the
proposed changes do not give consistent data to 32-bit and 64-bit
calls.  The resizing is only done in 32-bit calls.  And, the proposal
also caps f_files, f_ffree, and f_favail at UINT32_MAX in 32-bit
calls.

>
>We have to remember that statvfs returns _BLOCKED_ units. Space 
>(non)availability is judged in these units. And actions like:
>
>       - release fs_frsize/N from N different files ==> fs_blkfree++
>       - fs_blkfree == 1, alloc fs_frsize/N to N different files succeeds
>
>are _NOT_ covered by the standard as "must (not) fail".

Yes.  The standard says nothing about releasing bytes; only about
releasing blocks.  The standard also understands that (depending on
file size) you may be able to write (blocksize-1) bytes to the end of a
file without allocating any blocks and that if there are holes in a
file, writing one byte at each of ten different byte offsets in a file
may need to allocate 10 blocks without changing the file size at all.

Calls to posix_fallocate() may allocate or deallocate blocks; calls to
ftruncate() may deallocate blocks.  Both of these also change the file
size.  (I don't remember if ftruncate allocates blocks or just changes
the file size and lets subsequent write()s allocate blocks when
needed.)

>
>If I'm wrong about this, then please provide pointers. I'm very 
>interested.
>
>Thanks,
>FrankH.
>
>>
>>>
>>> For any particular call to statvfs(), the system won't know whether the
>>> discarded low order bits of f_blocks, f_bfree, and f_bavail and the
>>> high order bits of f_files, f_ffree, and f_favail are important or
>>> not.  The only safe thing to do is report the overflows and fix the
>>> applications that get the resulting EOVERFLOW errors.
>>
>> Even if an application decides that those bits were important (and
>> we're talking about the very last tiny bit of space on the fs),
>> there are currently no guarantees that those few bytes would be
>> available to a user anyway.  Growing a file by one byte could still
>> result in ENOSPC if the fs needed more than the remaining bytes for
>> its own data structures.
>>
>> I agree that broken apps should be fixed.  But that's not always
>> possible.  And I guess I still don't see where we would be breaking
>> any well-behaved applications.
>
>A 32bit application that uses statvfs() in a non-largefile compile 
>environment - aka not getting statvfs64() - isn't "well behaved". It's way 
>over a decade that the LF64 interfaces were introduced. That's a while 
>before anyone used the term 'Java' or 'Netbeans Installer'.
>
>
>
>>
>> -Chris
>>
>> _______________________________________________
>> ufs-discuss mailing list
>> ufs-discuss at opensolaris.org

[zfs-code] [ufs-discuss] statvfs change

Reply via email to