Chris, There are several issues here. Please find comments in-line below...
Cheers, Don >Date: Mon, 27 Aug 2007 16:04:42 -0500 >From: Chris Kirby <chris.kirby at sun.com> >Subject: Re: [zfs-code] statvfs change >To: johansen-osdev at sun.com >Cc: ufs-discuss at opensolaris.org, don.cragun at sun.com, zfs-code at >opensolaris.org > >johansen-osdev at sun.com wrote: >> Can you explain in a bit more detail why we're doing this? I probably >> don't understand the issue in sufficient detail. It seems like the >> large file compilation environment, lfcompile(5), exists to solve this >> exact problem. Isn't it the application's responsibility to properly >> handle EOVERFLOW or choose an interface that can handle file offsets >> that are greater than 2Gbytes? Is there something obvious here that I'm >> missing? >> > >It's not a large file issue, it's a large *filesystem* issue >that revolves around f_frsize unit reporting via the cstatvfs32 >interface. f_blocks, f_bfree, and f_bavail are all reported in >units of f_frsize. ZFS has large file and large filesystem issues. But, those of us who participated in the Large File Summit (vendors and consumers who jointly produced the Large File Summit Specification and did the work to get the non-transitional interfaces integrated into X/Open's (now The Open Group's) X/Open Portability Guide, Issue 4, Version 2 remember the discussions that led to the creation of the EOVERFLOW errno value. The base point is that any time you lie to applications, some application software is going to make wrong decisions based on the lie. > >For ZFS, we report f_frsize as 512 regardless of the size of >the fs. ... Why? Why shouldn't you always set f_frsize to the actual size of an allocation unit on the filesystem? Is it still true that we don't support disks formatted with 1024 byte sectors? > ... This means we can only express vfs size up to >UINT32_MAX * 512 bytes. That's not a terribly large fs >by today's standards. Anything larger will result in EOVERFLOW >from statvfs. > >You're entirely correct that it's the application's responsibility >to deal with EOVERFLOW, perhaps by using statvfs64. But if we can >return valid information instead of an error, that seems like a >good thing. When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the correct values for these fields are larger; you are not returning valid information. You may be returning "valid" values for f_frsize, f_blocks, f_bfree, and f_bavail; but you aren't checking to see if that is true or not. (If shifting f_blocks, f_bfree, or f_bavail right throws away a bit that was not a zero bit; the scaled values being returned are not valid.) Since the statvfs(2) and statvfs.h(3HEAD) man pages don't state any relationship between f_bsize and f_frsize, applications may well have made their own assumptions. Is there documentation somewhere that specifies how many bytes should be written at a time (on boundaries that is a multiple of that value) to get the most efficiency out of the underlying hardware? I would hope that f_bsize would be that value. If it is, it seems that f_bsize should be an integral multiple of f_frsize. > >-Chris