bug#10282: change in behavior of du with multiple arguments (commit efe53cc)
On 12/13/11 09:16, Eric Blake wrote: > Or maybe --count-links gains an optional argument, that says how to > count links: > > --count-links=none -> POSIX behavior (if POSIX requires elision across > command line arguments > --count-links=per-directory -> traditional behavior, resetting hash > between command line arguments > --count-links == --count-links=all -> count every file on every encounter Yes, that was the sort of thing that I had in mind. Though, like Kamil, I hadn't thought that the per-directory argument was all that useful, since one can simply invoke "du" multiple times to get that behavior. Your earlier message suggests another option, which could be --count-links=zero or something like that: it causes "du" to output a "0 X" line for every entry that would otherwise be elided because it's a duplicate. Or perhaps we could make that "-0 X" so that one can distinguish duplicates from true zero entries. Or perhaps it would be better to have a --count-links=mark option, which causes du to output "+N X" for a duplicate link, where N is the size, and the "+" marks it as a duplicate. Just thinking out loud At any rate it doesn't sound like it's high priority.
bug#10282: change in behavior of du with multiple arguments (commit efe53cc)
On 12/13/2011 10:37 AM, Kamil Dudka wrote: > On Tuesday 13 December 2011 18:16:12 Eric Blake wrote: >> I think the proposal is to add a new option that forces du to reset its >> duplicate inode hash table for each command line argument, to make >> behavior more like traditional du, even though it means -s can then >> output a larger usage by summing the first column than what you would >> get by the default behavior, when encountering command line arguments >> that are a duplicate with an inode already traversed earlier in the >> command line. --count-links isn't quite right, because you still want >> to elide links within a single directory of the command-line argument. >> Or maybe --count-links gains an optional argument, that says how to >> count links: >> >> --count-links=none -> POSIX behavior (if POSIX requires elision across >> command line arguments >> --count-links=per-directory -> traditional behavior, resetting hash >> between command line arguments >> --count-links == --count-links=all -> count every file on every encounter > > What would be the difference between the 'per-directory' variant and invoking > du multiple times, giving it one argument at a time? Fewer forks. -- Eric Blake ebl...@redhat.com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
bug#10282: change in behavior of du with multiple arguments (commit efe53cc)
On Tuesday 13 December 2011 18:16:12 Eric Blake wrote: > I think the proposal is to add a new option that forces du to reset its > duplicate inode hash table for each command line argument, to make > behavior more like traditional du, even though it means -s can then > output a larger usage by summing the first column than what you would > get by the default behavior, when encountering command line arguments > that are a duplicate with an inode already traversed earlier in the > command line. --count-links isn't quite right, because you still want > to elide links within a single directory of the command-line argument. > Or maybe --count-links gains an optional argument, that says how to > count links: > > --count-links=none -> POSIX behavior (if POSIX requires elision across > command line arguments > --count-links=per-directory -> traditional behavior, resetting hash > between command line arguments > --count-links == --count-links=all -> count every file on every encounter What would be the difference between the 'per-directory' variant and invoking du multiple times, giving it one argument at a time? Kamil
bug#10282: change in behavior of du with multiple arguments (commit efe53cc)
On 12/13/2011 09:46 AM, Kamil Dudka wrote: >> I agree that printing "0 X" for these seems inconsistent with the >> elision mandated for the second and subsequent encounter of a file, >> but I suppose command line arguments are intrinsically different >> enough that handling them specially makes sense. Maybe even as >> the default. >> >>> Perhaps 'du' needs a new option to control what to do with >>> files that 'du' has already seen before. something that >>> generalizes --count-links. >> >> That sounds like a good way to do it. >> Anyone interested? > > Thank all of you for looking at the issue. If I understand it correctly, the > old behavior was violating POSIX whereas the current default behavior is > correct. Not quite. The POSIX wording does not match historical practice, and appears to be contradictory (or at least ambiguous), so we may need to ask for clarification from the Austin Group. The problem is that POSIX says that if an inode is encountered more than once, it is only listed once (without reference to whether those encounters were from recursion on a single command line argument, recursion across multiple command line arguments, or even if the duplication occurs on the command line itself); but it also says that with '-s', listings are output for all command line arguments. Historically, du implementations elided output for inode duplication found within a single command line argument, but not across multiple command line arguments. The coreutils behavior was changed to elide duplicates across multiple command line arguments; particularly so that in the -s case, you can sum the total usage and get an accurate feel, no matter which order the command line arguments were listed in. But in doing so, we elided duplicate command line arguments, which goes against the POSIX wording that -s will list a summary for all arguments. Hence our proposal of using '0' for a directory previously counted. > I tried du --count-links with the original reproducer and it seemed > to work fine. So what would be the point in adding a new option? I think the proposal is to add a new option that forces du to reset its duplicate inode hash table for each command line argument, to make behavior more like traditional du, even though it means -s can then output a larger usage by summing the first column than what you would get by the default behavior, when encountering command line arguments that are a duplicate with an inode already traversed earlier in the command line. --count-links isn't quite right, because you still want to elide links within a single directory of the command-line argument. Or maybe --count-links gains an optional argument, that says how to count links: --count-links=none -> POSIX behavior (if POSIX requires elision across command line arguments --count-links=per-directory -> traditional behavior, resetting hash between command line arguments --count-links == --count-links=all -> count every file on every encounter -- Eric Blake ebl...@redhat.com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
bug#10282: change in behavior of du with multiple arguments (commit efe53cc)
On 12/13/11 08:46, Kamil Dudka wrote: > If I understand it correctly, the > old behavior was violating POSIX whereas the current default behavior is > correct. I tried du --count-links with the original reproducer and it seemed > to work fine. So what would be the point in adding a new option? Hey, I like that answer! Less work. Though arguably POSIX is incorrect here, as it seems to prohibit common behavior in most other 'du's.
bug#10282: change in behavior of du with multiple arguments (commit efe53cc)
On Tuesday 13 December 2011 08:09:08 Jim Meyering wrote: > Paul Eggert wrote: > > On 12/12/11 14:58, Eric Blake wrote: > >> "Files with multiple links shall be counted and written for only one > >> entry. The directory entry that is selected in the report is > >> unspecified." > > > > Yes, that's partly what motivates the current GNU du behavior: > > the idea is to implement this notion consistently (historical > > 'du' implementations do not). > > > >> But even historically, command line arguments were always listed, even > >> if they are otherwise multiple links. > > > > I suppose we could change GNU 'du' to output "0 X" for a command-line > > argument X that's already been seen. > > This seems sensible. > > > This wouldn't address the problem > > perceived by the original poster, though. And it's a glitch from the > > point of view of consistency. > > I agree that printing "0 X" for these seems inconsistent with the > elision mandated for the second and subsequent encounter of a file, > but I suppose command line arguments are intrinsically different > enough that handling them specially makes sense. Maybe even as > the default. > > > Perhaps 'du' needs a new option to control what to do with > > files that 'du' has already seen before. something that > > generalizes --count-links. > > That sounds like a good way to do it. > Anyone interested? Thank all of you for looking at the issue. If I understand it correctly, the old behavior was violating POSIX whereas the current default behavior is correct. I tried du --count-links with the original reproducer and it seemed to work fine. So what would be the point in adding a new option? Kamil
bug#10282: change in behavior of du with multiple arguments (commit efe53cc)
Paul Eggert wrote: > On 12/12/11 14:58, Eric Blake wrote: >> "Files with multiple links shall be counted and written for only one >> entry. The directory entry that is selected in the report is unspecified." > > Yes, that's partly what motivates the current GNU du behavior: > the idea is to implement this notion consistently (historical > 'du' implementations do not). > >> But even historically, command line arguments were always listed, even >> if they are otherwise multiple links. > > I suppose we could change GNU 'du' to output "0 X" for a command-line > argument X that's already been seen. This seems sensible. > This wouldn't address the problem > perceived by the original poster, though. And it's a glitch from the > point of view of consistency. I agree that printing "0 X" for these seems inconsistent with the elision mandated for the second and subsequent encounter of a file, but I suppose command line arguments are intrinsically different enough that handling them specially makes sense. Maybe even as the default. > Perhaps 'du' needs a new option to control what to do with > files that 'du' has already seen before. something that > generalizes --count-links. That sounds like a good way to do it. Anyone interested?
bug#10282: change in behavior of du with multiple arguments (commit efe53cc)
On 12/12/11 14:58, Eric Blake wrote: > "Files with multiple links shall be counted and written for only one > entry. The directory entry that is selected in the report is unspecified." Yes, that's partly what motivates the current GNU du behavior: the idea is to implement this notion consistently (historical 'du' implementations do not). > But even historically, command line arguments were always listed, even > if they are otherwise multiple links. I suppose we could change GNU 'du' to output "0 X" for a command-line argument X that's already been seen. This wouldn't address the problem perceived by the original poster, though. And it's a glitch from the point of view of consistency. Perhaps 'du' needs a new option to control what to do with files that 'du' has already seen before. something that generalizes --count-links.
bug#10282: change in behavior of du with multiple arguments (commit efe53cc)
On 12/12/2011 03:33 PM, Eric Blake wrote: >> However, changing the numbers is one thing and missing lines in the output >> of du is quite another thing. > > Yes, that's the bug I think we introduced - we are mistakenly eliding > lines of output, rather than listing those directories with 0 attributed > additional size. > > More importantly, POSIX says of -s: > > "−s Instead of the default output, report only the total sum for each of > the specified files." > > But we fail that: > > $ mkdir -p /tmp/a/b > $ cd /tmp/a > $ du -s . b > 8 . > $ du -s b . > 4 b > 4 . > > We correctly deduced that only 8 units were occupied (that is, b was not > double-counted in either approach), but we _failed_ to list b in the > first approach. I think POSIX requires the output to have been: > > $ du -s . b > 8 . > 0 b POSIX also says: "Files with multiple links shall be counted and written for only one entry. The directory entry that is selected in the report is unspecified." But even historically, command line arguments were always listed, even if they are otherwise multiple links. On Solaris 10, for example, $ touch a $ ln a b $ /bin/du a b 1 a 1 b instead of omitting one of the two entries. The omission only occurs during recursion of a directory on the command line: $ /bin/du -a . 1 ./b 4 . > I think that a saner output would be: > > $ du . b > 4 ./b > 8 . > 0 b So this would be okay (even though we encountered b via two different links, the second encounter was a command line, so it should not be elided entirely, but listing 0 would make it obvious that there is no further disk usage to count), > $ du b . > 4 b > 0 ./b > 4 . whereas this proposed line of '0 ./b' is questionable (we could argue that ./b should not be elided because no other link to b was printed during recursion, or we could argue that elision should trump recursion once the command line arguments have been printed). -- Eric Blake ebl...@redhat.com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
bug#10282: change in behavior of du with multiple arguments (commit efe53cc)
On 12/12/2011 05:50 AM, Kamil Dudka wrote: > Hi, > > the following upstream commit introduces a major change in behavior of du > when multiple arguments are specified: > > http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=efe53cc > > ... and the issue has landed as a bug in our Bugzilla: > > https://bugzilla.redhat.com/747075#c3 > > Was such a change in behavior intended? A change in behavior was intended, but I think we ended up introducing a bug in its place. > The info > documentation states: > > The FILE argument order affects which links are counted, and changing the > argument order may change the numbers that `du' outputs. And this is intended. The end goal is that if a directory appears both on the command line and as a child of another directory on the command line, that it gets counted only once. > > However, changing the numbers is one thing and missing lines in the output > of du is quite another thing. Yes, that's the bug I think we introduced - we are mistakenly eliding lines of output, rather than listing those directories with 0 attributed additional size. More importantly, POSIX says of -s: "−s Instead of the default output, report only the total sum for each of the specified files." But we fail that: $ mkdir -p /tmp/a/b $ cd /tmp/a $ du -s . b 8 . $ du -s b . 4 b 4 . We correctly deduced that only 8 units were occupied (that is, b was not double-counted in either approach), but we _failed_ to list b in the first approach. I think POSIX requires the output to have been: $ du -s . b 8 . 0 b as an indication that we did visit b, but that there were no additional contributions to the disk usage encountered during our visit there. Meanwhile, without -s, I still think we elided too much data: $ du . b 4 ./b 8 . $ du b . 4 b 4 . In the first case, we recursed into ./b, then back out to ., but elided any notion that we ever directly visited b. In the second case, we visited b, then recursed into ./b but had nothing to output, then back out to '.'. I think that a saner output would be: $ du . b 4 ./b 8 . 0 b $ du b . 4 b 0 ./b 4 . to make it obvious that we pruned recursion at points where we encountered duplicates, and that the sum of the first columns shows an accurate disk usage. -- Eric Blake ebl...@redhat.com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
bug#10282: change in behavior of du with multiple arguments (commit efe53cc)
Hi, the following upstream commit introduces a major change in behavior of du when multiple arguments are specified: http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=efe53cc ... and the issue has landed as a bug in our Bugzilla: https://bugzilla.redhat.com/747075#c3 Was such a change in behavior intended? I am asking as I was not able to find it documented anywhere. The up2date man page states: Summarize disk usage of each FILE, recursively for directories. ..., where FILE refers to a single argument given to du. The info documentation states: The FILE argument order affects which links are counted, and changing the argument order may change the numbers that `du' outputs. However, changing the numbers is one thing and missing lines in the output of du is quite another thing. Could anybody please clarify the current behavior of du? Thanks in advance! Kamil