bug#10282: change in behavior of du with multiple arguments (commit efe53cc)

2011-12-13 Thread Kamil Dudka
On Tuesday 13 December 2011 08:09:08 Jim Meyering wrote:
 Paul Eggert wrote:
  On 12/12/11 14:58, Eric Blake wrote:
  Files with multiple links shall be counted and written for only one
  entry. The directory entry that is selected in the report is
  unspecified.
 
  Yes, that's partly what motivates the current GNU du behavior:
  the idea is to implement this notion consistently (historical
  'du' implementations do not).
 
  But even historically, command line arguments were always listed, even
  if they are otherwise multiple links.
 
  I suppose we could change GNU 'du' to output 0 X for a command-line
  argument X that's already been seen.
 
 This seems sensible.
 
  This wouldn't address the problem
  perceived by the original poster, though.  And it's a glitch from the
  point of view of consistency.
 
 I agree that printing 0 X for these seems inconsistent with the
 elision mandated for the second and subsequent encounter of a file,
 but I suppose command line arguments are intrinsically different
 enough that handling them specially makes sense.  Maybe even as
 the default.
 
  Perhaps 'du' needs a new option to control what to do with
  files that 'du' has already seen before. something that
  generalizes --count-links.
 
 That sounds like a good way to do it.
 Anyone interested?

Thank all of you for looking at the issue.  If I understand it correctly, the 
old behavior was violating POSIX whereas the current default behavior is 
correct.  I tried du --count-links with the original reproducer and it seemed 
to work fine.  So what would be the point in adding a new option?

Kamil





bug#10282: change in behavior of du with multiple arguments (commit efe53cc)

2011-12-13 Thread Paul Eggert
On 12/13/11 08:46, Kamil Dudka wrote:
 If I understand it correctly, the 
 old behavior was violating POSIX whereas the current default behavior is 
 correct.  I tried du --count-links with the original reproducer and it seemed 
 to work fine.  So what would be the point in adding a new option?

Hey, I like that answer!  Less work.

Though arguably POSIX is incorrect here, as it seems
to prohibit common behavior in most other 'du's.





bug#10282: change in behavior of du with multiple arguments (commit efe53cc)

2011-12-13 Thread Eric Blake
On 12/13/2011 09:46 AM, Kamil Dudka wrote:
 I agree that printing 0 X for these seems inconsistent with the
 elision mandated for the second and subsequent encounter of a file,
 but I suppose command line arguments are intrinsically different
 enough that handling them specially makes sense.  Maybe even as
 the default.

 Perhaps 'du' needs a new option to control what to do with
 files that 'du' has already seen before. something that
 generalizes --count-links.

 That sounds like a good way to do it.
 Anyone interested?
 
 Thank all of you for looking at the issue.  If I understand it correctly, the 
 old behavior was violating POSIX whereas the current default behavior is 
 correct.

Not quite.  The POSIX wording does not match historical practice, and
appears to be contradictory (or at least ambiguous), so we may need to
ask for clarification from the Austin Group.  The problem is that POSIX
says that if an inode is encountered more than once, it is only listed
once (without reference to whether those encounters were from recursion
on a single command line argument, recursion across multiple command
line arguments, or even if the duplication occurs on the command line
itself); but it also says that with '-s', listings are output for all
command line arguments.  Historically, du implementations elided output
for inode duplication found within a single command line argument, but
not across multiple command line arguments.

The coreutils behavior was changed to elide duplicates across multiple
command line arguments; particularly so that in the -s case, you can sum
the total usage and get an accurate feel, no matter which order the
command line arguments were listed in.  But in doing so, we elided
duplicate command line arguments, which goes against the POSIX wording
that -s will list a summary for all arguments.  Hence our proposal of
using '0' for a directory previously counted.

  I tried du --count-links with the original reproducer and it seemed 
 to work fine.  So what would be the point in adding a new option?

I think the proposal is to add a new option that forces du to reset its
duplicate inode hash table for each command line argument, to make
behavior more like traditional du, even though it means -s can then
output a larger usage by summing the first column than what you would
get by the default behavior, when encountering command line arguments
that are a duplicate with an inode already traversed earlier in the
command line.  --count-links isn't quite right, because you still want
to elide links within a single directory of the command-line argument.
Or maybe --count-links gains an optional argument, that says how to
count links:

--count-links=none - POSIX behavior (if POSIX requires elision across
command line arguments
--count-links=per-directory - traditional behavior, resetting hash
between command line arguments
--count-links == --count-links=all - count every file on every encounter

-- 
Eric Blake   ebl...@redhat.com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


bug#10282: change in behavior of du with multiple arguments (commit efe53cc)

2011-12-13 Thread Kamil Dudka
On Tuesday 13 December 2011 18:16:12 Eric Blake wrote:
 I think the proposal is to add a new option that forces du to reset its
 duplicate inode hash table for each command line argument, to make
 behavior more like traditional du, even though it means -s can then
 output a larger usage by summing the first column than what you would
 get by the default behavior, when encountering command line arguments
 that are a duplicate with an inode already traversed earlier in the
 command line.  --count-links isn't quite right, because you still want
 to elide links within a single directory of the command-line argument.
 Or maybe --count-links gains an optional argument, that says how to
 count links:
 
 --count-links=none - POSIX behavior (if POSIX requires elision across
 command line arguments
 --count-links=per-directory - traditional behavior, resetting hash
 between command line arguments
 --count-links == --count-links=all - count every file on every encounter

What would be the difference between the 'per-directory' variant and invoking 
du multiple times, giving it one argument at a time?

Kamil





bug#10282: change in behavior of du with multiple arguments (commit efe53cc)

2011-12-13 Thread Eric Blake
On 12/13/2011 10:37 AM, Kamil Dudka wrote:
 On Tuesday 13 December 2011 18:16:12 Eric Blake wrote:
 I think the proposal is to add a new option that forces du to reset its
 duplicate inode hash table for each command line argument, to make
 behavior more like traditional du, even though it means -s can then
 output a larger usage by summing the first column than what you would
 get by the default behavior, when encountering command line arguments
 that are a duplicate with an inode already traversed earlier in the
 command line.  --count-links isn't quite right, because you still want
 to elide links within a single directory of the command-line argument.
 Or maybe --count-links gains an optional argument, that says how to
 count links:

 --count-links=none - POSIX behavior (if POSIX requires elision across
 command line arguments
 --count-links=per-directory - traditional behavior, resetting hash
 between command line arguments
 --count-links == --count-links=all - count every file on every encounter
 
 What would be the difference between the 'per-directory' variant and invoking 
 du multiple times, giving it one argument at a time?

Fewer forks.

-- 
Eric Blake   ebl...@redhat.com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


bug#10282: change in behavior of du with multiple arguments (commit efe53cc)

2011-12-13 Thread Paul Eggert
On 12/13/11 09:16, Eric Blake wrote:
 Or maybe --count-links gains an optional argument, that says how to
 count links:
 
 --count-links=none - POSIX behavior (if POSIX requires elision across
 command line arguments
 --count-links=per-directory - traditional behavior, resetting hash
 between command line arguments
 --count-links == --count-links=all - count every file on every encounter

Yes, that was the sort of thing that I had in mind.  Though,
like Kamil, I hadn't thought that the per-directory argument was all that
useful, since one can simply invoke du multiple times to get that behavior.

Your earlier message suggests another option, which could be
--count-links=zero or something like that: it causes du to
output a 0 X line for every entry that would otherwise be
elided because it's a duplicate.  Or perhaps we could make that
-0 X so that one can distinguish duplicates from true zero
entries.

Or perhaps it would be better to have a --count-links=mark option,
which causes du to output +N X for a duplicate link, where N is
the size, and the + marks it as a duplicate.

Just thinking out loud  At any rate it doesn't sound like
it's high priority.





bug#10282: change in behavior of du with multiple arguments (commit efe53cc)

2011-12-12 Thread Kamil Dudka
Hi,

the following upstream commit introduces a major change in behavior of du
when multiple arguments are specified:

http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=efe53cc

... and the issue has landed as a bug in our Bugzilla:

https://bugzilla.redhat.com/747075#c3

Was such a change in behavior intended?  I am asking as I was not able to
find it documented anywhere.  The up2date man page states:

Summarize disk usage of each FILE, recursively for directories.

..., where FILE refers to a single argument given to du.  The info 
documentation states:

The FILE argument order affects which links are counted, and changing the
argument order may change the numbers that `du' outputs.

However, changing the numbers is one thing and missing lines in the output
of du is quite another thing.

Could anybody please clarify the current behavior of du?  Thanks in advance!

Kamil





bug#10282: change in behavior of du with multiple arguments (commit efe53cc)

2011-12-12 Thread Eric Blake
On 12/12/2011 05:50 AM, Kamil Dudka wrote:
 Hi,
 
 the following upstream commit introduces a major change in behavior of du
 when multiple arguments are specified:
 
 http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=efe53cc
 
 ... and the issue has landed as a bug in our Bugzilla:
 
 https://bugzilla.redhat.com/747075#c3
 
 Was such a change in behavior intended?

A change in behavior was intended, but I think we ended up introducing a
bug in its place.

 The info 
 documentation states:
 
 The FILE argument order affects which links are counted, and changing the
 argument order may change the numbers that `du' outputs.

And this is intended.  The end goal is that if a directory appears both
on the command line and as a child of another directory on the command
line, that it gets counted only once.

 
 However, changing the numbers is one thing and missing lines in the output
 of du is quite another thing.

Yes, that's the bug I think we introduced - we are mistakenly eliding
lines of output, rather than listing those directories with 0 attributed
additional size.

More importantly, POSIX says of -s:

−s Instead of the default output, report only the total sum for each of
the specified files.

But we fail that:

$ mkdir -p /tmp/a/b
$ cd /tmp/a
$ du -s . b
8   .
$ du -s b .
4   b
4   .

We correctly deduced that only 8 units were occupied (that is, b was not
double-counted in either approach), but we _failed_ to list b in the
first approach.  I think POSIX requires the output to have been:

$ du -s . b
8   .
0   b

as an indication that we did visit b, but that there were no additional
contributions to the disk usage encountered during our visit there.

Meanwhile, without -s, I still think we elided too much data:

$ du . b
4   ./b
8   .
$ du b .
4   b
4   .

In the first case, we recursed into ./b, then back out to ., but elided
any notion that we ever directly visited b.  In the second case, we
visited b, then recursed into ./b but had nothing to output, then back
out to '.'.  I think that a saner output would be:

$ du . b
4   ./b
8   .
0   b
$ du b .
4   b
0   ./b
4   .

to make it obvious that we pruned recursion at points where we
encountered duplicates, and that the sum of the first columns shows an
accurate disk usage.

-- 
Eric Blake   ebl...@redhat.com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


bug#10282: change in behavior of du with multiple arguments (commit efe53cc)

2011-12-12 Thread Eric Blake
On 12/12/2011 03:33 PM, Eric Blake wrote:
 However, changing the numbers is one thing and missing lines in the output
 of du is quite another thing.
 
 Yes, that's the bug I think we introduced - we are mistakenly eliding
 lines of output, rather than listing those directories with 0 attributed
 additional size.
 
 More importantly, POSIX says of -s:
 
 −s Instead of the default output, report only the total sum for each of
 the specified files.
 
 But we fail that:
 
 $ mkdir -p /tmp/a/b
 $ cd /tmp/a
 $ du -s . b
 8 .
 $ du -s b .
 4 b
 4 .
 
 We correctly deduced that only 8 units were occupied (that is, b was not
 double-counted in either approach), but we _failed_ to list b in the
 first approach.  I think POSIX requires the output to have been:
 
 $ du -s . b
 8   .
 0   b

POSIX also says:

Files with multiple links shall be counted and written for only one
entry. The directory entry that is selected in the report is unspecified.

But even historically, command line arguments were always listed, even
if they are otherwise multiple links.  On Solaris 10, for example,

$ touch a
$ ln a b
$ /bin/du a b
1   a
1   b

instead of omitting one of the two entries.  The omission only occurs
during recursion of a directory on the command line:

$ /bin/du -a .
1   ./b
4   .

 I think that a saner output would be:
 
 $ du . b
 4   ./b
 8   .
 0   b

So this would be okay (even though we encountered b via two different
links, the second encounter was a command line, so it should not be
elided entirely, but listing 0 would make it obvious that there is no
further disk usage to count),

 $ du b .
 4   b
 0   ./b
 4   .

whereas this proposed line of '0 ./b' is questionable (we could argue
that ./b should not be elided because no other link to b was printed
during recursion, or we could argue that elision should trump recursion
once the command line arguments have been printed).

-- 
Eric Blake   ebl...@redhat.com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


bug#10282: change in behavior of du with multiple arguments (commit efe53cc)

2011-12-12 Thread Paul Eggert
On 12/12/11 14:58, Eric Blake wrote:
 Files with multiple links shall be counted and written for only one
 entry. The directory entry that is selected in the report is unspecified.

Yes, that's partly what motivates the current GNU du behavior:
the idea is to implement this notion consistently (historical
'du' implementations do not).

 But even historically, command line arguments were always listed, even
 if they are otherwise multiple links.

I suppose we could change GNU 'du' to output 0 X for a command-line
argument X that's already been seen.  This wouldn't address the problem
perceived by the original poster, though.  And it's a glitch from the
point of view of consistency.

Perhaps 'du' needs a new option to control what to do with
files that 'du' has already seen before. something that
generalizes --count-links.





bug#10282: change in behavior of du with multiple arguments (commit efe53cc)

2011-12-12 Thread Jim Meyering
Paul Eggert wrote:

 On 12/12/11 14:58, Eric Blake wrote:
 Files with multiple links shall be counted and written for only one
 entry. The directory entry that is selected in the report is unspecified.

 Yes, that's partly what motivates the current GNU du behavior:
 the idea is to implement this notion consistently (historical
 'du' implementations do not).

 But even historically, command line arguments were always listed, even
 if they are otherwise multiple links.

 I suppose we could change GNU 'du' to output 0 X for a command-line
 argument X that's already been seen.

This seems sensible.

 This wouldn't address the problem
 perceived by the original poster, though.  And it's a glitch from the
 point of view of consistency.

I agree that printing 0 X for these seems inconsistent with the
elision mandated for the second and subsequent encounter of a file,
but I suppose command line arguments are intrinsically different
enough that handling them specially makes sense.  Maybe even as
the default.

 Perhaps 'du' needs a new option to control what to do with
 files that 'du' has already seen before. something that
 generalizes --count-links.

That sounds like a good way to do it.
Anyone interested?