bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)

2019-01-22 Thread Paul Eggert

On 1/22/19 2:40 PM, Bernhard Voelker wrote:

This sounds to me as if you wanted 'du' to read() the content of each file
to get the 'correct' statistics.  That is more in the domain of wc(1).


du already has an --apparent-size option that gives the same size that 
'read' would give. As I understand it, this part of the request was to 
change the (arguably confusing) name of this option to a different (and 
also arguably confusing :-) name. As the option name has been that way 
for quite some time and the proposed name is not that much less 
confusing than the old, I think we'll stand pat.







bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)

2019-01-22 Thread Bernhard Voelker
On 1/17/19 11:13 AM, René J.V. Bertin wrote:
> I realise that you cannot really call the content size observable "real size
> when reporting from a disk-usage viewpoint, but "content size"
> (--content-size, -C) should be clear enough?

This sounds to me as if you wanted 'du' to read() the content of each file
to get the 'correct' statistics.  That is more in the domain of wc(1).

Have a nice day,
Berny





bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)

2019-01-18 Thread Assaf Gordon

Hello,

On 2019-01-18 2:56 a.m., René J.V. Bertin wrote:


the code isn't the most welcoming to dive into I've ever seen ;)


Two online resources that might help in exploring the code:

  http://www.maizure.org/projects/decoded-gnu-coreutils/

  https://opengrok.housegordon.com/source/xref/coreutils/

regards,
 - assaf





bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)

2019-01-18 Thread René J . V . Bertin
On Thursday January 17 2019 23:43:39 Assaf Gordon wrote:
>The parameter name "--apparent-size" is not likely to be changed.

I was thinking of making it an alias for a more aptly named parameter (long) 
before (possibly) phasing out the current name. 

>It has been named so for about 16 years (since 'fileutils 4.5.8'

A lot has happened on the filesystem front since that time. Just saying :)

>Concrete patches are welcomed.

I bet. We'll see who finds the time for that (the code isn't the most welcoming 
to dive into I've ever seen ;))

Cheers,
R.





bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)

2019-01-17 Thread Assaf Gordon

severity 34110 wishlist
retitle 34110 du: add dual-column showing apparent-size and disk-size
stop

Hello,

On 2019-01-17 3:13 a.m., René J.V. Bertin wrote:

On Wednesday January 16 2019 16:06:50 Assaf Gordon wrote:


I hope this helps to clarify "apparent-size".


Yes and no :) I understand what "apparent-size" does [] 
My whole point is that there might be a better name. 


The parameter name "--apparent-size" is not likely to be changed.
It has been named so for about 16 years (since 'fileutils 4.5.8'
which is even before 'coreutils' was created as a unified package).

Changing it would break existing scripts and user expectations.


I realise that you cannot really call the content size observable "real size" when 
reporting from a disk-usage viewpoint, but "content size" (--content-size, -C) should be 
clear enough?


Creating a second alias to "--apparent-size" is possible, but I'm not
sure it's warranted.

---

I think the discussion about "--apparent-size" is mostly concluded,
but the idea to have two-columns is an interesting feature request.

I'm marking this as a "wish list" item.
Concrete patches are welcomed.

regards,
 - assaf









bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)

2019-01-17 Thread René J . V . Bertin
On Wednesday January 16 2019 16:06:50 Assaf Gordon wrote:

Hello,

Yes, I used the exact same directory in all comparisons. It's a nodejs cache 
(or whatever) directory as you may have guessed; I picked it because it's a 
good example of the sort of directory found these days which can create 
considerable overhead. Small enough it'd tend to get dismissed as significant, 
but containing a large number of files (almost 8000 in my case), most of them 
tiny.

>I hope this helps to clarify "apparent-size".

Yes and no :) I understand what "apparent-size" does (and have dug through the 
code looking for ideas how to do similar things in one of my own apps).

My whole point is that there might be a better name. I know one should 
distinguish every-day language and technical terms but if the latter start to 
appear (pun intended) like the former (and lack a shorthand) then they'd best 
be chosen such that they don't require thinking about their interpretation.

Paul's comment about not being able to know what happens underneath only makes 
this argument stronger IMHO. On the one hand, du can only report how big a item 
would appear to be on disk (based on what stat() reports). In addition, how 
would it handle knowledge about the number of disks that a given file is 
written to? On the other hand, the actual content size is a given that 
shouldn't change and that is not subject to any existential questions. (Though 
as my examples show, this isn't necessarily true when du'in directories, and 
esp. so for HFS+ with compression.)

I realise that you cannot really call the content size observable "real size" 
when reporting from a disk-usage viewpoint, but "content size" (--content-size, 
-C) should be clear enough? "Estimated on-disk size" would be good enough as a 
header for the other observable (an estimate can be 100% accurate after all).

Cheers,
R.





bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)

2019-01-16 Thread Paul Eggert

I like the idea of two columns at once.


with "--apparent-size", du returns the actual file size; without, it returns 
how large the file appears to be (judging from its disk footprint).


The "apparent" size is the size that "ls -l" outputs, and is the size 
that traditional I/O operations like 'read' and 'write' deal with, 
regardless of the underlying implementation (where the size might be 
smaller or larger than the "apparent" size). In contrast the "disk 
usage" size is whatever the filesystem tells us it is. I wouldn't call 
either size the "actual" size these days, as even the disk usage (or 
"disk footprint") might be virtual blocks stored in a lower-level 
compressed device, and there's no way "du" can find out how much of the 
lower-level device is being used.







bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)

2019-01-16 Thread Assaf Gordon

Hello,

I'll address only the "apparent-size" issue (not the two-columns, or 
compressed file-systems):


On 2019-01-16 1:13 p.m., René J.V. Bertin wrote:


According to `du --help`, the apparent-size option reports a size that is not 
the actual disk usage. The numbers above seem to show the opposite.
If anything, I find the concept of "apparent size" more appropriate to the size a file 
occupies on the storage medium because ultimately that storage device will not give you more than 
"struct stat : st_size" bytes for uncompressed filesystems.
Another way to say it: with "--apparent-size", du returns the actual file size; 
without, it returns how large the file appears to be (judging from its disk footprint).


"apparent-size" shows how much content/data the file has.
without "apparent-size" du shows the amount of storage consumed (or 
"wasted"?) on the storage medium (accounting sparse file holes, though 
I'm not sure about compression).


To illustrate, create three files with specific sizes:

  $ head --bytes=1700 /dev/zero > a
  $ head --bytes=4097 /dev/zero > b
  $ truncate --size=105 c# will be a sparse file

These are their sizes, as in the amount of bytes they contain:

  $ ls -log
  total 12
  -rw-r--r-- 11700 Jan 16 15:36 a
  -rw-r--r-- 14097 Jan 16 15:36 b
  -rw-r--r-- 1 105 Jan 16 15:37 c


These are their "apparent-sizes", rounded up to the nearest
1K block:

  $ du --apparent-size a b c
  2 a
  5 b
  1026  c

e.g. file "a" is 1700 bytes, rounded-up to 2K, and "du --apparent-size"
shows "2".

Using "--apparent-size --block-size=1" (and its equivalent, "--bytes")
will show the exact sizes:

  $ du --apparent-size --block-size=1 a b c
  1700 a
  4097 b
  105  c

Without "--apparent-size", du shows how much storage space is actually 
used/wasted/consumed on the storage medium by the files:


  $ du a b c
  4a
  8b
  0c

How are these numbers calculated?

The simplest case is file "c" - it is completely sparse - so despite
logically containing 1,050,000 zeros, on the actual storage medium it 
consumes zero data blocks (ignoring inodes blocks and somesuch).


File "a" has 1,700 bytes of data.
On my filesystem the basic block size is 4096, as shown by "stat -f":

  $ stat -f /
File: "/"
  ID: 5a2cade519bada6a Namelen: 255 Type: ext2/ext3
->Block size: 4096   Fundamental block size: 4096<-
  Blocks: Total: 27559017   Free: 18845977   Available: 17435289
  Inodes: Total: 7036928Free: 6496730

Therefore, any file from size 1 to size 4096 will consume exactly one
disk block. On most common filesystems, disk blocks can not be shared
between files. Meaning that this block is fully consumed.

That's why for file "a" du shows "4" - meaning 4K bytes (exactly one
block) is consumed on the storage medium by this file.

Similarly for file "b" - its size is 4097, which is 1 byte more than one
filesystem block. Hence, file "b" consumes 2 blocks, coming up to 8K.
du then shows "8" for file "b".


Now to your examples:


%> du -hcs /Volumes/nif64/tmp/.npm/ ; du -hcs --apparent-size

/Volumes/nif64/tmp/.npm/

340M/Volumes/nif64/tmp/.npm/ > 180M/Volumes/nif64/tmp/.npm/
Same folder on btrfs (mounted with compress=lzo): > %> du -hcs /mnt/.npm/ ; du -hcs --apparent-size  /mnt/.npm> 198M 

/mnt/.npm/> 181M/mnt/.npm

In both cases, "du --apparent-size" shows about 180MB of actual data 
(181MB in the second example). That is the amount of actual content

(number of total bytes in these files).

In the first case, these files consume 340MB of space on your disk.
In the second case, these files consume 198MB of space on your disk.
The reason they consume MORE than their actual data is explained above
with the file-system blocks.

This suggest to me that compression is not accounted for in these
values. If it was, then the consumed size (without "--apparent-size")
should've been less than the actual size (with "--apparent-size").

A quick on-line search shows that btrsf's default block size is 16K,
while ZFS's default record-size is 128KB. That might explain
why similar amount of data (and I assume, similar number of files and
sizes) consume more disk space on ZFS (Could be wrong, though, comments
are welcomed).


I hope this helps to clarify "apparent-size".

I'll leave it to others to comment on how compressed file systems
come into play with du.

regards,
 - assaf








bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)

2019-01-16 Thread René J . V . Bertin
Hi,

I hope feature requests are acceptable here.

Now that more and more filesystems have support for compression it becomes more 
interesting the comparre actual file/directory (content) size and the 
corresponding on-disk size. Currently you have to call du twice to do that, 
which quickly becomes cumbersome in practice (commandlines, parsing the output) 
and requires repeating the same IO operations twice.

The code obtains both size values at the same time so it would make sense to do 
both calculations at the same time, and provide an option to display the 
regular and "apparent-size" values in column output. My guess would be that the 
cost of calculating both output values at the same time is negligible w.r.t. 
the cost of the stat() call (and thus that there's no need to complexify the 
code with "calculate this and/or that" conditionals).

The option could be called --both, --colums (-C) or --two (-T).

I'd also reconsider the "apparent-size" term as I think it is confusing and 
ambiguous. Consider this, taken from a ZFS dataset with gzip-9 compression (and 
copies=1; du v8.30):

%> du -hcs /Volumes/nif64/tmp/.npm/ ; du -hcs --apparent-size 
/Volumes/nif64/tmp/.npm/
340M/Volumes/nif64/tmp/.npm/
180M/Volumes/nif64/tmp/.npm/

Same folder on btrfs (mounted with compress=lzo):
%> du -hcs /mnt/.npm/ ; du -hcs --apparent-size  /mnt/.npm
198M/mnt/.npm/
181M/mnt/.npm

According to `du --help`, the apparent-size option reports a size that is not 
the actual disk usage. The numbers above seem to show the opposite.
If anything, I find the concept of "apparent size" more appropriate to the size 
a file occupies on the storage medium because ultimately that storage device 
will not give you more than "struct stat : st_size" bytes for uncompressed 
filesystems. 
Another way to say it: with "--apparent-size", du returns the actual file size; 
without, it returns how large the file appears to be (judging from its disk 
footprint).

For comparison; same folder,  on Mac with HFS+
%> du -hcs /Volumes/VMs/.npm ; du -hcs --apparent-size /Volumes/VMs/.npm
198M/Volumes/VMs/.npm
181M/Volumes/VMs/.npm

Idem, with HFS+ compression (zip-9)
%> du -hcs /Volumes/VMs/.npm ; du -hcs --apparent-size /Volumes/VMs/.npm
115M/Volumes/VMs/.npm
148M/Volumes/VMs/.npm

Thoughts?

Thanks,
R.