bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)
On 1/22/19 2:40 PM, Bernhard Voelker wrote: This sounds to me as if you wanted 'du' to read() the content of each file to get the 'correct' statistics. That is more in the domain of wc(1). du already has an --apparent-size option that gives the same size that 'read' would give. As I understand it, this part of the request was to change the (arguably confusing) name of this option to a different (and also arguably confusing :-) name. As the option name has been that way for quite some time and the proposed name is not that much less confusing than the old, I think we'll stand pat.
bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)
On 1/17/19 11:13 AM, René J.V. Bertin wrote: > I realise that you cannot really call the content size observable "real size > when reporting from a disk-usage viewpoint, but "content size" > (--content-size, -C) should be clear enough? This sounds to me as if you wanted 'du' to read() the content of each file to get the 'correct' statistics. That is more in the domain of wc(1). Have a nice day, Berny
bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)
Hello, On 2019-01-18 2:56 a.m., René J.V. Bertin wrote: the code isn't the most welcoming to dive into I've ever seen ;) Two online resources that might help in exploring the code: http://www.maizure.org/projects/decoded-gnu-coreutils/ https://opengrok.housegordon.com/source/xref/coreutils/ regards, - assaf
bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)
On Thursday January 17 2019 23:43:39 Assaf Gordon wrote: >The parameter name "--apparent-size" is not likely to be changed. I was thinking of making it an alias for a more aptly named parameter (long) before (possibly) phasing out the current name. >It has been named so for about 16 years (since 'fileutils 4.5.8' A lot has happened on the filesystem front since that time. Just saying :) >Concrete patches are welcomed. I bet. We'll see who finds the time for that (the code isn't the most welcoming to dive into I've ever seen ;)) Cheers, R.
bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)
severity 34110 wishlist retitle 34110 du: add dual-column showing apparent-size and disk-size stop Hello, On 2019-01-17 3:13 a.m., René J.V. Bertin wrote: On Wednesday January 16 2019 16:06:50 Assaf Gordon wrote: I hope this helps to clarify "apparent-size". Yes and no :) I understand what "apparent-size" does [] My whole point is that there might be a better name. The parameter name "--apparent-size" is not likely to be changed. It has been named so for about 16 years (since 'fileutils 4.5.8' which is even before 'coreutils' was created as a unified package). Changing it would break existing scripts and user expectations. I realise that you cannot really call the content size observable "real size" when reporting from a disk-usage viewpoint, but "content size" (--content-size, -C) should be clear enough? Creating a second alias to "--apparent-size" is possible, but I'm not sure it's warranted. --- I think the discussion about "--apparent-size" is mostly concluded, but the idea to have two-columns is an interesting feature request. I'm marking this as a "wish list" item. Concrete patches are welcomed. regards, - assaf
bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)
On Wednesday January 16 2019 16:06:50 Assaf Gordon wrote: Hello, Yes, I used the exact same directory in all comparisons. It's a nodejs cache (or whatever) directory as you may have guessed; I picked it because it's a good example of the sort of directory found these days which can create considerable overhead. Small enough it'd tend to get dismissed as significant, but containing a large number of files (almost 8000 in my case), most of them tiny. >I hope this helps to clarify "apparent-size". Yes and no :) I understand what "apparent-size" does (and have dug through the code looking for ideas how to do similar things in one of my own apps). My whole point is that there might be a better name. I know one should distinguish every-day language and technical terms but if the latter start to appear (pun intended) like the former (and lack a shorthand) then they'd best be chosen such that they don't require thinking about their interpretation. Paul's comment about not being able to know what happens underneath only makes this argument stronger IMHO. On the one hand, du can only report how big a item would appear to be on disk (based on what stat() reports). In addition, how would it handle knowledge about the number of disks that a given file is written to? On the other hand, the actual content size is a given that shouldn't change and that is not subject to any existential questions. (Though as my examples show, this isn't necessarily true when du'in directories, and esp. so for HFS+ with compression.) I realise that you cannot really call the content size observable "real size" when reporting from a disk-usage viewpoint, but "content size" (--content-size, -C) should be clear enough? "Estimated on-disk size" would be good enough as a header for the other observable (an estimate can be 100% accurate after all). Cheers, R.
bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)
I like the idea of two columns at once. with "--apparent-size", du returns the actual file size; without, it returns how large the file appears to be (judging from its disk footprint). The "apparent" size is the size that "ls -l" outputs, and is the size that traditional I/O operations like 'read' and 'write' deal with, regardless of the underlying implementation (where the size might be smaller or larger than the "apparent" size). In contrast the "disk usage" size is whatever the filesystem tells us it is. I wouldn't call either size the "actual" size these days, as even the disk usage (or "disk footprint") might be virtual blocks stored in a lower-level compressed device, and there's no way "du" can find out how much of the lower-level device is being used.
bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)
Hello, I'll address only the "apparent-size" issue (not the two-columns, or compressed file-systems): On 2019-01-16 1:13 p.m., René J.V. Bertin wrote: According to `du --help`, the apparent-size option reports a size that is not the actual disk usage. The numbers above seem to show the opposite. If anything, I find the concept of "apparent size" more appropriate to the size a file occupies on the storage medium because ultimately that storage device will not give you more than "struct stat : st_size" bytes for uncompressed filesystems. Another way to say it: with "--apparent-size", du returns the actual file size; without, it returns how large the file appears to be (judging from its disk footprint). "apparent-size" shows how much content/data the file has. without "apparent-size" du shows the amount of storage consumed (or "wasted"?) on the storage medium (accounting sparse file holes, though I'm not sure about compression). To illustrate, create three files with specific sizes: $ head --bytes=1700 /dev/zero > a $ head --bytes=4097 /dev/zero > b $ truncate --size=105 c# will be a sparse file These are their sizes, as in the amount of bytes they contain: $ ls -log total 12 -rw-r--r-- 11700 Jan 16 15:36 a -rw-r--r-- 14097 Jan 16 15:36 b -rw-r--r-- 1 105 Jan 16 15:37 c These are their "apparent-sizes", rounded up to the nearest 1K block: $ du --apparent-size a b c 2 a 5 b 1026 c e.g. file "a" is 1700 bytes, rounded-up to 2K, and "du --apparent-size" shows "2". Using "--apparent-size --block-size=1" (and its equivalent, "--bytes") will show the exact sizes: $ du --apparent-size --block-size=1 a b c 1700 a 4097 b 105 c Without "--apparent-size", du shows how much storage space is actually used/wasted/consumed on the storage medium by the files: $ du a b c 4a 8b 0c How are these numbers calculated? The simplest case is file "c" - it is completely sparse - so despite logically containing 1,050,000 zeros, on the actual storage medium it consumes zero data blocks (ignoring inodes blocks and somesuch). File "a" has 1,700 bytes of data. On my filesystem the basic block size is 4096, as shown by "stat -f": $ stat -f / File: "/" ID: 5a2cade519bada6a Namelen: 255 Type: ext2/ext3 ->Block size: 4096 Fundamental block size: 4096<- Blocks: Total: 27559017 Free: 18845977 Available: 17435289 Inodes: Total: 7036928Free: 6496730 Therefore, any file from size 1 to size 4096 will consume exactly one disk block. On most common filesystems, disk blocks can not be shared between files. Meaning that this block is fully consumed. That's why for file "a" du shows "4" - meaning 4K bytes (exactly one block) is consumed on the storage medium by this file. Similarly for file "b" - its size is 4097, which is 1 byte more than one filesystem block. Hence, file "b" consumes 2 blocks, coming up to 8K. du then shows "8" for file "b". Now to your examples: %> du -hcs /Volumes/nif64/tmp/.npm/ ; du -hcs --apparent-size /Volumes/nif64/tmp/.npm/ 340M/Volumes/nif64/tmp/.npm/ > 180M/Volumes/nif64/tmp/.npm/ Same folder on btrfs (mounted with compress=lzo): > %> du -hcs /mnt/.npm/ ; du -hcs --apparent-size /mnt/.npm> 198M /mnt/.npm/> 181M/mnt/.npm In both cases, "du --apparent-size" shows about 180MB of actual data (181MB in the second example). That is the amount of actual content (number of total bytes in these files). In the first case, these files consume 340MB of space on your disk. In the second case, these files consume 198MB of space on your disk. The reason they consume MORE than their actual data is explained above with the file-system blocks. This suggest to me that compression is not accounted for in these values. If it was, then the consumed size (without "--apparent-size") should've been less than the actual size (with "--apparent-size"). A quick on-line search shows that btrsf's default block size is 16K, while ZFS's default record-size is 128KB. That might explain why similar amount of data (and I assume, similar number of files and sizes) consume more disk space on ZFS (Could be wrong, though, comments are welcomed). I hope this helps to clarify "apparent-size". I'll leave it to others to comment on how compressed file systems come into play with du. regards, - assaf
bug#34110: feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)
Hi, I hope feature requests are acceptable here. Now that more and more filesystems have support for compression it becomes more interesting the comparre actual file/directory (content) size and the corresponding on-disk size. Currently you have to call du twice to do that, which quickly becomes cumbersome in practice (commandlines, parsing the output) and requires repeating the same IO operations twice. The code obtains both size values at the same time so it would make sense to do both calculations at the same time, and provide an option to display the regular and "apparent-size" values in column output. My guess would be that the cost of calculating both output values at the same time is negligible w.r.t. the cost of the stat() call (and thus that there's no need to complexify the code with "calculate this and/or that" conditionals). The option could be called --both, --colums (-C) or --two (-T). I'd also reconsider the "apparent-size" term as I think it is confusing and ambiguous. Consider this, taken from a ZFS dataset with gzip-9 compression (and copies=1; du v8.30): %> du -hcs /Volumes/nif64/tmp/.npm/ ; du -hcs --apparent-size /Volumes/nif64/tmp/.npm/ 340M/Volumes/nif64/tmp/.npm/ 180M/Volumes/nif64/tmp/.npm/ Same folder on btrfs (mounted with compress=lzo): %> du -hcs /mnt/.npm/ ; du -hcs --apparent-size /mnt/.npm 198M/mnt/.npm/ 181M/mnt/.npm According to `du --help`, the apparent-size option reports a size that is not the actual disk usage. The numbers above seem to show the opposite. If anything, I find the concept of "apparent size" more appropriate to the size a file occupies on the storage medium because ultimately that storage device will not give you more than "struct stat : st_size" bytes for uncompressed filesystems. Another way to say it: with "--apparent-size", du returns the actual file size; without, it returns how large the file appears to be (judging from its disk footprint). For comparison; same folder, on Mac with HFS+ %> du -hcs /Volumes/VMs/.npm ; du -hcs --apparent-size /Volumes/VMs/.npm 198M/Volumes/VMs/.npm 181M/Volumes/VMs/.npm Idem, with HFS+ compression (zip-9) %> du -hcs /Volumes/VMs/.npm ; du -hcs --apparent-size /Volumes/VMs/.npm 115M/Volumes/VMs/.npm 148M/Volumes/VMs/.npm Thoughts? Thanks, R.