How to print sizes of both files and directories in a directory?
Hi, `du -h --max-depth=1` only print directory sizes. Is there a way to print the sizes of both directories and files in a directory? Thanks. -- Regards, Peng
Re: How to sort and count efficiently?
On 2019-06-30 11:10 a.m., Peng Yu wrote: The problem with this kind of awk program is that everything will be loaded to memory. Well, those are the to main options: store in memory or resort to disk I/O. each has its own pros and cons. But bare `sort` use external files to save memory. Not exactly - The goal is not to "save" memory - Sort resorts to external files to be able to complete the sort even with it runs out of the (alloted) memory (which can be controlled with the "-S" parameter). I'm not familiar with a program which implements hashing backed by file- storage, but perhaps such program exists. When the hash in awk is too large, accessing it can become very slow (maybe due to potential cache miss or slow down of hash as a function of hash size). Nothing is "free", and using a hash incurs its own costs. If you're using the simplified awk hashing program, try to use other AWK implementations than GNU awk (e.g. I have had some performance gains from switching to "mawk", the default awk in Debian). $ printf "%s\n" a c b b b b b b c \ | mawk 'a[$1]++ {} END { for (i in a) { print i, a[i] } }' a 1 b 6 c 2 Or, if your input is exceedingly large, perhaps consider pre-processing it and splitting the input into smaller files - each one will have less strings and hashing them will consume less memory. The following example splits the input file into 27 files, based on the first letter of the string (and an "other" file for non-letters): mawk '{ l = tolower(substr($0,1,1)) ; if (l>="a" && l<="z") { print $0 > l } else { print $0 > "other" } }' INPUT This is an O(N) operation that doesn't consume any memory (just lots of disk I/O) - and the resulting files will be much smaller - then can be hashed with less memory. Of course this can be extended to split into smaller-grained files. -assaf
Re: How to sort and count efficiently?
The problem with this kind of awk program is that everything will be loaded to memory. But bare `sort` use external files to save memory. When the hash in awk is too large, accessing it can become very slow (maybe due to potential cache miss or slow down of hash as a function of hash size). On Sun, Jun 30, 2019 at 11:52 AM Assaf Gordon wrote: > Correcting myself: > > On Sun, Jun 30, 2019 at 10:08:46AM -0600, Assaf Gordon wrote: > > On Sun, Jun 30, 2019 at 07:34:19AM -0500, Peng Yu wrote: > > > > > > I have a long list of string (each string is in a line). I need to > > > count the number of appearance for each string. > > > > > > [...] Does anybody know any better way > > > to make the sort and count run more efficiently? > > > > > > > Or using gnu awk: > > use 'asorti' instead of 'asort', with the two-parameter variant: > > > $ printf "%s\n" a c b b b b b b c \ > | awk 'a[$1]++ {} >END { n = asorti(a,b) > for (i = 1; i <= n; i++) { > print b[i], a[b[i]] > } >}' > a 1 > b 6 > c 2 > > > For more details see: > > https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html#Array-Sorting-Functions > > -assaf > > -- Regards, Peng
Re: How to sort and count efficiently?
Correcting myself: On Sun, Jun 30, 2019 at 10:08:46AM -0600, Assaf Gordon wrote: > On Sun, Jun 30, 2019 at 07:34:19AM -0500, Peng Yu wrote: > > > > I have a long list of string (each string is in a line). I need to > > count the number of appearance for each string. > > > > [...] Does anybody know any better way > > to make the sort and count run more efficiently? > > > > Or using gnu awk: use 'asorti' instead of 'asort', with the two-parameter variant: $ printf "%s\n" a c b b b b b b c \ | awk 'a[$1]++ {} END { n = asorti(a,b) for (i = 1; i <= n; i++) { print b[i], a[b[i]] } }' a 1 b 6 c 2 For more details see: https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html#Array-Sorting-Functions -assaf
Re: How to sort and count efficiently?
Hello, On Sun, Jun 30, 2019 at 07:34:19AM -0500, Peng Yu wrote: > Hi, > > I have a long list of string (each string is in a line). I need to > count the number of appearance for each string. > > I currently use `sort` to sort the list and then use another program > to do the count. The second program doing the count needs only a small > amount of the memory as the input is sorted. > > [...] Does anybody know any better way > to make the sort and count run more efficiently? > Using awk: awk 'a[$1]++ {} END { for (i in a) { print i, a[i] } }' INPUT \ | sort -k1,1 Or using gnu awk: awk 'a[$1]++ {} END { n = asort(a) ; for (i = 1; i <= n; i++) { print i, a[i] } }' regards, - assaf
How to sort and count efficiently?
Hi, I have a long list of string (each string is in a line). I need to count the number of appearance for each string. I currently use `sort` to sort the list and then use another program to do the count. The second program doing the count needs only a small amount of the memory as the input is sorted. But `sort` writes a lot of temp files like `sortjISjDY`, which are very large. Because I only need the count, ideally, I'd like these temp files only keep the count info and the original string once, but not the original string many times. Does anybody know any better way to make the sort and count run more efficiently? -- Regards, Peng