How to sort and count efficiently?

2019-06-30 Thread Peng Yu
Hi,

I have a long list of string (each string is in a line). I need to
count the number of appearance for each string.

I currently use `sort` to sort the list and then use another program
to do the count. The second program doing the count needs only a small
amount of the memory as the input is sorted.

But `sort` writes a lot of temp files like `sortjISjDY`, which are
very large. Because I only need the count, ideally, I'd like these
temp files only keep the count info and the original string once, but
not the original string many times. Does anybody know any better way
to make the sort and count run more efficiently?

-- 
Regards,
Peng



Re: How to sort and count efficiently?

2019-06-30 Thread Peng Yu
The problem with this kind of awk program is that everything will be loaded
to memory. But bare `sort` use external files to save memory. When the hash
in awk is too large, accessing it can become very slow (maybe due to
potential cache miss or slow down of hash as a function of hash size).

On Sun, Jun 30, 2019 at 11:52 AM Assaf Gordon  wrote:

> Correcting myself:
>
> On Sun, Jun 30, 2019 at 10:08:46AM -0600, Assaf Gordon wrote:
> > On Sun, Jun 30, 2019 at 07:34:19AM -0500, Peng Yu wrote:
> > >
> > > I have a long list of string (each string is in a line). I need to
> > > count the number of appearance for each string.
> > >
> > > [...] Does anybody know any better way
> > > to make the sort and count run more efficiently?
> > >
> >
> > Or using gnu awk:
>
> use 'asorti' instead of 'asort', with the two-parameter variant:
>
>
>   $ printf "%s\n" a c b b b b b b c \
> | awk 'a[$1]++ {}
>END { n = asorti(a,b)
>  for (i = 1; i <= n; i++) {
> print b[i], a[b[i]]
>  }
>}'
>   a 1
>   b 6
>   c 2
>
>
> For more details see:
>
> https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html#Array-Sorting-Functions
>
> -assaf
>
> --
Regards,
Peng


Re: How to sort and count efficiently?

2019-06-30 Thread Assaf Gordon
Correcting myself:

On Sun, Jun 30, 2019 at 10:08:46AM -0600, Assaf Gordon wrote:
> On Sun, Jun 30, 2019 at 07:34:19AM -0500, Peng Yu wrote:
> > 
> > I have a long list of string (each string is in a line). I need to
> > count the number of appearance for each string.
> > 
> > [...] Does anybody know any better way
> > to make the sort and count run more efficiently?
> > 
> 
> Or using gnu awk:

use 'asorti' instead of 'asort', with the two-parameter variant:


  $ printf "%s\n" a c b b b b b b c \
| awk 'a[$1]++ {}
   END { n = asorti(a,b)
 for (i = 1; i <= n; i++) {
print b[i], a[b[i]]
 }
   }'
  a 1
  b 6
  c 2


For more details see:
https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html#Array-Sorting-Functions

-assaf




Re: How to sort and count efficiently?

2019-06-30 Thread Assaf Gordon



On 2019-06-30 11:10 a.m., Peng Yu wrote:
The problem with this kind of awk program is that everything will be 
loaded to memory.


Well, those are the to main options: store in memory or resort to disk
I/O. each has its own pros and cons.


But bare `sort` use external files to save memory.

Not exactly - The goal is not to "save" memory -
Sort resorts to external files to be able to complete the
sort even with it runs out of the (alloted) memory (which can be
controlled with the "-S" parameter).

I'm not familiar with a program which implements hashing backed by file-
storage, but perhaps such program exists.

When the hash in awk is too large, accessing it can become very slow 
(maybe due to potential cache miss or slow down of hash as a function of 
hash size).


Nothing is "free", and using a hash incurs its own costs.

If you're using the simplified awk hashing program,
try to use other AWK implementations than GNU awk (e.g. I have had
some performance gains from switching to "mawk", the default awk in
Debian).

  $ printf "%s\n" a c b b b b b b c \
 | mawk 'a[$1]++ {} END { for (i in a) { print i, a[i] } }'
  a 1
  b 6
  c 2

Or, if your input is exceedingly large, perhaps consider pre-processing
it and splitting the input into smaller files - each one will have less
strings and hashing them will consume less memory.
The following example splits the input file into 27 files, based on
the first letter of the string (and an "other" file for non-letters):

 mawk '{ l = tolower(substr($0,1,1)) ;
 if (l>="a" && l<="z") {
   print $0 > l
 } else {
   print $0 > "other"
 }
   }' INPUT

This is an O(N) operation that doesn't consume any memory (just lots of
disk I/O) - and the resulting files will be much smaller - then can be 
hashed with less memory.


Of course this can be extended to split into smaller-grained files.

-assaf



Re: How to sort and count efficiently?

2019-06-30 Thread Assaf Gordon
Hello,

On Sun, Jun 30, 2019 at 07:34:19AM -0500, Peng Yu wrote:
> Hi,
> 
> I have a long list of string (each string is in a line). I need to
> count the number of appearance for each string.
> 
> I currently use `sort` to sort the list and then use another program
> to do the count. The second program doing the count needs only a small
> amount of the memory as the input is sorted.
> 
> [...] Does anybody know any better way
> to make the sort and count run more efficiently?
> 

Using awk:

  awk 'a[$1]++ {} END { for (i in a) { print i, a[i] } }' INPUT \
| sort -k1,1

Or using gnu awk:

  awk 'a[$1]++ {}
   END { n = asort(a) ;
 for (i = 1; i <= n; i++) {
 print i, a[i]
 }
   }'


regards,
 - assaf