[Trisquel-users] Re : Unforeseen feature of the join command

2020-03-16 Thread lcerf
Thousands of lines are nothing.  GNU sort can process millions if not  
billions.  It automatically uses temporary files to not run short of RAM.


Your experience with grep is weird.  As far as I understand, grep's memory  
requirement do not depend on the number of lines (which could be infinite).   
It processes one single line, outputs it if and only if it matches the  
regular expression, then forgets about it to process the subsequent line and  
so on.


[Trisquel-users] Re : Unforeseen feature of the join command

2020-03-15 Thread lcerf
It's great that sort can be "trained" to do the same kind of multi-stage  
sorting task that appears to be built into LibreOffice Calc.


It is built in, not trained.  And GNU sort is certainly much faster than what  
LibreOffice achieves on spreadsheets, which are also limited in the number of  
rows they can have.


[Trisquel-users] Re : Unforeseen feature of the join command

2020-03-15 Thread lcerf

I realized that LibreOffice Calc. cannot handle more than a million rows ...

Spreadsheets are only meant to do computation on little data.  To store many  
data, use text files or a database management system.


I decided to sort the file, first on Column 3, then Column 1, and then on  
Column 2.
Accordingly, I rearranged the columns thusly: $3, $1, $2, $4 and sorted with:  
"sort -nrk 1,4" where "nr" puts the biggest numbers at the top of the column,  
but sort evidently did not reach to the third column, resulting in an  
ordering of only hostname and visits-per-domain.


'sort -k 1,4' uses the part of the line up to column 4 (one single string) to  
sort.  It is not what you want, and neither is 'sort -k 1,3'.  What you want  
("sort the file, first on Column 3, then Column 1, and then on Column 2") is  
achieved using three times option -k (where the order matters): 'sort -k 3,3  
1,1 2,2'.


Again: read 'info sort'.  At the end of it, there are even well-explained  
examples with multiple -k options, starting with this one:

 Sort numerically on the second field and resolve ties by sorting
 alphabetically on the third and fourth characters of field five.
 Use ‘:’ as the field delimiter.

  sort -t : -k 2,2n -k 5.3,5.4

 Note that if you had written ‘-k 2n’ instead of ‘-k 2,2n’  
‘sort’

 would have used all characters beginning in the second field and
 extending to the end of the line as the primary _numeric_ key.  For
 the large majority of applications, treating keys spanning more
 than one field as numeric will not do what you expect.

 Also note that the ‘n’ modifier was applied to the field-end
 specifier for the first key.  It would have been equivalent to
 specify ‘-k 2n,2’ or ‘-k 2n,2n’.  All modifiers except ‘b’  
apply to

 the associated _field_, regardless of whether the modifier
 character is attached to the field-start and/or the field-end part
 of the key specifier.


[Trisquel-users] Re : Unforeseen feature of the join command

2020-03-14 Thread lcerf
I had been guilty of not reading (i.e., unaware of !) the "info sort"  
material


'man sort' specifies the expected argument of option -k, although in a much  
more arid way (as always for GNU commands: 'info' provides the full  
documentation):


   -k, --key=KEYDEF
  sort via a key; KEYDEF gives location and type
(...)
   KEYDEF  is  F[.C][OPTS][,F[.C][OPTS]]  for start and stop position,  
where F is a field number and C a character position in the field; both are  
origin 1, and the stop position defaults to the line's end.


"cut" is not yet in my scripting vocabulary.

'cut' can select fields, specified after -f as you specify pages to print (so  
"-2" means "every field up to the second one; I could have written "1,2" for  
the same effect), delimited by one single character (the tabulation is the  
default but a different one can be specified after -d).  See 'info cut' for  
the full documentation.


[Trisquel-users] Re : Unforeseen feature of the join command

2020-03-13 Thread lcerf

Each has three columns: IPv4, Count 1 or Count 2, Textfile name.

Above, I assumed ListA.txt had one single column (because the attached file  
is like that).  If there are more columns, then ++count[$0] must be replaced  
with ++count[$1].


For performance, use pipes instead of temporary files and avoid useless  
commands.  As far as I understand the beginning of your last post (very  
little), you do not need to reorder the input fields and can get rid of the  
last one in DataY-DVCts.txt (your awk program does not print it).  In the  
end, it looks like you could write:
$ join 


[Trisquel-users] Re : Unforeseen feature of the join command

2020-03-11 Thread lcerf
'sort -k 1' is the same as 'sort'.  Indeed, giving one single integer N (here  
1) to option -k means "*from* the Nth column to the last one".  Here is the  
relevant excerpt from 'info sort':


‘-k POS1[,POS2]’
‘--key=POS1[,POS2]’
 Specify a sort field that consists of the part of the line between
 POS1 and POS2 (or the end of the line, if POS2 is omitted),
 _inclusive_.

It is the root of your problem here.  The join fields of ListB-w.txt and  
ListA-w-Counts.txt are not sorted in the same way because of that.  They have  
to for 'join' to properly work.


Also simplifying the sequence of commands (with apparently useless steps and  
options), you get:
$ join -2 2