[Trisquel-users] Re : Unforeseen feature of the join command
Thousands of lines are nothing. GNU sort can process millions if not billions. It automatically uses temporary files to not run short of RAM. Your experience with grep is weird. As far as I understand, grep's memory requirement do not depend on the number of lines (which could be infinite). It processes one single line, outputs it if and only if it matches the regular expression, then forgets about it to process the subsequent line and so on.
[Trisquel-users] Re : Unforeseen feature of the join command
It's great that sort can be "trained" to do the same kind of multi-stage sorting task that appears to be built into LibreOffice Calc. It is built in, not trained. And GNU sort is certainly much faster than what LibreOffice achieves on spreadsheets, which are also limited in the number of rows they can have.
[Trisquel-users] Re : Unforeseen feature of the join command
I realized that LibreOffice Calc. cannot handle more than a million rows ... Spreadsheets are only meant to do computation on little data. To store many data, use text files or a database management system. I decided to sort the file, first on Column 3, then Column 1, and then on Column 2. Accordingly, I rearranged the columns thusly: $3, $1, $2, $4 and sorted with: "sort -nrk 1,4" where "nr" puts the biggest numbers at the top of the column, but sort evidently did not reach to the third column, resulting in an ordering of only hostname and visits-per-domain. 'sort -k 1,4' uses the part of the line up to column 4 (one single string) to sort. It is not what you want, and neither is 'sort -k 1,3'. What you want ("sort the file, first on Column 3, then Column 1, and then on Column 2") is achieved using three times option -k (where the order matters): 'sort -k 3,3 1,1 2,2'. Again: read 'info sort'. At the end of it, there are even well-explained examples with multiple -k options, starting with this one: Sort numerically on the second field and resolve ties by sorting alphabetically on the third and fourth characters of field five. Use ‘:’ as the field delimiter. sort -t : -k 2,2n -k 5.3,5.4 Note that if you had written ‘-k 2n’ instead of ‘-k 2,2n’ ‘sort’ would have used all characters beginning in the second field and extending to the end of the line as the primary _numeric_ key. For the large majority of applications, treating keys spanning more than one field as numeric will not do what you expect. Also note that the ‘n’ modifier was applied to the field-end specifier for the first key. It would have been equivalent to specify ‘-k 2n,2’ or ‘-k 2n,2n’. All modifiers except ‘b’ apply to the associated _field_, regardless of whether the modifier character is attached to the field-start and/or the field-end part of the key specifier.
[Trisquel-users] Re : Unforeseen feature of the join command
I had been guilty of not reading (i.e., unaware of !) the "info sort" material 'man sort' specifies the expected argument of option -k, although in a much more arid way (as always for GNU commands: 'info' provides the full documentation): -k, --key=KEYDEF sort via a key; KEYDEF gives location and type (...) KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end. "cut" is not yet in my scripting vocabulary. 'cut' can select fields, specified after -f as you specify pages to print (so "-2" means "every field up to the second one; I could have written "1,2" for the same effect), delimited by one single character (the tabulation is the default but a different one can be specified after -d). See 'info cut' for the full documentation.
[Trisquel-users] Re : Unforeseen feature of the join command
Each has three columns: IPv4, Count 1 or Count 2, Textfile name. Above, I assumed ListA.txt had one single column (because the attached file is like that). If there are more columns, then ++count[$0] must be replaced with ++count[$1]. For performance, use pipes instead of temporary files and avoid useless commands. As far as I understand the beginning of your last post (very little), you do not need to reorder the input fields and can get rid of the last one in DataY-DVCts.txt (your awk program does not print it). In the end, it looks like you could write: $ join
[Trisquel-users] Re : Unforeseen feature of the join command
'sort -k 1' is the same as 'sort'. Indeed, giving one single integer N (here 1) to option -k means "*from* the Nth column to the last one". Here is the relevant excerpt from 'info sort': ‘-k POS1[,POS2]’ ‘--key=POS1[,POS2]’ Specify a sort field that consists of the part of the line between POS1 and POS2 (or the end of the line, if POS2 is omitted), _inclusive_. It is the root of your problem here. The join fields of ListB-w.txt and ListA-w-Counts.txt are not sorted in the same way because of that. They have to for 'join' to properly work. Also simplifying the sequence of commands (with apparently useless steps and options), you get: $ join -2 2