[Trisquel-users] Re : The join command is missing the IPv4 addresses in long mixed lists of strings

2019-07-05 Thread lcerf
I see that your suspicions may be well founded. The smaller output files (30  
to several hundred kB) are clean-looking, but the largest ones (1500 kB down  
to ~700 kB) have duplicated rows, nearly exclusively.


That was no suspicions.  I was writing that you taught me something.  I  
thought the shell would spit an error.  Since it does not, it is probably  
valid syntax, doing what the user expects it to do.  However, I suspect it  
may be a "bashism", i.e., a syntax not all shells would accept.


join's output certainly has duplicates because the input files have  
duplicates (is that normal?).  Just add the option --unique (or simply -u) to  
the sort commands.


They'll have to have the duplicates removed during post-processing ... and be  
checked for errors.


Do not do that as a post-process.  As I have just written: just add the  
option --unique (or simply -u) to the sort commands.  It actually turns their  
execution faster.  That of 'join' too (smaller inputs and output).


Before I start that processing, I'll see if I can try out your script; the  
extra steps won't be any drag on the joining, as the longest times for any  
joins were still in the blink-of-an-eye category (0.044 sec. system time).


Using two named pipes may actually be faster than using two subshells (what  
happens when you put commands between parentheses)... by a constant time you  
should not care about.  Only optimize at the end, if necessary.  After  
ensuring the whole process is correct and after identifying the bottleneck  
(usually one single command).


I've been pairing up the most recent data with all of the prior data, one  
pair at a time, and that's getting tedious.


Use a Shell loop (or two).  For instance, if what you call "prior data" and  
"most recent data" are files in two separate directories and you want all  
pairs, then you can pass these two directories as the two arguments of a  
script like this one:

#!/bin/sh
mkfifo old.sorted
for old in "$1"/*
do
for new in "$2"/*
do
out=joined-$(basename "$old")-$(basename "$new")
sort -uk 1b,1 "$old" > old.sorted &
sort -uk 1b,1 "$new" | join old.sorted - > "$out"
done
done
rm old.sorted

"info join" page says that one of the target fields (but not both !) can be  
read from standard input.


One of the two input files (not "target fields": there is no such thing),  
yes.  I did it above, to give you an example.


In these repetitive joins that I'm doing now, can one of the target fields be  
read from a file that lists the other target files ?


You can do such a thing in a Shel script using 'while read line; do ...; done  
< file'.  Don't you prefer to organize files in directories and specified  
these directories, as I suggested above?


Remark: do you join files whose join fields are the whole lines (there are no  
additional fields)?  In other words, are you searching for equal lines in two  
files?  If so, then you actually want to use 'comm -12' instead of 'join'.   
'comm' is a simpler command, to compare the (sorted) lines of two files.


[Trisquel-users] Re : The join command is missing the IPv4 addresses in long mixed lists of strings

2019-07-05 Thread lcerf

About the end of your post:

You cannot both read and write in a same file; your "two-step solution" is  
OK.
I did not know it was OK to redirect twice the standard input; to avoid  
touching the disk I would have created named pipes, as in this short  
(untested) script:

mkfifo file1.sorted file2.sorted
sort -k 1b,1 -o file1.sorted file1 &
sort -k 1b,1 -o file2.sorted file2 &
join file1.sorted file2.sorted > Joined-file0102.txt



[Trisquel-users] Re : The join command is missing the IPv4 addresses in long mixed lists of strings

2019-07-04 Thread lcerf
'join' takes as input two (paths to) text files.  It does not matter what  
that text is.  It does not know what an IP address is, because it does not  
need to.  Attach the two text files you are joining, if you want us to take a  
look.


Forget about the size of the input data: 'join', like any GNU text-processing  
command can process arbitrarily long inputs.  The problem is not there.  The  
problem is almost certainly that the files are not sorted (with the 'sort'  
command and the same locale) w.r.t. their join fields. Have you read 'info  
join'?


Additional remarks:

Writing "-1 1 -2 1" is like writing nothing, because 1 is the default value  
for both options.
You certainly do *not* want to use --nocheck-order, to be warned if the files  
are not ordered w.r.t. their join fields.
You probably want to redirect the sole standard output (with ">") instead of  
both the standard and the error output (with "&>"), to have errors/warnings  
(such as one about a misordered file) appearing in the terminal.