I see that your suspicions may be well founded. The smaller output files (30
to several hundred kB) are clean-looking, but the largest ones (1500 kB down
to ~700 kB) have duplicated rows, nearly exclusively.
That was no suspicions. I was writing that you taught me something. I
thought the shell would spit an error. Since it does not, it is probably
valid syntax, doing what the user expects it to do. However, I suspect it
may be a "bashism", i.e., a syntax not all shells would accept.
join's output certainly has duplicates because the input files have
duplicates (is that normal?). Just add the option --unique (or simply -u) to
the sort commands.
They'll have to have the duplicates removed during post-processing ... and be
checked for errors.
Do not do that as a post-process. As I have just written: just add the
option --unique (or simply -u) to the sort commands. It actually turns their
execution faster. That of 'join' too (smaller inputs and output).
Before I start that processing, I'll see if I can try out your script; the
extra steps won't be any drag on the joining, as the longest times for any
joins were still in the blink-of-an-eye category (0.044 sec. system time).
Using two named pipes may actually be faster than using two subshells (what
happens when you put commands between parentheses)... by a constant time you
should not care about. Only optimize at the end, if necessary. After
ensuring the whole process is correct and after identifying the bottleneck
(usually one single command).
I've been pairing up the most recent data with all of the prior data, one
pair at a time, and that's getting tedious.
Use a Shell loop (or two). For instance, if what you call "prior data" and
"most recent data" are files in two separate directories and you want all
pairs, then you can pass these two directories as the two arguments of a
script like this one:
#!/bin/sh
mkfifo old.sorted
for old in "$1"/*
do
for new in "$2"/*
do
out=joined-$(basename "$old")-$(basename "$new")
sort -uk 1b,1 "$old" > old.sorted &
sort -uk 1b,1 "$new" | join old.sorted - > "$out"
done
done
rm old.sorted
"info join" page says that one of the target fields (but not both !) can be
read from standard input.
One of the two input files (not "target fields": there is no such thing),
yes. I did it above, to give you an example.
In these repetitive joins that I'm doing now, can one of the target fields be
read from a file that lists the other target files ?
You can do such a thing in a Shel script using 'while read line; do ...; done
< file'. Don't you prefer to organize files in directories and specified
these directories, as I suggested above?
Remark: do you join files whose join fields are the whole lines (there are no
additional fields)? In other words, are you searching for equal lines in two
files? If so, then you actually want to use 'comm -12' instead of 'join'.
'comm' is a simpler command, to compare the (sorted) lines of two files.