Re: join bug

2008-03-05 Thread Bob Proulx
Martin,

Martin Schmeing wrote:
 Hi Bob,
 Join works fine with my test smaller files, giving an appropriate
 output.  When both files are 1000 (short) lines long, it outputs
 maybe one or two of the joined lines, but there should be hundreds
 output. The files are sorted, and there is no error message given.
 Here are my test files:

  pcmodel.list
  pcmodel1000.list
  radmodel.list
  radmodel1000.list

This one is tricky.  At first pass it would seem that everything is in
good shape for join.  For example the input files to join must be
sorted and not having them sorted is a common problem.  But these are
obvously sorted.  The first thing I did was to check this.

  for f in *.list; do sort -c $f; done

No errors from sort.  All of the files were sorted.  So I tried
joining the larger files.

  join pcmodel1000.list radmodel1000.list
  992 16023 239 3915 2793 43472.2226562 257.2904053
  993 16023 240 4134 2889 44867.9531250 393.2121582

Two lines.  What are in these files?  The first 15 lines of the first
file show the problem.  But it is tricky.  In fact I missed it until
this point.

 1  16021 1  8346525
 2  16021 2 10056699
 3  16021 3 12966651
 4  16021 4 13806594
 5  16021 5 11886534
 6  16021 6 10446363
 7  16021 7  4986240
 8  16021 8  3576405
 9  16021 9  2705886
10  1602110  9575436
11  1602111 11226096
12  1602112 15065865
13  1602113 14076030
14  1602114 13835922
15  1602115 15336045

The first field is lined up with a variable number of spaces in the
first column.  That is the root of the issue here.  Sort by default
sorts the entire line using the character collating sequence specified
by the LC_COLLATE locale.  Join does the same but does so ignoring
blanks at the start of the field.  Because of the variable number of
blanks sort and join are seeing a different sort order for the first
field.

Just last month (Feb 19 2008) James Youngman added a new feature to
join that warns about this case.  Using this very recent join the
following diagnostic is printed.  Eventually this will help people be
made aware of this problem much more easily than with older versions
of join.

  join: File 1 is not in sorted order
  join: File 2 is not in sorted order

Knowing this makes it obvious that I used the wrong sort check.  What
I should have done was using -b to skip blanks to match what join is
doing.  Or more precisely 'sort -k 1b,1'.

  for f in *.list; do sort -c -k 1b,1 $f; done
  sort: pcmodel1000.list:10: disorder: 10 1602110  957
5436
  sort: radmodel1000.list:116: disorder:   1001 44867.9531250 
393.2121582

Now the problem is much more apparent.  The file needs to be sorted in
the same order that join would expect it.  Not numberically but
lexically using 'sort -k 1b,1'.

  sort -k 1b,1 -o pcmodel1000.list pcmodel1000.list
  sort -k 1b,1 -o radmodel1000.list radmodel1000.list

  head -n10
 1  16021 1  8346525
10  1602110  9575436
   100  16021   100 1764 714
  1000  16023   247 48333609
   101  16021   101 1932 588
   102  16021   102 2058 501
   103  16021   103 2418 399
   104  16021   104 2256 447
   105  16021   105 1644 849

Looks better for join even if it looks worse for humans.  That is the
ordering that is needed for character sorting.

  join pcmodel1000.list radmodel1000.list | wc -l
  115

That looks a little more reasonable.

Hope that explanation helped.
Bob


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


join bug

2008-02-29 Thread Martin Schmeing

Hello,
Is there a size limit for the input files for join? I want to do it with 
large files, but even files of 1000 lines fail

Thanks,
Martin


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


join bug?

2007-06-20 Thread Tiago Junqueira

Hi,

I was trying to attach 2 files,
join -1 1 -2 1 Received_Packets Sent_Packets

The problem is, after the packet nr82, there is more numbers that
match, and the program just join the files until the packet nr82.

and file Received_Packets is:
4   25.109191
5   25.199239
6   25.289286
7   25.384085
8   25.474132
9   25.56418
10  25.654227
11  25.744274
12  25.834322
13  25.929103
14  26.01915
15  26.109197
16  26.199245
17  26.289292
18  26.384091
19  26.474138
20  26.564186
21  26.654233
22  26.74428
23  26.834327
24  26.929103
25  27.01915
45  27.109197
82  27.199245
120 27.289292
159 27.384091

and Sent_Packets is:
0   25.0
1   25.00240
2   25.00480
3   25.00720
4   25.00960
5   25.01200
6   25.01440
7   25.01680
8   25.01920
9   25.02160
10  25.02400
11  25.02640
12  25.02880
13  25.03120
14  25.03360
15  25.03600
16  25.03840
17  25.04080
18  25.04320
19  25.04560
20  25.04800
21  25.05040
22  25.05280
23  25.05520
24  25.05760
25  25.06000
26  25.06240
27  25.06480
28  25.06720
29  25.06960
30  25.07200
31  25.07440
32  25.07680
33  25.07920
34  25.08160
35  25.08400
36  25.08640
37  25.08880
38  25.09120
39  25.09360
40  25.09600
41  25.09840
42  25.10080
43  25.10320
44  25.10560
45  25.10800
46  25.11040
47  25.11280
48  25.11520
49  25.11760
50  25.12000
51  25.12240
52  25.12480
53  25.12720
54  25.12960
55  25.13200
56  25.13440
57  25.13680
58  25.13920
59  25.14160
60  25.14400
61  25.14640
62  25.14880
63  25.15120
64  25.15360
65  25.15600
66  25.15840
67  25.16080
68  25.16320
69  25.16560
70  25.16800
71  25.17040
72  25.17280
73  25.17520
74  25.17760
75  25.18000
76  25.18240
77  25.18480
78  25.18720
79  25.18960
80  25.19200
81  25.19440
82  25.19680
83  25.19920
84  25.20160
85  25.20400
86  25.20640
87  25.20880
88  25.21120
89  25.21360
90  25.21600
91  25.21840
92  25.22080
93  25.22320
94  25.22560
95  25.22800
96  25.23040
97  25.23280
98  25.23520
99  25.23760
100 25.24000
101 25.24240
102 25.24480
103 25.24720
104 25.24960
105 25.25200
106 25.25440
107 25.25680
108 25.25920
109 25.26160
110 25.26400
111 25.26640
112 25.26880
113 25.27120
114 25.27360
115 25.27600
116 25.27840
117 25.28080
118 25.28320
119 25.28560
120 25.28800
121 25.29040
122 25.29280
123 25.29520
124 25.29760
125 25.3
126 25.30240
127 25.30480
128 25.30720
129 25.30960

Why i get after the command:
4 25.109191 25.00960
5 25.199239 25.01200
6 25.289286 25.01440
7 25.384085 25.01680
8 25.474132 25.01920
9 25.56418 25.02160
10 25.654227 25.02400
11 25.744274 25.02640
12 25.834322 25.02880
13 25.929103 25.03120
14 26.01915 25.03360
15 26.109197 25.03600
16 26.199245 25.03840
17 26.289292 25.04080
18 26.384091 25.04320
19 26.474138 25.04560
20 26.564186 25.04800
21 26.654233 25.05040
22 26.74428 25.05280
23 26.834327 25.05520
24 26.929103 25.05760
25 27.01915 25.06000
45 27.109197 25.10800
82 27.199245 25.19680
(no more)


Kindly, Tiago Junqueira


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: join bug?

2007-06-20 Thread Andreas Schwab
Tiago Junqueira [EMAIL PROTECTED] writes:

 Why i get after the command:
 4 25.109191 25.00960

Your input is not sorted.

$ join --help
[...]
Important: FILE1 and FILE2 must be sorted on the join fields.
E.g., use `sort -k 1b,1' if `join' has no options.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
And now for something completely different.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


join bug?

2006-09-20 Thread Erez PERELMAN
hey, i think there is a bug in join command.  i'm joining two sorted files 
and there is no joining for an expected mapping. here is the example:


f1:
79 53

f2:
791 834
79 82

join f1 f2 == blank

if i change f2 to:
1791 834
79 82

join f1 f2 == 79 53 82

thanks,
erez


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: join bug?

2006-09-20 Thread Eric Blake
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

According to Erez PERELMAN on 9/19/2006 11:44 AM:
 hey, i think there is a bug in join command.  i'm joining two sorted
 files and there is no joining for an expected mapping. here is the example:
 
 f1:
 79 53
 
 f2:
 791 834
 79 82

Not a bug.  join requires its inputs to be sorted, otherwise you get
arbitrary behavior.  f2 was not sorted (at least, not in the C locale).

- --
Life is short - so eat dessert first!

Eric Blake [EMAIL PROTECTED]
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.1 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFETU084KuGfSFAYARAloCAJ458hNeZFlPbW48LpCvHZFOYHII9QCeI+Dw
KX7Pue9vcrlv5RwPpq6MquU=
=VTHJ
-END PGP SIGNATURE-


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: join bug?

2006-02-06 Thread german rigau
On 2/6/06, Paul Eggert [EMAIL PROTECTED] wrote:

 german rigau [EMAIL PROTECTED] writes:

  Obviously, the problem is in the sort command. With C locale
  runs perfectly. However, I use LANG=en_US.UTF-8 ...
  And then it seems that the sort command have different behaviour ...

 I don't see any bug in the examples that you gave.


Sorry for insisting.

If you see carefully the last example I sent, we obtain two different
sortings with locale en_US.UTF-8 ... with sort kk2 we obtain icecream
before ice_cream and with sort -k 1,2 kk2 we obtain ice_cream before
icecream!

However, we obtain with sort kk2 and sort -k 1,2 kk2 the same ordering
with locale C. ??

Getting back to the original question, join must use the same
 collating convention that sort does.  If you sort in the
 en_US.UTF-8 locale, you must join in the same locale.  Otherwise, as
 you discovered, things won't work in general.


No. I use the same collating for sorting and joining. This is why I
detected the abnormality: join failed to locate the same elements
ordered by the default sorting ...

Also, my advice is to stick with the C locale unless you know what
 you're doing.


I think I know perfectly what I am doing ... ;-)

 For example, if you're not sure what you want to do in
 the case of encoding error (or, if you don't know what an encoding
 error is :-), then you should stick with the C locale.

 But, this is only useful for encoding English language ... and
there are many more around.

Best,

German
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: join bug?

2006-02-06 Thread Paul Eggert
german rigau [EMAIL PROTECTED] writes:

 If you see carefully the last example I sent, we obtain two different
 sortings with locale en_US.UTF-8 ... with sort kk2 we obtain icecream
 before ice_cream and with sort -k 1,2 kk2 we obtain ice_cream before
 icecream!

That is because in that locale, _ is discarded in the first pass of
the collation comparison.  Plain sort therefore puts
icecream%1:13:00:: 07510835 1 0 before ice_cream%1:13:00:: 07510835
1 1 (because of the 0 versus the 1).  However, sort -k 1,2 sees
only the icecream%1:13:00:: 07510835 and ice_cream%1:13:00::
07510835 and reports the opposite order, because the second pass of
the collation comparison kicks in.

 I use the same collating for sorting and joining.

I didn't see where you did that.  You're not doing it in
http://lists.gnu.org/archive/html/bug-coreutils/2006-02/msg00011.html,
for example, because the join is only on the first field, but the
sort is sorting via the entire line.

So I still don't see any bug.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: join bug?

2006-02-05 Thread Paul Eggert
german rigau [EMAIL PROTECTED] writes:

 Obviously, the problem is in the sort command. With C locale
 runs perfectly. However, I use LANG=en_US.UTF-8 ...
 And then it seems that the sort command have different behaviour ...

I don't see any bug in the examples that you gave.

Getting back to the original question, join must use the same
collating convention that sort does.  If you sort in the
en_US.UTF-8 locale, you must join in the same locale.  Otherwise, as
you discovered, things won't work in general.

Also, my advice is to stick with the C locale unless you know what
you're doing.  For example, if you're not sure what you want to do in
the case of encoding error (or, if you don't know what an encoding
error is :-), then you should stick with the C locale.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: join bug?

2006-02-03 Thread german rigau
Hello again,

Thanks a lot for you quick reply ...

But, BOTH files were sorted previously with default options.

[EMAIL PROTECTED] BaseConcepts]$ cat kk1
ice_cream%1:13:00::
life_style%1:07:00::
part-time%3:00:00::

[EMAIL PROTECTED] BaseConcepts]$ cat kk2
ice_cream%1:13:00:: 07510835 1 1
icecream%1:13:00:: 07510835 1 0
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21

[EMAIL PROTECTED] BaseConcepts]$ join (sort kk1) (sort kk2)
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21

[EMAIL PROTECTED] BaseConcepts]$ join -v 1 (sort kk1) (sort kk2)
ice_cream%1:13:00::

Then, the problem is on the sort command because:

[EMAIL PROTECTED] BaseConcepts]$ more kk2
ice_cream%1:13:00:: 07510835 1 1
icecream%1:13:00:: 07510835 1 0
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21
[EMAIL PROTECTED] BaseConcepts]$ sort kk2
icecream%1:13:00:: 07510835 1 0
ice_cream%1:13:00:: 07510835 1 1
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21

In fact, adding icecream%1:13:00: in kk1 we obtain:

[EMAIL PROTECTED] BaseConcepts]$ sort kk1
ice_cream%1:13:00::
icecream%1:13:00::
life_style%1:07:00::
part-time%3:00:00::
[EMAIL PROTECTED] BaseConcepts]$ sort kk2
icecream%1:13:00:: 07510835 1 0
ice_cream%1:13:00:: 07510835 1 1
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21

Which could be the reason to produce two different orderings??

Thanks in advance,

German
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: join bug?

2006-02-03 Thread german rigau
Hello again,

It seems to me that the problem is on the sort command ...

[EMAIL PROTECTED] BaseConcepts]$ sort kk2
icecream%1:13:00:: 07510835 1 0
ice_cream%1:13:00:: 07510835 1 1
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21
[EMAIL PROTECTED] BaseConcepts]$ sort -k 1,2 kk2
ice_cream%1:13:00:: 07510835 1 1
icecream%1:13:00:: 07510835 1 0
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21

which seems to apply different criteria depending on
the length of the lines ...

Best,

German
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: join bug?

2006-02-03 Thread german rigau
On 2/3/06, Paul Eggert [EMAIL PROTECTED] wrote:

 german rigau [EMAIL PROTECTED] writes:

  [EMAIL PROTECTED] BaseConcepts]$ more kk2.sort
  icecream%1:13:00:: 07510835 1 0
  ice_cream%1:13:00:: 07510835 1 1
  life_style%1:07:00:: 04875322 1 2
  part-time%3:00:00:: 01131371 1 21

 kk2.sort isn't sorted correctly, at least not for the default C
 locale.  The first two lines ought to be interchanged.  That probably
 explains your problem.  'join' requires sorted input.


Obviously, the problem is in the sort command. With C locale
runs perfectly. However, I use LANG=en_US.UTF-8 ...

And then it seems that the sort command have different behaviour ...

[EMAIL PROTECTED] BaseConcepts]$ export LANG=C
[EMAIL PROTECTED] BaseConcepts]$ sort kk2
ice_cream%1:13:00:: 07510835 1 1
icecream%1:13:00:: 07510835 1 0
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21
[EMAIL PROTECTED] BaseConcepts]$ sort -k 1,2 kk2
ice_cream%1:13:00:: 07510835 1 1
icecream%1:13:00:: 07510835 1 0
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21
[EMAIL PROTECTED] BaseConcepts]$ export LANG=en_US.UTF-8
[EMAIL PROTECTED] BaseConcepts]$ sort kk2
icecream%1:13:00:: 07510835 1 0
ice_cream%1:13:00:: 07510835 1 1
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21
[EMAIL PROTECTED] BaseConcepts]$ sort -k 1,2 kk2
ice_cream%1:13:00:: 07510835 1 1
icecream%1:13:00:: 07510835 1 0
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21

Best,

German
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


join bug?

2006-02-02 Thread german rigau
Just in case this is really a bug...

I'm using version 5.2.1 of coreutils on a fedora core 4.

Consider joining these two files ordered with default options:

[EMAIL PROTECTED] BaseConcepts]$ more kk1.sort
ice_cream%1:13:00::
life_style%1:07:00::
part-time%3:00:00::
[EMAIL PROTECTED] BaseConcepts]$ more kk2.sort
icecream%1:13:00:: 07510835 1 0
ice_cream%1:13:00:: 07510835 1 1
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21
[EMAIL PROTECTED] BaseConcepts]$ join kk1.sort kk2.sort
life_style%1:07:00:: 04875322 1 2
part-time%3:00:00:: 01131371 1 21
[EMAIL PROTECTED] BaseConcepts]$ join -v 1 kk1.sort kk2.sort
ice_cream%1:13:00::

Why ice_cream%1:13:00:: is not joined? Is this corrected in
version 5.93 of coreutils?

Thanks,

German
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: join bug?

2006-02-02 Thread Paul Eggert
german rigau [EMAIL PROTECTED] writes:

 [EMAIL PROTECTED] BaseConcepts]$ more kk2.sort
 icecream%1:13:00:: 07510835 1 0
 ice_cream%1:13:00:: 07510835 1 1
 life_style%1:07:00:: 04875322 1 2
 part-time%3:00:00:: 01131371 1 21

kk2.sort isn't sorted correctly, at least not for the default C
locale.  The first two lines ought to be interchanged.  That probably
explains your problem.  'join' requires sorted input.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils