Re: join bug
Martin, Martin Schmeing wrote: Hi Bob, Join works fine with my test smaller files, giving an appropriate output. When both files are 1000 (short) lines long, it outputs maybe one or two of the joined lines, but there should be hundreds output. The files are sorted, and there is no error message given. Here are my test files: pcmodel.list pcmodel1000.list radmodel.list radmodel1000.list This one is tricky. At first pass it would seem that everything is in good shape for join. For example the input files to join must be sorted and not having them sorted is a common problem. But these are obvously sorted. The first thing I did was to check this. for f in *.list; do sort -c $f; done No errors from sort. All of the files were sorted. So I tried joining the larger files. join pcmodel1000.list radmodel1000.list 992 16023 239 3915 2793 43472.2226562 257.2904053 993 16023 240 4134 2889 44867.9531250 393.2121582 Two lines. What are in these files? The first 15 lines of the first file show the problem. But it is tricky. In fact I missed it until this point. 1 16021 1 8346525 2 16021 2 10056699 3 16021 3 12966651 4 16021 4 13806594 5 16021 5 11886534 6 16021 6 10446363 7 16021 7 4986240 8 16021 8 3576405 9 16021 9 2705886 10 1602110 9575436 11 1602111 11226096 12 1602112 15065865 13 1602113 14076030 14 1602114 13835922 15 1602115 15336045 The first field is lined up with a variable number of spaces in the first column. That is the root of the issue here. Sort by default sorts the entire line using the character collating sequence specified by the LC_COLLATE locale. Join does the same but does so ignoring blanks at the start of the field. Because of the variable number of blanks sort and join are seeing a different sort order for the first field. Just last month (Feb 19 2008) James Youngman added a new feature to join that warns about this case. Using this very recent join the following diagnostic is printed. Eventually this will help people be made aware of this problem much more easily than with older versions of join. join: File 1 is not in sorted order join: File 2 is not in sorted order Knowing this makes it obvious that I used the wrong sort check. What I should have done was using -b to skip blanks to match what join is doing. Or more precisely 'sort -k 1b,1'. for f in *.list; do sort -c -k 1b,1 $f; done sort: pcmodel1000.list:10: disorder: 10 1602110 957 5436 sort: radmodel1000.list:116: disorder: 1001 44867.9531250 393.2121582 Now the problem is much more apparent. The file needs to be sorted in the same order that join would expect it. Not numberically but lexically using 'sort -k 1b,1'. sort -k 1b,1 -o pcmodel1000.list pcmodel1000.list sort -k 1b,1 -o radmodel1000.list radmodel1000.list head -n10 1 16021 1 8346525 10 1602110 9575436 100 16021 100 1764 714 1000 16023 247 48333609 101 16021 101 1932 588 102 16021 102 2058 501 103 16021 103 2418 399 104 16021 104 2256 447 105 16021 105 1644 849 Looks better for join even if it looks worse for humans. That is the ordering that is needed for character sorting. join pcmodel1000.list radmodel1000.list | wc -l 115 That looks a little more reasonable. Hope that explanation helped. Bob ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
join bug
Hello, Is there a size limit for the input files for join? I want to do it with large files, but even files of 1000 lines fail Thanks, Martin ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
join bug?
Hi, I was trying to attach 2 files, join -1 1 -2 1 Received_Packets Sent_Packets The problem is, after the packet nr82, there is more numbers that match, and the program just join the files until the packet nr82. and file Received_Packets is: 4 25.109191 5 25.199239 6 25.289286 7 25.384085 8 25.474132 9 25.56418 10 25.654227 11 25.744274 12 25.834322 13 25.929103 14 26.01915 15 26.109197 16 26.199245 17 26.289292 18 26.384091 19 26.474138 20 26.564186 21 26.654233 22 26.74428 23 26.834327 24 26.929103 25 27.01915 45 27.109197 82 27.199245 120 27.289292 159 27.384091 and Sent_Packets is: 0 25.0 1 25.00240 2 25.00480 3 25.00720 4 25.00960 5 25.01200 6 25.01440 7 25.01680 8 25.01920 9 25.02160 10 25.02400 11 25.02640 12 25.02880 13 25.03120 14 25.03360 15 25.03600 16 25.03840 17 25.04080 18 25.04320 19 25.04560 20 25.04800 21 25.05040 22 25.05280 23 25.05520 24 25.05760 25 25.06000 26 25.06240 27 25.06480 28 25.06720 29 25.06960 30 25.07200 31 25.07440 32 25.07680 33 25.07920 34 25.08160 35 25.08400 36 25.08640 37 25.08880 38 25.09120 39 25.09360 40 25.09600 41 25.09840 42 25.10080 43 25.10320 44 25.10560 45 25.10800 46 25.11040 47 25.11280 48 25.11520 49 25.11760 50 25.12000 51 25.12240 52 25.12480 53 25.12720 54 25.12960 55 25.13200 56 25.13440 57 25.13680 58 25.13920 59 25.14160 60 25.14400 61 25.14640 62 25.14880 63 25.15120 64 25.15360 65 25.15600 66 25.15840 67 25.16080 68 25.16320 69 25.16560 70 25.16800 71 25.17040 72 25.17280 73 25.17520 74 25.17760 75 25.18000 76 25.18240 77 25.18480 78 25.18720 79 25.18960 80 25.19200 81 25.19440 82 25.19680 83 25.19920 84 25.20160 85 25.20400 86 25.20640 87 25.20880 88 25.21120 89 25.21360 90 25.21600 91 25.21840 92 25.22080 93 25.22320 94 25.22560 95 25.22800 96 25.23040 97 25.23280 98 25.23520 99 25.23760 100 25.24000 101 25.24240 102 25.24480 103 25.24720 104 25.24960 105 25.25200 106 25.25440 107 25.25680 108 25.25920 109 25.26160 110 25.26400 111 25.26640 112 25.26880 113 25.27120 114 25.27360 115 25.27600 116 25.27840 117 25.28080 118 25.28320 119 25.28560 120 25.28800 121 25.29040 122 25.29280 123 25.29520 124 25.29760 125 25.3 126 25.30240 127 25.30480 128 25.30720 129 25.30960 Why i get after the command: 4 25.109191 25.00960 5 25.199239 25.01200 6 25.289286 25.01440 7 25.384085 25.01680 8 25.474132 25.01920 9 25.56418 25.02160 10 25.654227 25.02400 11 25.744274 25.02640 12 25.834322 25.02880 13 25.929103 25.03120 14 26.01915 25.03360 15 26.109197 25.03600 16 26.199245 25.03840 17 26.289292 25.04080 18 26.384091 25.04320 19 26.474138 25.04560 20 26.564186 25.04800 21 26.654233 25.05040 22 26.74428 25.05280 23 26.834327 25.05520 24 26.929103 25.05760 25 27.01915 25.06000 45 27.109197 25.10800 82 27.199245 25.19680 (no more) Kindly, Tiago Junqueira ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: join bug?
Tiago Junqueira [EMAIL PROTECTED] writes: Why i get after the command: 4 25.109191 25.00960 Your input is not sorted. $ join --help [...] Important: FILE1 and FILE2 must be sorted on the join fields. E.g., use `sort -k 1b,1' if `join' has no options. Andreas. -- Andreas Schwab, SuSE Labs, [EMAIL PROTECTED] SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 And now for something completely different. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
join bug?
hey, i think there is a bug in join command. i'm joining two sorted files and there is no joining for an expected mapping. here is the example: f1: 79 53 f2: 791 834 79 82 join f1 f2 == blank if i change f2 to: 1791 834 79 82 join f1 f2 == 79 53 82 thanks, erez ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: join bug?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 According to Erez PERELMAN on 9/19/2006 11:44 AM: hey, i think there is a bug in join command. i'm joining two sorted files and there is no joining for an expected mapping. here is the example: f1: 79 53 f2: 791 834 79 82 Not a bug. join requires its inputs to be sorted, otherwise you get arbitrary behavior. f2 was not sorted (at least, not in the C locale). - -- Life is short - so eat dessert first! Eric Blake [EMAIL PROTECTED] -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.1 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFETU084KuGfSFAYARAloCAJ458hNeZFlPbW48LpCvHZFOYHII9QCeI+Dw KX7Pue9vcrlv5RwPpq6MquU= =VTHJ -END PGP SIGNATURE- ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: join bug?
On 2/6/06, Paul Eggert [EMAIL PROTECTED] wrote: german rigau [EMAIL PROTECTED] writes: Obviously, the problem is in the sort command. With C locale runs perfectly. However, I use LANG=en_US.UTF-8 ... And then it seems that the sort command have different behaviour ... I don't see any bug in the examples that you gave. Sorry for insisting. If you see carefully the last example I sent, we obtain two different sortings with locale en_US.UTF-8 ... with sort kk2 we obtain icecream before ice_cream and with sort -k 1,2 kk2 we obtain ice_cream before icecream! However, we obtain with sort kk2 and sort -k 1,2 kk2 the same ordering with locale C. ?? Getting back to the original question, join must use the same collating convention that sort does. If you sort in the en_US.UTF-8 locale, you must join in the same locale. Otherwise, as you discovered, things won't work in general. No. I use the same collating for sorting and joining. This is why I detected the abnormality: join failed to locate the same elements ordered by the default sorting ... Also, my advice is to stick with the C locale unless you know what you're doing. I think I know perfectly what I am doing ... ;-) For example, if you're not sure what you want to do in the case of encoding error (or, if you don't know what an encoding error is :-), then you should stick with the C locale. But, this is only useful for encoding English language ... and there are many more around. Best, German ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: join bug?
german rigau [EMAIL PROTECTED] writes: If you see carefully the last example I sent, we obtain two different sortings with locale en_US.UTF-8 ... with sort kk2 we obtain icecream before ice_cream and with sort -k 1,2 kk2 we obtain ice_cream before icecream! That is because in that locale, _ is discarded in the first pass of the collation comparison. Plain sort therefore puts icecream%1:13:00:: 07510835 1 0 before ice_cream%1:13:00:: 07510835 1 1 (because of the 0 versus the 1). However, sort -k 1,2 sees only the icecream%1:13:00:: 07510835 and ice_cream%1:13:00:: 07510835 and reports the opposite order, because the second pass of the collation comparison kicks in. I use the same collating for sorting and joining. I didn't see where you did that. You're not doing it in http://lists.gnu.org/archive/html/bug-coreutils/2006-02/msg00011.html, for example, because the join is only on the first field, but the sort is sorting via the entire line. So I still don't see any bug. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: join bug?
german rigau [EMAIL PROTECTED] writes: Obviously, the problem is in the sort command. With C locale runs perfectly. However, I use LANG=en_US.UTF-8 ... And then it seems that the sort command have different behaviour ... I don't see any bug in the examples that you gave. Getting back to the original question, join must use the same collating convention that sort does. If you sort in the en_US.UTF-8 locale, you must join in the same locale. Otherwise, as you discovered, things won't work in general. Also, my advice is to stick with the C locale unless you know what you're doing. For example, if you're not sure what you want to do in the case of encoding error (or, if you don't know what an encoding error is :-), then you should stick with the C locale. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: join bug?
Hello again, Thanks a lot for you quick reply ... But, BOTH files were sorted previously with default options. [EMAIL PROTECTED] BaseConcepts]$ cat kk1 ice_cream%1:13:00:: life_style%1:07:00:: part-time%3:00:00:: [EMAIL PROTECTED] BaseConcepts]$ cat kk2 ice_cream%1:13:00:: 07510835 1 1 icecream%1:13:00:: 07510835 1 0 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 [EMAIL PROTECTED] BaseConcepts]$ join (sort kk1) (sort kk2) life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 [EMAIL PROTECTED] BaseConcepts]$ join -v 1 (sort kk1) (sort kk2) ice_cream%1:13:00:: Then, the problem is on the sort command because: [EMAIL PROTECTED] BaseConcepts]$ more kk2 ice_cream%1:13:00:: 07510835 1 1 icecream%1:13:00:: 07510835 1 0 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 [EMAIL PROTECTED] BaseConcepts]$ sort kk2 icecream%1:13:00:: 07510835 1 0 ice_cream%1:13:00:: 07510835 1 1 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 In fact, adding icecream%1:13:00: in kk1 we obtain: [EMAIL PROTECTED] BaseConcepts]$ sort kk1 ice_cream%1:13:00:: icecream%1:13:00:: life_style%1:07:00:: part-time%3:00:00:: [EMAIL PROTECTED] BaseConcepts]$ sort kk2 icecream%1:13:00:: 07510835 1 0 ice_cream%1:13:00:: 07510835 1 1 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 Which could be the reason to produce two different orderings?? Thanks in advance, German ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: join bug?
Hello again, It seems to me that the problem is on the sort command ... [EMAIL PROTECTED] BaseConcepts]$ sort kk2 icecream%1:13:00:: 07510835 1 0 ice_cream%1:13:00:: 07510835 1 1 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 [EMAIL PROTECTED] BaseConcepts]$ sort -k 1,2 kk2 ice_cream%1:13:00:: 07510835 1 1 icecream%1:13:00:: 07510835 1 0 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 which seems to apply different criteria depending on the length of the lines ... Best, German ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: join bug?
On 2/3/06, Paul Eggert [EMAIL PROTECTED] wrote: german rigau [EMAIL PROTECTED] writes: [EMAIL PROTECTED] BaseConcepts]$ more kk2.sort icecream%1:13:00:: 07510835 1 0 ice_cream%1:13:00:: 07510835 1 1 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 kk2.sort isn't sorted correctly, at least not for the default C locale. The first two lines ought to be interchanged. That probably explains your problem. 'join' requires sorted input. Obviously, the problem is in the sort command. With C locale runs perfectly. However, I use LANG=en_US.UTF-8 ... And then it seems that the sort command have different behaviour ... [EMAIL PROTECTED] BaseConcepts]$ export LANG=C [EMAIL PROTECTED] BaseConcepts]$ sort kk2 ice_cream%1:13:00:: 07510835 1 1 icecream%1:13:00:: 07510835 1 0 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 [EMAIL PROTECTED] BaseConcepts]$ sort -k 1,2 kk2 ice_cream%1:13:00:: 07510835 1 1 icecream%1:13:00:: 07510835 1 0 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 [EMAIL PROTECTED] BaseConcepts]$ export LANG=en_US.UTF-8 [EMAIL PROTECTED] BaseConcepts]$ sort kk2 icecream%1:13:00:: 07510835 1 0 ice_cream%1:13:00:: 07510835 1 1 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 [EMAIL PROTECTED] BaseConcepts]$ sort -k 1,2 kk2 ice_cream%1:13:00:: 07510835 1 1 icecream%1:13:00:: 07510835 1 0 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 Best, German ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
join bug?
Just in case this is really a bug... I'm using version 5.2.1 of coreutils on a fedora core 4. Consider joining these two files ordered with default options: [EMAIL PROTECTED] BaseConcepts]$ more kk1.sort ice_cream%1:13:00:: life_style%1:07:00:: part-time%3:00:00:: [EMAIL PROTECTED] BaseConcepts]$ more kk2.sort icecream%1:13:00:: 07510835 1 0 ice_cream%1:13:00:: 07510835 1 1 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 [EMAIL PROTECTED] BaseConcepts]$ join kk1.sort kk2.sort life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 [EMAIL PROTECTED] BaseConcepts]$ join -v 1 kk1.sort kk2.sort ice_cream%1:13:00:: Why ice_cream%1:13:00:: is not joined? Is this corrected in version 5.93 of coreutils? Thanks, German ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: join bug?
german rigau [EMAIL PROTECTED] writes: [EMAIL PROTECTED] BaseConcepts]$ more kk2.sort icecream%1:13:00:: 07510835 1 0 ice_cream%1:13:00:: 07510835 1 1 life_style%1:07:00:: 04875322 1 2 part-time%3:00:00:: 01131371 1 21 kk2.sort isn't sorted correctly, at least not for the default C locale. The first two lines ought to be interchanged. That probably explains your problem. 'join' requires sorted input. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils