Jacob,
If I run that code with an added row in file2.txt, e.g.,
$ cat file2.txt
q1 d1
q1 d2
q3 d3
q2 d4
This gives me mistaken results, i.e.,
(q1,2)
(q2,1)
(q3,0)
I am new at this so I apologize for the ponderous pace of the following.
It can no doubt be shortened. But it gets the correct results with
either data set.
set io.sort.mb 10; -- avoid java.lang.OutOfMemoryError: Java
heap space (execmode: -x local)
A = LOAD '../../../input/file1.txt' using PigStorage(' ') as
(aa:chararray, ab:chararray);
B = LOAD '../../../input/file2.txt' using PigStorage(' ') as
(ba:chararray, bb:chararray);
C = UNION A, B;
D = COGROUP C by ($0, $1);
F = FOREACH D GENERATE FLATTEN($0), COUNT($1);
G0 = FILTER F BY $2 > 1; -- any that match
G1 = FILTER F BY $2 < 2; -- any that don't match
H0 = GROUP G0 BY $0;
H1 = GROUP G1 BY $0;
J0 = FOREACH H0 GENERATE $0, COUNT($1);
J1 = FOREACH H1 GENERATE $0, 0;
K = UNION J0, J1;
DUMP K;
/*
(q2,0)
(q3,0)
(q1,2)
*/
Barclay Dunn
On 6/20/13 10:07 AM, Jacob Perkins wrote:
Hi,
This should just be a simple cogroup.
A = load 'file1.txt' as (q:chararray, d:chararray);
B = load 'file2.txt' as (q:chararray, d:chararray);
counts = foreach (cogroup A by q, B by q) {
num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
generate
group as q,
num_matches as num_matches;
};
dump counts;
(q1,2)
(q2,0)
(q3,0)
--jacob
@thedatachef
On Jun 20, 2013, at 4:00 AM, Siddhi Borkar wrote:
Hi,
I have a problem statement where in I have to compare two files and get the
count of matching attributes.
For ex:
File 1: file1.txt
q1 d1
q1 d2
q2 d3
q2 d1
File 2: file2.txt
q1 d1
q1 d2
q3 d3
Now what I need is for each distinct q the count of matching d's
For ex, the output should be
q1 2 (q1 d1 and q1 d2 are matching in both the files
hence count is 2)
q2 0 (has no d's matching)
q3 0
Any idea how this can be achieved?
Thnx in advance
-Sid
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the
property of Persistent Systems Ltd. It is intended only for the use of the
individual or entity to which it is addressed. If you are not the intended
recipient, you are not authorized to read, retain, copy, print, distribute or
use this message. If you have received this communication in error, please
notify the sender and delete all copies of this message. Persistent Systems
Ltd. does not accept any liability for virus infected mails.