Z
Sent from Samsung Mobile -------- Original message -------- From: Jacob Perkins <[email protected]> Date: 20/06/2013 20:30 (GMT+00:00) To: Barclay Dunn <[email protected]> Cc: [email protected] Subject: Re: comparing two files using pig I did not read you original post clearly enough. I didn't realize both the d AND the q had to match. It's only slightly more complex, just add the d column to the cogroup statement and sum the number of matches: A = load 'file1.txt' as (q:chararray, d:chararray); B = load 'file2.txt' as (q:chararray, d:chararray); counts = foreach (cogroup A by (q,d), B by (q,d)) { num_matches = MIN(TOBAG(COUNT(A), COUNT(B))); generate flatten(group) as (q,d), num_matches as num_matches; }; all_matches = foreach (group counts by q) generate group as q, SUM(counts.num_matches) as total_matches; dump all_matches; (q1,2) (q2,0) (q3,0) --jacob @thedatachef On 06/20/2013 02:06 PM, Barclay Dunn wrote: > Jacob, > > If I run that code with an added row in file2.txt, e.g., > > $ cat file2.txt > q1 d1 > q1 d2 > q3 d3 > q2 d4 > > This gives me mistaken results, i.e., > > (q1,2) > (q2,1) > (q3,0) > > > I am new at this so I apologize for the ponderous pace of the > following. It can no doubt be shortened. But it gets the correct > results with either data set. > > set io.sort.mb 10; -- avoid java.lang.OutOfMemoryError: Java > heap space (execmode: -x local) > > A = LOAD '../../../input/file1.txt' using PigStorage(' ') as > (aa:chararray, ab:chararray); > B = LOAD '../../../input/file2.txt' using PigStorage(' ') as > (ba:chararray, bb:chararray); > > C = UNION A, B; > D = COGROUP C by ($0, $1); > > F = FOREACH D GENERATE FLATTEN($0), COUNT($1); > > G0 = FILTER F BY $2 > 1; -- any that match > G1 = FILTER F BY $2 < 2; -- any that don't match > > H0 = GROUP G0 BY $0; > H1 = GROUP G1 BY $0; > > > J0 = FOREACH H0 GENERATE $0, COUNT($1); > J1 = FOREACH H1 GENERATE $0, 0; > > K = UNION J0, J1; > > DUMP K; > /* > (q2,0) > (q3,0) > (q1,2) > */ > > > Barclay Dunn > > > On 6/20/13 10:07 AM, Jacob Perkins wrote: >> Hi, >> >> This should just be a simple cogroup. >> >> A = load 'file1.txt' as (q:chararray, d:chararray); >> B = load 'file2.txt' as (q:chararray, d:chararray); >> >> counts = foreach (cogroup A by q, B by q) { >> num_matches = MIN(TOBAG(COUNT(A), COUNT(B))); >> generate >> group as q, >> num_matches as num_matches; >> }; >> >> dump counts; >> >> (q1,2) >> (q2,0) >> (q3,0) >> >> --jacob >> @thedatachef >> >> On Jun 20, 2013, at 4:00 AM, Siddhi Borkar wrote: >> >>> Hi, >>> >>> I have a problem statement where in I have to compare two files and get the >>> count of matching attributes. >>> >>> For ex: >>> File 1: file1.txt >>> >>> q1 d1 >>> q1 d2 >>> q2 d3 >>> q2 d1 >>> >>> File 2: file2.txt >>> q1 d1 >>> q1 d2 >>> q3 d3 >>> >>> Now what I need is for each distinct q the count of matching d's >>> >>> For ex, the output should be >>> q1 2 (q1 d1 and q1 d2 are matching in both the >>> files hence count is 2) >>> q2 0 (has no d's matching) >>> q3 0 >>> >>> Any idea how this can be achieved? >>> >>> Thnx in advance >>> >>> -Sid >>> >>> >>> >>> DISCLAIMER >>> ========== >>> This e-mail may contain privileged and confidential information which is >>> the property of Persistent Systems Ltd. It is intended only for the use of >>> the individual or entity to which it is addressed. If you are not the >>> intended recipient, you are not authorized to read, retain, copy, print, >>> distribute or use this message. If you have received this communication in >>> error, please notify the sender and delete all copies of this message. >>> Persistent Systems Ltd. does not accept any liability for virus infected >>> mails. >
