Re: comparing two files using pig

Barclay Dunn Fri, 21 Jun 2013 06:45:39 -0700

The introductory theory of this is awesome. "A+++++ would read again" ;)


On 6/21/13 9:38 AM, Jacob Perkins wrote:

Now here's where it gets fun :)

First, I do want to show you that (given sufficient coffee) there is a set 
theoretic approach to your first question that allows you to solve it with just 
one map-reduce job (a single cogroup) and not two (a cogroup followed by a 
group). Consider two sets, A and B where |A| is the number of elements in A and 
|B| is the number of elements in B.

Let |AUB| be the size of the set union of A and B. Note, Pig does not have a 
set union operator. The UNION operator in Pig is a misnomer. Plus, you cant use 
it in a nested projection which is frustrating...
Let |A^B| be the size of the set intersection of A and B. (The number of 
elements that are in BOTH A and B.

What you're technically after is |A^B|. However, since Pig does not have a set 
intersection operator, and I'm assuming writing a UDF is out of the question 
for you, we can be a bit more clever. As it turns out Pig has a DIFF operator. 
It takes two bags (basically sets although duplicate elements are allowed) and 
returns all the elements that are in either bag but NOT in both. Notice:

|AUB| = |A^B| + |DIFF(A,B)| and
|AUB| = |A| + |B| - |A^B| therefor

|A^B| = 1/2*( |A| + |B| - |DIFF(A,B)| )

All of which we can compute with native Pig :)

So:

A = load 'file1.txt' as (q:chararray, d:chararray);
B = load 'file2.txt' as (q:chararray, d:chararray);

counts = foreach (cogroup A by q, B by q) {
            a_size     = COUNT(A);         -- |A|
            b_size     = COUNT(B);         -- |B|
            diff_size  = COUNT(DIFF(A,B)); -- |DIFF(A,B)
            match_size = (a_size + b_size - diff_size)/2l; -- 1/2*(|A| + |B| - 
|DIFF(A,B)|) = |A intersect B|
            generate
              group as q,
              match_size;
          };

dump counts;



Alright, back to your other issue of adding the matching elements. Again, if 
you were up for it, you could simply write a set intersection udf and be done 
with it. Otherwise, here's what I came up with:


A = load 'file1.txt' as (q:chararray, d:chararray);
B = load 'file2.txt' as (q:chararray, d:chararray);

counts = foreach (cogroup A by (q,d), B by (q,d)) {
             num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
             generate
               flatten(group) as (q,d),
               num_matches    as num_matches;
           };

all_matches = foreach (group counts by q) {
                 match_set = filter counts by num_matches > 0;
                 match_set = match_set.d;
                 generate
                   group as q,
                   SUM(counts.num_matches) as total_matches,
                   match_set as match_set;
               };

dump all_matches;

(q1,2,{(d1),(d2)})
(q2,0,{})
(q3,0,{})

The empty curly braces indicate bags that contain no tuples.

--jacob
@thedatachef

On Jun 21, 2013, at 6:14 AM, Siddhi Borkar wrote:

Thanks a lot the solution worked fine. Is it possible also to display the comma 
separated matching d's?

For ex
(q1,2, {d1,d2})
(q2,0)
(q3,0)

-----Original Message-----
From: Chris Hokamp [mailto:[email protected]]
Sent: Friday, June 21, 2013 1:52 AM
To: [email protected]; Barclay Dunn
Subject: Re: comparing two files using pig

Z


Sent from Samsung Mobile

-------- Original message --------
From: Jacob Perkins <[email protected]>
Date: 20/06/2013  20:30  (GMT+00:00)
To: Barclay Dunn <[email protected]>
Cc: [email protected]
Subject: Re: comparing two files using pig

I did not read you original post clearly enough. I didn't realize both the d 
AND the q had to match. It's only slightly more complex, just add the d column 
to the cogroup statement and sum the number of matches:

A = load 'file1.txt' as (q:chararray, d:chararray); B = load 'file2.txt' as 
(q:chararray, d:chararray);

counts = foreach (cogroup A by (q,d), B by (q,d)) {
             num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
             generate
               flatten(group) as (q,d),
               num_matches    as num_matches;
           };

all_matches = foreach (group counts by q) generate group as q,
SUM(counts.num_matches) as total_matches;

dump all_matches;

(q1,2)
(q2,0)
(q3,0)

--jacob
@thedatachef

On 06/20/2013 02:06 PM, Barclay Dunn wrote:

Jacob,

If I run that code with an added row in file2.txt, e.g.,

   $ cat file2.txt
q1 d1
q1 d2
q3 d3
q2 d4

This gives me mistaken results, i.e.,

(q1,2)
(q2,1)
(q3,0)


I am new at this so I apologize for the ponderous pace of the
following. It can no doubt be shortened. But it gets the correct
results with either data set.

set io.sort.mb 10;         -- avoid java.lang.OutOfMemoryError: Java
heap space (execmode: -x local)

A = LOAD '../../../input/file1.txt' using PigStorage(' ') as
(aa:chararray, ab:chararray); B = LOAD '../../../input/file2.txt'
using PigStorage(' ') as (ba:chararray, bb:chararray);

C = UNION A, B;
D = COGROUP C by ($0, $1);

F = FOREACH D GENERATE FLATTEN($0), COUNT($1);

G0 = FILTER F BY $2 > 1;   -- any that match
G1 = FILTER F BY $2 < 2;   -- any that don't match

H0 = GROUP G0 BY $0;
H1 = GROUP G1 BY $0;


J0 = FOREACH H0 GENERATE $0, COUNT($1);
J1 = FOREACH H1 GENERATE $0, 0;

K = UNION J0, J1;

DUMP K;
/*
(q2,0)
(q3,0)
(q1,2)
*/


Barclay Dunn


On 6/20/13 10:07 AM, Jacob Perkins wrote:

Hi,

This should just be a simple cogroup.

A = load 'file1.txt' as (q:chararray, d:chararray); B = load
'file2.txt' as (q:chararray, d:chararray);

counts = foreach (cogroup A by q, B by q) {
                   num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
                   generate
                     group       as q,
                     num_matches as num_matches;
                };

dump counts;

(q1,2)
(q2,0)
(q3,0)

--jacob
@thedatachef

On Jun 20, 2013, at 4:00 AM, Siddhi Borkar wrote:

Hi,

I have a problem statement where in I have to compare two files and get the 
count of matching attributes.

For ex:
File 1:  file1.txt

q1           d1
q1           d2
q2           d3
q2           d1

File 2: file2.txt
q1           d1
q1           d2
q3           d3

Now what I need is for each distinct q  the count of matching d's

For ex, the output should be
q1           2  (q1     d1 and q1            d2 are matching in both
the files hence count is 2)
q2           0 (has no d's matching)
q3           0

Any idea how this can be achieved?

Thnx in advance

-Sid



DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Re: comparing two files using pig

Reply via email to