Re: comparing two files using pig

Jacob Perkins Fri, 21 Jun 2013 06:40:02 -0700

Now here's where it gets fun :)

First, I do want to show you that (given sufficient coffee) there is a set 
theoretic approach to your first question that allows you to solve it with just 
one map-reduce job (a single cogroup) and not two (a cogroup followed by a 
group). Consider two sets, A and B where |A| is the number of elements in A and 
|B| is the number of elements in B.


Let |AUB| be the size of the set union of A and B. Note, Pig does not have a 
set union operator. The UNION operator in Pig is a misnomer. Plus, you cant use 
it in a nested projection which is frustrating...
Let |A^B| be the size of the set intersection of A and B. (The number of 
elements that are in BOTH A and B.

What you're technically after is |A^B|. However, since Pig does not have a set 
intersection operator, and I'm assuming writing a UDF is out of the question 
for you, we can be a bit more clever. As it turns out Pig has a DIFF operator. 
It takes two bags (basically sets although duplicate elements are allowed) and 
returns all the elements that are in either bag but NOT in both. Notice:

|AUB| = |A^B| + |DIFF(A,B)| and
|AUB| = |A| + |B| - |A^B| therefor

|A^B| = 1/2*( |A| + |B| - |DIFF(A,B)| )

All of which we can compute with native Pig :)

So:

A = load 'file1.txt' as (q:chararray, d:chararray); 
B = load 'file2.txt' as (q:chararray, d:chararray);

counts = foreach (cogroup A by q, B by q) {
           a_size     = COUNT(A);         -- |A|
           b_size     = COUNT(B);         -- |B|
           diff_size  = COUNT(DIFF(A,B)); -- |DIFF(A,B)
           match_size = (a_size + b_size - diff_size)/2l; -- 1/2*(|A| + |B| - 
|DIFF(A,B)|) = |A intersect B|
           generate
             group as q,
             match_size;
         };

dump counts;



Alright, back to your other issue of adding the matching elements. Again, if 
you were up for it, you could simply write a set intersection udf and be done 
with it. Otherwise, here's what I came up with:


A = load 'file1.txt' as (q:chararray, d:chararray); 
B = load 'file2.txt' as (q:chararray, d:chararray);

counts = foreach (cogroup A by (q,d), B by (q,d)) {
            num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
            generate
              flatten(group) as (q,d),
              num_matches    as num_matches;
          };

all_matches = foreach (group counts by q) {
                match_set = filter counts by num_matches > 0;
                match_set = match_set.d;
                generate
                  group as q,
                  SUM(counts.num_matches) as total_matches,
                  match_set as match_set;
              };
              

dump all_matches;

(q1,2,{(d1),(d2)})
(q2,0,{})
(q3,0,{})

The empty curly braces indicate bags that contain no tuples.

--jacob
@thedatachef

On Jun 21, 2013, at 6:14 AM, Siddhi Borkar wrote:

> Thanks a lot the solution worked fine. Is it possible also to display the 
> comma separated matching d's?
> 
> For ex 
> (q1,2, {d1,d2})
> (q2,0)
> (q3,0)
> 
> -----Original Message-----
> From: Chris Hokamp [mailto:[email protected]] 
> Sent: Friday, June 21, 2013 1:52 AM
> To: [email protected]; Barclay Dunn
> Subject: Re: comparing two files using pig
> 
> Z
> 
> 
> Sent from Samsung Mobile
> 
> -------- Original message --------
> From: Jacob Perkins <[email protected]>
> Date: 20/06/2013  20:30  (GMT+00:00)
> To: Barclay Dunn <[email protected]>
> Cc: [email protected]
> Subject: Re: comparing two files using pig 
> 
> I did not read you original post clearly enough. I didn't realize both the d 
> AND the q had to match. It's only slightly more complex, just add the d 
> column to the cogroup statement and sum the number of matches:
> 
> A = load 'file1.txt' as (q:chararray, d:chararray); B = load 'file2.txt' as 
> (q:chararray, d:chararray);
> 
> counts = foreach (cogroup A by (q,d), B by (q,d)) {
>             num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
>             generate
>               flatten(group) as (q,d),
>               num_matches    as num_matches;
>           };
> 
> all_matches = foreach (group counts by q) generate group as q,
> SUM(counts.num_matches) as total_matches;
> 
> dump all_matches;
> 
> (q1,2)
> (q2,0)
> (q3,0)
> 
> --jacob
> @thedatachef
> 
> On 06/20/2013 02:06 PM, Barclay Dunn wrote:
>> Jacob,
>> 
>> If I run that code with an added row in file2.txt, e.g.,
>> 
>>   $ cat file2.txt
>> q1 d1
>> q1 d2
>> q3 d3
>> q2 d4
>> 
>> This gives me mistaken results, i.e.,
>> 
>> (q1,2)
>> (q2,1)
>> (q3,0)
>> 
>> 
>> I am new at this so I apologize for the ponderous pace of the 
>> following. It can no doubt be shortened. But it gets the correct 
>> results with either data set.
>> 
>> set io.sort.mb 10;         -- avoid java.lang.OutOfMemoryError: Java 
>> heap space (execmode: -x local)
>> 
>> A = LOAD '../../../input/file1.txt' using PigStorage(' ') as 
>> (aa:chararray, ab:chararray); B = LOAD '../../../input/file2.txt' 
>> using PigStorage(' ') as (ba:chararray, bb:chararray);
>> 
>> C = UNION A, B;
>> D = COGROUP C by ($0, $1);
>> 
>> F = FOREACH D GENERATE FLATTEN($0), COUNT($1);
>> 
>> G0 = FILTER F BY $2 > 1;   -- any that match
>> G1 = FILTER F BY $2 < 2;   -- any that don't match
>> 
>> H0 = GROUP G0 BY $0;
>> H1 = GROUP G1 BY $0;
>> 
>> 
>> J0 = FOREACH H0 GENERATE $0, COUNT($1);
>> J1 = FOREACH H1 GENERATE $0, 0;
>> 
>> K = UNION J0, J1;
>> 
>> DUMP K;
>> /*
>> (q2,0)
>> (q3,0)
>> (q1,2)
>> */
>> 
>> 
>> Barclay Dunn
>> 
>> 
>> On 6/20/13 10:07 AM, Jacob Perkins wrote:
>>> Hi,
>>> 
>>> This should just be a simple cogroup.
>>> 
>>> A = load 'file1.txt' as (q:chararray, d:chararray); B = load 
>>> 'file2.txt' as (q:chararray, d:chararray);
>>> 
>>> counts = foreach (cogroup A by q, B by q) {
>>>                   num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
>>>                   generate
>>>                     group       as q,
>>>                     num_matches as num_matches;
>>>                };
>>> 
>>> dump counts;
>>> 
>>> (q1,2)
>>> (q2,0)
>>> (q3,0)
>>> 
>>> --jacob
>>> @thedatachef
>>> 
>>> On Jun 20, 2013, at 4:00 AM, Siddhi Borkar wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I have a problem statement where in I have to compare two files and get 
>>>> the count of matching attributes.
>>>> 
>>>> For ex:
>>>> File 1:  file1.txt
>>>> 
>>>> q1           d1
>>>> q1           d2
>>>> q2           d3
>>>> q2           d1
>>>> 
>>>> File 2: file2.txt
>>>> q1           d1
>>>> q1           d2
>>>> q3           d3
>>>> 
>>>> Now what I need is for each distinct q  the count of matching d's
>>>> 
>>>> For ex, the output should be
>>>> q1           2  (q1     d1 and q1            d2 are matching in both 
>>>> the files hence count is 2)
>>>> q2           0 (has no d's matching)
>>>> q3           0
>>>> 
>>>> Any idea how this can be achieved?
>>>> 
>>>> Thnx in advance
>>>> 
>>>> -Sid
>>>> 
>>>> 
>>>> 
>>>> DISCLAIMER
>>>> ==========
>>>> This e-mail may contain privileged and confidential information which is 
>>>> the property of Persistent Systems Ltd. It is intended only for the use of 
>>>> the individual or entity to which it is addressed. If you are not the 
>>>> intended recipient, you are not authorized to read, retain, copy, print, 
>>>> distribute or use this message. If you have received this communication in 
>>>> error, please notify the sender and delete all copies of this message. 
>>>> Persistent Systems Ltd. does not accept any liability for virus infected 
>>>> mails.
>> 
> 
> 
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.

Re: comparing two files using pig

Reply via email to