Now here's where it gets fun :)
First, I do want to show you that (given sufficient coffee) there is a set
theoretic approach to your first question that allows you to solve it with just
one map-reduce job (a single cogroup) and not two (a cogroup followed by a
group). Consider two sets, A and B where |A| is the number of elements in A and
|B| is the number of elements in B.
Let |AUB| be the size of the set union of A and B. Note, Pig does not have a
set union operator. The UNION operator in Pig is a misnomer. Plus, you cant use
it in a nested projection which is frustrating...
Let |A^B| be the size of the set intersection of A and B. (The number of
elements that are in BOTH A and B.
What you're technically after is |A^B|. However, since Pig does not have a set
intersection operator, and I'm assuming writing a UDF is out of the question
for you, we can be a bit more clever. As it turns out Pig has a DIFF operator.
It takes two bags (basically sets although duplicate elements are allowed) and
returns all the elements that are in either bag but NOT in both. Notice:
|AUB| = |A^B| + |DIFF(A,B)| and
|AUB| = |A| + |B| - |A^B| therefor
|A^B| = 1/2*( |A| + |B| - |DIFF(A,B)| )
All of which we can compute with native Pig :)
So:
A = load 'file1.txt' as (q:chararray, d:chararray);
B = load 'file2.txt' as (q:chararray, d:chararray);
counts = foreach (cogroup A by q, B by q) {
a_size = COUNT(A); -- |A|
b_size = COUNT(B); -- |B|
diff_size = COUNT(DIFF(A,B)); -- |DIFF(A,B)
match_size = (a_size + b_size - diff_size)/2l; -- 1/2*(|A| + |B| -
|DIFF(A,B)|) = |A intersect B|
generate
group as q,
match_size;
};
dump counts;
Alright, back to your other issue of adding the matching elements. Again, if
you were up for it, you could simply write a set intersection udf and be done
with it. Otherwise, here's what I came up with:
A = load 'file1.txt' as (q:chararray, d:chararray);
B = load 'file2.txt' as (q:chararray, d:chararray);
counts = foreach (cogroup A by (q,d), B by (q,d)) {
num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
generate
flatten(group) as (q,d),
num_matches as num_matches;
};
all_matches = foreach (group counts by q) {
match_set = filter counts by num_matches > 0;
match_set = match_set.d;
generate
group as q,
SUM(counts.num_matches) as total_matches,
match_set as match_set;
};
dump all_matches;
(q1,2,{(d1),(d2)})
(q2,0,{})
(q3,0,{})
The empty curly braces indicate bags that contain no tuples.
--jacob
@thedatachef
On Jun 21, 2013, at 6:14 AM, Siddhi Borkar wrote:
> Thanks a lot the solution worked fine. Is it possible also to display the
> comma separated matching d's?
>
> For ex
> (q1,2, {d1,d2})
> (q2,0)
> (q3,0)
>
> -----Original Message-----
> From: Chris Hokamp [mailto:[email protected]]
> Sent: Friday, June 21, 2013 1:52 AM
> To: [email protected]; Barclay Dunn
> Subject: Re: comparing two files using pig
>
> Z
>
>
> Sent from Samsung Mobile
>
> -------- Original message --------
> From: Jacob Perkins <[email protected]>
> Date: 20/06/2013 20:30 (GMT+00:00)
> To: Barclay Dunn <[email protected]>
> Cc: [email protected]
> Subject: Re: comparing two files using pig
>
> I did not read you original post clearly enough. I didn't realize both the d
> AND the q had to match. It's only slightly more complex, just add the d
> column to the cogroup statement and sum the number of matches:
>
> A = load 'file1.txt' as (q:chararray, d:chararray); B = load 'file2.txt' as
> (q:chararray, d:chararray);
>
> counts = foreach (cogroup A by (q,d), B by (q,d)) {
> num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
> generate
> flatten(group) as (q,d),
> num_matches as num_matches;
> };
>
> all_matches = foreach (group counts by q) generate group as q,
> SUM(counts.num_matches) as total_matches;
>
> dump all_matches;
>
> (q1,2)
> (q2,0)
> (q3,0)
>
> --jacob
> @thedatachef
>
> On 06/20/2013 02:06 PM, Barclay Dunn wrote:
>> Jacob,
>>
>> If I run that code with an added row in file2.txt, e.g.,
>>
>> $ cat file2.txt
>> q1 d1
>> q1 d2
>> q3 d3
>> q2 d4
>>
>> This gives me mistaken results, i.e.,
>>
>> (q1,2)
>> (q2,1)
>> (q3,0)
>>
>>
>> I am new at this so I apologize for the ponderous pace of the
>> following. It can no doubt be shortened. But it gets the correct
>> results with either data set.
>>
>> set io.sort.mb 10; -- avoid java.lang.OutOfMemoryError: Java
>> heap space (execmode: -x local)
>>
>> A = LOAD '../../../input/file1.txt' using PigStorage(' ') as
>> (aa:chararray, ab:chararray); B = LOAD '../../../input/file2.txt'
>> using PigStorage(' ') as (ba:chararray, bb:chararray);
>>
>> C = UNION A, B;
>> D = COGROUP C by ($0, $1);
>>
>> F = FOREACH D GENERATE FLATTEN($0), COUNT($1);
>>
>> G0 = FILTER F BY $2 > 1; -- any that match
>> G1 = FILTER F BY $2 < 2; -- any that don't match
>>
>> H0 = GROUP G0 BY $0;
>> H1 = GROUP G1 BY $0;
>>
>>
>> J0 = FOREACH H0 GENERATE $0, COUNT($1);
>> J1 = FOREACH H1 GENERATE $0, 0;
>>
>> K = UNION J0, J1;
>>
>> DUMP K;
>> /*
>> (q2,0)
>> (q3,0)
>> (q1,2)
>> */
>>
>>
>> Barclay Dunn
>>
>>
>> On 6/20/13 10:07 AM, Jacob Perkins wrote:
>>> Hi,
>>>
>>> This should just be a simple cogroup.
>>>
>>> A = load 'file1.txt' as (q:chararray, d:chararray); B = load
>>> 'file2.txt' as (q:chararray, d:chararray);
>>>
>>> counts = foreach (cogroup A by q, B by q) {
>>> num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
>>> generate
>>> group as q,
>>> num_matches as num_matches;
>>> };
>>>
>>> dump counts;
>>>
>>> (q1,2)
>>> (q2,0)
>>> (q3,0)
>>>
>>> --jacob
>>> @thedatachef
>>>
>>> On Jun 20, 2013, at 4:00 AM, Siddhi Borkar wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a problem statement where in I have to compare two files and get
>>>> the count of matching attributes.
>>>>
>>>> For ex:
>>>> File 1: file1.txt
>>>>
>>>> q1 d1
>>>> q1 d2
>>>> q2 d3
>>>> q2 d1
>>>>
>>>> File 2: file2.txt
>>>> q1 d1
>>>> q1 d2
>>>> q3 d3
>>>>
>>>> Now what I need is for each distinct q the count of matching d's
>>>>
>>>> For ex, the output should be
>>>> q1 2 (q1 d1 and q1 d2 are matching in both
>>>> the files hence count is 2)
>>>> q2 0 (has no d's matching)
>>>> q3 0
>>>>
>>>> Any idea how this can be achieved?
>>>>
>>>> Thnx in advance
>>>>
>>>> -Sid
>>>>
>>>>
>>>>
>>>> DISCLAIMER
>>>> ==========
>>>> This e-mail may contain privileged and confidential information which is
>>>> the property of Persistent Systems Ltd. It is intended only for the use of
>>>> the individual or entity to which it is addressed. If you are not the
>>>> intended recipient, you are not authorized to read, retain, copy, print,
>>>> distribute or use this message. If you have received this communication in
>>>> error, please notify the sender and delete all copies of this message.
>>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>>> mails.
>>
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the
> property of Persistent Systems Ltd. It is intended only for the use of the
> individual or entity to which it is addressed. If you are not the intended
> recipient, you are not authorized to read, retain, copy, print, distribute or
> use this message. If you have received this communication in error, please
> notify the sender and delete all copies of this message. Persistent Systems
> Ltd. does not accept any liability for virus infected mails.