Z

Sent from Samsung Mobile

-------- Original message --------
From: Jacob Perkins <[email protected]> 
Date: 20/06/2013  20:30  (GMT+00:00) 
To: Barclay Dunn <[email protected]> 
Cc: [email protected] 
Subject: Re: comparing two files using pig 
 
I did not read you original post clearly enough. I didn't realize both 
the d AND the q had to match. It's only slightly more complex, just add 
the d column to the cogroup statement and sum the number of matches:

A = load 'file1.txt' as (q:chararray, d:chararray);
B = load 'file2.txt' as (q:chararray, d:chararray);

counts = foreach (cogroup A by (q,d), B by (q,d)) {
            num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
            generate
              flatten(group) as (q,d),
              num_matches    as num_matches;
          };

all_matches = foreach (group counts by q) generate group as q, 
SUM(counts.num_matches) as total_matches;

dump all_matches;

(q1,2)
(q2,0)
(q3,0)

--jacob
@thedatachef

On 06/20/2013 02:06 PM, Barclay Dunn wrote:
> Jacob,
>
> If I run that code with an added row in file2.txt, e.g.,
>
>  $ cat file2.txt
> q1 d1
> q1 d2
> q3 d3
> q2 d4
>
> This gives me mistaken results, i.e.,
>
> (q1,2)
> (q2,1)
> (q3,0)
>
>
> I am new at this so I apologize for the ponderous pace of the 
> following. It can no doubt be shortened. But it gets the correct 
> results with either data set.
>
> set io.sort.mb 10;         -- avoid java.lang.OutOfMemoryError: Java 
> heap space (execmode: -x local)
>
> A = LOAD '../../../input/file1.txt' using PigStorage(' ') as 
> (aa:chararray, ab:chararray);
> B = LOAD '../../../input/file2.txt' using PigStorage(' ') as 
> (ba:chararray, bb:chararray);
>
> C = UNION A, B;
> D = COGROUP C by ($0, $1);
>
> F = FOREACH D GENERATE FLATTEN($0), COUNT($1);
>
> G0 = FILTER F BY $2 > 1;   -- any that match
> G1 = FILTER F BY $2 < 2;   -- any that don't match
>
> H0 = GROUP G0 BY $0;
> H1 = GROUP G1 BY $0;
>
>
> J0 = FOREACH H0 GENERATE $0, COUNT($1);
> J1 = FOREACH H1 GENERATE $0, 0;
>
> K = UNION J0, J1;
>
> DUMP K;
> /*
> (q2,0)
> (q3,0)
> (q1,2)
> */
>
>
> Barclay Dunn
>
>
> On 6/20/13 10:07 AM, Jacob Perkins wrote:
>> Hi,
>>
>> This should just be a simple cogroup.
>>
>> A = load 'file1.txt' as (q:chararray, d:chararray);
>> B = load 'file2.txt' as (q:chararray, d:chararray);
>>
>> counts = foreach (cogroup A by q, B by q) {
>>                  num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
>>                  generate
>>                    group       as q,
>>                    num_matches as num_matches;
>>               };
>>
>> dump counts;
>>
>> (q1,2)
>> (q2,0)
>> (q3,0)
>>
>> --jacob
>> @thedatachef
>>
>> On Jun 20, 2013, at 4:00 AM, Siddhi Borkar wrote:
>>
>>> Hi,
>>>
>>> I have a problem statement where in I have to compare two files and get the 
>>> count of matching attributes.
>>>
>>> For ex:
>>> File 1:  file1.txt
>>>
>>> q1           d1
>>> q1           d2
>>> q2           d3
>>> q2           d1
>>>
>>> File 2: file2.txt
>>> q1           d1
>>> q1           d2
>>> q3           d3
>>>
>>> Now what I need is for each distinct q  the count of matching d's
>>>
>>> For ex, the output should be
>>> q1           2  (q1     d1 and q1            d2 are matching in both the 
>>> files hence count is 2)
>>> q2           0 (has no d's matching)
>>> q3           0
>>>
>>> Any idea how this can be achieved?
>>>
>>> Thnx in advance
>>>
>>> -Sid
>>>
>>>
>>>
>>> DISCLAIMER
>>> ==========
>>> This e-mail may contain privileged and confidential information which is 
>>> the property of Persistent Systems Ltd. It is intended only for the use of 
>>> the individual or entity to which it is addressed. If you are not the 
>>> intended recipient, you are not authorized to read, retain, copy, print, 
>>> distribute or use this message. If you have received this communication in 
>>> error, please notify the sender and delete all copies of this message. 
>>> Persistent Systems Ltd. does not accept any liability for virus infected 
>>> mails.
>

Reply via email to