Re: Problems with Join in pig

Dmitriy Ryaboy Mon, 14 Mar 2011 17:39:05 -0700

Uh no I am wrong. They are on 20, 18 was 0.4

Yea Srikanth you guys should just upgrade. 0.5 to 0.6 is relatively
painless. The jump to 0.7-0.8 is harder, but worth it.


D

On Mon, Mar 14, 2011 at 5:37 PM, Dmitriy Ryaboy <[email protected]> wrote:

> If they are on 5 that means they have bigger problems. They are on Hadoop
> 18.
>
> D
>
>
> On Mon, Mar 14, 2011 at 5:29 PM, Thejas M Nair <[email protected]>wrote:
>
>> Fragment-replicate join will also produce an efficient query plan for this
>> use case -
>> http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Replicated+Joins .
>> It is available in 0.5 as well.
>> -Thejas
>>
>>
>>
>> On 3/14/11 3:20 PM, "Paltheru, Srikanth" <[email protected]>
>> wrote:
>>
>> I am using Pig 0.5 version. We don't have plans to upgrade it to a newer
>> version. But the problem I have is the script runs for some files(both
>> larger and smaller than the ones mentioned) but not for this particular one.
>> I get "GC overhead limit" Error.
>> Thanks
>> Sri
>>
>>
>> -----Original Message-----
>> From: Thejas M Nair [mailto:[email protected]]
>> Sent: Monday, March 14, 2011 4:18 PM
>> To: [email protected]; Paltheru, Srikanth
>> Subject: Re: Problems with Join in pig
>>
>> What version of pig are you using ? There have been some memory
>> utilization fixes in 0.8 . For this use case, you can also use the new
>> scalar feature in
>> 0.8 -
>>
>> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Sc
>> alars . That query plan will be more efficient.
>>
>> You might want to build a new version of pig from svn 0.8 branch because
>> there have been some bug fixes after the release -
>>
>> svn co http://svn.apache.org/repos/asf/pig/branches/branch-0.8
>> cd branch-0.8
>> ant
>>
>> -Thejas
>>
>>
>> On 3/14/11 1:40 PM, "Paltheru, Srikanth" <[email protected]>
>> wrote:
>>
>> > The following pig script runs fine without the 2GB memory setting (see
>> > in yellow). But fails with memory setting. I am not sure what's
>> > happening. It's a simple operation of joining one tuple(of 1 row) with
>> the other tuple.
>> > Here is what I am trying to do:
>> >
>> >  1.  grouping all SELECT HIT TIME DATA into a single tuple by doing a
>> > GROUP ALL.
>> >  2.  getting the min and max of that set and putting it into MIN HIT
>> DATA.
>> > This is a tuple with a single row.
>> >  3.  then grouping SELECT MAX VISIT TIME DATA by visid,  4.  then
>> > generating  DUMMY_KEY  for every row, along with MAX of start time.
>> >  5.  then try to join the single tuple in 2 with all tuples generated
>> > in 4 to get a min time and a max time
>> >
>> > Code:
>> > Shell prompt:
>> > ## setting heap size to 2 GB
>> > PIG_OPTS="$PIG_OPTS -Dmapred.child.java.opts=-Xmx2048m"
>> > export PIG_OPTS
>> >
>> > Pig/Grunt
>> >
>> > RAW_DATA = LOAD
>> > '/omniture_test_qa/cleansed_output_1/2011/01/05/wdgesp360/wdgesp360_20
>> > 11-01-05
>> > *.tsv.gz' USING PigStorage('\t');
>> > FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY $6 <= 0; SELECT_CAST_DATA =
>> > FOREACH FILTER_EXCLUDES_DATA GENERATE 'DUMMYKEY' AS
>> > DUMMY_KEY,(int)$0 AS hit_time_gmt, (long)$2 AS visid_high, (long)$3 AS
>> > visid_low, (chararray)$5 AS truncated_hit; SELECT_DATA = FILTER
>> > SELECT_CAST_DATA BY truncated_hit =='N';  --MIN AND MAX_HIT_TIME_GMT
>> > FOR THE FILE/SUITE SELECT_HIT_TIME_DATA = FOREACH SELECT_DATA GENERATE
>> > (int)hit_time_gmt; GROUPED_ALL_DATA = GROUP SELECT_HIT_TIME_DATA ALL
>> > PARALLEL 100; MIN_HIT_DATA = FOREACH GROUPED_ALL_DATA  GENERATE
>> > 'DUMMYKEY'AS
>> > DUMMY_KEY,MIN(SELECT_HIT_TIME_DATA.hit_time_gmt) AS
>> > MIN_HIT_TIME_GMT,MAX(SELECT_HIT_TIME_DATA.hit_time_gmt) AS
>> > MAX_HIT_TIME_GMT;  ---MAX_VISIT_START_TIME BY VISITOR_ID
>> > SELECT_MAX_VISIT_TIME_DATA =  FOREACH SELECT_DATA GENERATE
>> > visid_high,visid_low,visit_start_time_gmt;
>> > GROUP_BY_VISID_MAX_VISIT_TIME_DATA = GROUP SELECT_MAX_VISIT_TIME_DATA
>> > BY
>> > (visid_high,visid_low) PARALLEL 100;
>> > MAX_VISIT_TIME = FOREACH GROUP_BY_VISID_MAX_VISIT_TIME_DATA GENERATE
>> > 'DUMMYKEY' AS DUMMY_KEY,FLATTEN(group.visid_high) AS
>> > visid_high,FLATTEN(group.visid_low) AS visid_low,
>> > MAX(SELECT_MAX_VISIT_TIME_DATA.visit_start_time_gmt) AS
>> > MAX_VISIT_START_TIME; JOINED_MAX_VISIT_TIME_DATA = COGROUP
>> > MAX_VISIT_TIME BY DUMMY_KEY OUTER,MIN_HIT_DATA BY DUMMY_KEY OUTER
>> > PARALLEL 100; MIN_MAX_VISIT_HIT_TIME = FOREACH
>> > JOINED_MAX_VISIT_TIME_DATA GENERATE
>> > FLATTEN(MAX_VISIT_TIME.visid_high),FLATTEN(MAX_VISIT_TIME.visid_low),F
>> > LATTEN(M
>> > AX_VISIT_TIME.MAX_VISIT_START_TIME),FLATTEN(MIN_HIT_DATA.MIN_HIT_TIME_
>> > GMT),FLA
>> > TTEN(MIN_HIT_DATA.MAX_HIT_TIME_GMT);
>> >  DUMP MIN_MAX_VISIT_HIT_TIME;
>> >
>> >
>> > Can any one please guide me through this problem?
>> > Thanks
>> > Sri
>> >
>>
>>
>>
>>
>>
>

Re: Problems with Join in pig

Reply via email to