Uh no I am wrong. They are on 20, 18 was 0.4 Yea Srikanth you guys should just upgrade. 0.5 to 0.6 is relatively painless. The jump to 0.7-0.8 is harder, but worth it.
D On Mon, Mar 14, 2011 at 5:37 PM, Dmitriy Ryaboy <[email protected]> wrote: > If they are on 5 that means they have bigger problems. They are on Hadoop > 18. > > D > > > On Mon, Mar 14, 2011 at 5:29 PM, Thejas M Nair <[email protected]>wrote: > >> Fragment-replicate join will also produce an efficient query plan for this >> use case - >> http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Replicated+Joins . >> It is available in 0.5 as well. >> -Thejas >> >> >> >> On 3/14/11 3:20 PM, "Paltheru, Srikanth" <[email protected]> >> wrote: >> >> I am using Pig 0.5 version. We don't have plans to upgrade it to a newer >> version. But the problem I have is the script runs for some files(both >> larger and smaller than the ones mentioned) but not for this particular one. >> I get "GC overhead limit" Error. >> Thanks >> Sri >> >> >> -----Original Message----- >> From: Thejas M Nair [mailto:[email protected]] >> Sent: Monday, March 14, 2011 4:18 PM >> To: [email protected]; Paltheru, Srikanth >> Subject: Re: Problems with Join in pig >> >> What version of pig are you using ? There have been some memory >> utilization fixes in 0.8 . For this use case, you can also use the new >> scalar feature in >> 0.8 - >> >> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Sc >> alars . That query plan will be more efficient. >> >> You might want to build a new version of pig from svn 0.8 branch because >> there have been some bug fixes after the release - >> >> svn co http://svn.apache.org/repos/asf/pig/branches/branch-0.8 >> cd branch-0.8 >> ant >> >> -Thejas >> >> >> On 3/14/11 1:40 PM, "Paltheru, Srikanth" <[email protected]> >> wrote: >> >> > The following pig script runs fine without the 2GB memory setting (see >> > in yellow). But fails with memory setting. I am not sure what's >> > happening. It's a simple operation of joining one tuple(of 1 row) with >> the other tuple. >> > Here is what I am trying to do: >> > >> > 1. grouping all SELECT HIT TIME DATA into a single tuple by doing a >> > GROUP ALL. >> > 2. getting the min and max of that set and putting it into MIN HIT >> DATA. >> > This is a tuple with a single row. >> > 3. then grouping SELECT MAX VISIT TIME DATA by visid, 4. then >> > generating DUMMY_KEY for every row, along with MAX of start time. >> > 5. then try to join the single tuple in 2 with all tuples generated >> > in 4 to get a min time and a max time >> > >> > Code: >> > Shell prompt: >> > ## setting heap size to 2 GB >> > PIG_OPTS="$PIG_OPTS -Dmapred.child.java.opts=-Xmx2048m" >> > export PIG_OPTS >> > >> > Pig/Grunt >> > >> > RAW_DATA = LOAD >> > '/omniture_test_qa/cleansed_output_1/2011/01/05/wdgesp360/wdgesp360_20 >> > 11-01-05 >> > *.tsv.gz' USING PigStorage('\t'); >> > FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY $6 <= 0; SELECT_CAST_DATA = >> > FOREACH FILTER_EXCLUDES_DATA GENERATE 'DUMMYKEY' AS >> > DUMMY_KEY,(int)$0 AS hit_time_gmt, (long)$2 AS visid_high, (long)$3 AS >> > visid_low, (chararray)$5 AS truncated_hit; SELECT_DATA = FILTER >> > SELECT_CAST_DATA BY truncated_hit =='N'; --MIN AND MAX_HIT_TIME_GMT >> > FOR THE FILE/SUITE SELECT_HIT_TIME_DATA = FOREACH SELECT_DATA GENERATE >> > (int)hit_time_gmt; GROUPED_ALL_DATA = GROUP SELECT_HIT_TIME_DATA ALL >> > PARALLEL 100; MIN_HIT_DATA = FOREACH GROUPED_ALL_DATA GENERATE >> > 'DUMMYKEY'AS >> > DUMMY_KEY,MIN(SELECT_HIT_TIME_DATA.hit_time_gmt) AS >> > MIN_HIT_TIME_GMT,MAX(SELECT_HIT_TIME_DATA.hit_time_gmt) AS >> > MAX_HIT_TIME_GMT; ---MAX_VISIT_START_TIME BY VISITOR_ID >> > SELECT_MAX_VISIT_TIME_DATA = FOREACH SELECT_DATA GENERATE >> > visid_high,visid_low,visit_start_time_gmt; >> > GROUP_BY_VISID_MAX_VISIT_TIME_DATA = GROUP SELECT_MAX_VISIT_TIME_DATA >> > BY >> > (visid_high,visid_low) PARALLEL 100; >> > MAX_VISIT_TIME = FOREACH GROUP_BY_VISID_MAX_VISIT_TIME_DATA GENERATE >> > 'DUMMYKEY' AS DUMMY_KEY,FLATTEN(group.visid_high) AS >> > visid_high,FLATTEN(group.visid_low) AS visid_low, >> > MAX(SELECT_MAX_VISIT_TIME_DATA.visit_start_time_gmt) AS >> > MAX_VISIT_START_TIME; JOINED_MAX_VISIT_TIME_DATA = COGROUP >> > MAX_VISIT_TIME BY DUMMY_KEY OUTER,MIN_HIT_DATA BY DUMMY_KEY OUTER >> > PARALLEL 100; MIN_MAX_VISIT_HIT_TIME = FOREACH >> > JOINED_MAX_VISIT_TIME_DATA GENERATE >> > FLATTEN(MAX_VISIT_TIME.visid_high),FLATTEN(MAX_VISIT_TIME.visid_low),F >> > LATTEN(M >> > AX_VISIT_TIME.MAX_VISIT_START_TIME),FLATTEN(MIN_HIT_DATA.MIN_HIT_TIME_ >> > GMT),FLA >> > TTEN(MIN_HIT_DATA.MAX_HIT_TIME_GMT); >> > DUMP MIN_MAX_VISIT_HIT_TIME; >> > >> > >> > Can any one please guide me through this problem? >> > Thanks >> > Sri >> > >> >> >> >> >> >
