The problem has been solved, it's related with the Bug  PIG-2923 (PIG-2917,
PIG-2918). ( refer to [email protected])

Dmitriy has actually fixed 2 months ago, when I use pig-0.11,  my problem
has gone, and the GC time falls from 80s to 0.5s .

Thanks for your effort,Dmitriy.


Xingbang.Wang

2012/11/7 W W <[email protected]>

> Thanks for your help Dmitriy!
>
> I've found the problem of the powerful machine being slow th an weak
> machine.
>
> The heap size is not the answer to the problem of powerful machine being
> slower than week one.
>
> It's the because the GC time on the powerful machine is  more than twice
> on those week ones.
> In my case, JVM by default assign the powerful machine 18 GC threads(there
> are 24 cores on one Node)  while on the weak machine only    4(only 4 cores
> on the Node)  threads of GC.   And the memory are the same, so the overhead
> of GC on the powerful machine dominates.
>
> I think that's the main reason of my problem.
>
> Besides,I think  the SurvivorRatio of Java heap also contributes to that.
>  My guess is for pig, most of the data on the flow will be somehow  garbage
> collected, so if the Survivor area too bigger(given that the New Generation
> in JVM is constant), it means Eden area is smaller. Then more gc is needed.
>  There should be a pivotal point for the SurvivorRation.
>
>
> my solution is add the following to mapred-site.xml.
>         <property>
>                 <name>mapred.child.java.opts</name>
>                 <value> -XX:ParallelGCThreads=4
> -XX:SurvivorRatio=20</value>
>         </property>
>
>
> Thanks
> Regards
> Xingbang Wang
>
> 2012/11/3 Dmitriy Ryaboy <[email protected]>
>
>> mapred.child.java.opts should be in the gigabytes, 200M is way too low.
>> Check this stack overflow thread for comments on how to ensure your
>> setting
>> actually takes effect -- it's possible you are not propagating it to the
>> job. If you change it in the hadoop config files, you need to restart the
>> MR daemons (JT and TTs).
>> http://stackoverflow.com/questions/8464048/out-of-memory-error-in-hadoop
>>
>> I'll take a look at your script next time I have a few minutes, but try
>> this first -- 200M is definitely too low to get much done in Hadoop.
>>
>>
>> On Fri, Nov 2, 2012 at 3:17 AM, W W <[email protected]> wrote:
>>
>> > hi Dmitriy
>> > Thanks for your explanation!
>> > I think split on $2 is not easy because what I am doing is actually
>> > rolling-up a table,which means they can not be get by join.
>> > Here is the whole script with schema although I omitted many FLATTENs .
>> >
>> > IDF_VALID= LOAD '/user/hadoop/idf.dat'
>> > USING PigStorage('^A') AS (
>> >
>> >   ast_id : int,
>> >   value :chararray,
>> >   pro_id : int,
>> >   pag_id  : int ,
>> >   bgr_id : int,
>> >
>> > );
>> >
>> > grouped_recs= GROUP IDF_VALID BY ast_id PARALLEL 40;
>> >
>> > rollup= FOREACH grouped_recs {
>> >
>> >         bombay_code= FILTER IDF_VALID BY $2 == 76 ;
>> >         singapore_code= FILTER IDF_VALID BY $2 == 90 ;
>> >
>> > GENERATE
>> >
>> >         FLATTEN(group) as nda_id,
>> >         FLATTEN((IsEmpty(bombay_code)?null:bombay_code.$1)) AS
>> bombay_code
>> > ,
>> >   FLATTEN((IsEmpty(singapore_code)?null:singapore_code.$1)) AS
>> > singapore_code;
>> >
>> > }
>> >
>> > STORE rollup INTO 'idf-out-full' USING PigStorage('^A');
>> >
>> >
>> >
>> > Besides,  how can I "  increase the amount of available heap". I've
>> changed
>> > mapred.child.java.opts   from -Xmx200m  to -Xmx1024m .  It seems it
>> doesn't
>> > help. And that threshold value is still the same.
>> > when I monitor the java process by top command, it seems the setting of
>> > mapred.child.java.opts have NO influence on both VIRT and RES, it seems
>> >  mapred.child.java.opts has been overrided by pig.
>> >  Do you have any idea about that ?
>> >
>> > Thanks and Regards
>> > Xingbang
>> >
>> >
>> >
>> > 2012/11/2 Dmitriy Ryaboy <[email protected]>
>> >
>> > > Rather than increase memory, rewrite the script so it does not need so
>> > much
>> > > ram to begin with.
>> > > You can split on $2, group and generate what you need, then join
>> things
>> > > back.
>> > > Hard to tell what exactly you are going for without schemas and
>> expected
>> > > inputs/outputs.
>> > >
>> > > If the hadoop configs are the same, the fact that it's the powerful
>> > machine
>> > > that fails doesn't mean anything -- you are running out of RAM, and
>> you
>> > > gave all machines the same amount of RAM for the reduce processes. It
>> > just
>> > > happens to be the one that a big group is hashing to.
>> > >
>> > > The threshold you are asking about is the threshold after which Pig
>> will
>> > > try to spill what it can, since GC is imminent. It's defined as 70% of
>> > the
>> > > largest memory pool found on the jvm. This threshold itself is not
>> what
>> > you
>> > > want to increase -- you want to increase the amount of available heap
>> if
>> > > possible.
>> > >
>> > > You can set pig.spill.gc.activation.size (invoke GC if we managed to
>> > spill
>> > > at least this much) and pig.spill.size.threshold (how big a spill
>> must be
>> > > before it makes sense to spill anything) if you want.
>> > >
>> > > D
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Nov 1, 2012 at 2:59 AM, W W <[email protected]> wrote:
>> > >
>> > > > hello
>> > > >
>> > > > I just have came across a problem with SpillableMemoryManager.
>> > > > I've searched lots of discussion contained this key, but they are
>> all
>> > > > different from my problem.
>> > > >
>> > > > The problem is
>> > > >
>> > > > When I run a pig script,it takes longer to finish the same task on
>> the
>> > > > powerful machine. And the log(the part that is not clear to me )  of
>> > the
>> > > > task node is
>> > > >
>> > > > Week Node:
>> > > >
>> > > > 2001-06-28 04:04:39,356 INFO
>> > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory
>> handler
>> > > > call - Collection threshold init = 86048768(84032K) used =
>> > > > 86048752(84031K) committed = 125304832(122368K) max =
>> > > > 139853824(136576K)
>> > > > 2001-06-28 04:04:39,940 INFO
>> > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory
>> handler
>> > > > call- Usage threshold init = 86048768(84032K) used =
>> 98041880(95744K)
>> > > > committed = 125304832(122368K) max = 139853824(136576K)
>> > > > 2001-06-28 04:06:10,048 INFO org.apache.hadoop.mapred.Task:
>> > > > Task:attempt_201211010504_0007_r_000018_0 is done. And is in the
>> > > > process of commiting
>> > > >
>> > > >
>> > > > Powerful Node:
>> > > >
>> > > > 2012-11-01 06:12:56,801 INFO
>> > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory
>> handler
>> > > > call- Usage threshold init = 139853824(136576K) used =
>> > > > 99240424(96914K) committed = 139853824(136576K) max =
>> > > > 139853824(136576K)
>> > > > 2012-11-01 06:13:22,733 INFO
>> > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory
>> handler
>> > > > call - Collection threshold init = 139853824(136576K) used =
>> > > > 77466824(75651K) committed = 139853824(136576K) max =
>> > > > 139853824(136576K)
>> > > > 2012-11-01 06:15:41,178 INFO org.apache.hadoop.mapred.Task:
>> > > > Task:attempt_201211010504_0007_r_000014_0 is done. And is in the
>> > > > process of commiting
>> > > >
>> > > >
>> > > > My question is how to control the number following  those like  the
>> > >  "Usage
>> > > > threshold init" , It seems I can't set them in the config files.
>> > > > Are they default to some hardware parameters?
>> > > >
>> > > >
>> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
>> > > >
>> > > >
>> > > > The description of the cluster
>> > > >
>> > > > I have a heterogeneous cluster with
>> > > >  6 virtual machines with 4-core and 8G memory for each.
>> > > >  4 physical machines with 24-core and 32Gmemory for each.
>> > > >
>> > > > The hadoop configs are all the same for all nodes(I assigned the
>> same
>> > > slots
>> > > > for M/R to the powerful machines even there is a waste)
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > The pig script that cause the problem:
>> > > >
>> > > > grouped_recs= GROUP IDF_VALID BY ast_id PARALLEL 40;
>> > > >
>> > > > rollup= FOREACH grouped_recs {
>> > > >
>> > > >         bombay_code= FILTER IDF_VALID BY $2 == 76 ;
>> > > >         singapore_code= FILTER IDF_VALID BY $2 == 90 ;
>> > > >
>> > > > GENERATE
>> > > >
>> > > >         FLATTEN(group) as nda_id,
>> > > >         FLATTEN((IsEmpty(bombay_code)?null:bombay_code.$1)) AS
>> > > bombay_code
>> > > > ,
>> > > >   FLATTEN((IsEmpty(singapore_code)?null:singapore_code.$1)) AS
>> > > > singapore_code;
>> > > >
>> > > > }
>> > > >
>> > > >
>> > > >
>> > > > Thanks&Regards
>> > > > Xingbang
>> > > >
>> > >
>> >
>>
>
>

Reply via email to