Thanks for your help Dmitriy!
I've found the problem of the powerful machine being slow than weak machine.
The heap size is not the answer to the problem of powerful machine being
slower than week one.
It's the because the GC time on the powerful machine is more than twice on
those week ones.
In my case, JVM by default assign the powerful machine 18 GC threads(there
are 24 cores on one Node) while on the weak machine only 4(only 4 cores
on the Node) threads of GC. And the memory are the same, so the overhead
of GC on the powerful machine dominates.
I think that's the main reason of my problem.
Besides,I think the SurvivorRatio of Java heap also contributes to that.
My guess is for pig, most of the data on the flow will be somehow garbage
collected, so if the Survivor area too bigger(given that the New Generation
in JVM is constant), it means Eden area is smaller. Then more gc is needed.
There should be a pivotal point for the SurvivorRation.
my solution is add the following to mapred-site.xml.
<property>
<name>mapred.child.java.opts</name>
<value> -XX:ParallelGCThreads=4 -XX:SurvivorRatio=20</value>
</property>
Thanks
Regards
Xingbang Wang
2012/11/3 Dmitriy Ryaboy <[email protected]>
> mapred.child.java.opts should be in the gigabytes, 200M is way too low.
> Check this stack overflow thread for comments on how to ensure your setting
> actually takes effect -- it's possible you are not propagating it to the
> job. If you change it in the hadoop config files, you need to restart the
> MR daemons (JT and TTs).
> http://stackoverflow.com/questions/8464048/out-of-memory-error-in-hadoop
>
> I'll take a look at your script next time I have a few minutes, but try
> this first -- 200M is definitely too low to get much done in Hadoop.
>
>
> On Fri, Nov 2, 2012 at 3:17 AM, W W <[email protected]> wrote:
>
> > hi Dmitriy
> > Thanks for your explanation!
> > I think split on $2 is not easy because what I am doing is actually
> > rolling-up a table,which means they can not be get by join.
> > Here is the whole script with schema although I omitted many FLATTENs .
> >
> > IDF_VALID= LOAD '/user/hadoop/idf.dat'
> > USING PigStorage('^A') AS (
> >
> > ast_id : int,
> > value :chararray,
> > pro_id : int,
> > pag_id : int ,
> > bgr_id : int,
> >
> > );
> >
> > grouped_recs= GROUP IDF_VALID BY ast_id PARALLEL 40;
> >
> > rollup= FOREACH grouped_recs {
> >
> > bombay_code= FILTER IDF_VALID BY $2 == 76 ;
> > singapore_code= FILTER IDF_VALID BY $2 == 90 ;
> >
> > GENERATE
> >
> > FLATTEN(group) as nda_id,
> > FLATTEN((IsEmpty(bombay_code)?null:bombay_code.$1)) AS
> bombay_code
> > ,
> > FLATTEN((IsEmpty(singapore_code)?null:singapore_code.$1)) AS
> > singapore_code;
> >
> > }
> >
> > STORE rollup INTO 'idf-out-full' USING PigStorage('^A');
> >
> >
> >
> > Besides, how can I " increase the amount of available heap". I've
> changed
> > mapred.child.java.opts from -Xmx200m to -Xmx1024m . It seems it
> doesn't
> > help. And that threshold value is still the same.
> > when I monitor the java process by top command, it seems the setting of
> > mapred.child.java.opts have NO influence on both VIRT and RES, it seems
> > mapred.child.java.opts has been overrided by pig.
> > Do you have any idea about that ?
> >
> > Thanks and Regards
> > Xingbang
> >
> >
> >
> > 2012/11/2 Dmitriy Ryaboy <[email protected]>
> >
> > > Rather than increase memory, rewrite the script so it does not need so
> > much
> > > ram to begin with.
> > > You can split on $2, group and generate what you need, then join things
> > > back.
> > > Hard to tell what exactly you are going for without schemas and
> expected
> > > inputs/outputs.
> > >
> > > If the hadoop configs are the same, the fact that it's the powerful
> > machine
> > > that fails doesn't mean anything -- you are running out of RAM, and you
> > > gave all machines the same amount of RAM for the reduce processes. It
> > just
> > > happens to be the one that a big group is hashing to.
> > >
> > > The threshold you are asking about is the threshold after which Pig
> will
> > > try to spill what it can, since GC is imminent. It's defined as 70% of
> > the
> > > largest memory pool found on the jvm. This threshold itself is not what
> > you
> > > want to increase -- you want to increase the amount of available heap
> if
> > > possible.
> > >
> > > You can set pig.spill.gc.activation.size (invoke GC if we managed to
> > spill
> > > at least this much) and pig.spill.size.threshold (how big a spill must
> be
> > > before it makes sense to spill anything) if you want.
> > >
> > > D
> > >
> > >
> > >
> > >
> > > On Thu, Nov 1, 2012 at 2:59 AM, W W <[email protected]> wrote:
> > >
> > > > hello
> > > >
> > > > I just have came across a problem with SpillableMemoryManager.
> > > > I've searched lots of discussion contained this key, but they are all
> > > > different from my problem.
> > > >
> > > > The problem is
> > > >
> > > > When I run a pig script,it takes longer to finish the same task on
> the
> > > > powerful machine. And the log(the part that is not clear to me ) of
> > the
> > > > task node is
> > > >
> > > > Week Node:
> > > >
> > > > 2001-06-28 04:04:39,356 INFO
> > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> > > > call - Collection threshold init = 86048768(84032K) used =
> > > > 86048752(84031K) committed = 125304832(122368K) max =
> > > > 139853824(136576K)
> > > > 2001-06-28 04:04:39,940 INFO
> > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> > > > call- Usage threshold init = 86048768(84032K) used = 98041880(95744K)
> > > > committed = 125304832(122368K) max = 139853824(136576K)
> > > > 2001-06-28 04:06:10,048 INFO org.apache.hadoop.mapred.Task:
> > > > Task:attempt_201211010504_0007_r_000018_0 is done. And is in the
> > > > process of commiting
> > > >
> > > >
> > > > Powerful Node:
> > > >
> > > > 2012-11-01 06:12:56,801 INFO
> > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> > > > call- Usage threshold init = 139853824(136576K) used =
> > > > 99240424(96914K) committed = 139853824(136576K) max =
> > > > 139853824(136576K)
> > > > 2012-11-01 06:13:22,733 INFO
> > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> > > > call - Collection threshold init = 139853824(136576K) used =
> > > > 77466824(75651K) committed = 139853824(136576K) max =
> > > > 139853824(136576K)
> > > > 2012-11-01 06:15:41,178 INFO org.apache.hadoop.mapred.Task:
> > > > Task:attempt_201211010504_0007_r_000014_0 is done. And is in the
> > > > process of commiting
> > > >
> > > >
> > > > My question is how to control the number following those like the
> > > "Usage
> > > > threshold init" , It seems I can't set them in the config files.
> > > > Are they default to some hardware parameters?
> > > >
> > > >
> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
> > > >
> > > >
> > > > The description of the cluster
> > > >
> > > > I have a heterogeneous cluster with
> > > > 6 virtual machines with 4-core and 8G memory for each.
> > > > 4 physical machines with 24-core and 32Gmemory for each.
> > > >
> > > > The hadoop configs are all the same for all nodes(I assigned the same
> > > slots
> > > > for M/R to the powerful machines even there is a waste)
> > > >
> > > >
> > > >
> > > >
> > > > The pig script that cause the problem:
> > > >
> > > > grouped_recs= GROUP IDF_VALID BY ast_id PARALLEL 40;
> > > >
> > > > rollup= FOREACH grouped_recs {
> > > >
> > > > bombay_code= FILTER IDF_VALID BY $2 == 76 ;
> > > > singapore_code= FILTER IDF_VALID BY $2 == 90 ;
> > > >
> > > > GENERATE
> > > >
> > > > FLATTEN(group) as nda_id,
> > > > FLATTEN((IsEmpty(bombay_code)?null:bombay_code.$1)) AS
> > > bombay_code
> > > > ,
> > > > FLATTEN((IsEmpty(singapore_code)?null:singapore_code.$1)) AS
> > > > singapore_code;
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > Thanks&Regards
> > > > Xingbang
> > > >
> > >
> >
>