mapred.child.java.opts should be in the gigabytes, 200M is way too low.
Check this stack overflow thread for comments on how to ensure your setting
actually takes effect -- it's possible you are not propagating it to the
job. If you change it in the hadoop config files, you need to restart the
MR daemons (JT and TTs).
http://stackoverflow.com/questions/8464048/out-of-memory-error-in-hadoop

I'll take a look at your script next time I have a few minutes, but try
this first -- 200M is definitely too low to get much done in Hadoop.


On Fri, Nov 2, 2012 at 3:17 AM, W W <[email protected]> wrote:

> hi Dmitriy
> Thanks for your explanation!
> I think split on $2 is not easy because what I am doing is actually
> rolling-up a table,which means they can not be get by join.
> Here is the whole script with schema although I omitted many FLATTENs .
>
> IDF_VALID= LOAD '/user/hadoop/idf.dat'
> USING PigStorage('^A') AS (
>
>   ast_id : int,
>   value :chararray,
>   pro_id : int,
>   pag_id  : int ,
>   bgr_id : int,
>
> );
>
> grouped_recs= GROUP IDF_VALID BY ast_id PARALLEL 40;
>
> rollup= FOREACH grouped_recs {
>
>         bombay_code= FILTER IDF_VALID BY $2 == 76 ;
>         singapore_code= FILTER IDF_VALID BY $2 == 90 ;
>
> GENERATE
>
>         FLATTEN(group) as nda_id,
>         FLATTEN((IsEmpty(bombay_code)?null:bombay_code.$1)) AS bombay_code
> ,
>   FLATTEN((IsEmpty(singapore_code)?null:singapore_code.$1)) AS
> singapore_code;
>
> }
>
> STORE rollup INTO 'idf-out-full' USING PigStorage('^A');
>
>
>
> Besides,  how can I "  increase the amount of available heap". I've changed
> mapred.child.java.opts   from -Xmx200m  to -Xmx1024m .  It seems it doesn't
> help. And that threshold value is still the same.
> when I monitor the java process by top command, it seems the setting of
> mapred.child.java.opts have NO influence on both VIRT and RES, it seems
>  mapred.child.java.opts has been overrided by pig.
>  Do you have any idea about that ?
>
> Thanks and Regards
> Xingbang
>
>
>
> 2012/11/2 Dmitriy Ryaboy <[email protected]>
>
> > Rather than increase memory, rewrite the script so it does not need so
> much
> > ram to begin with.
> > You can split on $2, group and generate what you need, then join things
> > back.
> > Hard to tell what exactly you are going for without schemas and expected
> > inputs/outputs.
> >
> > If the hadoop configs are the same, the fact that it's the powerful
> machine
> > that fails doesn't mean anything -- you are running out of RAM, and you
> > gave all machines the same amount of RAM for the reduce processes. It
> just
> > happens to be the one that a big group is hashing to.
> >
> > The threshold you are asking about is the threshold after which Pig will
> > try to spill what it can, since GC is imminent. It's defined as 70% of
> the
> > largest memory pool found on the jvm. This threshold itself is not what
> you
> > want to increase -- you want to increase the amount of available heap if
> > possible.
> >
> > You can set pig.spill.gc.activation.size (invoke GC if we managed to
> spill
> > at least this much) and pig.spill.size.threshold (how big a spill must be
> > before it makes sense to spill anything) if you want.
> >
> > D
> >
> >
> >
> >
> > On Thu, Nov 1, 2012 at 2:59 AM, W W <[email protected]> wrote:
> >
> > > hello
> > >
> > > I just have came across a problem with SpillableMemoryManager.
> > > I've searched lots of discussion contained this key, but they are all
> > > different from my problem.
> > >
> > > The problem is
> > >
> > > When I run a pig script,it takes longer to finish the same task on the
> > > powerful machine. And the log(the part that is not clear to me )  of
> the
> > > task node is
> > >
> > > Week Node:
> > >
> > > 2001-06-28 04:04:39,356 INFO
> > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> > > call - Collection threshold init = 86048768(84032K) used =
> > > 86048752(84031K) committed = 125304832(122368K) max =
> > > 139853824(136576K)
> > > 2001-06-28 04:04:39,940 INFO
> > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> > > call- Usage threshold init = 86048768(84032K) used = 98041880(95744K)
> > > committed = 125304832(122368K) max = 139853824(136576K)
> > > 2001-06-28 04:06:10,048 INFO org.apache.hadoop.mapred.Task:
> > > Task:attempt_201211010504_0007_r_000018_0 is done. And is in the
> > > process of commiting
> > >
> > >
> > > Powerful Node:
> > >
> > > 2012-11-01 06:12:56,801 INFO
> > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> > > call- Usage threshold init = 139853824(136576K) used =
> > > 99240424(96914K) committed = 139853824(136576K) max =
> > > 139853824(136576K)
> > > 2012-11-01 06:13:22,733 INFO
> > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> > > call - Collection threshold init = 139853824(136576K) used =
> > > 77466824(75651K) committed = 139853824(136576K) max =
> > > 139853824(136576K)
> > > 2012-11-01 06:15:41,178 INFO org.apache.hadoop.mapred.Task:
> > > Task:attempt_201211010504_0007_r_000014_0 is done. And is in the
> > > process of commiting
> > >
> > >
> > > My question is how to control the number following  those like  the
> >  "Usage
> > > threshold init" , It seems I can't set them in the config files.
> > > Are they default to some hardware parameters?
> > >
> > >
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
> > >
> > >
> > > The description of the cluster
> > >
> > > I have a heterogeneous cluster with
> > >  6 virtual machines with 4-core and 8G memory for each.
> > >  4 physical machines with 24-core and 32Gmemory for each.
> > >
> > > The hadoop configs are all the same for all nodes(I assigned the same
> > slots
> > > for M/R to the powerful machines even there is a waste)
> > >
> > >
> > >
> > >
> > > The pig script that cause the problem:
> > >
> > > grouped_recs= GROUP IDF_VALID BY ast_id PARALLEL 40;
> > >
> > > rollup= FOREACH grouped_recs {
> > >
> > >         bombay_code= FILTER IDF_VALID BY $2 == 76 ;
> > >         singapore_code= FILTER IDF_VALID BY $2 == 90 ;
> > >
> > > GENERATE
> > >
> > >         FLATTEN(group) as nda_id,
> > >         FLATTEN((IsEmpty(bombay_code)?null:bombay_code.$1)) AS
> > bombay_code
> > > ,
> > >   FLATTEN((IsEmpty(singapore_code)?null:singapore_code.$1)) AS
> > > singapore_code;
> > >
> > > }
> > >
> > >
> > >
> > > Thanks&Regards
> > > Xingbang
> > >
> >
>

Reply via email to