mapred.child.java.opts should be in the gigabytes, 200M is way too low. Check this stack overflow thread for comments on how to ensure your setting actually takes effect -- it's possible you are not propagating it to the job. If you change it in the hadoop config files, you need to restart the MR daemons (JT and TTs). http://stackoverflow.com/questions/8464048/out-of-memory-error-in-hadoop
I'll take a look at your script next time I have a few minutes, but try this first -- 200M is definitely too low to get much done in Hadoop. On Fri, Nov 2, 2012 at 3:17 AM, W W <[email protected]> wrote: > hi Dmitriy > Thanks for your explanation! > I think split on $2 is not easy because what I am doing is actually > rolling-up a table,which means they can not be get by join. > Here is the whole script with schema although I omitted many FLATTENs . > > IDF_VALID= LOAD '/user/hadoop/idf.dat' > USING PigStorage('^A') AS ( > > ast_id : int, > value :chararray, > pro_id : int, > pag_id : int , > bgr_id : int, > > ); > > grouped_recs= GROUP IDF_VALID BY ast_id PARALLEL 40; > > rollup= FOREACH grouped_recs { > > bombay_code= FILTER IDF_VALID BY $2 == 76 ; > singapore_code= FILTER IDF_VALID BY $2 == 90 ; > > GENERATE > > FLATTEN(group) as nda_id, > FLATTEN((IsEmpty(bombay_code)?null:bombay_code.$1)) AS bombay_code > , > FLATTEN((IsEmpty(singapore_code)?null:singapore_code.$1)) AS > singapore_code; > > } > > STORE rollup INTO 'idf-out-full' USING PigStorage('^A'); > > > > Besides, how can I " increase the amount of available heap". I've changed > mapred.child.java.opts from -Xmx200m to -Xmx1024m . It seems it doesn't > help. And that threshold value is still the same. > when I monitor the java process by top command, it seems the setting of > mapred.child.java.opts have NO influence on both VIRT and RES, it seems > mapred.child.java.opts has been overrided by pig. > Do you have any idea about that ? > > Thanks and Regards > Xingbang > > > > 2012/11/2 Dmitriy Ryaboy <[email protected]> > > > Rather than increase memory, rewrite the script so it does not need so > much > > ram to begin with. > > You can split on $2, group and generate what you need, then join things > > back. > > Hard to tell what exactly you are going for without schemas and expected > > inputs/outputs. > > > > If the hadoop configs are the same, the fact that it's the powerful > machine > > that fails doesn't mean anything -- you are running out of RAM, and you > > gave all machines the same amount of RAM for the reduce processes. It > just > > happens to be the one that a big group is hashing to. > > > > The threshold you are asking about is the threshold after which Pig will > > try to spill what it can, since GC is imminent. It's defined as 70% of > the > > largest memory pool found on the jvm. This threshold itself is not what > you > > want to increase -- you want to increase the amount of available heap if > > possible. > > > > You can set pig.spill.gc.activation.size (invoke GC if we managed to > spill > > at least this much) and pig.spill.size.threshold (how big a spill must be > > before it makes sense to spill anything) if you want. > > > > D > > > > > > > > > > On Thu, Nov 1, 2012 at 2:59 AM, W W <[email protected]> wrote: > > > > > hello > > > > > > I just have came across a problem with SpillableMemoryManager. > > > I've searched lots of discussion contained this key, but they are all > > > different from my problem. > > > > > > The problem is > > > > > > When I run a pig script,it takes longer to finish the same task on the > > > powerful machine. And the log(the part that is not clear to me ) of > the > > > task node is > > > > > > Week Node: > > > > > > 2001-06-28 04:04:39,356 INFO > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler > > > call - Collection threshold init = 86048768(84032K) used = > > > 86048752(84031K) committed = 125304832(122368K) max = > > > 139853824(136576K) > > > 2001-06-28 04:04:39,940 INFO > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler > > > call- Usage threshold init = 86048768(84032K) used = 98041880(95744K) > > > committed = 125304832(122368K) max = 139853824(136576K) > > > 2001-06-28 04:06:10,048 INFO org.apache.hadoop.mapred.Task: > > > Task:attempt_201211010504_0007_r_000018_0 is done. And is in the > > > process of commiting > > > > > > > > > Powerful Node: > > > > > > 2012-11-01 06:12:56,801 INFO > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler > > > call- Usage threshold init = 139853824(136576K) used = > > > 99240424(96914K) committed = 139853824(136576K) max = > > > 139853824(136576K) > > > 2012-11-01 06:13:22,733 INFO > > > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler > > > call - Collection threshold init = 139853824(136576K) used = > > > 77466824(75651K) committed = 139853824(136576K) max = > > > 139853824(136576K) > > > 2012-11-01 06:15:41,178 INFO org.apache.hadoop.mapred.Task: > > > Task:attempt_201211010504_0007_r_000014_0 is done. And is in the > > > process of commiting > > > > > > > > > My question is how to control the number following those like the > > "Usage > > > threshold init" , It seems I can't set them in the config files. > > > Are they default to some hardware parameters? > > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~` > > > > > > > > > The description of the cluster > > > > > > I have a heterogeneous cluster with > > > 6 virtual machines with 4-core and 8G memory for each. > > > 4 physical machines with 24-core and 32Gmemory for each. > > > > > > The hadoop configs are all the same for all nodes(I assigned the same > > slots > > > for M/R to the powerful machines even there is a waste) > > > > > > > > > > > > > > > The pig script that cause the problem: > > > > > > grouped_recs= GROUP IDF_VALID BY ast_id PARALLEL 40; > > > > > > rollup= FOREACH grouped_recs { > > > > > > bombay_code= FILTER IDF_VALID BY $2 == 76 ; > > > singapore_code= FILTER IDF_VALID BY $2 == 90 ; > > > > > > GENERATE > > > > > > FLATTEN(group) as nda_id, > > > FLATTEN((IsEmpty(bombay_code)?null:bombay_code.$1)) AS > > bombay_code > > > , > > > FLATTEN((IsEmpty(singapore_code)?null:singapore_code.$1)) AS > > > singapore_code; > > > > > > } > > > > > > > > > > > > Thanks&Regards > > > Xingbang > > > > > >
