there's also a reason of the OOM: I group the data by all , and the parallelism is 1, With a big data bag, the reducer OOM
after digging into the pig source code , I find out that replace the data bag in BinSedesTuple is quite tricky, and maybe will cause other unknown problems… Is there anybody else encounter the same problem? Haitao Yao [email protected] weibo: @haitao_yao Skype: haitao.yao.final 在 2012-7-9,上午11:11, Haitao Yao 写道: > sorry for the improper statement. > The problem is the DataBag. The BinSedesTuple read full data of the > DataBag. and while use COUNT for the data, it causes OOM. > The diagrams also shows that most of the objects is from the ArrayList. > > I want to reimplement the DataBag that read by BinSedesTuple, it just holds > the reference of the data input and read the data one by one while using > iterator to access the data. > > I will give a shot. > > Haitao Yao > [email protected] > weibo: @haitao_yao > Skype: haitao.yao.final > > 在 2012-7-6,下午11:06, Dmitriy Ryaboy 写道: > >> BinSedesTuple is just the tuple, changing it won't do anything about the >> fact that lots of tuples are being loaded. >> >> The snippet you provided will not load all the data for computation, since >> COUNT implements algebraic interface (partial counts will be done on >> combiners). >> >> Something else is causing tuples to be materialized. Are you using other >> UDFs? Can you provide more details on the script? When you run "explain" on >> "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc? >> >> You can check the "pig.alias" property in the jobconf to identify which >> relations are being calculated by a given MR job; that might help narrow >> things down. >> >> -Dmitriy >> >> >> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[email protected]> wrote: >> hi, >> I wrote a pig script that one of the reduces always OOM no matter how I >> change the parallelism. >> Here's the script snippet: >> Data = group SourceData all; >> Result = foreach Data generate group, COUNt(SourceData); >> store Result into 'XX'; >> >> I analyzed the dumped java heap, and find out that the reason is that >> the reducer load all the data for the foreach and count. >> >> Can I re-implement the BinSedesTuple to avoid reducers load all the >> data for computation? >> >> Here's the object domination tree: >> >> >> >> here's the jmap result: >> >> >> >> Haitao Yao >> [email protected] >> weibo: @haitao_yao >> Skype: haitao.yao.final >> >> >
