Like I said earlier, if all you are doing is count, the data bag should not be growing. On the reduce side, it'll just be a bag of counts from each reducer. Something else is happening that's preventing the algebraic and accumulative optimizations from kicking in. Can you share a minimal script that reproduces the problem for you?
On Jul 9, 2012, at 3:24 AM, Haitao Yao <[email protected]> wrote: > seems like Big data big is still a headache for pig. > here's a mail archive I found : > http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%[email protected]%3E > > I've tried all the ways I can think of, and none works. > I think I have to play some tricks inside Pig source code. > > > > Haitao Yao > [email protected] > weibo: @haitao_yao > Skype: haitao.yao.final > > 在 2012-7-9,下午2:18, Haitao Yao 写道: > >> there's also a reason of the OOM: I group the data by all , and the >> parallelism is 1, With a big data bag, the reducer OOM >> >> after digging into the pig source code , I find out that replace the data >> bag in BinSedesTuple is quite tricky, and maybe will cause other unknown >> problems… >> >> Is there anybody else encounter the same problem? >> >> >> Haitao Yao >> [email protected] >> weibo: @haitao_yao >> Skype: haitao.yao.final >> >> 在 2012-7-9,上午11:11, Haitao Yao 写道: >> >>> sorry for the improper statement. >>> The problem is the DataBag. The BinSedesTuple read full data of the >>> DataBag. and while use COUNT for the data, it causes OOM. >>> The diagrams also shows that most of the objects is from the ArrayList. >>> >>> I want to reimplement the DataBag that read by BinSedesTuple, it just holds >>> the reference of the data input and read the data one by one while using >>> iterator to access the data. >>> >>> I will give a shot. >>> >>> Haitao Yao >>> [email protected] >>> weibo: @haitao_yao >>> Skype: haitao.yao.final >>> >>> 在 2012-7-6,下午11:06, Dmitriy Ryaboy 写道: >>> >>>> BinSedesTuple is just the tuple, changing it won't do anything about the >>>> fact that lots of tuples are being loaded. >>>> >>>> The snippet you provided will not load all the data for computation, since >>>> COUNT implements algebraic interface (partial counts will be done on >>>> combiners). >>>> >>>> Something else is causing tuples to be materialized. Are you using other >>>> UDFs? Can you provide more details on the script? When you run "explain" >>>> on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc? >>>> >>>> You can check the "pig.alias" property in the jobconf to identify which >>>> relations are being calculated by a given MR job; that might help narrow >>>> things down. >>>> >>>> -Dmitriy >>>> >>>> >>>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[email protected]> wrote: >>>> hi, >>>> I wrote a pig script that one of the reduces always OOM no matter how I >>>> change the parallelism. >>>> Here's the script snippet: >>>> Data = group SourceData all; >>>> Result = foreach Data generate group, COUNt(SourceData); >>>> store Result into 'XX'; >>>> >>>> I analyzed the dumped java heap, and find out that the reason is that >>>> the reducer load all the data for the foreach and count. >>>> >>>> Can I re-implement the BinSedesTuple to avoid reducers load all the >>>> data for computation? >>>> >>>> Here's the object domination tree: >>>> >>>> >>>> >>>> here's the jmap result: >>>> >>>> >>>> >>>> Haitao Yao >>>> [email protected] >>>> weibo: @haitao_yao >>>> Skype: haitao.yao.final >>>> >>>> >>> >> >
