I've found the reason: it's InternalCachedBag. I've posted all the details in a mail titled: What is the best way to do counting in pig? I'm afraid I can not give you the mail link since the mail archiver of apache mailing system still doesn't catch up with that message.
Haitao Yao [email protected] weibo: @haitao_yao Skype: haitao.yao.final 在 2012-7-10,下午10:35, Dmitriy Ryaboy 写道: > Like I said earlier, if all you are doing is count, the data bag should not > be growing. On the reduce side, it'll just be a bag of counts from each > reducer. Something else is happening that's preventing the algebraic and > accumulative optimizations from kicking in. Can you share a minimal script > that reproduces the problem for you? > > On Jul 9, 2012, at 3:24 AM, Haitao Yao <[email protected]> wrote: > >> seems like Big data big is still a headache for pig. >> here's a mail archive I found : >> http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%[email protected]%3E >> >> I've tried all the ways I can think of, and none works. >> I think I have to play some tricks inside Pig source code. >> >> >> >> Haitao Yao >> [email protected] >> weibo: @haitao_yao >> Skype: haitao.yao.final >> >> 在 2012-7-9,下午2:18, Haitao Yao 写道: >> >>> there's also a reason of the OOM: I group the data by all , and the >>> parallelism is 1, With a big data bag, the reducer OOM >>> >>> after digging into the pig source code , I find out that replace the data >>> bag in BinSedesTuple is quite tricky, and maybe will cause other unknown >>> problems… >>> >>> Is there anybody else encounter the same problem? >>> >>> >>> Haitao Yao >>> [email protected] >>> weibo: @haitao_yao >>> Skype: haitao.yao.final >>> >>> 在 2012-7-9,上午11:11, Haitao Yao 写道: >>> >>>> sorry for the improper statement. >>>> The problem is the DataBag. The BinSedesTuple read full data of the >>>> DataBag. and while use COUNT for the data, it causes OOM. >>>> The diagrams also shows that most of the objects is from the ArrayList. >>>> >>>> I want to reimplement the DataBag that read by BinSedesTuple, it just >>>> holds the reference of the data input and read the data one by one while >>>> using iterator to access the data. >>>> >>>> I will give a shot. >>>> >>>> Haitao Yao >>>> [email protected] >>>> weibo: @haitao_yao >>>> Skype: haitao.yao.final >>>> >>>> 在 2012-7-6,下午11:06, Dmitriy Ryaboy 写道: >>>> >>>>> BinSedesTuple is just the tuple, changing it won't do anything about the >>>>> fact that lots of tuples are being loaded. >>>>> >>>>> The snippet you provided will not load all the data for computation, >>>>> since COUNT implements algebraic interface (partial counts will be done >>>>> on combiners). >>>>> >>>>> Something else is causing tuples to be materialized. Are you using other >>>>> UDFs? Can you provide more details on the script? When you run "explain" >>>>> on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc? >>>>> >>>>> You can check the "pig.alias" property in the jobconf to identify which >>>>> relations are being calculated by a given MR job; that might help narrow >>>>> things down. >>>>> >>>>> -Dmitriy >>>>> >>>>> >>>>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[email protected]> wrote: >>>>> hi, >>>>> I wrote a pig script that one of the reduces always OOM no matter how I >>>>> change the parallelism. >>>>> Here's the script snippet: >>>>> Data = group SourceData all; >>>>> Result = foreach Data generate group, COUNt(SourceData); >>>>> store Result into 'XX'; >>>>> >>>>> I analyzed the dumped java heap, and find out that the reason is that >>>>> the reducer load all the data for the foreach and count. >>>>> >>>>> Can I re-implement the BinSedesTuple to avoid reducers load all the >>>>> data for computation? >>>>> >>>>> Here's the object domination tree: >>>>> >>>>> >>>>> >>>>> here's the jmap result: >>>>> >>>>> >>>>> >>>>> Haitao Yao >>>>> [email protected] >>>>> weibo: @haitao_yao >>>>> Skype: haitao.yao.final >>>>> >>>>> >>>> >>> >>
