I've found the reason: it's InternalCachedBag.
I've posted all the details in a mail titled: What is the best way to do 
counting in pig?
I'm afraid I can not give you the mail link since the mail archiver of apache 
mailing system still doesn't catch up with that message.


Haitao Yao
[email protected]
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-10,下午10:35, Dmitriy Ryaboy 写道:

> Like I said earlier, if all you are doing is count, the data bag should not 
> be growing. On the reduce side, it'll just be a bag of counts from each 
> reducer. Something else is happening that's preventing the algebraic and 
> accumulative optimizations from kicking in. Can you share a minimal script 
> that reproduces the problem for you?
> 
> On Jul 9, 2012, at 3:24 AM, Haitao Yao <[email protected]> wrote:
> 
>> seems like Big data big is still a headache for pig. 
>> here's a mail archive  I found : 
>> http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%[email protected]%3E
>> 
>> I've tried all the ways I can think of, and none works. 
>> I think I have to play some tricks inside Pig source code.
>> 
>> 
>> 
>> Haitao Yao
>> [email protected]
>> weibo: @haitao_yao
>> Skype:  haitao.yao.final
>> 
>> 在 2012-7-9,下午2:18, Haitao Yao 写道:
>> 
>>> there's also a reason of the OOM: I group the data by all , and the 
>>> parallelism is 1, With a big data bag, the reducer OOM 
>>> 
>>> after digging into the pig source code ,  I find out that replace the data 
>>> bag in BinSedesTuple is quite tricky, and maybe will cause other unknown 
>>> problems… 
>>> 
>>> Is there anybody else encounter the same problem? 
>>> 
>>> 
>>> Haitao Yao
>>> [email protected]
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>> 
>>> 在 2012-7-9,上午11:11, Haitao Yao 写道:
>>> 
>>>> sorry for the improper statement. 
>>>> The problem is the DataBag.  The BinSedesTuple read full  data of the 
>>>> DataBag. and while use COUNT for the data, it causes OOM.
>>>> The diagrams also shows that most of the objects is from the ArrayList.
>>>> 
>>>> I want to reimplement the DataBag that read by BinSedesTuple, it just 
>>>> holds the reference of the data input and read the data one by one while 
>>>> using iterator to access the data.
>>>> 
>>>> I will give a shot. 
>>>> 
>>>> Haitao Yao
>>>> [email protected]
>>>> weibo: @haitao_yao
>>>> Skype:  haitao.yao.final
>>>> 
>>>> 在 2012-7-6,下午11:06, Dmitriy Ryaboy 写道:
>>>> 
>>>>> BinSedesTuple is just the tuple, changing it won't do anything about the 
>>>>> fact that lots of tuples are being loaded.
>>>>> 
>>>>> The snippet you provided will not load all the data for computation, 
>>>>> since COUNT implements algebraic interface (partial counts will be done 
>>>>> on combiners).
>>>>> 
>>>>> Something else is causing tuples to be materialized. Are you using other 
>>>>> UDFs? Can you provide more details on the script? When you run "explain" 
>>>>> on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
>>>>> 
>>>>> You can check the "pig.alias" property in the jobconf to identify which 
>>>>> relations are being calculated by a given MR job; that might help narrow 
>>>>> things down.
>>>>> 
>>>>> -Dmitriy
>>>>> 
>>>>> 
>>>>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[email protected]> wrote:
>>>>> hi,
>>>>>   I wrote a pig script that one of the reduces always OOM no matter how I 
>>>>> change the parallelism.
>>>>>       Here's the script snippet:
>>>>>       Data = group SourceData all;
>>>>>       Result = foreach Data generate group, COUNt(SourceData);
>>>>>       store Result into 'XX';
>>>>> 
>>>>>   I analyzed the dumped java heap,  and find out that the reason is that 
>>>>> the reducer load all the data for the foreach and count. 
>>>>> 
>>>>>   Can I re-implement the BinSedesTuple to avoid reducers load all the 
>>>>> data for computation? 
>>>>> 
>>>>> Here's the object domination tree:
>>>>> 
>>>>> 
>>>>> 
>>>>> here's the jmap result: 
>>>>> 
>>>>> 
>>>>> 
>>>>> Haitao Yao
>>>>> [email protected]
>>>>> weibo: @haitao_yao
>>>>> Skype:  haitao.yao.final
>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Reply via email to