there's also a reason of the OOM: I group the data by all , and the parallelism 
is 1, With a big data bag, the reducer OOM 

after digging into the pig source code ,  I find out that replace the data bag 
in BinSedesTuple is quite tricky, and maybe will cause other unknown problems… 

Is there anybody else encounter the same problem? 


Haitao Yao
[email protected]
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-9,上午11:11, Haitao Yao 写道:

> sorry for the improper statement. 
> The problem is the DataBag.  The BinSedesTuple read full  data of the 
> DataBag. and while use COUNT for the data, it causes OOM.
> The diagrams also shows that most of the objects is from the ArrayList.
> 
> I want to reimplement the DataBag that read by BinSedesTuple, it just holds 
> the reference of the data input and read the data one by one while using 
> iterator to access the data.
> 
> I will give a shot. 
> 
> Haitao Yao
> [email protected]
> weibo: @haitao_yao
> Skype:  haitao.yao.final
> 
> 在 2012-7-6,下午11:06, Dmitriy Ryaboy 写道:
> 
>> BinSedesTuple is just the tuple, changing it won't do anything about the 
>> fact that lots of tuples are being loaded.
>> 
>> The snippet you provided will not load all the data for computation, since 
>> COUNT implements algebraic interface (partial counts will be done on 
>> combiners).
>> 
>> Something else is causing tuples to be materialized. Are you using other 
>> UDFs? Can you provide more details on the script? When you run "explain" on 
>> "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
>> 
>> You can check the "pig.alias" property in the jobconf to identify which 
>> relations are being calculated by a given MR job; that might help narrow 
>> things down.
>> 
>> -Dmitriy
>> 
>> 
>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[email protected]> wrote:
>> hi,
>>      I wrote a pig script that one of the reduces always OOM no matter how I 
>> change the parallelism.
>>         Here's the script snippet:
>>              Data = group SourceData all;
>>              Result = foreach Data generate group, COUNt(SourceData);
>>              store Result into 'XX';
>>      
>>      I analyzed the dumped java heap,  and find out that the reason is that 
>> the reducer load all the data for the foreach and count. 
>> 
>>      Can I re-implement the BinSedesTuple to avoid reducers load all the 
>> data for computation? 
>> 
>> Here's the object domination tree:
>> 
>> 
>> 
>> here's the jmap result: 
>> 
>>  
>> 
>> Haitao Yao
>> [email protected]
>> weibo: @haitao_yao
>> Skype:  haitao.yao.final
>> 
>> 
> 

Reply via email to