Re: What is the best way to do counting in pig?

Haitao Yao Mon, 09 Jul 2012 22:07:03 -0700

my reducers get 512 MB, -Xms512M -Xmx512M. 
The reducer does not get OOM when manually invoke spill in my case.


Can you explain more about your solution? 
And can your solution fit into 512MB reducer process?
Thanks very much.



Haitao Yao
[email protected]
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-10，下午12:26， Jonathan Coveney 写道：

> I have something in the mix that should reduce bag memory :) Question: how
> much memory are your reducers getting? In my experience, you'll get OOM's
> on spilling if you have allocated less than a gig to the JVM
> 
> 2012/7/9 Haitao Yao <[email protected]>
> 
>> I have encountered the similar problem.  And I got a OOM while running the
>> reducer.
>> I think the reason is the data bag generated after group all is too big to
>> fit into the reducer's memory.
>> 
>> and I have written a new COUNT implementation with explicit invoke
>> System.gc() and spill  after the COUNT function finish its job, but it
>> still get OOM
>> 
>> here's the code of the new COUNT implementation:
>>        @Override
>>        public Long exec(Tuple input) throws IOException {
>>                DataBag bag = (DataBag)input.get(0);
>>                Long result = super.exec(input);
>>                LOG.warn(" before spill data bag memory : " +
>> Runtime.getRuntime().freeMemory());
>>                bag.spill();
>>                System.gc();
>>                LOG.warn(" after spill data bag memory : " +
>> Runtime.getRuntime().freeMemory());
>>                LOG.warn("big bag size: " + bag.size() + ", hashcode: " +
>> bag.hashCode());
>>                return result;
>>        }
>> 
>> 
>> I think we have to redesign the data bag implementation with less memory
>> consumed.
>> 
>> 
>> 
>> Haitao Yao
>> [email protected]
>> weibo: @haitao_yao
>> Skype:  haitao.yao.final
>> 
>> 在 2012-7-10，上午6:54， Sheng Guo 写道：
>> 
>>> the pig script:
>>> 
>>> longDesc = load '/user/xx/filtered_chunk' USING AvroStorage();
>>> 
>>> grpall = group longDesc all;
>>> cnt = foreach grpall generate COUNT(longDesc) as allNumber;
>>> explain cnt;
>>> 
>>> 
>>> the dump relation result:
>>> 
>>> #-----------------------------------------------
>>> # New Logical Plan:
>>> #-----------------------------------------------
>>> cnt: (Name: LOStore Schema: allNumber#65:long)
>>> |
>>> |---cnt: (Name: LOForEach Schema: allNumber#65:long)
>>>   |   |
>>>   |   (Name: LOGenerate[false] Schema:
>>> allNumber#65:long)ColumnPrune:InputUids=[63]ColumnPrune:OutputUids=[65]
>>>   |   |   |
>>>   |   |   (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid:
>>> 65)
>>>   |   |   |
>>>   |   |   |---longDesc:(Name: Project Type: bag Uid: 63 Input: 0 Column:
>>> (*))
>>>   |   |
>>>   |   |---longDesc: (Name: LOInnerLoad[1] Schema:
>>> 
>> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)
>>>   |
>>>   |---grpall: (Name: LOCogroup Schema:
>>> 
>> group#62:chararray,longDesc#63:bag{#64:tuple(DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)})
>>>       |   |
>>>       |   (Name: Constant Type: chararray Uid: 62)
>>>       |
>>>       |---longDesc: (Name: LOLoad Schema:
>>> 
>> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)RequiredFields:null
>>> 
>>> #-----------------------------------------------
>>> # Physical Plan:
>>> #-----------------------------------------------
>>> cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
>>> |
>>> |---cnt: New For Each(false)[bag] - scope-8
>>>   |   |
>>>   |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-6
>>>   |   |
>>>   |   |---Project[bag][1] - scope-5
>>>   |
>>>   |---grpall: Package[tuple]{chararray} - scope-2
>>>       |
>>>       |---grpall: Global Rearrange[tuple] - scope-1
>>>           |
>>>           |---grpall: Local Rearrange[tuple]{chararray}(false) - scope-3
>>>               |   |
>>>               |   Constant(all) - scope-4
>>>               |
>>>               |---longDesc:
>>> Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0
>>> 
>>> 2012-07-09 15:47:02,441 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
>>> File concatenation threshold: 100 optimistic? false
>>> 2012-07-09 15:47:02,448 [main] INFO
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
>>> - Choosing to move algebraic foreach to combiner
>>> 2012-07-09 15:47:02,581 [main] INFO
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>> - MR plan size before optimization: 1
>>> 2012-07-09 15:47:02,581 [main] INFO
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>> - MR plan size after optimization: 1
>>> #--------------------------------------------------
>>> # Map Reduce Plan
>>> #--------------------------------------------------
>>> MapReduce node scope-10
>>> Map Plan
>>> grpall: Local Rearrange[tuple]{chararray}(false) - scope-22
>>> |   |
>>> |   Project[chararray][0] - scope-23
>>> |
>>> |---cnt: New For Each(false,false)[bag] - scope-11
>>>   |   |
>>>   |   Project[chararray][0] - scope-12
>>>   |   |
>>>   |   POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - scope-13
>>>   |   |
>>>   |   |---Project[bag][1] - scope-14
>>>   |
>>>   |---Pre Combiner Local Rearrange[tuple]{Unknown} - scope-24
>>>       |
>>>       |---longDesc:
>>> Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0--------
>>> Combine Plan
>>> grpall: Local Rearrange[tuple]{chararray}(false) - scope-26
>>> |   |
>>> |   Project[chararray][0] - scope-27
>>> |
>>> |---cnt: New For Each(false,false)[bag] - scope-15
>>>   |   |
>>>   |   Project[chararray][0] - scope-16
>>>   |   |
>>>   |   POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] -
>>> scope-17
>>>   |   |
>>>   |   |---Project[bag][1] - scope-18
>>>   |
>>>   |---POCombinerPackage[tuple]{chararray} - scope-20--------
>>> Reduce Plan
>>> cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
>>> |
>>> |---cnt: New For Each(false)[bag] - scope-8
>>>   |   |
>>>   |   POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - scope-6
>>>   |   |
>>>   |   |---Project[bag][1] - scope-19
>>>   |
>>>   |---POCombinerPackage[tuple]{chararray} - scope-28--------
>>> Global sort: false
>>> ----------------
>>> 
>>> 
>>> 
>>> On Tue, Jul 3, 2012 at 9:56 AM, Jonathan Coveney <[email protected]>
>> wrote:
>>> 
>>>> instead of doing "dump relation," do "explain relation" (then run
>>>> identically) and paste the output here. It will show whether the
>> combiner
>>>> is being used,
>>>> 
>>>> 2012/7/3 Ruslan Al-Fakikh <[email protected]>
>>>> 
>>>>> Hi,
>>>>> 
>>>>> As it was said, COUNT is algebraic and should be fast, because it
>>>>> forces combiner. You should make sure that combiner is really used
>>>>> here. It can be disabled in some situations. I've encountered such
>>>>> situations many times when a job is tooo heavy in case no combiner is
>>>>> applied.
>>>>> 
>>>>> Ruslan
>>>>> 
>>>>> On Tue, Jul 3, 2012 at 1:35 AM, Subir S <[email protected]>
>>>> wrote:
>>>>>> Right!!
>>>>>> 
>>>>>> Since it is mentioned that job is hanging, wild guess is it must be
>>>>>> 'group all'. How can that be confirmed?
>>>>>> 
>>>>>> On 7/3/12, Jonathan Coveney <[email protected]> wrote:
>>>>>>> group all uses a single reducer, but COUNT is algebraic, and as such,
>>>>> will
>>>>>>> use combiners, so it is generally quite fast.
>>>>>>> 
>>>>>>> 2012/7/2 Subir S <[email protected]>
>>>>>>> 
>>>>>>>> Group all - uses single reducer AFAIU. You can try to count per
>> group
>>>>>>>> and sum may be.
>>>>>>>> 
>>>>>>>> You may also try with COUNT_STAR to include NULL fields.
>>>>>>>> 
>>>>>>>> On 7/3/12, Sheng Guo <[email protected]> wrote:
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> I used to use the following pig script to do the counting of the
>>>>>>>>> records.
>>>>>>>>> 
>>>>>>>>> m_skill_group = group m_skills_filter by member_id;
>>>>>>>>> grpd = group m_skill_group all;
>>>>>>>>> cnt = foreach grpd generate COUNT(m_skill_group);
>>>>>>>>> 
>>>>>>>>> cnt_filter = limit cnt 10;
>>>>>>>>> dump cnt_filter;
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> but sometimes, when the records get larger, it takes lots of time
>>>> and
>>>>>>>> hang
>>>>>>>>> up, and or die.
>>>>>>>>> I thought counting should be simple enough, so what is the best way
>>>>> to
>>>>>>>> do a
>>>>>>>>> counting in pig?
>>>>>>>>> 
>>>>>>>>> Thanks!
>>>>>>>>> 
>>>>>>>>> Sheng
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best Regards,
>>>>> Ruslan Al-Fakikh
>>>>> 
>>>> 
>> 
>>

Re: What is the best way to do counting in pig?

Reply via email to