I have encountered the similar problem.  And I got a OOM while running the 
reducer.
I think the reason is the data bag generated after group all is too big to fit 
into the reducer's memory.

and I have written a new COUNT implementation with explicit invoke System.gc() 
and spill  after the COUNT function finish its job, but it still get OOM

here's the code of the new COUNT implementation:
        @Override
        public Long exec(Tuple input) throws IOException {
                DataBag bag = (DataBag)input.get(0);
                Long result = super.exec(input);
                LOG.warn(" before spill data bag memory : " + 
Runtime.getRuntime().freeMemory());
                bag.spill();
                System.gc();
                LOG.warn(" after spill data bag memory : " + 
Runtime.getRuntime().freeMemory());
                LOG.warn("big bag size: " + bag.size() + ", hashcode: " + 
bag.hashCode());
                return result;
        }


I think we have to redesign the data bag implementation with less memory 
consumed.



Haitao Yao
[email protected]
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-10,上午6:54, Sheng Guo 写道:

> the pig script:
> 
> longDesc = load '/user/xx/filtered_chunk' USING AvroStorage();
> 
> grpall = group longDesc all;
> cnt = foreach grpall generate COUNT(longDesc) as allNumber;
> explain cnt;
> 
> 
> the dump relation result:
> 
> #-----------------------------------------------
> # New Logical Plan:
> #-----------------------------------------------
> cnt: (Name: LOStore Schema: allNumber#65:long)
> |
> |---cnt: (Name: LOForEach Schema: allNumber#65:long)
>    |   |
>    |   (Name: LOGenerate[false] Schema:
> allNumber#65:long)ColumnPrune:InputUids=[63]ColumnPrune:OutputUids=[65]
>    |   |   |
>    |   |   (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid:
> 65)
>    |   |   |
>    |   |   |---longDesc:(Name: Project Type: bag Uid: 63 Input: 0 Column:
> (*))
>    |   |
>    |   |---longDesc: (Name: LOInnerLoad[1] Schema:
> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)
>    |
>    |---grpall: (Name: LOCogroup Schema:
> group#62:chararray,longDesc#63:bag{#64:tuple(DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)})
>        |   |
>        |   (Name: Constant Type: chararray Uid: 62)
>        |
>        |---longDesc: (Name: LOLoad Schema:
> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)RequiredFields:null
> 
> #-----------------------------------------------
> # Physical Plan:
> #-----------------------------------------------
> cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
> |
> |---cnt: New For Each(false)[bag] - scope-8
>    |   |
>    |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-6
>    |   |
>    |   |---Project[bag][1] - scope-5
>    |
>    |---grpall: Package[tuple]{chararray} - scope-2
>        |
>        |---grpall: Global Rearrange[tuple] - scope-1
>            |
>            |---grpall: Local Rearrange[tuple]{chararray}(false) - scope-3
>                |   |
>                |   Constant(all) - scope-4
>                |
>                |---longDesc:
> Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0
> 
> 2012-07-09 15:47:02,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> File concatenation threshold: 100 optimistic? false
> 2012-07-09 15:47:02,448 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
> - Choosing to move algebraic foreach to combiner
> 2012-07-09 15:47:02,581 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2012-07-09 15:47:02,581 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
> #--------------------------------------------------
> # Map Reduce Plan
> #--------------------------------------------------
> MapReduce node scope-10
> Map Plan
> grpall: Local Rearrange[tuple]{chararray}(false) - scope-22
> |   |
> |   Project[chararray][0] - scope-23
> |
> |---cnt: New For Each(false,false)[bag] - scope-11
>    |   |
>    |   Project[chararray][0] - scope-12
>    |   |
>    |   POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - scope-13
>    |   |
>    |   |---Project[bag][1] - scope-14
>    |
>    |---Pre Combiner Local Rearrange[tuple]{Unknown} - scope-24
>        |
>        |---longDesc:
> Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0--------
> Combine Plan
> grpall: Local Rearrange[tuple]{chararray}(false) - scope-26
> |   |
> |   Project[chararray][0] - scope-27
> |
> |---cnt: New For Each(false,false)[bag] - scope-15
>    |   |
>    |   Project[chararray][0] - scope-16
>    |   |
>    |   POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] -
> scope-17
>    |   |
>    |   |---Project[bag][1] - scope-18
>    |
>    |---POCombinerPackage[tuple]{chararray} - scope-20--------
> Reduce Plan
> cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
> |
> |---cnt: New For Each(false)[bag] - scope-8
>    |   |
>    |   POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - scope-6
>    |   |
>    |   |---Project[bag][1] - scope-19
>    |
>    |---POCombinerPackage[tuple]{chararray} - scope-28--------
> Global sort: false
> ----------------
> 
> 
> 
> On Tue, Jul 3, 2012 at 9:56 AM, Jonathan Coveney <[email protected]> wrote:
> 
>> instead of doing "dump relation," do "explain relation" (then run
>> identically) and paste the output here. It will show whether the combiner
>> is being used,
>> 
>> 2012/7/3 Ruslan Al-Fakikh <[email protected]>
>> 
>>> Hi,
>>> 
>>> As it was said, COUNT is algebraic and should be fast, because it
>>> forces combiner. You should make sure that combiner is really used
>>> here. It can be disabled in some situations. I've encountered such
>>> situations many times when a job is tooo heavy in case no combiner is
>>> applied.
>>> 
>>> Ruslan
>>> 
>>> On Tue, Jul 3, 2012 at 1:35 AM, Subir S <[email protected]>
>> wrote:
>>>> Right!!
>>>> 
>>>> Since it is mentioned that job is hanging, wild guess is it must be
>>>> 'group all'. How can that be confirmed?
>>>> 
>>>> On 7/3/12, Jonathan Coveney <[email protected]> wrote:
>>>>> group all uses a single reducer, but COUNT is algebraic, and as such,
>>> will
>>>>> use combiners, so it is generally quite fast.
>>>>> 
>>>>> 2012/7/2 Subir S <[email protected]>
>>>>> 
>>>>>> Group all - uses single reducer AFAIU. You can try to count per group
>>>>>> and sum may be.
>>>>>> 
>>>>>> You may also try with COUNT_STAR to include NULL fields.
>>>>>> 
>>>>>> On 7/3/12, Sheng Guo <[email protected]> wrote:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I used to use the following pig script to do the counting of the
>>>>>>> records.
>>>>>>> 
>>>>>>> m_skill_group = group m_skills_filter by member_id;
>>>>>>> grpd = group m_skill_group all;
>>>>>>> cnt = foreach grpd generate COUNT(m_skill_group);
>>>>>>> 
>>>>>>> cnt_filter = limit cnt 10;
>>>>>>> dump cnt_filter;
>>>>>>> 
>>>>>>> 
>>>>>>> but sometimes, when the records get larger, it takes lots of time
>> and
>>>>>> hang
>>>>>>> up, and or die.
>>>>>>> I thought counting should be simple enough, so what is the best way
>>> to
>>>>>> do a
>>>>>>> counting in pig?
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> Sheng
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Best Regards,
>>> Ruslan Al-Fakikh
>>> 
>> 

Reply via email to