I have something in the mix that should reduce bag memory :) Question: how much memory are your reducers getting? In my experience, you'll get OOM's on spilling if you have allocated less than a gig to the JVM
2012/7/9 Haitao Yao <[email protected]> > I have encountered the similar problem. And I got a OOM while running the > reducer. > I think the reason is the data bag generated after group all is too big to > fit into the reducer's memory. > > and I have written a new COUNT implementation with explicit invoke > System.gc() and spill after the COUNT function finish its job, but it > still get OOM > > here's the code of the new COUNT implementation: > @Override > public Long exec(Tuple input) throws IOException { > DataBag bag = (DataBag)input.get(0); > Long result = super.exec(input); > LOG.warn(" before spill data bag memory : " + > Runtime.getRuntime().freeMemory()); > bag.spill(); > System.gc(); > LOG.warn(" after spill data bag memory : " + > Runtime.getRuntime().freeMemory()); > LOG.warn("big bag size: " + bag.size() + ", hashcode: " + > bag.hashCode()); > return result; > } > > > I think we have to redesign the data bag implementation with less memory > consumed. > > > > Haitao Yao > [email protected] > weibo: @haitao_yao > Skype: haitao.yao.final > > 在 2012-7-10,上午6:54, Sheng Guo 写道: > > > the pig script: > > > > longDesc = load '/user/xx/filtered_chunk' USING AvroStorage(); > > > > grpall = group longDesc all; > > cnt = foreach grpall generate COUNT(longDesc) as allNumber; > > explain cnt; > > > > > > the dump relation result: > > > > #----------------------------------------------- > > # New Logical Plan: > > #----------------------------------------------- > > cnt: (Name: LOStore Schema: allNumber#65:long) > > | > > |---cnt: (Name: LOForEach Schema: allNumber#65:long) > > | | > > | (Name: LOGenerate[false] Schema: > > allNumber#65:long)ColumnPrune:InputUids=[63]ColumnPrune:OutputUids=[65] > > | | | > > | | (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid: > > 65) > > | | | > > | | |---longDesc:(Name: Project Type: bag Uid: 63 Input: 0 Column: > > (*)) > > | | > > | |---longDesc: (Name: LOInnerLoad[1] Schema: > > > DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray) > > | > > |---grpall: (Name: LOCogroup Schema: > > > group#62:chararray,longDesc#63:bag{#64:tuple(DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)}) > > | | > > | (Name: Constant Type: chararray Uid: 62) > > | > > |---longDesc: (Name: LOLoad Schema: > > > DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)RequiredFields:null > > > > #----------------------------------------------- > > # Physical Plan: > > #----------------------------------------------- > > cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9 > > | > > |---cnt: New For Each(false)[bag] - scope-8 > > | | > > | POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-6 > > | | > > | |---Project[bag][1] - scope-5 > > | > > |---grpall: Package[tuple]{chararray} - scope-2 > > | > > |---grpall: Global Rearrange[tuple] - scope-1 > > | > > |---grpall: Local Rearrange[tuple]{chararray}(false) - scope-3 > > | | > > | Constant(all) - scope-4 > > | > > |---longDesc: > > Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0 > > > > 2012-07-09 15:47:02,441 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - > > File concatenation threshold: 100 optimistic? false > > 2012-07-09 15:47:02,448 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer > > - Choosing to move algebraic foreach to combiner > > 2012-07-09 15:47:02,581 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > > - MR plan size before optimization: 1 > > 2012-07-09 15:47:02,581 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > > - MR plan size after optimization: 1 > > #-------------------------------------------------- > > # Map Reduce Plan > > #-------------------------------------------------- > > MapReduce node scope-10 > > Map Plan > > grpall: Local Rearrange[tuple]{chararray}(false) - scope-22 > > | | > > | Project[chararray][0] - scope-23 > > | > > |---cnt: New For Each(false,false)[bag] - scope-11 > > | | > > | Project[chararray][0] - scope-12 > > | | > > | POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - scope-13 > > | | > > | |---Project[bag][1] - scope-14 > > | > > |---Pre Combiner Local Rearrange[tuple]{Unknown} - scope-24 > > | > > |---longDesc: > > Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0-------- > > Combine Plan > > grpall: Local Rearrange[tuple]{chararray}(false) - scope-26 > > | | > > | Project[chararray][0] - scope-27 > > | > > |---cnt: New For Each(false,false)[bag] - scope-15 > > | | > > | Project[chararray][0] - scope-16 > > | | > > | POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] - > > scope-17 > > | | > > | |---Project[bag][1] - scope-18 > > | > > |---POCombinerPackage[tuple]{chararray} - scope-20-------- > > Reduce Plan > > cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9 > > | > > |---cnt: New For Each(false)[bag] - scope-8 > > | | > > | POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - scope-6 > > | | > > | |---Project[bag][1] - scope-19 > > | > > |---POCombinerPackage[tuple]{chararray} - scope-28-------- > > Global sort: false > > ---------------- > > > > > > > > On Tue, Jul 3, 2012 at 9:56 AM, Jonathan Coveney <[email protected]> > wrote: > > > >> instead of doing "dump relation," do "explain relation" (then run > >> identically) and paste the output here. It will show whether the > combiner > >> is being used, > >> > >> 2012/7/3 Ruslan Al-Fakikh <[email protected]> > >> > >>> Hi, > >>> > >>> As it was said, COUNT is algebraic and should be fast, because it > >>> forces combiner. You should make sure that combiner is really used > >>> here. It can be disabled in some situations. I've encountered such > >>> situations many times when a job is tooo heavy in case no combiner is > >>> applied. > >>> > >>> Ruslan > >>> > >>> On Tue, Jul 3, 2012 at 1:35 AM, Subir S <[email protected]> > >> wrote: > >>>> Right!! > >>>> > >>>> Since it is mentioned that job is hanging, wild guess is it must be > >>>> 'group all'. How can that be confirmed? > >>>> > >>>> On 7/3/12, Jonathan Coveney <[email protected]> wrote: > >>>>> group all uses a single reducer, but COUNT is algebraic, and as such, > >>> will > >>>>> use combiners, so it is generally quite fast. > >>>>> > >>>>> 2012/7/2 Subir S <[email protected]> > >>>>> > >>>>>> Group all - uses single reducer AFAIU. You can try to count per > group > >>>>>> and sum may be. > >>>>>> > >>>>>> You may also try with COUNT_STAR to include NULL fields. > >>>>>> > >>>>>> On 7/3/12, Sheng Guo <[email protected]> wrote: > >>>>>>> Hi all, > >>>>>>> > >>>>>>> I used to use the following pig script to do the counting of the > >>>>>>> records. > >>>>>>> > >>>>>>> m_skill_group = group m_skills_filter by member_id; > >>>>>>> grpd = group m_skill_group all; > >>>>>>> cnt = foreach grpd generate COUNT(m_skill_group); > >>>>>>> > >>>>>>> cnt_filter = limit cnt 10; > >>>>>>> dump cnt_filter; > >>>>>>> > >>>>>>> > >>>>>>> but sometimes, when the records get larger, it takes lots of time > >> and > >>>>>> hang > >>>>>>> up, and or die. > >>>>>>> I thought counting should be simple enough, so what is the best way > >>> to > >>>>>> do a > >>>>>>> counting in pig? > >>>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>> Sheng > >>>>>>> > >>>>>> > >>>>> > >>> > >>> > >>> > >>> -- > >>> Best Regards, > >>> Ruslan Al-Fakikh > >>> > >> > >
