the pig script:
longDesc = load '/user/xx/filtered_chunk' USING AvroStorage();
grpall = group longDesc all;
cnt = foreach grpall generate COUNT(longDesc) as allNumber;
explain cnt;
the dump relation result:
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
cnt: (Name: LOStore Schema: allNumber#65:long)
|
|---cnt: (Name: LOForEach Schema: allNumber#65:long)
| |
| (Name: LOGenerate[false] Schema:
allNumber#65:long)ColumnPrune:InputUids=[63]ColumnPrune:OutputUids=[65]
| | |
| | (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid:
65)
| | |
| | |---longDesc:(Name: Project Type: bag Uid: 63 Input: 0 Column:
(*))
| |
| |---longDesc: (Name: LOInnerLoad[1] Schema:
DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)
|
|---grpall: (Name: LOCogroup Schema:
group#62:chararray,longDesc#63:bag{#64:tuple(DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)})
| |
| (Name: Constant Type: chararray Uid: 62)
|
|---longDesc: (Name: LOLoad Schema:
DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)RequiredFields:null
#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
|
|---cnt: New For Each(false)[bag] - scope-8
| |
| POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-6
| |
| |---Project[bag][1] - scope-5
|
|---grpall: Package[tuple]{chararray} - scope-2
|
|---grpall: Global Rearrange[tuple] - scope-1
|
|---grpall: Local Rearrange[tuple]{chararray}(false) - scope-3
| |
| Constant(all) - scope-4
|
|---longDesc:
Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0
2012-07-09 15:47:02,441 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2012-07-09 15:47:02,448 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
- Choosing to move algebraic foreach to combiner
2012-07-09 15:47:02,581 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-07-09 15:47:02,581 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-10
Map Plan
grpall: Local Rearrange[tuple]{chararray}(false) - scope-22
| |
| Project[chararray][0] - scope-23
|
|---cnt: New For Each(false,false)[bag] - scope-11
| |
| Project[chararray][0] - scope-12
| |
| POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - scope-13
| |
| |---Project[bag][1] - scope-14
|
|---Pre Combiner Local Rearrange[tuple]{Unknown} - scope-24
|
|---longDesc:
Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0--------
Combine Plan
grpall: Local Rearrange[tuple]{chararray}(false) - scope-26
| |
| Project[chararray][0] - scope-27
|
|---cnt: New For Each(false,false)[bag] - scope-15
| |
| Project[chararray][0] - scope-16
| |
| POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] -
scope-17
| |
| |---Project[bag][1] - scope-18
|
|---POCombinerPackage[tuple]{chararray} - scope-20--------
Reduce Plan
cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
|
|---cnt: New For Each(false)[bag] - scope-8
| |
| POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - scope-6
| |
| |---Project[bag][1] - scope-19
|
|---POCombinerPackage[tuple]{chararray} - scope-28--------
Global sort: false
----------------
On Tue, Jul 3, 2012 at 9:56 AM, Jonathan Coveney <[email protected]> wrote:
> instead of doing "dump relation," do "explain relation" (then run
> identically) and paste the output here. It will show whether the combiner
> is being used,
>
> 2012/7/3 Ruslan Al-Fakikh <[email protected]>
>
> > Hi,
> >
> > As it was said, COUNT is algebraic and should be fast, because it
> > forces combiner. You should make sure that combiner is really used
> > here. It can be disabled in some situations. I've encountered such
> > situations many times when a job is tooo heavy in case no combiner is
> > applied.
> >
> > Ruslan
> >
> > On Tue, Jul 3, 2012 at 1:35 AM, Subir S <[email protected]>
> wrote:
> > > Right!!
> > >
> > > Since it is mentioned that job is hanging, wild guess is it must be
> > > 'group all'. How can that be confirmed?
> > >
> > > On 7/3/12, Jonathan Coveney <[email protected]> wrote:
> > >> group all uses a single reducer, but COUNT is algebraic, and as such,
> > will
> > >> use combiners, so it is generally quite fast.
> > >>
> > >> 2012/7/2 Subir S <[email protected]>
> > >>
> > >>> Group all - uses single reducer AFAIU. You can try to count per group
> > >>> and sum may be.
> > >>>
> > >>> You may also try with COUNT_STAR to include NULL fields.
> > >>>
> > >>> On 7/3/12, Sheng Guo <[email protected]> wrote:
> > >>> > Hi all,
> > >>> >
> > >>> > I used to use the following pig script to do the counting of the
> > >>> > records.
> > >>> >
> > >>> > m_skill_group = group m_skills_filter by member_id;
> > >>> > grpd = group m_skill_group all;
> > >>> > cnt = foreach grpd generate COUNT(m_skill_group);
> > >>> >
> > >>> > cnt_filter = limit cnt 10;
> > >>> > dump cnt_filter;
> > >>> >
> > >>> >
> > >>> > but sometimes, when the records get larger, it takes lots of time
> and
> > >>> hang
> > >>> > up, and or die.
> > >>> > I thought counting should be simple enough, so what is the best way
> > to
> > >>> do a
> > >>> > counting in pig?
> > >>> >
> > >>> > Thanks!
> > >>> >
> > >>> > Sheng
> > >>> >
> > >>>
> > >>
> >
> >
> >
> > --
> > Best Regards,
> > Ruslan Al-Fakikh
> >
>