instead of doing "dump relation," do "explain relation" (then run identically) and paste the output here. It will show whether the combiner is being used,
2012/7/3 Ruslan Al-Fakikh <[email protected]> > Hi, > > As it was said, COUNT is algebraic and should be fast, because it > forces combiner. You should make sure that combiner is really used > here. It can be disabled in some situations. I've encountered such > situations many times when a job is tooo heavy in case no combiner is > applied. > > Ruslan > > On Tue, Jul 3, 2012 at 1:35 AM, Subir S <[email protected]> wrote: > > Right!! > > > > Since it is mentioned that job is hanging, wild guess is it must be > > 'group all'. How can that be confirmed? > > > > On 7/3/12, Jonathan Coveney <[email protected]> wrote: > >> group all uses a single reducer, but COUNT is algebraic, and as such, > will > >> use combiners, so it is generally quite fast. > >> > >> 2012/7/2 Subir S <[email protected]> > >> > >>> Group all - uses single reducer AFAIU. You can try to count per group > >>> and sum may be. > >>> > >>> You may also try with COUNT_STAR to include NULL fields. > >>> > >>> On 7/3/12, Sheng Guo <[email protected]> wrote: > >>> > Hi all, > >>> > > >>> > I used to use the following pig script to do the counting of the > >>> > records. > >>> > > >>> > m_skill_group = group m_skills_filter by member_id; > >>> > grpd = group m_skill_group all; > >>> > cnt = foreach grpd generate COUNT(m_skill_group); > >>> > > >>> > cnt_filter = limit cnt 10; > >>> > dump cnt_filter; > >>> > > >>> > > >>> > but sometimes, when the records get larger, it takes lots of time and > >>> hang > >>> > up, and or die. > >>> > I thought counting should be simple enough, so what is the best way > to > >>> do a > >>> > counting in pig? > >>> > > >>> > Thanks! > >>> > > >>> > Sheng > >>> > > >>> > >> > > > > -- > Best Regards, > Ruslan Al-Fakikh >
