Re: PFPGrowth on cluster does not distribute work load equally on nodes

jacobz Wed, 23 Jun 2010 05:48:20 -0700

Hallo Robin.
Thank you for you answer.

I still have troubles getting it to work as I want.
I have about 400 unique features in a big dataset. (US Census 1990 dataset 30% 
sample, preprocessed)


I tried to set the number of groups to 4, 40 or 400. The Map-capacity of my 
cluster is 20. (10 nodes, 2 maps per node) But the "PFPGrowth"-Job always only 
uses one single map job. The "Parallel Counting" and the "PFP Transaction 
Sorting" use 6 map job.

In the job-configuration of the "PFPGrowth"-job the value for 
"mapred.map.tasks" is 1 although I have set it to 20.

The number of map jobs correspond to the number of input-file-splits, right? So 
can I somehow force the input file to be split, or does this alter the result 
somehow?

Or do you have another idea?

Thanks a lot in advance, I tried for hours to get this somehow working.

Björn

> Hi Bjorn, The  distribution of data is in a skewed manner. Thats a problem
> with the algorithm as proposed in the paper . The way around it is to
> increase the number of groups parameter. For example, if you have 10K
> unique
> features, try to split it into groups such that there is around 10
> features
> per split. Each reducer finds the TopK patterns by creating FP-Trees
> having
> predominantly those 10 features. So set the number of groups as 1000
> 
> Robin
> 
> 2010/6/16 "Björn Jacobs" <[email protected]>
> 
> > Hallo everyone!
> >
> > I am trying to get used to the PFPGrowth in the Mahout packages. I am
> > planning to adapt this code to be able to run a parallelized subgroup
> > discovery. This is btw the aim of my bachelor thesis, which I am
> currently
> > writing.
> >
> > I'm having the problem that the algorithm does not distribute the work
> load
> > equally on the nodes in my cluster. I have 10 nodes and I set the
> > mapred.map.tasks=15 as well as the mapred.reduce.tasks variable.
> >
> > My problem is, that the "PFP Growth Driver running over
> > input/test002/sortedoutput"-Job did the following:
> >
> > Node 0 got nearly 100% of the work (finished in 20 minutes)
> > Node 1-3 got a very small piece (finished in less than 10 seconds)
> > Node 4-14 got nothing and finished execution immediately
> >
> > This way one node had to do all the work while the others had nothing to
> do
> > and the job took really long to finish... that's not parallel.
> >
> > Is this a bug or do I have to configure something to get this working?
> > Thanks a lot!
> >
> > Yours,
> > Björn Jacobs
> > --
> > GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl.
> > Bis zu 150 EUR Startguthaben inklusive! http://portal.gmx.net/de/go/dsl
> >

-- 
GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl.  
Bis zu 150 EUR Startguthaben inklusive! http://portal.gmx.net/de/go/dsl

Re: PFPGrowth on cluster does not distribute work load equally on nodes

Reply via email to