Hallo Robin. Thank you for you answer. I still have troubles getting it to work as I want. I have about 400 unique features in a big dataset. (US Census 1990 dataset 30% sample, preprocessed)
I tried to set the number of groups to 4, 40 or 400. The Map-capacity of my cluster is 20. (10 nodes, 2 maps per node) But the "PFPGrowth"-Job always only uses one single map job. The "Parallel Counting" and the "PFP Transaction Sorting" use 6 map job. In the job-configuration of the "PFPGrowth"-job the value for "mapred.map.tasks" is 1 although I have set it to 20. The number of map jobs correspond to the number of input-file-splits, right? So can I somehow force the input file to be split, or does this alter the result somehow? Or do you have another idea? Thanks a lot in advance, I tried for hours to get this somehow working. Björn > Hi Bjorn, The distribution of data is in a skewed manner. Thats a problem > with the algorithm as proposed in the paper . The way around it is to > increase the number of groups parameter. For example, if you have 10K > unique > features, try to split it into groups such that there is around 10 > features > per split. Each reducer finds the TopK patterns by creating FP-Trees > having > predominantly those 10 features. So set the number of groups as 1000 > > Robin > > 2010/6/16 "Björn Jacobs" <[email protected]> > > > Hallo everyone! > > > > I am trying to get used to the PFPGrowth in the Mahout packages. I am > > planning to adapt this code to be able to run a parallelized subgroup > > discovery. This is btw the aim of my bachelor thesis, which I am > currently > > writing. > > > > I'm having the problem that the algorithm does not distribute the work > load > > equally on the nodes in my cluster. I have 10 nodes and I set the > > mapred.map.tasks=15 as well as the mapred.reduce.tasks variable. > > > > My problem is, that the "PFP Growth Driver running over > > input/test002/sortedoutput"-Job did the following: > > > > Node 0 got nearly 100% of the work (finished in 20 minutes) > > Node 1-3 got a very small piece (finished in less than 10 seconds) > > Node 4-14 got nothing and finished execution immediately > > > > This way one node had to do all the work while the others had nothing to > do > > and the job took really long to finish... that's not parallel. > > > > Is this a bug or do I have to configure something to get this working? > > Thanks a lot! > > > > Yours, > > Björn Jacobs > > -- > > GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl. > > Bis zu 150 EUR Startguthaben inklusive! http://portal.gmx.net/de/go/dsl > > -- GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl. Bis zu 150 EUR Startguthaben inklusive! http://portal.gmx.net/de/go/dsl
