Hallo everyone! I am trying to get used to the PFPGrowth in the Mahout packages. I am planning to adapt this code to be able to run a parallelized subgroup discovery. This is btw the aim of my bachelor thesis, which I am currently writing.
I'm having the problem that the algorithm does not distribute the work load equally on the nodes in my cluster. I have 10 nodes and I set the mapred.map.tasks=15 as well as the mapred.reduce.tasks variable. My problem is, that the "PFP Growth Driver running over input/test002/sortedoutput"-Job did the following: Node 0 got nearly 100% of the work (finished in 20 minutes) Node 1-3 got a very small piece (finished in less than 10 seconds) Node 4-14 got nothing and finished execution immediately This way one node had to do all the work while the others had nothing to do and the job took really long to finish... that's not parallel. Is this a bug or do I have to configure something to get this working? Thanks a lot! Yours, Björn Jacobs -- GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl. Bis zu 150 EUR Startguthaben inklusive! http://portal.gmx.net/de/go/dsl
