This very much sounds like a hadoop config problem. Other users have used Mahout to compute frequent item sets over billions of items.
On Tue, Sep 21, 2010 at 11:09 AM, Mark <[email protected]> wrote: > Smaller samples work. It seems like anytime more than 1 reduce tasks is > launched then it will hang and never finish. Is this a possible hadoop > configuration bug? > > On 9/18/10 12:08 PM, Ted Dunning wrote: > >> Good advice relative to Mahout as well. Trying it on a smaller sample >> will >> tell you if it is due to bad scaling or really a hangup. >> >> On Sat, Sep 18, 2010 at 12:03 PM, Mark<[email protected]> wrote: >> >> Thanks. Ill give this a try and see how it performs >>> >>> >>> On 9/18/10 12:01 PM, Neal Richter wrote: >>> >>> I suggest you take a sample of your data and run it on these >>>> non-hadoop implementations of itemset miners, FPGrowth is one of the >>>> available algorithms. >>>> >>>> http://www.borgelt.net/fpm.html >>>> >>>> If you have success on a small sample then start upscaling the sample >>>> as well as investigate the distributions of your data. >>>> >>>> - Neal >>>> >>>> On Sat, Sep 18, 2010 at 12:30 PM, Ted Dunning<[email protected]> >>>> wrote: >>>> >>>> In order to encourage your excellent practice of reposting, I will >>>>> repeat >>>>> my >>>>> (non)-answer here. >>>>> >>>>> ------------------------------------------- >>>>> I don't know the answer to this, but previously this kind of problem >>>>> was >>>>> caused by highly skewed statistics in the input data. >>>>> >>>>> If there are things that cooccur with everything, then you will have >>>>> problems with the speed of the algorithm. >>>>> >>>>> Can you say something about the distribution of your data? Can you >>>>> post >>>>> a >>>>> frequency by rank table? >>>>> >>>>> On Sat, Sep 18, 2010 at 10:37 AM, Mark<[email protected]> >>>>> wrote: >>>>> >>>>> I am trying to run FPGrowth: >>>>> >>>>>> /hadoop jar /opt/mahout-0.3/mahout-examples-0.3.job >>>>>> org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver -i >>>>>> output/product/part-r-00000 -o pfp -method mapreduce -regex [\\t] -s 5 >>>>>> -g >>>>>> 17500 -k 50/ >>>>>> >>>>>> However the 3rd task:/ "Processing FPTree: Bottom Up FP Growth> >>>>>> reduce"/ >>>>>> will not finish. It's basically stuck at 85% and hasn't budged in over >>>>>> an >>>>>> hour. The output of the first task outputted there were about 37K >>>>>> features >>>>>> so I set -g to 17500. Does anyone know whats going on and how I can >>>>>> speed >>>>>> this up? >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>>
