It's not well documented, but there are actually two distinct implementations of FPGrowth, which each can be run sequentially or as mapreduce jobs.

The --method option lets you select sequential/mapreduce, and the --useFPG2/-2 flag selects the alternate implementation.

Any way you run FPG, patterns will be collected in FrequentPatternMaxHeaps; all implementation/mode combinations will make use of this class.

I do not recall the precise details right now, but something about the mining/aggregation strategy used in the original (default) implementation leads to redundant patterns appearing when running in mapreduce mode. If your question is driven by finding unexpected redundancies in FPG output, I'd be interested to hear if this persists after trying --useFPG2.

-tom


On 02/26/2012 12:06 PM, gaurav singh wrote:
Hi Tom,

I don't understand, why do you say I will get a lot of redundant patterns?
In each group dependent shard generates patterns with respect to the
elements of that shard. The fpg-2 as far as I know and if I am correct is
only a new sequential implementation of fp-growth and not map/reduce
implementation.

My question was specifically if we eliminate subpatterns from output in
mahout parallel fp-growth(map/reduce version)? I know that the function
exists in FrequentPatternMaxHeap, but that's the sequential algorithm, I am
asking only about the map/reduce version?

On Sun, Feb 26, 2012 at 9:39 PM, tom<[email protected]>  wrote:

Hi Gaurav,

The patterns are accumulated in a heap (see FrequentPatternMaxHeap), which
uses isSubPatternOf.

That said, I do think the default implementation of PFPGrowth will get you
many redundant patterns under certain circumstances, but the "-2"
implementation will reduce (perhaps eliminate?) redundant patterns.

-tom


On 02/26/2012 09:39 AM, gaurav singh wrote:

Hi Guys,


There is a function in mahout sequential fp-growth algorithm named
isSubPatternof() which returns whether one pattern is subpattern of
another
pattern and if both have equal support only the one larger of the two is
output. I can't find any such function being used in parallel fp-growth.
Does that mean that in parallel fp-growth we display all the possible
patterns without eliminating such subpatterns?

Thanks for help!




Reply via email to