It's not well documented, but there are actually two distinct
implementations of FPGrowth, which each can be run sequentially or as
mapreduce jobs.
The --method option lets you select sequential/mapreduce, and the
--useFPG2/-2 flag selects the alternate implementation.
Any way you run FPG, patterns will be collected in
FrequentPatternMaxHeaps; all implementation/mode combinations will make
use of this class.
I do not recall the precise details right now, but something about the
mining/aggregation strategy used in the original (default)
implementation leads to redundant patterns appearing when running in
mapreduce mode. If your question is driven by finding unexpected
redundancies in FPG output, I'd be interested to hear if this persists
after trying --useFPG2.
-tom
On 02/26/2012 12:06 PM, gaurav singh wrote:
Hi Tom,
I don't understand, why do you say I will get a lot of redundant patterns?
In each group dependent shard generates patterns with respect to the
elements of that shard. The fpg-2 as far as I know and if I am correct is
only a new sequential implementation of fp-growth and not map/reduce
implementation.
My question was specifically if we eliminate subpatterns from output in
mahout parallel fp-growth(map/reduce version)? I know that the function
exists in FrequentPatternMaxHeap, but that's the sequential algorithm, I am
asking only about the map/reduce version?
On Sun, Feb 26, 2012 at 9:39 PM, tom<[email protected]> wrote:
Hi Gaurav,
The patterns are accumulated in a heap (see FrequentPatternMaxHeap), which
uses isSubPatternOf.
That said, I do think the default implementation of PFPGrowth will get you
many redundant patterns under certain circumstances, but the "-2"
implementation will reduce (perhaps eliminate?) redundant patterns.
-tom
On 02/26/2012 09:39 AM, gaurav singh wrote:
Hi Guys,
There is a function in mahout sequential fp-growth algorithm named
isSubPatternof() which returns whether one pattern is subpattern of
another
pattern and if both have equal support only the one larger of the two is
output. I can't find any such function being used in parallel fp-growth.
Does that mean that in parallel fp-growth we display all the possible
patterns without eliminating such subpatterns?
Thanks for help!