g is the number of groups in which features get divided so that the total
size of transactions in bytes is almost equal in each reducer. See the
PFPGrowth paper. With g=1 you get the original fpgrowth. I usually suggest a
g size == numfeatures / (10 or 20) so as to make parallel fpgrowth scalable
and still get similar results as the sequential one.

Robin

On Wed, Nov 10, 2010 at 12:23 AM, <[email protected]> wrote:

> Hi Anil,
> Here is the result for the same features with g=1
> Key: 46705: Value: ([46705],2526), ([46705, 46840],698)
> Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840,
> 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],207), ([46485,
> 46815],175), ([46485, 46852],159), ([46840, 46847, 46485],130), ([46705,
> 46847, 46485],126), ([46705, 46485, 46815],105), ([46840, 46485, 46815],97),
> ([46840, 46485, 46852],96), ([46847, 46485, 46815],94), ([46705, 46485,
> 46852],93), ([46705, 46840, 46847, 46485],92), ([20975, 46485],92), ([16794,
> 46485],80), ([46847, 46485, 46852],76), ([46705, 46840, 46485, 46815],75),
> ([46485, 46852, 46815],75), ([46705, 46840, 46485, 46852],69), ([20924,
> 46485],68), ([46705, 46847, 46485, 46815],67), ([46840, 46847, 46485,
> 46815],66), ([20975, 46705, 46840, 46485],65), ([46840, 46847, 46485,
> 46852],56), ([20975, 46705, 46485],55), ([20975, 46840, 46485],54), ([46705,
> 46840, 46847, 46485, 46815],53)
>
> Full Result for same features when g=500 is:
> Key: 46705: Value: ([46705],2526)
> Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840,
> 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],205), ([46840,
> 46847, 46485],127), ([46705, 46847, 46485],124), ([20975, 46485],92),
> ([46705, 46840, 46847, 46485],90), ([20975, 46705, 46485],55), ([20975,
> 46840, 46485],54), ([21243, 46485],47), ([20975, 46705, 46840, 46485],43),
> ([39140, 46485],37), ([20975, 46847, 46485],31), ([20975, 46840, 46847,
> 46485],27), ([20975, 46705, 46847, 46485],26), ([20975, 46705, 46840, 46847,
> 46485],23), ([27984, 46705, 46485],23), ([21243, 46840, 46485],22), ([21243,
> 46705, 46485],21), ([39140, 46840, 46485],19), ([21243, 46847, 46485],18),
> ([39140, 46705, 46485],15), ([21243, 46705, 46840, 46485],14), ([6942,
> 46485],14), ([21243, 46840, 46847, 46485],13), ([39140, 46847, 46485],13),
> ([39140, 46840, 46847, 46485],11), ([20975, 39140, 46485],11), ([20975,
> 21243, 46485],11), ([39140, 46705, 46840, 46485],10), ([27984, 46705, 46840,
> 46847, 46485],9), ([39140, 46705, 46847, 46485],9), ([20975, 27984, 46705,
> 46485],8), ([39140, 46705, 46840, 46847, 46485],7), ([20975, 27984, 46705,
> 46840, 46485],7), ([21243, 46705, 46847, 46485],7), ([20975, 39140, 46840,
> 46485],7), ([6942, 46705, 46485],7), ([21243, 46705, 46840, 46847,
> 46485],6), ([20975, 21243, 46840, 46847, 46485],6), ([21243, 27984,
> 46485],6), ([39140, 27984, 46485],6), ([6942, 46840, 46485],6), ([20975,
> 27984, 46705, 46847, 46485],5), ([39140, 27984, 46847, 46485],5), ([20975,
> 39140, 46705, 46485],5), ([21243, 39140, 46485],5), ([4873, 46485],5)
>
> The results are obviously different. This raises another question. Are the
> frequent patterns supposed to change with different values of g?
>
> Praveen
>
> -----Original Message-----
> From: ext Robin Anil [mailto:[email protected]]
> Sent: Tuesday, November 09, 2010 1:11 PM
> To: [email protected]
> Subject: Re: Deriving associations from frequent patterns
>
> Can you try with g1 and tell the resutl
>
> On Tue, Nov 9, 2010 at 11:37 PM, <[email protected]> wrote:
>
> > Here is the command I used to run PFPGrowth. I am still using only
> > single machine. Will be setting up hadoop cluster soon.
> >
> > $ hadoop jar mahout-examples-0.4-job.jar
> > org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver      -i downloads-input
> >  -o reco-patterns-output      -k 50      -method mapreduce      -g 10
> >  -regex '[\ ]' -s 500
> >
> > -----Original Message-----
> > From: ext Robin Anil [mailto:[email protected]]
> > Sent: Tuesday, November 09, 2010 1:01 PM
> > To: [email protected]
> > Subject: Re: Deriving associations from frequent patterns
> >
> > On Tue, Nov 9, 2010 at 11:20 PM, <[email protected]> wrote:
> >
> > > Hi Anil,
> > > 1. I am not sure if I understand your answer to #1 (or were you
> > > asking me a question?). Could you pls clarify? The sample patterns I
> > > gave is only a small subset from the output. I included only those
> > > two features for simplicity.
> > >
> >  Oh. Never mind. Let me see
> >
> >
> > > 2. I am sending the gzipped sample transaction file (1M downloads)
> > > to your private email since I am not sure if I can attach files to
> > > the
> > mailing list.
> > > Please check your email for the sample file.
> > >
> > > Praveen
> > >
> > > -----Original Message-----
> > > From: ext Robin Anil [mailto:[email protected]]
> > > Sent: Tuesday, November 09, 2010 12:40 PM
> > > To: [email protected]
> > > Subject: Re: Deriving associations from frequent patterns
> > >
> > > On Tue, Nov 9, 2010 at 9:50 PM, <[email protected]> wrote:
> > >
> > > > Hello all,
> > > > I am new to mahout. I have just started looking into mahout to
> > > > replace our current fpgrowth implementation with a parallel fp
> > > > growth that Mahout since we started having scalability issues. I
> > > > looked at PFPGrowth documentation and I noticed that it only
> > > > produces top K frequent patterns but not the associations and what
> > > > we need is associations. So I was thinking of implementing a
> > > > simple AssociationGenerator given the frequent patterns output.
> > > > However I am not sure what is the best way to generate
> > > > associations given the frequent
> > > patterns produced by mahout.
> > > >
> > > > I have the following sample output from mahout.
> > > >
> > > > Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> > > > Key: 46705: Value: ([46705],2526)
> > > >
> > > > We are interested only in item set size of 2 since we need only 1
> > > > ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
> > > >
> > > > I was planning to calculate associations with confidence as follows:
> > > > For each key above as A {
> > > >        for each two-item set as [A,C] {
> > > >                confidence (A->C) = support(A->C)/support(C);
> > > >                add association (A, C, confidence(A->C) to the list;
> > > >        }
> > > > }
> > > >
> > > > Keeping the above requirement and pseudo code n mind, my questions
> > > > as
> > > > follows:
> > > > 1. Is the above algorithm efficient?
> > > >
> > > You are running it over a set of Top K patterns. Its small. doesnt
> > > matter if its inefficient or not
> > >
> > > > 2. In the first pattern, [46705, 46485] occurred 355 times but in
> > > > second pattern why is the same pattern not repeated. Because of
> > > > this calculating confidence (46705 -> 46485) becomes difficult. As
> > > > you can see from above code, I was planning to read patterns for
> > > > each feature and calculate confidence of all association with
> antecedent.
> > > > But when I read feature 46705, I cannot calculate confidence of
> > > > (46705 ->
> > > > 46485) since the item set is not included with the feature.
> > > >
> > > Good question. I guess the partitioning is screwing this up as there
> > > are other K-1 patterns in the list > 355. Can you give a sample to
> test.
> > >
> > > > 3. Has anyone implemented associations from the generated frequent
> > > > patterns.
> > > >
> > > Nope
> > >
> > > >
> > > >
> > > > Thanks
> > > > Praveen
> > > >
> > > >
> > >
> >
>

Reply via email to