Hello all,
Sorry to bother everyone but I still could not make any progress in generating 
item to item associations from the frequent patterns generated by Mahout PFP. I 
am still trying to understand the semantics of generated frequent patterns and 
what is the best way to generate associations from frequent patterns. 

For others sake, I would like to repeat my questions:
1. Why are the frequent patterns not generated in bopth directions. Example 
below has frequent pattern for ([46705, 46840],698) on first line but not for 
[46840, 46705] on 2nd line. So I cannot build association looping through 
products
>> Key: 46705: Value: ([46705],2526), ([46705, 46840],698)
>> Key: 46485: Value: ([46485],936), ([46705, 46485],355)

2. Could someone give a high level info on the algorithm to generate 
associations based on the frequent patterns generated?

Thanks
Praveen

-----Original Message-----
From: Peddi Praveen (Nokia-MS/Boston) 
Sent: Wednesday, November 10, 2010 7:44 AM
To: [email protected]
Subject: Re: Deriving associations from frequent patterns

Ok thanks Anil.

Please let me know if you need anything else from me regarding my original 
question of calculating association rules and what can be done to make the 
output have necessary information.

Praveen

On Nov 9, 2010, at 11:17 PM, ext Robin Anil <[email protected]> wrote:

> g is the number of groups in which features get divided so that the 
> total size of transactions in bytes is almost equal in each reducer. 
> See the PFPGrowth paper. With g=1 you get the original fpgrowth. I 
> usually suggest a g size == numfeatures / (10 or 20) so as to make 
> parallel fpgrowth scalable and still get similar results as the sequential 
> one.
> 
> Robin
> 
> On Wed, Nov 10, 2010 at 12:23 AM, <[email protected]> wrote:
> 
>> Hi Anil,
>> Here is the result for the same features with g=1
>> Key: 46705: Value: ([46705],2526), ([46705, 46840],698)
>> Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840, 
>> 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],207), 
>> ([46485, 46815],175), ([46485, 46852],159), ([46840, 46847, 
>> 46485],130), ([46705, 46847, 46485],126), ([46705, 46485, 
>> 46815],105), ([46840, 46485, 46815],97), ([46840, 46485, 46852],96), 
>> ([46847, 46485, 46815],94), ([46705, 46485, 46852],93), ([46705, 
>> 46840, 46847, 46485],92), ([20975, 46485],92), ([16794, 46485],80), 
>> ([46847, 46485, 46852],76), ([46705, 46840, 46485, 46815],75), 
>> ([46485, 46852, 46815],75), ([46705, 46840, 46485, 46852],69), 
>> ([20924, 46485],68), ([46705, 46847, 46485, 46815],67), ([46840, 
>> 46847, 46485, 46815],66), ([20975, 46705, 46840, 46485],65), ([46840, 
>> 46847, 46485, 46852],56), ([20975, 46705, 46485],55), ([20975, 46840, 
>> 46485],54), ([46705, 46840, 46847, 46485, 46815],53)
>> 
>> Full Result for same features when g=500 is:
>> Key: 46705: Value: ([46705],2526)
>> Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840, 
>> 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],205), 
>> ([46840, 46847, 46485],127), ([46705, 46847, 46485],124), ([20975, 
>> 46485],92), ([46705, 46840, 46847, 46485],90), ([20975, 46705, 
>> 46485],55), ([20975, 46840, 46485],54), ([21243, 46485],47), ([20975, 
>> 46705, 46840, 46485],43), ([39140, 46485],37), ([20975, 46847, 
>> 46485],31), ([20975, 46840, 46847, 46485],27), ([20975, 46705, 46847, 
>> 46485],26), ([20975, 46705, 46840, 46847, 46485],23), ([27984, 46705, 
>> 46485],23), ([21243, 46840, 46485],22), ([21243, 46705, 46485],21), 
>> ([39140, 46840, 46485],19), ([21243, 46847, 46485],18), ([39140, 
>> 46705, 46485],15), ([21243, 46705, 46840, 46485],14), ([6942, 
>> 46485],14), ([21243, 46840, 46847, 46485],13), ([39140, 46847, 
>> 46485],13), ([39140, 46840, 46847, 46485],11), ([20975, 39140, 
>> 46485],11), ([20975, 21243, 46485],11), ([39140, 46705, 46840, 
>> 46485],10), ([27984, 46705, 46840, 46847, 46485],9), ([39140, 46705, 
>> 46847, 46485],9), ([20975, 27984, 46705, 46485],8), ([39140, 46705, 
>> 46840, 46847, 46485],7), ([20975, 27984, 46705, 46840, 46485],7), 
>> ([21243, 46705, 46847, 46485],7), ([20975, 39140, 46840, 46485],7), 
>> ([6942, 46705, 46485],7), ([21243, 46705, 46840, 46847, 46485],6), 
>> ([20975, 21243, 46840, 46847, 46485],6), ([21243, 27984, 46485],6), 
>> ([39140, 27984, 46485],6), ([6942, 46840, 46485],6), ([20975, 27984, 
>> 46705, 46847, 46485],5), ([39140, 27984, 46847, 46485],5), ([20975, 
>> 39140, 46705, 46485],5), ([21243, 39140, 46485],5), ([4873, 46485],5)
>> 
>> The results are obviously different. This raises another question. 
>> Are the frequent patterns supposed to change with different values of g?
>> 
>> Praveen
>> 
>> -----Original Message-----
>> From: ext Robin Anil [mailto:[email protected]]
>> Sent: Tuesday, November 09, 2010 1:11 PM
>> To: [email protected]
>> Subject: Re: Deriving associations from frequent patterns
>> 
>> Can you try with g1 and tell the resutl
>> 
>> On Tue, Nov 9, 2010 at 11:37 PM, <[email protected]> wrote:
>> 
>>> Here is the command I used to run PFPGrowth. I am still using only 
>>> single machine. Will be setting up hadoop cluster soon.
>>> 
>>> $ hadoop jar mahout-examples-0.4-job.jar
>>> org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver      -i downloads-input
>>> -o reco-patterns-output      -k 50      -method mapreduce      -g 10
>>> -regex '[\ ]' -s 500
>>> 
>>> -----Original Message-----
>>> From: ext Robin Anil [mailto:[email protected]]
>>> Sent: Tuesday, November 09, 2010 1:01 PM
>>> To: [email protected]
>>> Subject: Re: Deriving associations from frequent patterns
>>> 
>>> On Tue, Nov 9, 2010 at 11:20 PM, <[email protected]> wrote:
>>> 
>>>> Hi Anil,
>>>> 1. I am not sure if I understand your answer to #1 (or were you 
>>>> asking me a question?). Could you pls clarify? The sample patterns 
>>>> I gave is only a small subset from the output. I included only 
>>>> those two features for simplicity.
>>>> 
>>> Oh. Never mind. Let me see
>>> 
>>> 
>>>> 2. I am sending the gzipped sample transaction file (1M downloads) 
>>>> to your private email since I am not sure if I can attach files to 
>>>> the
>>> mailing list.
>>>> Please check your email for the sample file.
>>>> 
>>>> Praveen
>>>> 
>>>> -----Original Message-----
>>>> From: ext Robin Anil [mailto:[email protected]]
>>>> Sent: Tuesday, November 09, 2010 12:40 PM
>>>> To: [email protected]
>>>> Subject: Re: Deriving associations from frequent patterns
>>>> 
>>>> On Tue, Nov 9, 2010 at 9:50 PM, <[email protected]> wrote:
>>>> 
>>>>> Hello all,
>>>>> I am new to mahout. I have just started looking into mahout to 
>>>>> replace our current fpgrowth implementation with a parallel fp 
>>>>> growth that Mahout since we started having scalability issues. I 
>>>>> looked at PFPGrowth documentation and I noticed that it only 
>>>>> produces top K frequent patterns but not the associations and what 
>>>>> we need is associations. So I was thinking of implementing a 
>>>>> simple AssociationGenerator given the frequent patterns output.
>>>>> However I am not sure what is the best way to generate 
>>>>> associations given the frequent
>>>> patterns produced by mahout.
>>>>> 
>>>>> I have the following sample output from mahout.
>>>>> 
>>>>> Key: 46485: Value: ([46485],936), ([46705, 46485],355)
>>>>> Key: 46705: Value: ([46705],2526)
>>>>> 
>>>>> We are interested only in item set size of 2 since we need only 1 
>>>>> ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
>>>>> 
>>>>> I was planning to calculate associations with confidence as follows:
>>>>> For each key above as A {
>>>>>       for each two-item set as [A,C] {
>>>>>               confidence (A->C) = support(A->C)/support(C);
>>>>>               add association (A, C, confidence(A->C) to the list;
>>>>>       }
>>>>> }
>>>>> 
>>>>> Keeping the above requirement and pseudo code n mind, my questions 
>>>>> as
>>>>> follows:
>>>>> 1. Is the above algorithm efficient?
>>>>> 
>>>> You are running it over a set of Top K patterns. Its small. doesnt 
>>>> matter if its inefficient or not
>>>> 
>>>>> 2. In the first pattern, [46705, 46485] occurred 355 times but in 
>>>>> second pattern why is the same pattern not repeated. Because of 
>>>>> this calculating confidence (46705 -> 46485) becomes difficult. As 
>>>>> you can see from above code, I was planning to read patterns for 
>>>>> each feature and calculate confidence of all association with
>> antecedent.
>>>>> But when I read feature 46705, I cannot calculate confidence of
>>>>> (46705 ->
>>>>> 46485) since the item set is not included with the feature.
>>>>> 
>>>> Good question. I guess the partitioning is screwing this up as 
>>>> there are other K-1 patterns in the list > 355. Can you give a 
>>>> sample to
>> test.
>>>> 
>>>>> 3. Has anyone implemented associations from the generated frequent 
>>>>> patterns.
>>>>> 
>>>> Nope
>>>> 
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Praveen
>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Reply via email to