Hi Ted,
Thanks for your response. Currently this is what we are doing.

1. We get product download information as a triplet (userId, productId, 
downloaded_time)
2. We generate arff transaction file based on userId -> all downloaded product 
Ids
3. Feed that into a tool called WEKA's FPGrowth's implementation to get 
association rules in the form of (antecendent, consequent and confidence) where 
antecendetn and consequent is of only one size.
4. Use the above association rules to recommend products based on past download 
history (of course after applying all filters)

The drawback of the above implementation is that WEKA loads all transactions 
into memory and we are hitting the memory threshold when the total download 
size reached 60 to 80 millions. However we have more than 200M transactions to 
process and its growing fast. Our goal is find highly scalable solution.

Looking at mahout's PFPGrowth impementation I thought I can do the same exact 
thing in s distributed fashion except that I would need to generate association 
rules myself. The formula I mentioned to calculate confidence is from here: 
http://en.wikipedia.org/wiki/Association_rule_learning. When I looked at the 
output generated PFPGrowth, I thought I can simply process feature by feature 
and claculate these rules (as mentioned in my previous email) however I don't 
think that would work with the way mahout's PFPGrowth is generating the output.

Hope I clarified all your questions.

Praveen

-----Original Message-----
From: ext Ted Dunning [mailto:[email protected]] 
Sent: Tuesday, November 09, 2010 12:29 PM
To: [email protected]
Subject: Re: Deriving associations from frequent patterns

Praveen,

Could you define what you mean by association?  Do you mean temporal sequence?  
A causal relationship?  Or simply a cooccurrence?

There is considerable confusion in the world about these terms.  To answer your 
question, it will be important to be sure what you mean by your question.

Assuming that you are looking at cooccurrence, possibly with a temporal 
ordering, your measure of confidence is anything but a measure of confidence.  
More correctly, you are estimating the conditional probability P(B|A).  The 
estimate you are using, however, is subject to substantial error when counts 
are too small.

In many applications, you get much better results if you refrain from 
estimating conditional probabilities at all and satisfy yourself with simply 
separating those conditional probabilities that differ from the marginal 
probabilities (i.e. where P(B) != P(B | A) ).  This is a much simpler task and 
helps you avoid over-fitting.  In the Luduan system, for instance, I used a 
multinomial generalized log-likelihood ratio test (often called G^2) to finding 
interesting query terms and then used general corpus frequencies to weight the 
terms.  The results were substantially better than methods that tried to weight 
the terms using conditional probabilities because the Luduan approach could 
avoid over-fitting better.

This G^2 test is available in Mahout, but I think that the PFPGrowth algorithm 
inherently imposes something like it during the winnowing of patterns.

Another powerful method is to use spectral techniques to find cliques in a 
large graph.  This can have very dramatic results and very good scaling 
relative to iterative item-set growth techniques.

If you could say more about what you are trying to do at a high level, we could 
probably help you find capabilities in Mahout that suit your needs.

On Tue, Nov 9, 2010 at 8:20 AM, <[email protected]> wrote:

> Hello all,
> I am new to mahout. I have just started looking into mahout to replace 
> our current fpgrowth implementation with a parallel fp growth that 
> Mahout since we started having scalability issues. I looked at 
> PFPGrowth documentation and I noticed that it only produces top K 
> frequent patterns but not the associations and what we need is 
> associations. So I was thinking of implementing a simple 
> AssociationGenerator given the frequent patterns output. However I am 
> not sure what is the best way to generate associations given the frequent 
> patterns produced by mahout.
>
> I have the following sample output from mahout.
>
> Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> Key: 46705: Value: ([46705],2526)
>
> We are interested only in item set size of 2 since we need only 1 
> ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
>
> I was planning to calculate associations with confidence as follows:
> For each key above as A {
>        for each two-item set as [A,C] {
>                confidence (A->C) = support(A->C)/support(C);
>                add association (A, C, confidence(A->C) to the list;
>        }
> }
>
> Keeping the above requirement and pseudo code n mind, my questions as
> follows:
> 1. Is the above algorithm efficient?
> 2. In the first pattern, [46705, 46485] occurred 355 times but in 
> second pattern why is the same pattern not repeated. Because of this 
> calculating confidence (46705 -> 46485) becomes difficult. As you can 
> see from above code, I was planning to read patterns for each feature 
> and calculate confidence of all association with antecedent. But when 
> I read feature 46705, I cannot calculate confidence of (46705 -> 
> 46485) since the item set is not included with the feature.
> 3. Has anyone implemented associations from the generated frequent 
> patterns.
>
>
> Thanks
> Praveen
>
>

Reply via email to