Praveen,

Could you define what you mean by association?  Do you mean temporal
sequence?  A causal relationship?  Or simply a cooccurrence?

There is considerable confusion in the world about these terms.  To answer
your question, it will be important to be sure what you
mean by your question.

Assuming that you are looking at cooccurrence, possibly with a temporal
ordering, your measure of confidence is anything but a measure of
confidence.  More correctly, you are estimating the conditional probability
P(B|A).  The estimate you are using, however,
is subject to substantial error when counts are too small.

In many applications, you get much better results if you refrain from
estimating conditional probabilities at all and satisfy yourself with
simply separating those conditional probabilities that differ from the
marginal probabilities (i.e. where P(B) != P(B | A) ).  This is a much
simpler task and helps you avoid over-fitting.  In the Luduan system, for
instance, I used a multinomial generalized log-likelihood ratio
test (often called G^2) to finding interesting query terms and then used
general corpus frequencies to weight the terms.  The results
were substantially better than methods that tried to weight the terms using
conditional probabilities because the Luduan approach
could avoid over-fitting better.

This G^2 test is available in Mahout, but I think that the PFPGrowth
algorithm inherently imposes something like it during the winnowing of
patterns.

Another powerful method is to use spectral techniques to find cliques in a
large graph.  This can have very dramatic results and very
good scaling relative to iterative item-set growth techniques.

If you could say more about what you are trying to do at a high level, we
could probably help you find capabilities in Mahout that
suit your needs.

On Tue, Nov 9, 2010 at 8:20 AM, <[email protected]> wrote:

> Hello all,
> I am new to mahout. I have just started looking into mahout to replace our
> current fpgrowth implementation with a parallel fp growth that Mahout since
> we started having scalability issues. I looked at PFPGrowth documentation
> and I noticed that it only produces top K frequent patterns but not the
> associations and what we need is associations. So I was thinking of
> implementing a simple AssociationGenerator given the frequent patterns
> output. However I am not sure what is the best way to generate associations
> given the frequent patterns produced by mahout.
>
> I have the following sample output from mahout.
>
> Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> Key: 46705: Value: ([46705],2526)
>
> We are interested only in item set size of 2 since we need only 1
> ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
>
> I was planning to calculate associations with confidence as follows:
> For each key above as A {
>        for each two-item set as [A,C] {
>                confidence (A->C) = support(A->C)/support(C);
>                add association (A, C, confidence(A->C) to the list;
>        }
> }
>
> Keeping the above requirement and pseudo code n mind, my questions as
> follows:
> 1. Is the above algorithm efficient?
> 2. In the first pattern, [46705, 46485] occurred 355 times but in second
> pattern why is the same pattern not repeated. Because of this calculating
> confidence (46705 -> 46485) becomes difficult. As you can see from above
> code, I was planning to read patterns for each feature and calculate
> confidence of all association with antecedent. But when I read feature
> 46705, I cannot calculate confidence of (46705 -> 46485) since the item set
> is not included with the feature.
> 3. Has anyone implemented associations from the generated frequent
> patterns.
>
>
> Thanks
> Praveen
>
>

Reply via email to