Praveen, Could you define what you mean by association? Do you mean temporal sequence? A causal relationship? Or simply a cooccurrence?
There is considerable confusion in the world about these terms. To answer your question, it will be important to be sure what you mean by your question. Assuming that you are looking at cooccurrence, possibly with a temporal ordering, your measure of confidence is anything but a measure of confidence. More correctly, you are estimating the conditional probability P(B|A). The estimate you are using, however, is subject to substantial error when counts are too small. In many applications, you get much better results if you refrain from estimating conditional probabilities at all and satisfy yourself with simply separating those conditional probabilities that differ from the marginal probabilities (i.e. where P(B) != P(B | A) ). This is a much simpler task and helps you avoid over-fitting. In the Luduan system, for instance, I used a multinomial generalized log-likelihood ratio test (often called G^2) to finding interesting query terms and then used general corpus frequencies to weight the terms. The results were substantially better than methods that tried to weight the terms using conditional probabilities because the Luduan approach could avoid over-fitting better. This G^2 test is available in Mahout, but I think that the PFPGrowth algorithm inherently imposes something like it during the winnowing of patterns. Another powerful method is to use spectral techniques to find cliques in a large graph. This can have very dramatic results and very good scaling relative to iterative item-set growth techniques. If you could say more about what you are trying to do at a high level, we could probably help you find capabilities in Mahout that suit your needs. On Tue, Nov 9, 2010 at 8:20 AM, <[email protected]> wrote: > Hello all, > I am new to mahout. I have just started looking into mahout to replace our > current fpgrowth implementation with a parallel fp growth that Mahout since > we started having scalability issues. I looked at PFPGrowth documentation > and I noticed that it only produces top K frequent patterns but not the > associations and what we need is associations. So I was thinking of > implementing a simple AssociationGenerator given the frequent patterns > output. However I am not sure what is the best way to generate associations > given the frequent patterns produced by mahout. > > I have the following sample output from mahout. > > Key: 46485: Value: ([46485],936), ([46705, 46485],355) > Key: 46705: Value: ([46705],2526) > > We are interested only in item set size of 2 since we need only 1 > ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY. > > I was planning to calculate associations with confidence as follows: > For each key above as A { > for each two-item set as [A,C] { > confidence (A->C) = support(A->C)/support(C); > add association (A, C, confidence(A->C) to the list; > } > } > > Keeping the above requirement and pseudo code n mind, my questions as > follows: > 1. Is the above algorithm efficient? > 2. In the first pattern, [46705, 46485] occurred 355 times but in second > pattern why is the same pattern not repeated. Because of this calculating > confidence (46705 -> 46485) becomes difficult. As you can see from above > code, I was planning to read patterns for each feature and calculate > confidence of all association with antecedent. But when I read feature > 46705, I cannot calculate confidence of (46705 -> 46485) since the item set > is not included with the feature. > 3. Has anyone implemented associations from the generated frequent > patterns. > > > Thanks > Praveen > >
