Robin, 

So here's how (P)FPGrowth looks - from where I see : 

FPGrowth reports the support of itemsets individually in that if Item X appears individually 12 times and appears with item Y 10 times (a total of 22 times) AND item Y appears individually 4 times (a total of 14 times) then this is what the output will be (say for min-support 2)

12 X
10 X Y
4   Y

If the minimum support is 5 then the output will look like : 
12 X
10 X Y

if the minimum support is 11 then the output will look like 
12 X

if the minimum support is 13 then there will be NO output.

even though all the way along Xs support was 22 and Y's was 14

Attachment: XY
Description: Binary data



Even if we want to show just the maximal itemsets (although i would like to see ALL the frequent itemsets - maximal or not) this output is wrong as with a support of 13 we should still have seen X(22) and Y(14)


Now Say you add XYZ 11 times

Attachment: XYZ
Description: Binary data


for support 1 you'd see
12 X
10 X Y
11 X Y Z
4   Y

And for support 11 you'd see
12 X
11 X Y Z

Although I'd expect the output (for s=11) to be 
33 X
25 Y 
21 XY
11 Z
11 XZ
11 YZ
11 XYZ

Hope this helps. 


Vipul

On Mar 5, 2011, at 2:13 AM, Robin Anil wrote:

Hi Vipul Is it possible for you to attach a test data to a JIRA issue for me to investigate

Robin

On Sat, Mar 5, 2011 at 12:09 PM, Vipul Pandey <[email protected]> wrote:
Hi All,


I'm running into a different issue with PFP growth now. I see an output like :

$ cat part-r-00000 | grep 1678807047
12      1678807047
38      1678807047 3159925415

which says that the support (12) for the item (1678807047) is lesser than the support (38) of a pair containing that item. Needless to say that this is ridiculous.
I get this even with the Sequential version of FPGrowth.

$ cat part-r-00000  | grep 1441690161
12              1441690161 3910019844
18              1604285941 1441690161 3910019844
75              1441690161


I'm sure I'm doing something "crafty" somewhere.

For sequential, I supply the file containing baskets and get the output as a file of sequences.

I run the following code to read the sequence file and write out the support and itemsets in plain text :

(MapReduce was written for PFPGrowth output, which is bigger.  My reducer is just an identity reducer)
         @Override
       protected void map(Text key, TopKStringPatterns input, Context context)
                       throws IOException, InterruptedException {
                 for(Pair<List<String>,Long> pair : input.getPatterns()){
                         StringBuffer sb = new StringBuffer();
                         for(String item : pair.getFirst())
                                 sb.append(item).append(" ");
                         context.write(new LongWritable(pair.getSecond()), new Text(sb.toString()));
                 }
       }

This gives me the output above.
Is this the right way? Am I doing something wrong while parsing the output?

My command line arguments are :
-i ./baskets/part-r-00000 -o ./patterns -k 50 -method sequential -g 10 -regex '[\t]' -s 10

Any help would be highly appreciated.

Regards,
Vipul




On Feb 3, 2011, at 6:44 PM, <[email protected]> <[email protected]> wrote:

> Hi Vipul,
> Frquent patterns are reported per feature which is why you are seeing the two patterns twice. First one is for feature 1518311 and second one is for feature 1476937.
>
> However both should have the same exact support. I am not sure why you have different support for the same item set. May be if you send the full output from Mahout as it is we could take a look.
>
> Are you running on multi node Hadoop cluster. If so did you read all the output files?
>
> Praveen
> ________________________________________
> From: ext Vipul Pandey [[email protected]]
> Sent: Thursday, February 03, 2011 8:21 PM
> To: [email protected]
> Subject: PFPGrowth - weird output?
>
> Hi all!
>
> I'm trying to run PFPgrowth on my data and this is an output I get. (Please
> note that I parse the output in frequentpatterns folder and generate this
> output with the support followed by the itemset)
>
> support : Itemset
> *234     1518311    1476937  *
> 235     55843184
> 238     1238079
> 244     34541
> 247     4516454
> 252     106478
> 252     670864
> *254     1476937   1518311  *
>
> You can see that two items are reported twice (*1518311    1476937*) with
> different supports.
>
> And below are all the occurance of these two items together .... if you
> notice it has all the permutations of the three items (*1476937* *720020* *
> 1518311*  )
>
> 22 *1476937* 720020 *1518311*
> 30 *1518311* *1476937* 720020
> 30 720020 *1518311* *1476937*
> 34 720020 *1476937* *1518311*
> 38 *1518311* 720020 *1476937*
> 42 *1476937* *1518311* 720020
> 234 *1518311* *1476937*
> 254 *1476937* *1518311*
>
> Does this mean if I have to get the support of just the the pair  (*1476937*
> *1518311*  ) I will have to add all of them up !?
>
> Even in that case ... this total comes out to *684* and if I count the
> number of co-ocurrances of these two items in the original baskets the
> support is *766*? Why's there a difference? any idea?
>
>
> Thanks!
> Vipul



Reply via email to