Hi All, 

I'm running into a different issue with PFP growth now. I see an output like : 

$ cat part-r-00000 | grep 1678807047
12      1678807047  
38      1678807047 3159925415  

which says that the support (12) for the item (1678807047) is lesser than the 
support (38) of a pair containing that item. Needless to say that this is 
ridiculous. 
I get this even with the Sequential version of FPGrowth. 

$ cat part-r-00000  | grep 1441690161 
12              1441690161 3910019844  
18              1604285941 1441690161 3910019844  
75              1441690161  


I'm sure I'm doing something "crafty" somewhere.

For sequential, I supply the file containing baskets and get the output as a 
file of sequences.

I run the following code to read the sequence file and write out the support 
and itemsets in plain text :

(MapReduce was written for PFPGrowth output, which is bigger.  My reducer is 
just an identity reducer)
          @Override
        protected void map(Text key, TopKStringPatterns input, Context context)
                        throws IOException, InterruptedException {
                  for(Pair<List<String>,Long> pair : input.getPatterns()){
                          StringBuffer sb = new StringBuffer();
                          for(String item : pair.getFirst())
                                  sb.append(item).append(" ");
                          context.write(new LongWritable(pair.getSecond()), new 
Text(sb.toString()));
                  }
        }

This gives me the output above. 
Is this the right way? Am I doing something wrong while parsing the output?

My command line arguments are : 
-i ./baskets/part-r-00000 -o ./patterns -k 50 -method sequential -g 10 -regex 
'[\t]' -s 10

Any help would be highly appreciated.

Regards,
Vipul




On Feb 3, 2011, at 6:44 PM, <[email protected]> <[email protected]> 
wrote:

> Hi Vipul,
> Frquent patterns are reported per feature which is why you are seeing the two 
> patterns twice. First one is for feature 1518311 and second one is for 
> feature 1476937.
> 
> However both should have the same exact support. I am not sure why you have 
> different support for the same item set. May be if you send the full output 
> from Mahout as it is we could take a look.
> 
> Are you running on multi node Hadoop cluster. If so did you read all the 
> output files?
> 
> Praveen
> ________________________________________
> From: ext Vipul Pandey [[email protected]]
> Sent: Thursday, February 03, 2011 8:21 PM
> To: [email protected]
> Subject: PFPGrowth - weird output?
> 
> Hi all!
> 
> I'm trying to run PFPgrowth on my data and this is an output I get. (Please
> note that I parse the output in frequentpatterns folder and generate this
> output with the support followed by the itemset)
> 
> support : Itemset
> *234     1518311    1476937  *
> 235     55843184
> 238     1238079
> 244     34541
> 247     4516454
> 252     106478
> 252     670864
> *254     1476937   1518311  *
> 
> You can see that two items are reported twice (*1518311    1476937*) with
> different supports.
> 
> And below are all the occurance of these two items together .... if you
> notice it has all the permutations of the three items (*1476937* *720020* *
> 1518311*  )
> 
> 22 *1476937* 720020 *1518311*
> 30 *1518311* *1476937* 720020
> 30 720020 *1518311* *1476937*
> 34 720020 *1476937* *1518311*
> 38 *1518311* 720020 *1476937*
> 42 *1476937* *1518311* 720020
> 234 *1518311* *1476937*
> 254 *1476937* *1518311*
> 
> Does this mean if I have to get the support of just the the pair  (*1476937*
> *1518311*  ) I will have to add all of them up !?
> 
> Even in that case ... this total comes out to *684* and if I count the
> number of co-ocurrances of these two items in the original baskets the
> support is *766*? Why's there a difference? any idea?
> 
> 
> Thanks!
> Vipul

Reply via email to