Hi Vipul Is it possible for you to attach a test data to a JIRA issue for me to investigate
Robin On Sat, Mar 5, 2011 at 12:09 PM, Vipul Pandey <[email protected]> wrote: > Hi All, > > > I'm running into a different issue with PFP growth now. I see an output > like : > > $ cat part-r-00000 | grep 1678807047 > 12 1678807047 > 38 1678807047 3159925415 > > which says that the support (12) for the item (1678807047) is lesser than > the support (38) of a pair containing that item. Needless to say that this > is ridiculous. > I get this even with the Sequential version of FPGrowth. > > $ cat part-r-00000 | grep 1441690161 > 12 1441690161 3910019844 > 18 1604285941 1441690161 3910019844 > 75 1441690161 > > > I'm sure I'm doing something "crafty" somewhere. > > For sequential, I supply the file containing baskets and get the output as > a file of sequences. > > I run the following code to read the sequence file and write out the > support and itemsets in plain text : > > (MapReduce was written for PFPGrowth output, which is bigger. My reducer > is just an identity reducer) > @Override > protected void map(Text key, TopKStringPatterns input, Context > context) > throws IOException, InterruptedException { > for(Pair<List<String>,Long> pair : input.getPatterns()){ > StringBuffer sb = new StringBuffer(); > for(String item : pair.getFirst()) > sb.append(item).append(" "); > context.write(new LongWritable(pair.getSecond()), > new Text(sb.toString())); > } > } > > This gives me the output above. > Is this the right way? Am I doing something wrong while parsing the output? > > My command line arguments are : > -i ./baskets/part-r-00000 -o ./patterns -k 50 -method sequential -g 10 > -regex '[\t]' -s 10 > > Any help would be highly appreciated. > > Regards, > Vipul > > > > > On Feb 3, 2011, at 6:44 PM, <[email protected]> < > [email protected]> wrote: > > > Hi Vipul, > > Frquent patterns are reported per feature which is why you are seeing the > two patterns twice. First one is for feature 1518311 and second one is for > feature 1476937. > > > > However both should have the same exact support. I am not sure why you > have different support for the same item set. May be if you send the full > output from Mahout as it is we could take a look. > > > > Are you running on multi node Hadoop cluster. If so did you read all the > output files? > > > > Praveen > > ________________________________________ > > From: ext Vipul Pandey [[email protected]] > > Sent: Thursday, February 03, 2011 8:21 PM > > To: [email protected] > > Subject: PFPGrowth - weird output? > > > > Hi all! > > > > I'm trying to run PFPgrowth on my data and this is an output I get. > (Please > > note that I parse the output in frequentpatterns folder and generate this > > output with the support followed by the itemset) > > > > support : Itemset > > *234 1518311 1476937 * > > 235 55843184 > > 238 1238079 > > 244 34541 > > 247 4516454 > > 252 106478 > > 252 670864 > > *254 1476937 1518311 * > > > > You can see that two items are reported twice (*1518311 1476937*) with > > different supports. > > > > And below are all the occurance of these two items together .... if you > > notice it has all the permutations of the three items (*1476937* *720020* > * > > 1518311* ) > > > > 22 *1476937* 720020 *1518311* > > 30 *1518311* *1476937* 720020 > > 30 720020 *1518311* *1476937* > > 34 720020 *1476937* *1518311* > > 38 *1518311* 720020 *1476937* > > 42 *1476937* *1518311* 720020 > > 234 *1518311* *1476937* > > 254 *1476937* *1518311* > > > > Does this mean if I have to get the support of just the the pair > (*1476937* > > *1518311* ) I will have to add all of them up !? > > > > Even in that case ... this total comes out to *684* and if I count the > > number of co-ocurrances of these two items in the original baskets the > > support is *766*? Why's there a difference? any idea? > > > > > > Thanks! > > Vipul > >
