Hi All,
I'm running into a different issue with PFP growth now. I see an output like :
$ cat part-r-00000 | grep 1678807047
12 1678807047
38 1678807047 3159925415
which says that the support (12) for the item (1678807047) is lesser than the
support (38) of a pair containing that item. Needless to say that this is
ridiculous.
I get this even with the Sequential version of FPGrowth.
$ cat part-r-00000 | grep 1441690161
12 1441690161 3910019844
18 1604285941 1441690161 3910019844
75 1441690161
I'm sure I'm doing something "crafty" somewhere.
For sequential, I supply the file containing baskets and get the output as a
file of sequences.
I run the following code to read the sequence file and write out the support
and itemsets in plain text :
(MapReduce was written for PFPGrowth output, which is bigger. My reducer is
just an identity reducer)
@Override
protected void map(Text key, TopKStringPatterns input, Context context)
throws IOException, InterruptedException {
for(Pair<List<String>,Long> pair : input.getPatterns()){
StringBuffer sb = new StringBuffer();
for(String item : pair.getFirst())
sb.append(item).append(" ");
context.write(new LongWritable(pair.getSecond()), new
Text(sb.toString()));
}
}
This gives me the output above.
Is this the right way? Am I doing something wrong while parsing the output?
My command line arguments are :
-i ./baskets/part-r-00000 -o ./patterns -k 50 -method sequential -g 10 -regex
'[\t]' -s 10
Any help would be highly appreciated.
Regards,
Vipul
On Feb 3, 2011, at 6:44 PM, <[email protected]> <[email protected]>
wrote:
> Hi Vipul,
> Frquent patterns are reported per feature which is why you are seeing the two
> patterns twice. First one is for feature 1518311 and second one is for
> feature 1476937.
>
> However both should have the same exact support. I am not sure why you have
> different support for the same item set. May be if you send the full output
> from Mahout as it is we could take a look.
>
> Are you running on multi node Hadoop cluster. If so did you read all the
> output files?
>
> Praveen
> ________________________________________
> From: ext Vipul Pandey [[email protected]]
> Sent: Thursday, February 03, 2011 8:21 PM
> To: [email protected]
> Subject: PFPGrowth - weird output?
>
> Hi all!
>
> I'm trying to run PFPgrowth on my data and this is an output I get. (Please
> note that I parse the output in frequentpatterns folder and generate this
> output with the support followed by the itemset)
>
> support : Itemset
> *234 1518311 1476937 *
> 235 55843184
> 238 1238079
> 244 34541
> 247 4516454
> 252 106478
> 252 670864
> *254 1476937 1518311 *
>
> You can see that two items are reported twice (*1518311 1476937*) with
> different supports.
>
> And below are all the occurance of these two items together .... if you
> notice it has all the permutations of the three items (*1476937* *720020* *
> 1518311* )
>
> 22 *1476937* 720020 *1518311*
> 30 *1518311* *1476937* 720020
> 30 720020 *1518311* *1476937*
> 34 720020 *1476937* *1518311*
> 38 *1518311* 720020 *1476937*
> 42 *1476937* *1518311* 720020
> 234 *1518311* *1476937*
> 254 *1476937* *1518311*
>
> Does this mean if I have to get the support of just the the pair (*1476937*
> *1518311* ) I will have to add all of them up !?
>
> Even in that case ... this total comes out to *684* and if I count the
> number of co-ocurrances of these two items in the original baskets the
> support is *766*? Why's there a difference? any idea?
>
>
> Thanks!
> Vipul