Re: PFPGrowth - weird output?

Vipul Pandey Wed, 09 Mar 2011 20:38:06 -0800

Robin,

So here's how (P)FPGrowth looks - from where I see :

FPGrowth reports the support of itemsets individually in that if Item X appears individually 12 times and appears with item Y 10 times (a total of 22 times) AND item Y appears individually 4 times (a total of 14 times) then this is what the output will be (say for min-support 2)

12 X

10 X Y

4 Y

If the minimum support is 5 then the output will look like :

12 X

10 X Y

if the minimum support is 11 then the output will look like

12 X

if the minimum support is 13 then there will be NO output.

even though all the way along Xs support was 22 and Y's was 14

XY
Description: Binary data

Even if we want to show just the maximal itemsets (although i would like to see ALL the frequent itemsets - maximal or not) this output is wrong as with a support of 13 we should still have seen X(22) and Y(14)

Now Say you add XYZ 11 times

XYZ
Description: Binary data

for support 1 you'd see

12 X

10 X Y

11 X Y Z

4 Y

And for support 11 you'd see

12 X

11 X Y Z

Although I'd expect the output (for s=11) to be

33 X

25 Y

21 XY

11 Z

11 XZ

11 YZ

11 XYZ

Hope this helps.

Vipul

On Mar 5, 2011, at 2:13 AM, Robin Anil wrote:

Hi Vipul Is it possible for you to attach a test data to a JIRA issue for me to investigate

Robin

On Sat, Mar 5, 2011 at 12:09 PM, Vipul Pandey <[email protected]> wrote:

Hi All,

I'm running into a different issue with PFP growth now. I see an output like :

$ cat part-r-00000 | grep 1678807047
12 1678807047
38 1678807047 3159925415

which says that the support (12) for the item (1678807047) is lesser than the support (38) of a pair containing that item. Needless to say that this is ridiculous.
I get this even with the Sequential version of FPGrowth.

$ cat part-r-00000 | grep 1441690161
12 1441690161 3910019844
18 1604285941 1441690161 3910019844
75 1441690161

I'm sure I'm doing something "crafty" somewhere.

For sequential, I supply the file containing baskets and get the output as a file of sequences.

I run the following code to read the sequence file and write out the support and itemsets in plain text :

(MapReduce was written for PFPGrowth output, which is bigger. My reducer is just an identity reducer)
@Override
protected void map(Text key, TopKStringPatterns input, Context context)
throws IOException, InterruptedException {
for(Pair<List<String>,Long> pair : input.getPatterns()){
StringBuffer sb = new StringBuffer();
for(String item : pair.getFirst())
sb.append(item).append(" ");
context.write(new LongWritable(pair.getSecond()), new Text(sb.toString()));
}
}

This gives me the output above.
Is this the right way? Am I doing something wrong while parsing the output?

My command line arguments are :
-i ./baskets/part-r-00000 -o ./patterns -k 50 -method sequential -g 10 -regex '[\t]' -s 10

Any help would be highly appreciated.

Regards,

Vipul

On Feb 3, 2011, at 6:44 PM, <[email protected]> <[email protected]> wrote:

> Hi Vipul,
> Frquent patterns are reported per feature which is why you are seeing the two patterns twice. First one is for feature 1518311 and second one is for feature 1476937.
>
> However both should have the same exact support. I am not sure why you have different support for the same item set. May be if you send the full output from Mahout as it is we could take a look.
>
> Are you running on multi node Hadoop cluster. If so did you read all the output files?
>
> Praveen
> ________________________________________
> From: ext Vipul Pandey [[email protected]]
> Sent: Thursday, February 03, 2011 8:21 PM
> To: [email protected]
> Subject: PFPGrowth - weird output?
>
> Hi all!
>
> I'm trying to run PFPgrowth on my data and this is an output I get. (Please
> note that I parse the output in frequentpatterns folder and generate this
> output with the support followed by the itemset)
>
> support : Itemset
> *234 1518311 1476937 *
> 235 55843184
> 238 1238079
> 244 34541
> 247 4516454
> 252 106478
> 252 670864
> *254 1476937 1518311 *
>
> You can see that two items are reported twice (*1518311 1476937*) with
> different supports.
>
> And below are all the occurance of these two items together .... if you
> notice it has all the permutations of the three items (*1476937* *720020* *
> 1518311* )
>
> 22 *1476937* 720020 *1518311*
> 30 *1518311* *1476937* 720020
> 30 720020 *1518311* *1476937*
> 34 720020 *1476937* *1518311*
> 38 *1518311* 720020 *1476937*
> 42 *1476937* *1518311* 720020
> 234 *1518311* *1476937*
> 254 *1476937* *1518311*
>
> Does this mean if I have to get the support of just the the pair (*1476937*
> *1518311* ) I will have to add all of them up !?
>
> Even in that case ... this total comes out to *684* and if I count the
> number of co-ocurrances of these two items in the original baskets the
> support is *766*? Why's there a difference? any idea?
>
>
> Thanks!
> Vipul

Re: PFPGrowth - weird output?

Reply via email to