Re: FPGrowth does not handle large result sets

2016-01-13 Thread Ritu Raj Tiwari
Thanks Sean! I'll start with higher support threshold and work my way down. 

On Wednesday, January 13, 2016 8:57 AM, Sean Owen  
wrote:
 

 You're looking for subsets of items that appear in at least 200 of
200,000 transactions, which could be a whole lot. Keep in mind there
are 25,000 items, sure, but already 625,000,000 possible pairs of
items, and trillions of possible 3-item subsets. This sounds like it's
just far too low. Start with 0.1 and work down. I don't think there's
a general formula since if each transaction just contained 1 item, no
sets would be frequent, and if every transaction has ever item, than
all sets are frequent and that number is indescribably large.

On Wed, Jan 13, 2016 at 4:32 PM, Ritu Raj Tiwari
 wrote:
> Hi Sean:
> Thanks for checking out my question here. Its possible I am making a newbie
> error. Based on my dataset of about 200,000 transactions and a minimum
> support level of 0.001, I am looking for items that appear at least 200
> times. Given that the items in my transactions are drawn from a set of about
> 25,000 (I previously thought 17,000), what would be a rational way to
> determine the (peak) memory needs of my driver node?
>
> -Raj
>
>
> On Wednesday, January 13, 2016 1:18 AM, Sean Owen 
> wrote:
>
>
> As I said in your JIRA, the collect() in question is bringing results
> back to the driver to return them. The assumption is that there aren't
> a vast number of frequent items. If they are, they aren't 'frequent'
> and your min support is too low.
>
> On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari
>  wrote:
>> Folks:
>> We are running into a problem where FPGrowth seems to choke on data sets
>> that we think are not too large. We have about 200,000 transactions. Each
>> transaction is composed of on an average 50 items. There are about 17,000
>> unique item (SKUs) that might show up in any transaction.
>>
>> When running locally with 12G ram given to the PySpark process, the
>> FPGrowth
>> code fails with out of memory error for minSupport of 0.001. The failure
>> occurs when we try to enumerate and save the frequent itemsets. Looking at
>> the FPGrowth code
>>
>> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
>> it seems this is because the genFreqItems() method tries to collect() all
>> items. Is there a way the code could be rewritten so it does not try to
>> collect and therefore store all frequent item sets in memory?
>>
>> Thanks for any insights.
>>
>> -Raj
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



   

Re: FPGrowth does not handle large result sets

2016-01-13 Thread Sean Owen
You're looking for subsets of items that appear in at least 200 of
200,000 transactions, which could be a whole lot. Keep in mind there
are 25,000 items, sure, but already 625,000,000 possible pairs of
items, and trillions of possible 3-item subsets. This sounds like it's
just far too low. Start with 0.1 and work down. I don't think there's
a general formula since if each transaction just contained 1 item, no
sets would be frequent, and if every transaction has ever item, than
all sets are frequent and that number is indescribably large.

On Wed, Jan 13, 2016 at 4:32 PM, Ritu Raj Tiwari
 wrote:
> Hi Sean:
> Thanks for checking out my question here. Its possible I am making a newbie
> error. Based on my dataset of about 200,000 transactions and a minimum
> support level of 0.001, I am looking for items that appear at least 200
> times. Given that the items in my transactions are drawn from a set of about
> 25,000 (I previously thought 17,000), what would be a rational way to
> determine the (peak) memory needs of my driver node?
>
> -Raj
>
>
> On Wednesday, January 13, 2016 1:18 AM, Sean Owen 
> wrote:
>
>
> As I said in your JIRA, the collect() in question is bringing results
> back to the driver to return them. The assumption is that there aren't
> a vast number of frequent items. If they are, they aren't 'frequent'
> and your min support is too low.
>
> On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari
>  wrote:
>> Folks:
>> We are running into a problem where FPGrowth seems to choke on data sets
>> that we think are not too large. We have about 200,000 transactions. Each
>> transaction is composed of on an average 50 items. There are about 17,000
>> unique item (SKUs) that might show up in any transaction.
>>
>> When running locally with 12G ram given to the PySpark process, the
>> FPGrowth
>> code fails with out of memory error for minSupport of 0.001. The failure
>> occurs when we try to enumerate and save the frequent itemsets. Looking at
>> the FPGrowth code
>>
>> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
>> it seems this is because the genFreqItems() method tries to collect() all
>> items. Is there a way the code could be rewritten so it does not try to
>> collect and therefore store all frequent item sets in memory?
>>
>> Thanks for any insights.
>>
>> -Raj
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: FPGrowth does not handle large result sets

2016-01-13 Thread Ritu Raj Tiwari
Hi Sean:Thanks for checking out my question here. Its possible I am making a 
newbie error. Based on my dataset of about 200,000 transactions and a minimum 
support level of 0.001, I am looking for items that appear at least 200 times. 
Given that the items in my transactions are drawn from a set of about 25,000 (I 
previously thought 17,000), what would be a rational way to determine the 
(peak) memory needs of my driver node?
-Raj 

On Wednesday, January 13, 2016 1:18 AM, Sean Owen  
wrote:
 

 As I said in your JIRA, the collect() in question is bringing results
back to the driver to return them. The assumption is that there aren't
a vast number of frequent items. If they are, they aren't 'frequent'
and your min support is too low.

On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari
 wrote:
> Folks:
> We are running into a problem where FPGrowth seems to choke on data sets
> that we think are not too large. We have about 200,000 transactions. Each
> transaction is composed of on an average 50 items. There are about 17,000
> unique item (SKUs) that might show up in any transaction.
>
> When running locally with 12G ram given to the PySpark process, the FPGrowth
> code fails with out of memory error for minSupport of 0.001. The failure
> occurs when we try to enumerate and save the frequent itemsets. Looking at
> the FPGrowth code
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
> it seems this is because the genFreqItems() method tries to collect() all
> items. Is there a way the code could be rewritten so it does not try to
> collect and therefore store all frequent item sets in memory?
>
> Thanks for any insights.
>
> -Raj

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



   

Re: FPGrowth does not handle large result sets

2016-01-13 Thread Sean Owen
As I said in your JIRA, the collect() in question is bringing results
back to the driver to return them. The assumption is that there aren't
a vast number of frequent items. If they are, they aren't 'frequent'
and your min support is too low.

On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari
 wrote:
> Folks:
> We are running into a problem where FPGrowth seems to choke on data sets
> that we think are not too large. We have about 200,000 transactions. Each
> transaction is composed of on an average 50 items. There are about 17,000
> unique item (SKUs) that might show up in any transaction.
>
> When running locally with 12G ram given to the PySpark process, the FPGrowth
> code fails with out of memory error for minSupport of 0.001. The failure
> occurs when we try to enumerate and save the frequent itemsets. Looking at
> the FPGrowth code
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
> it seems this is because the genFreqItems() method tries to collect() all
> items. Is there a way the code could be rewritten so it does not try to
> collect and therefore store all frequent item sets in memory?
>
> Thanks for any insights.
>
> -Raj

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: FPGrowth does not handle large result sets

2016-01-12 Thread Ritu Raj Tiwari
I have been giving it 8-12G

-Raj
Sent from my iPhone

> On Jan 12, 2016, at 6:50 PM, Sabarish Sasidharan 
>  wrote:
> 
> How much RAM are you giving to the driver? 17000 items being collected 
> shouldn't fail unless your driver memory is too low.
> 
> Regards
> Sab
> 
>> On 13-Jan-2016 6:14 am, "Ritu Raj Tiwari"  
>> wrote:
>> Folks:
>> We are running into a problem where FPGrowth seems to choke on data sets 
>> that we think are not too large. We have about 200,000 transactions. Each 
>> transaction is composed of on an average 50 items. There are about 17,000 
>> unique item (SKUs) that might show up in any transaction.
>> 
>> When running locally with 12G ram given to the PySpark process, the FPGrowth 
>> code fails with out of memory error for minSupport of 0.001. The failure 
>> occurs when we try to enumerate and save the frequent itemsets. Looking at 
>> the FPGrowth code 
>> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
>>  it seems this is because the genFreqItems() method tries to collect() all 
>> items. Is there a way the code could be rewritten so it does not try to 
>> collect and therefore store all frequent item sets in memory?
>> 
>> Thanks for any insights.
>> 
>> -Raj


Re: FPGrowth does not handle large result sets

2016-01-12 Thread Sabarish Sasidharan
How much RAM are you giving to the driver? 17000 items being collected
shouldn't fail unless your driver memory is too low.

Regards
Sab
On 13-Jan-2016 6:14 am, "Ritu Raj Tiwari" 
wrote:

> Folks:
> We are running into a problem where FPGrowth seems to choke on data sets
> that we think are not too large. We have about 200,000 transactions. Each
> transaction is composed of on an average 50 items. There are about 17,000
> unique item (SKUs) that might show up in any transaction.
>
> When running locally with 12G ram given to the PySpark process, the
> FPGrowth code fails with out of memory error for minSupport of 0.001. The
> failure occurs when we try to enumerate and save the frequent itemsets.
> Looking at the FPGrowth code (
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
> it seems this is because the genFreqItems() method tries to collect() all
> items. Is there a way the code could be rewritten so it does not try to
> collect and therefore store all frequent item sets in memory?
>
> Thanks for any insights.
>
> -Raj
>


FPGrowth does not handle large result sets

2016-01-12 Thread Ritu Raj Tiwari
Folks:We are running into a problem where FPGrowth seems to choke on data sets 
that we think are not too large. We have about 200,000 transactions. Each 
transaction is composed of on an average 50 items. There are about 17,000 
unique item (SKUs) that might show up in any transaction.
When running locally with 12G ram given to the PySpark process, the FPGrowth 
code fails with out of memory error for minSupport of 0.001. The failure occurs 
when we try to enumerate and save the frequent itemsets. Looking at the 
FPGrowth code 
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
 it seems this is because the genFreqItems() method tries to collect() all 
items. Is there a way the code could be rewritten so it does not try to collect 
and therefore store all frequent item sets in memory?
Thanks for any insights.
-Raj