Re: FPGrowth does not handle large result sets

Sean Owen Wed, 13 Jan 2016 01:19:03 -0800

As I said in your JIRA, the collect() in question is bringing results
back to the driver to return them. The assumption is that there aren't
a vast number of frequent items. If they are, they aren't 'frequent'
and your min support is too low.


On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari
<rituraj_tiw...@yahoo.com.invalid> wrote:
> Folks:
> We are running into a problem where FPGrowth seems to choke on data sets
> that we think are not too large. We have about 200,000 transactions. Each
> transaction is composed of on an average 50 items. There are about 17,000
> unique item (SKUs) that might show up in any transaction.
>
> When running locally with 12G ram given to the PySpark process, the FPGrowth
> code fails with out of memory error for minSupport of 0.001. The failure
> occurs when we try to enumerate and save the frequent itemsets. Looking at
> the FPGrowth code
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
> it seems this is because the genFreqItems() method tries to collect() all
> items. Is there a way the code could be rewritten so it does not try to
> collect and therefore store all frequent item sets in memory?
>
> Thanks for any insights.
>
> -Raj

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: FPGrowth does not handle large result sets

Reply via email to