As I said in your JIRA, the collect() in question is bringing results back to the driver to return them. The assumption is that there aren't a vast number of frequent items. If they are, they aren't 'frequent' and your min support is too low.
On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari <rituraj_tiw...@yahoo.com.invalid> wrote: > Folks: > We are running into a problem where FPGrowth seems to choke on data sets > that we think are not too large. We have about 200,000 transactions. Each > transaction is composed of on an average 50 items. There are about 17,000 > unique item (SKUs) that might show up in any transaction. > > When running locally with 12G ram given to the PySpark process, the FPGrowth > code fails with out of memory error for minSupport of 0.001. The failure > occurs when we try to enumerate and save the frequent itemsets. Looking at > the FPGrowth code > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala), > it seems this is because the genFreqItems() method tries to collect() all > items. Is there a way the code could be rewritten so it does not try to > collect and therefore store all frequent item sets in memory? > > Thanks for any insights. > > -Raj --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org