The way the code work is:
1) create a BiMap for every id space in the client code (users and items). This 
is non-distributed code, typically run on the machine you launch from although 
in yarn-cluster mode the actual machine may be different. In any case the heap 
used is associated with the driver itself, not distributed code.
2) the BiMap is broadcast (copied) to every worker. This instantiates it in 
memory shared with all executors on the worker so there is only one copy per 
machine. Since it may be large this is the best way to handle it.

#1 requires that you have enough memory in the driver to create the BiMap. This 
memory is allocated when the driver is launched and available as heap. If you 
are not using yarn this would be JVM memory so the various methods for setting 
-Xmx4g (or however much you need). This will be something like “export 
JAVA_OPTS= -Xmx4g” or something.  You would have to have a giant BiMap to us 
that much memory. A Hashmap storage has an index and copy of every key/value 
pair. A BiMap has two HashMaps. If your ID strings are very long this increases 
the space required. So index aside the memory needed increases with the size of 
you ID strings, ints are used as Mahout IDs.

If you are using spark-submit you can change executor memory there. You can 
change it in the Spark conf files and using the driver’s 
-D:spark.executor.memory=4g. These use different mechanisms to get the config 
changed but should all work. Feel free to try a different method if you think 
-sem doesn’t.

Are you using yarn-client or yarn-cluster? Can you share your entire command 
line and console error log? The line also states that you have 1.8g free so we 
need to pinpoint the memory chunk that is being exhausted. Also is you could 
share a snippet or you data.

On May 18, 2015, at 6:10 AM, Xavier Rampino <xramp...@senscritique.com> wrote:

I just did that but I ran into the same problem, I feel like -sem doesn't
work with my setup. For instance I have :

15/05/18 13:44:39 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
localhost:60596 in memory (size: 2.7 KB, free: *1761.1 MB*)

(Maybe it's not related though)

On Wed, May 13, 2015 at 7:27 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> There is a bug in mahout 0.10.0 that you can fix if you are able to build
> from source. Get the source tar for 0.10.0, not the current master.
> 
> Got to
> https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157
> 
> remove the line that says: interactions.collect()
> 
> See this Jira https://issues.apache.org/jira/browse/MAHOUT-1707
> 
> There is one other thing that can cause this and is fixed by increasing
> you client JVM heap space but try the above first.
> 
> BTW setting the executor memory twice, is not necessary.
> 
> 
> On May 13, 2015, at 2:21 AM, Xavier Rampino <xramp...@senscritique.com>
> wrote:
> 
> Hello,
> 
> I've tried spark-rowsimilarity with out-of-the-box setup (downloaded mahout
> distribution and spark, and set up the PATH), and I stumble upon a Java
> Heap space error. My input file is ~100MB. It seems the various parameters
> I tried to give won't change this. I do :
> 
> ~/mahout-distribution-0.10.0/bin/mahout spark-rowsimilarity --input
> ~/query_result.tsv --output ~/work/result -sem 24g
> -D:spark.executor.memory=24g
> 
> Do I just need to input more memory, or is there another step I can do to
> solve this ?
> 
> 

Reply via email to