As I said, the memory to worry about is for the driver—the client code that launches Spark executor tasks, this runs on a single machine so using more machines will not help. Increase your driver memory with export MAHOUT_HEAPSIZE=6000 in your envoronment or JAVA_MAX_HEAP or if you are using spark-submit use the -diver-memory 6g. You need to set memory before the driver is launched
Also you should use Mahout 0.10.2 on Spark 1.2 or Mahout 0.11.0 on Spark 1.3+ Please show the error you are getting so we can tell if it is driver or executor. This data should not run out of memory. On Aug 5, 2015, at 7:11 PM, Rodolfo Viana <[email protected]> wrote: HI Pat, I'm using wikipedia dataset <https://en.m.wikipedia.org/wiki/Wikipedia:Database_download>, so my Users and items are number. For example: 225,5148363 225,5216791 225,5308944 225,5330132 225,5578696 226,436980 226,505673 226,550302 226,569080 226,569088 226,569094 226,573346 226,631250 226,650629 226,775638 226,899438 226,910306 226,1128374 226,1175762 226,1231654 226,1424177 226,1508647 226,1739522 226,2081174 226,2473511 226,2935004 226,3334949 226,3602167 226,3607953 226,3618166 226,4431009 226,4664960 226,4845331 226,5143691 226,5188581 226,5308939 226,5330132 227,4103131 228,226 229,5308939 230,4330310 231,226 232,505673 233,4103131 234,2858875 234,77935 Now I'm running with Spark in 2 machines, but I got some memory errors, So I was wondering how many machines do I need to install spark to run my experiment. On Tue, Aug 4, 2015 at 8:30 PM, Pat Ferrel <[email protected]> wrote: > More machines won’t help with memory requirements since they are for the > client, the driver code, even if you use mahout as a library. The amount of > storage is proportional to the total amount needed for your ID strings. How > many users, items, and how long are their ID strings? This total will give > you an idea for what the minimum for your client. You will need more to > hold the mapped integers and for the indexes but it will give you an idea. > > 6G is a lot of string storage. > > On Aug 4, 2015, at 11:58 AM, Rodolfo Viana <[email protected]> > wrote: > > Thank you Pat, you were right, when I run with Spark 1.3.1 with Mahout 0.10 > I didn't get this error. > > I’m trying to run Mahout with Spark with 20M, 50M, 1G and 10G. > Anybody have any ideas how many machines with 6G in ram should I configure > with Spark to be able to run this experiment? > So far I configured 3 machines, but I think it will not be enough. > > > > On Tue, Jul 21, 2015 at 1:58 PM, Pat Ferrel <[email protected]> wrote: > >> That should be plenty of memory on you executors but is that where you > are >> running low? This may be a low heap on your driver/client code. >> >> Increase driver memory by setting MAHOUT_HEAPSIZE=6g or some such when >> launching the driver. I think the default is 4g. If you are using Yarn > the >> answer is more complicated. >> >> The code creates a BiMaps for your user and item ids which will grow with >> the size of your total string storage needs, are your ids very long? With >> the default 4g of driver memory and the latest released 0.10.1 (be sure > to >> upgrade!) or master-0.11.0-snapshot code I wouldn’t expect to have this >> problem. >> >> The current master mahout-0.11.0-snapshot has better partitioning as >> Dmitriy mentions but it is built for Spark 1.3.1 so not sure if it is >> backward compatible. Some things won’t work but spark-itemsimilarity may > be >> ok. Somehow I doubt you are running into a partitioning problem. >> >> On Jul 20, 2015, at 2:04 PM, Dmitriy Lyubimov <[email protected]> wrote: >> >> assuming task memory x number of cores does not exceed ~5g, and block > cache >> manager ratio does not have some really weird setting, the next best > thing >> to look at is initial task split size. I don' think in the release you > are >> looking at the driver manages initial off-dfs splits satisfactorily > (that >> is, in any way at all). Basically, you may want smaller splits, more > tasks >> than what DFS gives you from the beginning. These apps tend to run a bit >> better when splits do not exceed 100...500k non-zero elements. >> >> I think Pat has done some stop-gap measure on current master for that >> (which i don't believe is a true optimal thing to do though). >> >> On Mon, Jul 20, 2015 at 1:40 PM, Rodolfo Viana < >> [email protected] >>> wrote: >> >>> I’m trying to run Mahout 0.10 with Spark 1.1.1. >>> I have input files with 8k, 10M, 20M, 25M. >>> >>> So far I run with the following configuration: >>> >>> 8k with 1,2,3 slaves >>> 10M with 1, 2, 3 slaves >>> 20M with 1,2,3 slaves >>> >>> But when I try to run >>> bin/mahout spark-itemsimilarity --master spark://node1:7077 --input >>> filein.txt --output out --sparkExecutorMem 6g >>> >>> with 25M I got this error: >>> >>> java.lang.OutOfMemoryError: Java heap space >>> >>> or >>> >>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>> >>> >>> Is that normal? Because when I was running 20M I didn’t get any error, >> now >>> I have 5M more. >>> >>> Any ideas why this is happening? >>> >>> -- >>> Rodolfo de Lima Viana >>> Undergraduate in Computer Science at UFCG >>> >> >> > > > -- > Rodolfo de Lima Viana > Undergraduate in Computer Science at UFCG > > -- Rodolfo de Lima Viana Undergraduate in Computer Science at UFCG
