Hi, A quick update. I managed to squeeze some more performance by tuning DFS block size, and number of mapeprs and reducers. Now 16 HPC nodes (256 CPU) seem to give the highest performance for both alternating least squares and CoEM (NLP algorithm). I have posted updated some new graphs on http://bickson.blogspot.com/2011/03/tunning-hadoop-configuration-for-high.html
Unfortunately, I am out of experiment EC2 budget, so I will not be able to further fine tune performance. - Danny On Sun, Mar 6, 2011 at 4:10 PM, Danny Bickson <[email protected]>wrote: > Hi again, > I think I found some problems in my setup and I will rerun experiments > soon. > When using 32,64 machines I think that not enough mappers/reducers are > allocated. > Regarding the patch, I still need it, I run all experiments with D=20, on > D=30 and above > I get memory errors. > > Thanks! > > > On Sun, Mar 6, 2011 at 4:02 PM, Sebastian Schelter <[email protected]> wrote: > >> Hi Danny, >> >> thanks for the nice writeup! I'm a little bit disappointed about the >> performance though... >> >> Seems you got around those memory problems from last week without my >> patch, which is good, since I unfortunately didn't have the time to finish >> that one yet. >> >> >> >> >> >> On 05.03.2011 01:33, Danny Bickson wrote: >> >>> Hi Sebastian, >>> As promised, you can find some results for testing your ALS code, on 64 >>> high performance Amazon EC2 machines (with up to 1,024 cores). >>> >>> http://bickson.blogspot.com/2011/03/tunning-hadoop-configuration-for-high.html >>> >>> I would love to get any feedback you or others may have about the setup >>> of this experiment. >>> >>> Best, >>> >>> Danny Bickson >>> >>> On Wed, Feb 23, 2011 at 4:41 PM, Sebastian Schelter <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hi Danny, >>> >>> please send all mails to [email protected] >>> <mailto:[email protected]> instead of directly sending them to >>> >>> me, there are a lot of smart people on that list that might join >>> with advice. >>> >>> I'm very excited that you intensively test this code and I'm >>> positively suprised to see it give good results. Thank you for the >>> effort you put into that! >>> >>> The exception seems to occur when ALSEvaluator is run. The code uses >>> a quick and dirty approach to compute the error of the model as it >>> just loads the user and item feature matrices completely into >>> memory. With an increasing number of features memory consumption is >>> getting too large. >>> >>> The code of that evaluator step needs to be changed, so that each >>> (user,item) pair for which the rating shall be predicted gets joined >>> with the according user and item feature vectors in a way that they >>> are mapped to the same key and go to the same reducer which can then >>> compute the error. >>> >>> I already started implementing something like this, but I don't have >>> a lot of time these days unfortunately. I could update the patch >>> during the next week if that's ok for you. >>> >>> --sebastian >>> >>> >>> >>> >>> On 23.02.2011 21:57, Danny Bickson wrote: >>> >>> Another exception I am getting: >>> >>> 11/02/23 20:45:34 INFO common.AbstractJob: Command line arguments: >>> {--endPhase=2147483647, --itemFeatures=/tmp/als/out/M/ >>> , --probes=/user/ubuntu/myout/probeSet/, --startPhase=0, >>> --tempDir=temp, >>> --userFeatures=/tmp/als/out/U/} >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap >>> space >>> at >>> >>> >>> org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:433) >>> at >>> >>> >>> org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387) >>> at >>> >>> >>> org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:134) >>> at >>> >>> org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:113) >>> at >>> >>> >>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) >>> at >>> >>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879) >>> at >>> >>> org.apache.mahout.utils.eval.ALSEvaluator.readMatrix(ALSEvaluator.java:113) >>> at >>> >>> org.apache.mahout.utils.eval.ALSEvaluator.run(ALSEvaluator.java:71) >>> at >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>> at >>> >>> org.apache.mahout.utils.eval.ALSEvaluator.main(ALSEvaluator.java:52) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>> Method) >>> at >>> >>> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>> at >>> >>> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:616) >>> at >>> >>> >>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>> at >>> >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>> at >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>> Method) >>> at >>> >>> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>> at >>> >>> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:616) >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >>> >>> THANKS! >>> ---------- Forwarded message ---------- >>> From: *Danny Bickson* <[email protected] >>> <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected] >>> >>> >>> Date: Wed, Feb 23, 2011 at 3:05 PM >>> Subject: Another mahout ALS question >>> To: [email protected] <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>> >>> >>> >>> Hi! >>> I successfully run 10 iterations for your ALS code, with D=20, >>> lambda=0.065 and I get a very impressive RMSE of 0.93 >>> However, when I try to increase D, I get various out of memory >>> errors, >>> even with small netflix subsample of 3M values. >>> >>> One of the errors I am getting is in the evaluateALS step: >>> 11/02/23 19:04:11 WARN driver.MahoutDriver: No evaluateALS.props >>> found >>> on classpath, will use command-line arguments only >>> 11/02/23 19:04:12 INFO common.AbstractJob: Command line arguments: >>> {--endPhase=2147483647, --itemFeatures=/tmp/als/out/M/, >>> --probes=/user/ubuntu/myout/probeSet/, --startPhase=0, >>> --tempDir=temp, >>> --userFeatures=/tmp/als/out/U/} >>> Exception in thread "main" java.lang.OutOfMemoryError: GC >>> overhead limit >>> exceeded >>> at >>> >>> >>> org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:433) >>> at >>> >>> >>> org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387) >>> at >>> >>> >>> org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:134) >>> at >>> >>> org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:113) >>> at >>> >>> >>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) >>> at >>> >>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879) >>> at >>> >>> org.apache.mahout.utils.eval.ALSEvaluator.readMatrix(ALSEvaluator.java:113) >>> at >>> >>> org.apache.mahout.utils.eval.ALSEvaluator.run(ALSEvaluator.java:71) >>> at >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>> at >>> >>> org.apache.mahout.utils.eval.ALSEvaluator.main(ALSEvaluator.java:52) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>> Method) >>> at >>> >>> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>> at >>> >>> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:616) >>> at >>> >>> >>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>> at >>> >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>> at >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>> Method) >>> at >>> >>> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>> at >>> >>> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:616) >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >>> >>> >>> There is no related exception in the Hadoop logs. >>> >>> I am running with java child opts of -Xmx2048M. >>> >>> Do you have any tips for me? Do you want me to post this into the >>> Mahout-542 newsgroup? >>> >>> thanks, >>> >>> >>> DB >>> >>> >>> >>> >> >
