Ted, Yes, Memory per node is only 16G.Usage of Memory cached is 100% as attached file show. And CPU is 100% too. And Max size of local disk hadoop temp is 160G, and it will be used 100% . It like that key point is the Sixth step of recommonder, for every time job fail at this step.
I have several tests, the log as attached. From firt time to fifth time,I cut the size 1/2(like below) and every time the job fail at Sixth step. Even when I cut the data size to about 100M as large of groupLen movie rating file, it still fail(btw, I run 100M groupLen Movie rating cost about 16 Minutes) -rw-r--r-- 3 hdfs supergroup 1505255088 2012-04-20 16:43 /user/hdfs/NetFlix_data -rw-r--r-- 3 hdfs supergroup 1058793314 2012-04-24 10:45 /user/hdfs/netFlixData2 -rw-r--r-- 3 hdfs supergroup 793294103 2012-04-26 08:59 /user/hdfs/netFlixData3 -rw-r--r-- 3 hdfs supergroup 476054038 2012-04-27 09:51 /user/hdfs/netFlixData4 -rw-r--r-- 3 hdfs supergroup 135210043 2012-04-28 13:53 /user/hdfs/netFlixData6 So, I think cut userId to 1/2 to reduce the size of Matrix. When I do this, recommendor finished, but it take about 40 hours. and the mapred conf of my cluster is: <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>7</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>7</value> </property> <property> <name>mapred.map.child.java.opts</name> <value>-Xmx512M</value> </property> <property> <name>mapred.reduce.child.java.opts</name> <value>-Xmx512M</value> </property> <property> <name>mapred.child.ulimit</name> <value>-Xmx600M</value> </property> > -----原始邮件----- > 发件人: "Ted Dunning" <[email protected]> > 发送时间: 2012年5月14日 星期一 > 收件人: [email protected], [email protected] > 抄送: > 主题: Re: 40 hours to run 1/2 Netflix Data? > > 许春玲, > > The nodes here are relatively under-provisioned with respect to memory. > Current standard practice is to use provide 4-6 GB per core. These > machines have half to a third that much memory. As a result, it is pretty > easy to cause swapping if you have too many map or reduce slots configured > on these machines. That would be my first suspicion. > > A second worry is that you apparently only have a single disk per node. > This will substantially slow down your processing. Even normal Hadoop can > move 300 MB/s/node with more drives and optimized systems like MapR can > move more than 1GB/s/node. With a single drive, you are going to be > severely limited in terms of I/O bandwidth. > > Additionally, any swapping that you are doing is going to eat away even > further. > > Have you looked at your swap rates, I/O rates, network rates and CPU usage > during the execution of this program? > > On Sun, May 13, 2012 at 10:44 PM, Sebastian Schelter <[email protected]> wrote: > > > Hi, > > > > something must be completely going wrong in this experiment. Please use > > the latest version of Mahout (Mahout 0.6) and tell us exactly at which > > point the job fails. > > > > I have been able to process datasets seven times as large as Netflix > > (http://webscope.sandbox.yahoo.com/catalog.php?datatype=r) in a few > > hours on a 6 machine cluster. > > > > --sebastian > > > > On 14.05.2012 03:44, 许春玲 wrote: > > > Hi, > > > > > > I run item recommemder base on Netflix, but it always fail for not > > > enough local disk space. So, I cut the User Id to half(not user account > > but user Id),to reduce the temp data. Now, it finish but > > > take 40 hours. The command like follow: > > > > > > hadoop jar > > /app/mahout-distribution-0.5/core/target/mahout-core-0.5-job.jar > > org.apache.mahout.cf.taste.hadoop.item.RecommenderJob > > -Dmapred.map.tasks=196 -Dmapred.reduce.tasks=196 > > -Dmapred.input.dir=NetFlix_data_new -Dmapred.output.dir=output_netflix8 > > > > > > my hadoop cluster: > > > > > > 28 nodes > > > 16G memory per node > > > 8 core per node > > > 250G local disk per node > > > > > > > > > > > > > > > >
