Re: 40 hours to run 1/2 Netflix Data?

Ted Dunning Sun, 13 May 2012 22:59:13 -0700

许春玲,

The nodes here are relatively under-provisioned with respect to memory.
 Current standard practice is to use provide 4-6 GB per core.  These
machines have half to a third that much memory.  As a result, it is pretty
easy to cause swapping if you have too many map or reduce slots configured
on these machines.  That would be my first suspicion.

A second worry is that you apparently only have a single disk per node.
 This will substantially slow down your processing.  Even normal Hadoop can
move 300 MB/s/node with more drives and optimized systems like MapR can
move more than 1GB/s/node.  With a single drive, you are going to be
severely limited in terms of I/O bandwidth.

Additionally, any swapping that you are doing is going to eat away even
further.

Have you looked at your swap rates, I/O rates, network rates and CPU usage
during the execution of this program?

On Sun, May 13, 2012 at 10:44 PM, Sebastian Schelter <[email protected]> wrote:

> Hi,
>
> something must be completely going wrong in this experiment. Please use
> the latest version of Mahout (Mahout 0.6) and tell us exactly at which
> point the job fails.
>
> I have been able to process datasets seven times as large as Netflix
> (http://webscope.sandbox.yahoo.com/catalog.php?datatype=r) in a few
> hours on a 6 machine cluster.
>
> --sebastian
>
> On 14.05.2012 03:44, 许春玲 wrote:
> > Hi,
> >
> >    I run item recommemder base on Netflix, but it always fail for not
> > enough local disk space. So, I cut the User Id to half(not user account
> but user Id),to reduce the temp data. Now, it finish but
> > take 40 hours. The command like follow:
> >
> > hadoop jar
> /app/mahout-distribution-0.5/core/target/mahout-core-0.5-job.jar
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.map.tasks=196 -Dmapred.reduce.tasks=196
> -Dmapred.input.dir=NetFlix_data_new -Dmapred.output.dir=output_netflix8
> >
> > my hadoop cluster:
> >
> > 28 nodes
> > 16G memory per node
> > 8 core per node
> > 250G local disk per node
> >
> >
> >
> >
>
>

Re: 40 hours to run 1/2 Netflix Data?

Reply via email to