Hello
right now I'm testing Mahout (taste) Jobs on AWS EMR.
I wonder if anyone does have any experience with the best cluster size and the best EC2 instances. Are there any best practices for mahout (taste) jobs?

On my first test I used a small 22 MB user-item-model an compute an ItemSimilarityJob with 3 small EC2 instances:

ruby elastic-mapreduce --create --alive --slave-instance-type m1.small --master-instance-type m1.small --num-instances 3 --name mahout-0.5-itemSimJob-TEST


ruby elastic-mapreduce
--jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar
--main-class org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
--arg -i --arg s3://some-uri/input/data_small_in.csv
--arg -o --arg s3://some-uri/output/data_out_small.csv
--arg -s
--arg SIMILARITY_LOGLIKELIHOOD
--arg -m
--arg 500
--arg -mo
--arg 500
-j JobId

Here everything worked well, even if it took a few minutes.

In a second test I used a bigger 200 MB user-item-model and do the same with a cluster of large instances:

ruby elastic-mapreduce --create --alive --slave-instance-type m1.large --master-instance-type m1.large --num-instances 8 --name mahout-0.5-itemSimJob-TEST2

I logged in on the masterNode with ssh and looked at the syslog. First a few hours everything looked ok and then it seems to stops at a 63% reduce step. I wait a few hours but nothing happend and so i terminate the job. I even couldn't find any errors in the logs.

So here my questions:
1. are the any proved best practice clustersizes and instancetypes (Standard- or High-Memory- or High-CPU-Instances) that work fine for big recommender jobs, or do I have to test it for every different job I use? 2. would it have some positiv effect if I split my big data_in.csv into many small csv's?

Do anyone have any experience with it and have some hints?

Thanks in advance
Thomas



Reply via email to