Hints for Best Practices for Jobs with amazon EMR

Thomas Rewig Thu, 14 Apr 2011 03:18:45 -0700

 Hello
right now I'm testing Mahout (taste) Jobs on AWS EMR.

I wonder if anyone does have any experience with the best cluster sizeand the best EC2 instances. Are there any best practices for mahout(taste) jobs?

On my first test I used a small 22 MB user-item-model an compute anItemSimilarityJob with 3 small EC2 instances:

ruby elastic-mapreduce --create --alive --slave-instance-type m1.small--master-instance-type m1.small --num-instances 3 --namemahout-0.5-itemSimJob-TEST



ruby elastic-mapreduce
--jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar

--main-classorg.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob

--arg -i --arg s3://some-uri/input/data_small_in.csv
--arg -o --arg s3://some-uri/output/data_out_small.csv
--arg -s
--arg SIMILARITY_LOGLIKELIHOOD
--arg -m
--arg 500
--arg -mo
--arg 500
-j JobId

Here everything worked well, even if it took a few minutes.

In a second test I used a bigger 200 MB user-item-model and do the samewith a cluster of large instances:

ruby elastic-mapreduce --create --alive --slave-instance-type m1.large--master-instance-type m1.large --num-instances 8 --namemahout-0.5-itemSimJob-TEST2

I logged in on the masterNode with ssh and looked at the syslog. First afew hours everything looked ok and then it seems to stops at a 63%reduce step. I wait a few hours but nothing happend and so i terminatethe job. I even couldn't find any errors in the logs.


So here my questions:

1. are the any proved best practice clustersizes and instancetypes(Standard- or High-Memory- or High-CPU-Instances) that work fine for bigrecommender jobs, or do I have to test it for every different job I use?2. would it have some positiv effect if I split my big data_in.csv intomany small csv's?


Do anyone have any experience with it and have some hints?

Thanks in advance
Thomas

Hints for Best Practices for Jobs with amazon EMR

Reply via email to