I agree, we looked at using EMR and found that we liked some custom Terraform + 
Docker much better. The existing EMR defined by AWS requires refactoring PIO or 
using it in yarn’s cluster mode. EMR is not meant to host any application code 
except what is sent into Spark in serialized form. However PIO expects to run 
the Spark “Driver” in the PIO process, which means on the PIO server machine. 

It is possible to make PIO use yarn’s cluster mode to serialize the “Driver” 
too but this is fairly complicated. I think I’ve seen Donald explain it before 
but we chose not to do this. For one thing optimizing and tuning yarn managed 
Spark changes the meaning of some tuning parameters.

Spark is moving to Kubernetes as a replacement for Yarn so we are quite 
interested in following that line of development.

One last thought on EMR: It was designed originally for Hadoop’s MapReduce. 
That meant that for a long time you couldn’t get big memory machines in EMR 
(you can now). So the EMR team in AWS does not seem to target Spark or other 
clustered services as well as they could. This is another reason we decided it 
wasn’t worth the trouble.


From: Mars Hall <mars.h...@salesforce.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
Date: February 5, 2018 at 11:45:46 AM
To: user@predictionio.apache.org <user@predictionio.apache.org>
Subject:  Re: pio train on Amazon EMR  

Hi Malik,

This is a topic I've been investigating as well.

Given how EMR manages its clusters & their runtime, I don't think hacking 
configs to make the PredictionIO host act like a cluster member will be a 
simple or sustainable approach.

PredictionIO already operates Spark by building `spark-submit` commands.
  
https://github.com/apache/predictionio/blob/df406bf92463da4a79c8d84ec0ca439feaa0ec7f/tools/src/main/scala/org/apache/predictionio/tools/Runner.scala#L313

Implementing a new AWS EMR command runner in PredictionIO, so that we can 
switch `pio train` from the existing, plain `spark-submit` command to using the 
AWS CLI, `aws emr add-steps --steps Args=spark-submit` would likely solve a big 
part of this problem.
  https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html

Also, uploading the engine assembly JARs (the job code to run on Spark) to the 
cluster members or S3 for access from the EMR Spark runtime will be another 
part of this challenge.

On Mon, Feb 5, 2018 at 5:29 AM, Malik Twain <chacha...@gmail.com> wrote:
I'm trying to run pio train with Amazon EMR. I copied core-site.xml and 
yarn-site.xml from EMR to my training machine, and configured HADOOP_CONF_DIR 
in pio-env.sh accordingly.

I'm running pio train as below:

pio train -- --master yarn --deploy-mode cluster

It's failing with the following errors:

18/02/05 11:56:15 INFO Client: 
   client token: N/A
   diagnostics: Application application_1517819705059_0007 failed 2 times due 
to AM Container for appattempt_1517819705059_0007_000002 exited with  exitCode: 
1
Diagnostics: Exception from container-launch.

And below are the errors from EMR stdout and stderr respectively:

java.io.FileNotFoundException: /root/pio.log (Permission denied)
[ERROR] [CreateWorkflow$] Error reading from file: File 
file:/quickstartapp/MyExample/engine.json does not exist. Aborting workflow.

Thank you.



--
*Mars Hall
415-818-7039
Customer Facing Architect
Salesforce Platform / Heroku
San Francisco, California

Reply via email to