I agree, we looked at using EMR and found that we liked some custom Terraform + Docker much better. The existing EMR defined by AWS requires refactoring PIO or using it in yarn’s cluster mode. EMR is not meant to host any application code except what is sent into Spark in serialized form. However PIO expects to run the Spark “Driver” in the PIO process, which means on the PIO server machine.
It is possible to make PIO use yarn’s cluster mode to serialize the “Driver” too but this is fairly complicated. I think I’ve seen Donald explain it before but we chose not to do this. For one thing optimizing and tuning yarn managed Spark changes the meaning of some tuning parameters. Spark is moving to Kubernetes as a replacement for Yarn so we are quite interested in following that line of development. One last thought on EMR: It was designed originally for Hadoop’s MapReduce. That meant that for a long time you couldn’t get big memory machines in EMR (you can now). So the EMR team in AWS does not seem to target Spark or other clustered services as well as they could. This is another reason we decided it wasn’t worth the trouble. From: Mars Hall <mars.h...@salesforce.com> Reply: user@predictionio.apache.org <user@predictionio.apache.org> Date: February 5, 2018 at 11:45:46 AM To: user@predictionio.apache.org <user@predictionio.apache.org> Subject: Re: pio train on Amazon EMR Hi Malik, This is a topic I've been investigating as well. Given how EMR manages its clusters & their runtime, I don't think hacking configs to make the PredictionIO host act like a cluster member will be a simple or sustainable approach. PredictionIO already operates Spark by building `spark-submit` commands. https://github.com/apache/predictionio/blob/df406bf92463da4a79c8d84ec0ca439feaa0ec7f/tools/src/main/scala/org/apache/predictionio/tools/Runner.scala#L313 Implementing a new AWS EMR command runner in PredictionIO, so that we can switch `pio train` from the existing, plain `spark-submit` command to using the AWS CLI, `aws emr add-steps --steps Args=spark-submit` would likely solve a big part of this problem. https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html Also, uploading the engine assembly JARs (the job code to run on Spark) to the cluster members or S3 for access from the EMR Spark runtime will be another part of this challenge. On Mon, Feb 5, 2018 at 5:29 AM, Malik Twain <chacha...@gmail.com> wrote: I'm trying to run pio train with Amazon EMR. I copied core-site.xml and yarn-site.xml from EMR to my training machine, and configured HADOOP_CONF_DIR in pio-env.sh accordingly. I'm running pio train as below: pio train -- --master yarn --deploy-mode cluster It's failing with the following errors: 18/02/05 11:56:15 INFO Client: client token: N/A diagnostics: Application application_1517819705059_0007 failed 2 times due to AM Container for appattempt_1517819705059_0007_000002 exited with exitCode: 1 Diagnostics: Exception from container-launch. And below are the errors from EMR stdout and stderr respectively: java.io.FileNotFoundException: /root/pio.log (Permission denied) [ERROR] [CreateWorkflow$] Error reading from file: File file:/quickstartapp/MyExample/engine.json does not exist. Aborting workflow. Thank you. -- *Mars Hall 415-818-7039 Customer Facing Architect Salesforce Platform / Heroku San Francisco, California