I agree, we looked at using EMR and found that we liked some custom Terraform +
Docker much better. The existing EMR defined by AWS requires refactoring PIO or
using it in yarn’s cluster mode. EMR is not meant to host any application code
except what is sent into Spark in serialized form. However PIO expects to run
the Spark “Driver” in the PIO process, which means on the PIO server machine.
It is possible to make PIO use yarn’s cluster mode to serialize the “Driver”
too but this is fairly complicated. I think I’ve seen Donald explain it before
but we chose not to do this. For one thing optimizing and tuning yarn managed
Spark changes the meaning of some tuning parameters.
Spark is moving to Kubernetes as a replacement for Yarn so we are quite
interested in following that line of development.
One last thought on EMR: It was designed originally for Hadoop’s MapReduce.
That meant that for a long time you couldn’t get big memory machines in EMR
(you can now). So the EMR team in AWS does not seem to target Spark or other
clustered services as well as they could. This is another reason we decided it
wasn’t worth the trouble.
From: Mars Hall
Reply: user@predictionio.apache.org
Date: February 5, 2018 at 11:45:46 AM
To: user@predictionio.apache.org
Subject: Re: pio train on Amazon EMR
Hi Malik,
This is a topic I've been investigating as well.
Given how EMR manages its clusters & their runtime, I don't think hacking
configs to make the PredictionIO host act like a cluster member will be a
simple or sustainable approach.
PredictionIO already operates Spark by building `spark-submit` commands.
https://github.com/apache/predictionio/blob/df406bf92463da4a79c8d84ec0ca439feaa0ec7f/tools/src/main/scala/org/apache/predictionio/tools/Runner.scala#L313
Implementing a new AWS EMR command runner in PredictionIO, so that we can
switch `pio train` from the existing, plain `spark-submit` command to using the
AWS CLI, `aws emr add-steps --steps Args=spark-submit` would likely solve a big
part of this problem.
https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html
Also, uploading the engine assembly JARs (the job code to run on Spark) to the
cluster members or S3 for access from the EMR Spark runtime will be another
part of this challenge.
On Mon, Feb 5, 2018 at 5:29 AM, Malik Twain wrote:
I'm trying to run pio train with Amazon EMR. I copied core-site.xml and
yarn-site.xml from EMR to my training machine, and configured HADOOP_CONF_DIR
in pio-env.sh accordingly.
I'm running pio train as below:
pio train -- --master yarn --deploy-mode cluster
It's failing with the following errors:
18/02/05 11:56:15 INFO Client:
client token: N/A
diagnostics: Application application_1517819705059_0007 failed 2 times due
to AM Container for appattempt_1517819705059_0007_02 exited with exitCode:
1
Diagnostics: Exception from container-launch.
And below are the errors from EMR stdout and stderr respectively:
java.io.FileNotFoundException: /root/pio.log (Permission denied)
[ERROR] [CreateWorkflow$] Error reading from file: File
file:/quickstartapp/MyExample/engine.json does not exist. Aborting workflow.
Thank you.
--
*Mars Hall
415-818-7039
Customer Facing Architect
Salesforce Platform / Heroku
San Francisco, California