Hi Igor, No, it was not a memory issue - but thanks for your question. Could have been a resources problem indeed :-)
Jochen Op vr 4 okt. 2019 om 19:51 schreef igor cabral uchoa < igorucho...@yahoo.com.br>: > Maybe it is a basic question, but your cluster has enough resource to run > your application? It is requesting 208G of RAM > > Thanks, > > Sent from Yahoo Mail for iPhone > <https://overview.mail.yahoo.com/?.src=iOS> > > On Friday, October 4, 2019, 2:31 PM, Jochen Hebbrecht < > jochenhebbre...@gmail.com> wrote: > > Hi Igor, > > We are deploying by submitting a batch job on a Livy server (from our > local PC or a Jenkins node). The Livy server then deploys the Spark job on > the cluster itself. > > For example: > --- > > Running '/usr/lib/spark/bin/spark-submit' '--class' '##MY_MAIN_CLASS##' > '--conf' 'spark.driver.userClassPathFirst=true' '--conf' > 'spark.default.parallelism=180' '--conf' 'spark.executor.memory=52g' '--conf' > 'spark.driver.memory=52g' '--conf' 'spark.yarn.tags=livy-batch-0-owjPBdmC' > '--conf' 'spark.executor.instances=3' '--conf' > 'spark.executor.memoryOverhead=6144' '--conf' 'spark.driver.cores=6' '--conf' > 'spark.driver.memoryOverhead=6144' '--conf' > 'spark.executor.extraJavaOptions=-XX:ThreadStackSize=2048 > -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 > -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled > -XX:OnOutOfMemoryError=\'kill -9 %p\'' '--conf' > 'spark.executor.userClassPathFirst=true' '--conf' > 'spark.submit.deployMode=cluster' '--conf' > 'spark.yarn.submit.waitAppCompletion=false' '--conf' > 'spark.executor.extraClassPath=true' '-- ... > > --- > > Jochen > > Op vr 4 okt. 2019 om 17:42 schreef igor cabral uchoa < > igorucho...@yahoo.com.br>: > > Hi Roland! > > What deploy mode are you using when you submit your applications? It is > client or cluster mode? > > Regards, > > > Sent from Yahoo Mail for iPhone > <https://overview.mail.yahoo.com/?.src=iOS> > > On Friday, October 4, 2019, 12:37 PM, Roland Johann > <roland.joh...@phenetic.io.INVALID> wrote: > > This are dynamic port ranges and dependa on configuration of your cluster. > Per job there is a separate application master so there can‘t be just one > port. > If I remeber correctly the default EMR setup creates worker security > groups with unrestricted traffic within the group, e.g. Between the worker > nodes. > Depending on your security requirements I suggest that you start with a > default like setup and determine ports and port ranges from the docs > afterwards to further restrict traffic between the nodes. > > Kind regards > > Jochen Hebbrecht <jochenhebbre...@gmail.com> schrieb am Fr. 4. Okt. 2019 > um 17:16: > > Hi Roland, > > We have indeed custom security groups. Can you tell me where exactly I > need to be able to access what? > For example, is it from the master instance to the driver instance? And > which port should be open? > > Jochen > > Op vr 4 okt. 2019 om 17:14 schreef Roland Johann < > roland.joh...@phenetic.io>: > > Ho Jochen, > > did you setup the EMR cluster with custom security groups? Can you confirm > that the relevant EC2 instances can connect through relevant ports? > > Best regards > > Jochen Hebbrecht <jochenhebbre...@gmail.com> schrieb am Fr. 4. Okt. 2019 > um 17:09: > > Hi Jeff, > > Thanks! Just tried that, but the same timeout occurs :-( ... > > Jochen > > Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zjf...@gmail.com>: > > You can try to increase property spark.yarn.am.waitTime (by default it is > 100s) > Maybe you are doing some very time consuming operation when initializing > SparkContext, which cause timeout. > > See this property here > http://spark.apache.org/docs/latest/running-on-yarn.html > > > Jochen Hebbrecht <jochenhebbre...@gmail.com> 于2019年10月4日周五 下午10:08写道: > > Hi, > > I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job > towards the cluster. Thhe job gets accepted, but the YARN application fails > with: > > > {code} > 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception: > java.util.concurrent.TimeoutException: Futures timed out after [100000 > milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) > at org.apache.spark.deploy.yarn.ApplicationMaster.org > $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) > at > org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) > 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, > exitCode: 13, (reason: Uncaught exception: > java.util.concurrent.TimeoutException: Futures timed out after [100000 > milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) > at org.apache.spark.deploy.yarn.ApplicationMaster.org > $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) > at > org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) > {code} > > It actually goes wrong at this line: > https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468 > > Now, I'm 100% sure Spark is OK and there's no bug, but there must be > something wrong with my setup. I don't understand the code of the > ApplicationMaster, so could somebody explain me what it is trying to reach? > Where exactly does the connection timeout? So at least I can debug it > further because I don't have a clue what it is doing :-) > > Thanks for any help! > Jochen > > > > -- > Best Regards > > Jeff Zhang > > -- > > > *Roland Johann*Software Developer/Data Engineer > > *phenetic GmbH* > Lütticher Straße 10, 50674 Köln, Germany > <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g> > > Mobil: +49 172 365 26 46 <+49%20172%20365%2026%2046> > Mail: roland.joh...@phenetic.io > Web: phenetic.io > > Handelsregister: Amtsgericht Köln (HRB 92595) > Geschäftsführer: Roland Johann, Uwe Reimann > > -- > > > *Roland Johann*Software Developer/Data Engineer > > *phenetic GmbH* > Lütticher Straße 10, 50674 Köln, Germany > > Mobil: +49 172 365 26 46 <+49%20172%20365%2026%2046> > Mail: roland.joh...@phenetic.io > Web: phenetic.io > > Handelsregister: Amtsgericht Köln (HRB 92595) > Geschäftsführer: Roland Johann, Uwe Reimann > >