Hi Roland, I just tried what you've suggested and it actually helped me finding the root cause. Once I had the default EMR cluster, I've submitted a Spark job using the master instance (using the 'spark-submit' command on a terminal) - and not use Livy to submit this job. In this way, I had much more logging in the terminal and now the logging actually indicated me what the timeout was causing. The timeout was related to a service call in our company and this service call failed due to access constraints.
Fixing those access constraints, made the Spark job succeed! So conclusion: nothing related to Spark itself, but it's the Livy output logging which was hiding the real error details. Thank you all for help! :-) Jochen Op vr 4 okt. 2019 om 19:32 schreef Roland Johann <roland.joh...@phenetic.io >: > Hi Jochen, > > Can you crate a small EMR cluster wirh all defaults and rhn the job there? > This way we can ensure that the issue is not infrastructure and YARN > configuration related. > > Kind regards > > Jochen Hebbrecht <jochenhebbre...@gmail.com> schrieb am Fr. 4. Okt. 2019 > um 19:27: > >> Hi Roland, >> >> I switched to the default security groups, ran my job again but the same >> exception pops up :-( ... >> All traffic is open on the security groups now. >> >> Jochen >> >> Op vr 4 okt. 2019 om 17:37 schreef Roland Johann < >> roland.joh...@phenetic.io>: >> >>> This are dynamic port ranges and dependa on configuration of your >>> cluster. Per job there is a separate application master so there can‘t be >>> just one port. >>> If I remeber correctly the default EMR setup creates worker security >>> groups with unrestricted traffic within the group, e.g. Between the worker >>> nodes. >>> Depending on your security requirements I suggest that you start with a >>> default like setup and determine ports and port ranges from the docs >>> afterwards to further restrict traffic between the nodes. >>> >>> Kind regards >>> >>> Jochen Hebbrecht <jochenhebbre...@gmail.com> schrieb am Fr. 4. Okt. >>> 2019 um 17:16: >>> >>>> Hi Roland, >>>> >>>> We have indeed custom security groups. Can you tell me where exactly I >>>> need to be able to access what? >>>> For example, is it from the master instance to the driver instance? And >>>> which port should be open? >>>> >>>> Jochen >>>> >>>> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann < >>>> roland.joh...@phenetic.io>: >>>> >>>>> Ho Jochen, >>>>> >>>>> did you setup the EMR cluster with custom security groups? Can you >>>>> confirm that the relevant EC2 instances can connect through relevant >>>>> ports? >>>>> >>>>> Best regards >>>>> >>>>> Jochen Hebbrecht <jochenhebbre...@gmail.com> schrieb am Fr. 4. Okt. >>>>> 2019 um 17:09: >>>>> >>>>>> Hi Jeff, >>>>>> >>>>>> Thanks! Just tried that, but the same timeout occurs :-( ... >>>>>> >>>>>> Jochen >>>>>> >>>>>> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zjf...@gmail.com>: >>>>>> >>>>>>> You can try to increase property spark.yarn.am.waitTime (by default >>>>>>> it is 100s) >>>>>>> Maybe you are doing some very time consuming operation when >>>>>>> initializing SparkContext, which cause timeout. >>>>>>> >>>>>>> See this property here >>>>>>> http://spark.apache.org/docs/latest/running-on-yarn.html >>>>>>> >>>>>>> >>>>>>> Jochen Hebbrecht <jochenhebbre...@gmail.com> 于2019年10月4日周五 >>>>>>> 下午10:08写道: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark >>>>>>>> job towards the cluster. Thhe job gets accepted, but the YARN >>>>>>>> application >>>>>>>> fails with: >>>>>>>> >>>>>>>> >>>>>>>> {code} >>>>>>>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception: >>>>>>>> java.util.concurrent.TimeoutException: Futures timed out after >>>>>>>> [100000 milliseconds] >>>>>>>> at >>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) >>>>>>>> at >>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) >>>>>>>> at >>>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) >>>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org >>>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) >>>>>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>>>>>> at >>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) >>>>>>>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, >>>>>>>> exitCode: 13, (reason: Uncaught exception: >>>>>>>> java.util.concurrent.TimeoutException: Futures timed out after [100000 >>>>>>>> milliseconds] >>>>>>>> at >>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) >>>>>>>> at >>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) >>>>>>>> at >>>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) >>>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org >>>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) >>>>>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>>>>>> at >>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) >>>>>>>> at >>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) >>>>>>>> {code} >>>>>>>> >>>>>>>> It actually goes wrong at this line: >>>>>>>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468 >>>>>>>> >>>>>>>> Now, I'm 100% sure Spark is OK and there's no bug, but there must >>>>>>>> be something wrong with my setup. I don't understand the code of the >>>>>>>> ApplicationMaster, so could somebody explain me what it is trying to >>>>>>>> reach? >>>>>>>> Where exactly does the connection timeout? So at least I can debug it >>>>>>>> further because I don't have a clue what it is doing :-) >>>>>>>> >>>>>>>> Thanks for any help! >>>>>>>> Jochen >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best Regards >>>>>>> >>>>>>> Jeff Zhang >>>>>>> >>>>>> -- >>>>> >>>>> >>>>> *Roland Johann*Software Developer/Data Engineer >>>>> >>>>> *phenetic GmbH* >>>>> Lütticher Straße 10, 50674 Köln, Germany >>>>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g> >>>>> >>>>> Mobil: +49 172 365 26 46 >>>>> Mail: roland.joh...@phenetic.io >>>>> Web: phenetic.io >>>>> >>>>> Handelsregister: Amtsgericht Köln (HRB 92595) >>>>> Geschäftsführer: Roland Johann, Uwe Reimann >>>>> >>>> -- >>> >>> >>> *Roland Johann*Software Developer/Data Engineer >>> >>> *phenetic GmbH* >>> Lütticher Straße 10, 50674 Köln, Germany >>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g> >>> >>> Mobil: +49 172 365 26 46 >>> Mail: roland.joh...@phenetic.io >>> Web: phenetic.io >>> >>> Handelsregister: Amtsgericht Köln (HRB 92595) >>> Geschäftsführer: Roland Johann, Uwe Reimann >>> >> -- > > > *Roland Johann*Software Developer/Data Engineer > > *phenetic GmbH* > Lütticher Straße 10, 50674 Köln, Germany > > Mobil: +49 172 365 26 46 > Mail: roland.joh...@phenetic.io > Web: phenetic.io > > Handelsregister: Amtsgericht Köln (HRB 92595) > Geschäftsführer: Roland Johann, Uwe Reimann >