Hi Jochen, Can you crate a small EMR cluster wirh all defaults and rhn the job there? This way we can ensure that the issue is not infrastructure and YARN configuration related.
Kind regards Jochen Hebbrecht <jochenhebbre...@gmail.com> schrieb am Fr. 4. Okt. 2019 um 19:27: > Hi Roland, > > I switched to the default security groups, ran my job again but the same > exception pops up :-( ... > All traffic is open on the security groups now. > > Jochen > > Op vr 4 okt. 2019 om 17:37 schreef Roland Johann < > roland.joh...@phenetic.io>: > >> This are dynamic port ranges and dependa on configuration of your >> cluster. Per job there is a separate application master so there can‘t be >> just one port. >> If I remeber correctly the default EMR setup creates worker security >> groups with unrestricted traffic within the group, e.g. Between the worker >> nodes. >> Depending on your security requirements I suggest that you start with a >> default like setup and determine ports and port ranges from the docs >> afterwards to further restrict traffic between the nodes. >> >> Kind regards >> >> Jochen Hebbrecht <jochenhebbre...@gmail.com> schrieb am Fr. 4. Okt. 2019 >> um 17:16: >> >>> Hi Roland, >>> >>> We have indeed custom security groups. Can you tell me where exactly I >>> need to be able to access what? >>> For example, is it from the master instance to the driver instance? And >>> which port should be open? >>> >>> Jochen >>> >>> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann < >>> roland.joh...@phenetic.io>: >>> >>>> Ho Jochen, >>>> >>>> did you setup the EMR cluster with custom security groups? Can you >>>> confirm that the relevant EC2 instances can connect through relevant ports? >>>> >>>> Best regards >>>> >>>> Jochen Hebbrecht <jochenhebbre...@gmail.com> schrieb am Fr. 4. Okt. >>>> 2019 um 17:09: >>>> >>>>> Hi Jeff, >>>>> >>>>> Thanks! Just tried that, but the same timeout occurs :-( ... >>>>> >>>>> Jochen >>>>> >>>>> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zjf...@gmail.com>: >>>>> >>>>>> You can try to increase property spark.yarn.am.waitTime (by default >>>>>> it is 100s) >>>>>> Maybe you are doing some very time consuming operation when >>>>>> initializing SparkContext, which cause timeout. >>>>>> >>>>>> See this property here >>>>>> http://spark.apache.org/docs/latest/running-on-yarn.html >>>>>> >>>>>> >>>>>> Jochen Hebbrecht <jochenhebbre...@gmail.com> 于2019年10月4日周五 下午10:08写道: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark >>>>>>> job towards the cluster. Thhe job gets accepted, but the YARN >>>>>>> application >>>>>>> fails with: >>>>>>> >>>>>>> >>>>>>> {code} >>>>>>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception: >>>>>>> java.util.concurrent.TimeoutException: Futures timed out after >>>>>>> [100000 milliseconds] >>>>>>> at >>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) >>>>>>> at >>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) >>>>>>> at >>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) >>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org >>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) >>>>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>>>>> at >>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) >>>>>>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, >>>>>>> exitCode: 13, (reason: Uncaught exception: >>>>>>> java.util.concurrent.TimeoutException: Futures timed out after [100000 >>>>>>> milliseconds] >>>>>>> at >>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) >>>>>>> at >>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) >>>>>>> at >>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) >>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org >>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) >>>>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>>>>> at >>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) >>>>>>> at >>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) >>>>>>> {code} >>>>>>> >>>>>>> It actually goes wrong at this line: >>>>>>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468 >>>>>>> >>>>>>> Now, I'm 100% sure Spark is OK and there's no bug, but there must be >>>>>>> something wrong with my setup. I don't understand the code of the >>>>>>> ApplicationMaster, so could somebody explain me what it is trying to >>>>>>> reach? >>>>>>> Where exactly does the connection timeout? So at least I can debug it >>>>>>> further because I don't have a clue what it is doing :-) >>>>>>> >>>>>>> Thanks for any help! >>>>>>> Jochen >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best Regards >>>>>> >>>>>> Jeff Zhang >>>>>> >>>>> -- >>>> >>>> >>>> *Roland Johann*Software Developer/Data Engineer >>>> >>>> *phenetic GmbH* >>>> Lütticher Straße 10, 50674 Köln, Germany >>>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g> >>>> >>>> Mobil: +49 172 365 26 46 >>>> Mail: roland.joh...@phenetic.io >>>> Web: phenetic.io >>>> >>>> Handelsregister: Amtsgericht Köln (HRB 92595) >>>> Geschäftsführer: Roland Johann, Uwe Reimann >>>> >>> -- >> >> >> *Roland Johann*Software Developer/Data Engineer >> >> *phenetic GmbH* >> Lütticher Straße 10, 50674 Köln, Germany >> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g> >> >> Mobil: +49 172 365 26 46 >> Mail: roland.joh...@phenetic.io >> Web: phenetic.io >> >> Handelsregister: Amtsgericht Köln (HRB 92595) >> Geschäftsführer: Roland Johann, Uwe Reimann >> > -- *Roland Johann*Software Developer/Data Engineer *phenetic GmbH* Lütticher Straße 10, 50674 Köln, Germany Mobil: +49 172 365 26 46 Mail: roland.joh...@phenetic.io Web: phenetic.io Handelsregister: Amtsgericht Köln (HRB 92595) Geschäftsführer: Roland Johann, Uwe Reimann