I added overkill high timeouts to the OkHttpClient.Builder() in RetrofitClientFactory.scala and I don't seem to be timing out anymore.
val okHttpClientBuilder = new OkHttpClient.Builder() .dispatcher(dispatcher) .proxy(resolvedProxy) .connectTimeout(120, TimeUnit.SECONDS) .writeTimeout(120, TimeUnit.SECONDS) .readTimeout(120, TimeUnit.SECONDS) -Jenna On Tue, Mar 27, 2018 at 10:48 AM, Jenna Hoole <jenna.ho...@gmail.com> wrote: > So I'm running into an issue with my resource staging server that's > producing a stacktrace like Issue 342 > <https://github.com/apache-spark-on-k8s/spark/issues/342>, but I don't > think for the same reasons. What's happening is that every time after I > start up a resource staging server, the first job submitted that uses it > will fail with a java.net.SocketTimeoutException: timeout, and then every > subsequent job will run perfectly. Including with different jars and > different users. It's only ever the first job that fails and it always > fails. I know I'm also running into Issue 577 > <https://github.com/apache-spark-on-k8s/spark/issues/577> in that it > takes about three minutes before the resource staging server is accessible, > but I'm still failing waiting over ten minutes or in one case overnight. > And I'm just using the examples jar, so it's not a super large jar like in > Issue 342. > > This isn't great for our CI process, so has anyone seen anything like this > before or know how to increase the timeout if it just takes a while on > initial contact? Using spark.network.timeout has no effect. > > [jhoole@nid00006 spark]$ kubectl get pods | grep jhoole-spark > > jhoole-spark-resource-staging-server-64666675c8-w5cdm 1/1 Running > 0 13m > > [jhoole@nid00006 spark]$ kubectl get svc | grep jhoole-spark > > jhoole-spark-resource-staging-service NodePort 10.96.143.55 > <none> 10000:30622/TCP 13m > > [jhoole@nid00006 spark]$ bin/spark-submit --class > org.apache.spark.examples.SparkPi --conf spark.app.name=spark-pi --conf > spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622 > ./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar > > 2018-03-27 12:30:13 WARN NativeCodeLoader:62 - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > > 2018-03-27 12:30:13 INFO UserGroupInformation:966 - Login successful for > user jhoole@local using keytab file /security/secrets/jhoole.keytab > > 2018-03-27 12:30:14 INFO HadoopStepsOrchestrator:54 - Hadoop Conf > directory: /etc/hadoop/conf > > 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls to: > jhoole > > 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls to: > jhoole > > 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls groups > to: > > 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls > groups to: > > 2018-03-27 12:30:14 INFO SecurityManager:54 - SecurityManager: > authentication disabled; ui acls disabled; users with view permissions: > Set(jhoole); groups with view permissions: Set(); users with modify > permissions: Set(jhoole); groups with modify permissions: Set() > > Exception in thread "main" java.net.SocketTimeoutException: timeout > > at okio.Okio$4.newTimeoutException(Okio.java:230) > > at okio.AsyncTimeout.exit(AsyncTimeout.java:285) > > at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) > > at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345) > > at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217) > > at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211) > > at okhttp3.internal.http1.Http1Codec.readResponseHeaders( > Http1Codec.java:189) > > at okhttp3.internal.http.CallServerInterceptor.intercept( > CallServerInterceptor.java:75) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:92) > > at okhttp3.internal.connection.ConnectInterceptor.intercept( > ConnectInterceptor.java:45) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:92) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:67) > > at okhttp3.internal.cache.CacheInterceptor.intercept( > CacheInterceptor.java:93) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:92) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:67) > > at okhttp3.internal.http.BridgeInterceptor.intercept( > BridgeInterceptor.java:93) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:92) > > at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept( > RetryAndFollowUpInterceptor.java:120) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:92) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:67) > > at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) > > at okhttp3.RealCall.execute(RealCall.java:69) > > at retrofit2.OkHttpCall.execute(OkHttpCall.java:174) > > at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp > l.getTypedResponseResult(SubmittedDependencyUploaderImpl.scala:101) > > at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp > l.doUpload(SubmittedDependencyUploaderImpl.scala:97) > > at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp > l.uploadJars(SubmittedDependencyUploaderImpl.scala:70) > > at org.apache.spark.deploy.k8s.submit.submitsteps.initcontainer. > SubmittedResourcesInitContainerConfigurationStep.configureInitContainer( > SubmittedResourcesInitContainerConfigurationStep.scala:48) > > at org.apache.spark.deploy.k8s.submit.submitsteps. > InitContainerBootstrapStep$$anonfun$configureDriver$1.apply( > InitContainerBootstrapStep.scala:43) > > at org.apache.spark.deploy.k8s.submit.submitsteps. > InitContainerBootstrapStep$$anonfun$configureDriver$1.apply( > InitContainerBootstrapStep.scala:42) > > at scala.collection.immutable.List.foreach(List.scala:381) > > at org.apache.spark.deploy.k8s.submit.submitsteps. > InitContainerBootstrapStep.configureDriver(InitContainerBootstrapStep. > scala:42) > > at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1. > apply(Client.scala:102) > > at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1. > apply(Client.scala:101) > > at scala.collection.immutable.List.foreach(List.scala:381) > > at org.apache.spark.deploy.k8s.submit.Client.run(Client.scala:101) > > at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5. > apply(Client.scala:200) > > at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5. > apply(Client.scala:193) > > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551) > > at org.apache.spark.deploy.k8s.submit.Client$.run(Client.scala:193) > > at org.apache.spark.deploy.k8s.submit.Client$.main(Client.scala:213) > > at org.apache.spark.deploy.k8s.submit.Client.main(Client.scala) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:62) > > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > > at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$ > deploy$SparkSubmit$$runMain(SparkSubmit.scala:786) > > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > Caused by: java.net.SocketException: Socket closed > > at java.net.SocketInputStream.read(SocketInputStream.java:204) > > at java.net.SocketInputStream.read(SocketInputStream.java:141) > > at okio.Okio$2.read(Okio.java:139) > > at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) > > ... 47 more > > 2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Shutdown hook called > > 2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Deleting directory > /tmp/uploaded-jars-4c7ca1cf-31d6-4dba-9203-c9a6f1cd4099 > > Thanks, > Jenna >