[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-07-12 Thread vanzin
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/17854 @mariahualiu do you plan to address any of the feedback here? If not, this should probably be closed. --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-09 Thread vanzin
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/17854 The reason why `spark.yarn.containerLauncherMaxThreads` does not work here is because it only control how many threads simultaneously send a container start command to YARN; that is usually a much

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-09 Thread foxish
Github user foxish commented on the issue: https://github.com/apache/spark/pull/17854 In Kubernetes/Spark, we see fairly similar behavior in the scenario described. When the simultaneous container launching is not throttled, it is capable of DOSing the system. Our solution so far is

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-09 Thread mariahualiu
Github user mariahualiu commented on the issue: https://github.com/apache/spark/pull/17854 @tgravescs I used the default spark.network.timeout (120s). When an executor cannot connect the driver, here is the executor log: 17/05/01 11:18:25 INFO [main] spark.SecurityManager:

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-08 Thread squito
Github user squito commented on the issue: https://github.com/apache/spark/pull/17854 > It took 3~4 minutes to start an executor on an NM (most of the time was spent on container localization: downloading spark jar, application jar and etc. from the hdfs staging folder). I

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-08 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17854 also what is the exact error/stack trace you see when you say "failed to connect"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well.

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-08 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17854 what is your network timeout (spark.network.timeout) set to? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-06 Thread mariahualiu
Github user mariahualiu commented on the issue: https://github.com/apache/spark/pull/17854 Now I can comfortably use 2500 executors. But when I pushed the executor count to 3000, I saw a lot of heartbeat timeout errors. It is something else we can improve, probably another jira.

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-06 Thread mariahualiu
Github user mariahualiu commented on the issue: https://github.com/apache/spark/pull/17854 I re-ran the same application adding these configurations "--conf spark.yarn.scheduler.heartbeat.interval-ms=15000 --conf spark.yarn.launchContainer.count.simultaneously=50". Though it took 50

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-06 Thread mariahualiu
Github user mariahualiu commented on the issue: https://github.com/apache/spark/pull/17854 Let me describe what I've seen when using 2500 executors. 1. In the first a few (2~3) requests, AM received all (in this case 2500) containers from Yarn. 2. In a few seconds, 2500

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-06 Thread mariahualiu
Github user mariahualiu commented on the issue: https://github.com/apache/spark/pull/17854 @squito yes, I capped the number of resources in updateResourceRequests so that YarnAllocator asks for less number of resources in each iteration. When allocation fails one iteration, the

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread vanzin
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/17854 > Although looking at it maybe I'm missing how its supposed to handle network failure? Spark has never really handled network failure. If the connection between the driver and the executor

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17854 > If that's what you mean, there's no need for retrying. No RPC calls retry anymore. See #16503 (comment) for an explanation. I see, I guess with the way we have the rpc implemented it

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread vanzin
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/17854 What do you mean by "not retrying"? Do you mean this line: ``` ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls)) ``` If that's what you

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17854 I took a quick look at the registerExecutor call in CoarseGrainedExecutorBackend and its not retrying at all. We should change that to retry. We retry heartbeats and many other things so it

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17854 to slow down launching you could just set spark.yarn.containerLauncherMaxThreads to be smaller. that isn't guaranteed but neither is this really. Just an alternative or something you can do

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread squito
Github user squito commented on the issue: https://github.com/apache/spark/pull/17854 also cc @tgravescs @vanzin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread squito
Github user squito commented on the issue: https://github.com/apache/spark/pull/17854 It looks to me like this is actually making 2 behavior changes: 1) throttle the requests for new containers, as you describe in your description 2) drop newly received containers if they

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17854 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76494/ Test PASSed. ---

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17854 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17854 **[Test build #76494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76494/testReport)** for PR 17854 at commit

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17854 **[Test build #76494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76494/testReport)** for PR 17854 at commit

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-05 Thread squito
Github user squito commented on the issue: https://github.com/apache/spark/pull/17854 Jenkins, ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so,

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17854 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this