[jira] [Created] (SPARK-6305) Add support for log4j 2.x to Spark
Tal Sliwowicz created SPARK-6305: Summary: Add support for log4j 2.x to Spark Key: SPARK-6305 URL: https://issues.apache.org/jira/browse/SPARK-6305 Project: Spark Issue Type: New Feature Components: Build Reporter: Tal Sliwowicz log4j 2 requires replacing the slf4j binding and adding the log4j jars in the classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice
[ https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182005#comment-14182005 ] Tal Sliwowicz commented on SPARK-4006: -- After the fix was merged to master, created a PR for 1.0, 1.1 and updated the original 0.9 PR (the merges were not clean). Spark Driver crashes whenever an Executor is registered twice - Key: SPARK-4006 URL: https://issues.apache.org/jira/browse/SPARK-4006 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 0.9.2, 1.0.2, 1.1.0, 1.2.0 Environment: Mesos, Coarse Grained Reporter: Tal Sliwowicz Assignee: Tal Sliwowicz Priority: Critical Fix For: 1.2.0 This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor {code} private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice
[ https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180512#comment-14180512 ] Tal Sliwowicz commented on SPARK-4006: -- Cool! Would be very interesting to know. For us it's hard to force reproduce this (because you need to registers without a remove in between), but it always happens eventually, so I can tell for sure that the fix resolved our issue. Spark Driver crashes whenever an Executor is registered twice - Key: SPARK-4006 URL: https://issues.apache.org/jira/browse/SPARK-4006 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 0.9.2, 1.0.2, 1.1.0, 1.2.0 Environment: Mesos, Coarse Grained Reporter: Tal Sliwowicz Priority: Critical This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor {code} private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice
[ https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180512#comment-14180512 ] Tal Sliwowicz edited comment on SPARK-4006 at 10/22/14 8:48 PM: Cool! Would be very interesting to know. For us it's hard to force reproduce this (because you need two registers without a remove in between), but it always happens eventually, so I can tell for sure that the fix resolved our issue. was (Author: sliwo): Cool! Would be very interesting to know. For us it's hard to force reproduce this (because you need to registers without a remove in between), but it always happens eventually, so I can tell for sure that the fix resolved our issue. Spark Driver crashes whenever an Executor is registered twice - Key: SPARK-4006 URL: https://issues.apache.org/jira/browse/SPARK-4006 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 0.9.2, 1.0.2, 1.1.0, 1.2.0 Environment: Mesos, Coarse Grained Reporter: Tal Sliwowicz Priority: Critical This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor {code} private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice
[ https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179351#comment-14179351 ] Tal Sliwowicz commented on SPARK-4006: -- Another pull request - this time on master - https://github.com/apache/spark/pull/2886 Spark Driver crashes whenever an Executor is registered twice - Key: SPARK-4006 URL: https://issues.apache.org/jira/browse/SPARK-4006 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 0.9.2, 1.0.2, 1.1.0 Environment: Mesos, Coarse Grained Reporter: Tal Sliwowicz Priority: Critical This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice
[ https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tal Sliwowicz updated SPARK-4006: - Comment: was deleted (was: Another pull request - this time on master - https://github.com/apache/spark/pull/2886) Spark Driver crashes whenever an Executor is registered twice - Key: SPARK-4006 URL: https://issues.apache.org/jira/browse/SPARK-4006 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 0.9.2, 1.0.2, 1.1.0 Environment: Mesos, Coarse Grained Reporter: Tal Sliwowicz Priority: Critical This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1042) spark cleans all java broadcast variables when it hits the spark.cleaner.ttl
[ https://issues.apache.org/jira/browse/SPARK-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176708#comment-14176708 ] Tal Sliwowicz commented on SPARK-1042: -- [~qqsun8819] I think the issue was resolved in 0.9.2. We are not experiencing it any more. Thanks! spark cleans all java broadcast variables when it hits the spark.cleaner.ttl - Key: SPARK-1042 URL: https://issues.apache.org/jira/browse/SPARK-1042 Project: Spark Issue Type: Bug Components: Java API, Spark Core Affects Versions: 0.8.0, 0.8.1, 0.9.0 Reporter: Tal Sliwowicz Assignee: OuyangJin Priority: Critical Labels: memory_leak When setting spark.cleaner.ttl, spark performs the cleanup on time - but it cleans all broadcast variables, not just the ones that are older than the ttl. This creates an exception when the next mapPartitions runs because it cannot find the broadcast variable, even when it was created immediately before running the task. Our temp workaround - not set the ttl and suffer from an ongoing memory leak (forces a restart). We are using JavaSparkContext and our broadcast variables are Java HashMaps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice
Tal Sliwowicz created SPARK-4006: Summary: Spark Driver crashes whenever an Executor is registered twice Key: SPARK-4006 URL: https://issues.apache.org/jira/browse/SPARK-4006 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 1.1.0, 1.0.2, 0.9.2 Environment: Mesos, Coarse Grained Reporter: Tal Sliwowicz Priority: Critical We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice
[ https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tal Sliwowicz updated SPARK-4006: - Description: This is a huge robustness issue for us, in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } was: We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } Spark Driver crashes whenever an Executor is registered twice - Key: SPARK-4006 URL: https://issues.apache.org/jira/browse/SPARK-4006 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 0.9.2, 1.0.2, 1.1.0 Environment: Mesos, Coarse Grained Reporter: Tal Sliwowicz Priority: Critical This is a huge robustness issue for us, in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice
[ https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tal Sliwowicz updated SPARK-4006: - Description: This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } was: This is a huge robustness issue for us, in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } Spark Driver crashes whenever an Executor is registered twice - Key: SPARK-4006 URL: https://issues.apache.org/jira/browse/SPARK-4006 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 0.9.2, 1.0.2, 1.1.0 Environment: Mesos, Coarse Grained Reporter: Tal Sliwowicz Priority: Critical This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice
[ https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176774#comment-14176774 ] Tal Sliwowicz commented on SPARK-4006: -- Fixed in - https://github.com/apache/spark/pull/2854 Spark Driver crashes whenever an Executor is registered twice - Key: SPARK-4006 URL: https://issues.apache.org/jira/browse/SPARK-4006 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 0.9.2, 1.0.2, 1.1.0 Environment: Mesos, Coarse Grained Reporter: Tal Sliwowicz Priority: Critical This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs. We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. The issue is with the System.exit(1) in BlockManagerMasterActor private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) { if (!blockManagerInfo.contains(id)) { blockManagerIdByExecutor.get(id.executorId) match { case Some(manager) = // A block manager of the same executor already exists. // This should never happen. Let's just quit. logError(Got two different block manager registrations on + id.executorId) System.exit(1) case None = blockManagerIdByExecutor(id.executorId) = id } logInfo(Registering block manager %s with %s RAM.format( id.hostPort, Utils.bytesToString(maxMemSize))) blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor) } listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1175) on shutting down a long running job, the cluster does not accept new jobs and gets hung
[ https://issues.apache.org/jira/browse/SPARK-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13970641#comment-13970641 ] Tal Sliwowicz commented on SPARK-1175: -- This prevents us from having a real automated solution for fail over when the driver fails. We cannot automatically start a new driver because spark is stuck on cleanup. on shutting down a long running job, the cluster does not accept new jobs and gets hung --- Key: SPARK-1175 URL: https://issues.apache.org/jira/browse/SPARK-1175 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0 Reporter: Tal Sliwowicz Assignee: Nan Zhu Labels: shutdown, worker When shutting down a long processing job (24+ hours) that runs periodically on the same context and generates a lot of shuffles (many hundreds of GB) the spark workers get hung for a long while and the cluster does not accept new jobs. The only way to proceed is to kill -9 the workers. This is a big problem because when multiple contexts run on the same cluster, one mast stop them all for a simple restart. The context is stopped using sc.stop() This happens both in standalone mode and under mesos. We suspect this is caused by the delete Spark local dirs thread. Attached a thread dump of the worker. Also, the relevant part may be: SIGTERM handler - Thread t@41040 java.lang.Thread.State: BLOCKED at java.lang.Shutdown.exit(Shutdown.java:168) - waiting to lock 69eab6a3 (a java.lang.Class) owned by SIGTERM handler t@41038 at java.lang.Terminator$1.handle(Terminator.java:35) at sun.misc.Signal$1.run(Signal.java:195) at java.lang.Thread.run(Thread.java:662) Locked ownable synchronizers: - None delete Spark local dirs - Thread t@40 java.lang.Thread.State: RUNNABLE at java.io.UnixFileSystem.delete0(Native Method) at java.io.UnixFileSystem.delete(UnixFileSystem.java:251) at java.io.File.delete(File.java:904) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:482) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478) at org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:141) at org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:139) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:139) Locked ownable synchronizers: - None SIGTERM handler - Thread t@41038 java.lang.Thread.State: WAITING at java.lang.Object.wait(Native Method) - waiting on 355c6c8d (a org.apache.spark.storage.DiskBlockManager$$anon$1) at java.lang.Thread.join(Thread.java:1186) at java.lang.Thread.join(Thread.java:1239) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24) at java.lang.Shutdown.runHooks(Shutdown.java:79) at java.lang.Shutdown.sequence(Shutdown.java:123) at java.lang.Shutdown.exit(Shutdown.java:168) - locked 69eab6a3 (a java.lang.Class) at java.lang.Terminator$1.handle(Terminator.java:35) at sun.misc.Signal$1.run(Signal.java:195) at java.lang.Thread.run(Thread.java:662) Locked ownable synchronizers: - None -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1175) on shutting down a long running job, the cluster does not accept new jobs and gets hung
[ https://issues.apache.org/jira/browse/SPARK-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971520#comment-13971520 ] Tal Sliwowicz commented on SPARK-1175: -- Yes on shutting down a long running job, the cluster does not accept new jobs and gets hung --- Key: SPARK-1175 URL: https://issues.apache.org/jira/browse/SPARK-1175 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0 Reporter: Tal Sliwowicz Assignee: Nan Zhu Labels: shutdown, worker When shutting down a long processing job (24+ hours) that runs periodically on the same context and generates a lot of shuffles (many hundreds of GB) the spark workers get hung for a long while and the cluster does not accept new jobs. The only way to proceed is to kill -9 the workers. This is a big problem because when multiple contexts run on the same cluster, one mast stop them all for a simple restart. The context is stopped using sc.stop() This happens both in standalone mode and under mesos. We suspect this is caused by the delete Spark local dirs thread. Attached a thread dump of the worker. Also, the relevant part may be: SIGTERM handler - Thread t@41040 java.lang.Thread.State: BLOCKED at java.lang.Shutdown.exit(Shutdown.java:168) - waiting to lock 69eab6a3 (a java.lang.Class) owned by SIGTERM handler t@41038 at java.lang.Terminator$1.handle(Terminator.java:35) at sun.misc.Signal$1.run(Signal.java:195) at java.lang.Thread.run(Thread.java:662) Locked ownable synchronizers: - None delete Spark local dirs - Thread t@40 java.lang.Thread.State: RUNNABLE at java.io.UnixFileSystem.delete0(Native Method) at java.io.UnixFileSystem.delete(UnixFileSystem.java:251) at java.io.File.delete(File.java:904) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:482) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478) at org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:141) at org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:139) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:139) Locked ownable synchronizers: - None SIGTERM handler - Thread t@41038 java.lang.Thread.State: WAITING at java.lang.Object.wait(Native Method) - waiting on 355c6c8d (a org.apache.spark.storage.DiskBlockManager$$anon$1) at java.lang.Thread.join(Thread.java:1186) at java.lang.Thread.join(Thread.java:1239) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24) at java.lang.Shutdown.runHooks(Shutdown.java:79) at java.lang.Shutdown.sequence(Shutdown.java:123) at java.lang.Shutdown.exit(Shutdown.java:168) - locked 69eab6a3 (a java.lang.Class) at java.lang.Terminator$1.handle(Terminator.java:35) at sun.misc.Signal$1.run(Signal.java:195) at java.lang.Thread.run(Thread.java:662) Locked ownable synchronizers: - None -- This message was sent by Atlassian JIRA (v6.2#6252)