[jira] [Created] (SPARK-6305) Add support for log4j 2.x to Spark

2015-03-12 Thread Tal Sliwowicz (JIRA)
Tal Sliwowicz created SPARK-6305:


 Summary: Add support for log4j 2.x to Spark
 Key: SPARK-6305
 URL: https://issues.apache.org/jira/browse/SPARK-6305
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Tal Sliwowicz


log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

2014-10-23 Thread Tal Sliwowicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182005#comment-14182005
 ] 

Tal Sliwowicz commented on SPARK-4006:
--

After the fix was merged to master, created a PR for 1.0, 1.1 and updated the 
original 0.9 PR (the merges were not clean).

 Spark Driver crashes whenever an Executor is registered twice
 -

 Key: SPARK-4006
 URL: https://issues.apache.org/jira/browse/SPARK-4006
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 0.9.2, 1.0.2, 1.1.0, 1.2.0
 Environment: Mesos, Coarse Grained
Reporter: Tal Sliwowicz
Assignee: Tal Sliwowicz
Priority: Critical
 Fix For: 1.2.0


 This is a huge robustness issue for us (Taboola), in mission critical , time 
 sensitive (real time) spark jobs.
 We have long running spark drivers and even though we have state of the art 
 hardware, from time to time executors disconnect. In many cases, the 
 RemoveExecutor is not received, and when the new executor registers, the 
 driver crashes. In mesos coarse grained, executor ids are fixed. 
 The issue is with the System.exit(1) in BlockManagerMasterActor
 {code}
 private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
 ActorRef) {
 if (!blockManagerInfo.contains(id)) {
   blockManagerIdByExecutor.get(id.executorId) match {
 case Some(manager) =
   // A block manager of the same executor already exists.
   // This should never happen. Let's just quit.
   logError(Got two different block manager registrations on  + 
 id.executorId)
   System.exit(1)
 case None =
   blockManagerIdByExecutor(id.executorId) = id
   }
   logInfo(Registering block manager %s with %s RAM.format(
 id.hostPort, Utils.bytesToString(maxMemSize)))
   blockManagerInfo(id) =
 new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
 slaveActor)
 }
 listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

2014-10-22 Thread Tal Sliwowicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180512#comment-14180512
 ] 

Tal Sliwowicz commented on SPARK-4006:
--

Cool! Would be very interesting to know.
For us it's hard to force reproduce this (because you need to registers without 
a remove in between), but it always happens eventually, so I can tell for sure 
that the fix resolved our issue.

 Spark Driver crashes whenever an Executor is registered twice
 -

 Key: SPARK-4006
 URL: https://issues.apache.org/jira/browse/SPARK-4006
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 0.9.2, 1.0.2, 1.1.0, 1.2.0
 Environment: Mesos, Coarse Grained
Reporter: Tal Sliwowicz
Priority: Critical

 This is a huge robustness issue for us (Taboola), in mission critical , time 
 sensitive (real time) spark jobs.
 We have long running spark drivers and even though we have state of the art 
 hardware, from time to time executors disconnect. In many cases, the 
 RemoveExecutor is not received, and when the new executor registers, the 
 driver crashes. In mesos coarse grained, executor ids are fixed. 
 The issue is with the System.exit(1) in BlockManagerMasterActor
 {code}
 private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
 ActorRef) {
 if (!blockManagerInfo.contains(id)) {
   blockManagerIdByExecutor.get(id.executorId) match {
 case Some(manager) =
   // A block manager of the same executor already exists.
   // This should never happen. Let's just quit.
   logError(Got two different block manager registrations on  + 
 id.executorId)
   System.exit(1)
 case None =
   blockManagerIdByExecutor(id.executorId) = id
   }
   logInfo(Registering block manager %s with %s RAM.format(
 id.hostPort, Utils.bytesToString(maxMemSize)))
   blockManagerInfo(id) =
 new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
 slaveActor)
 }
 listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

2014-10-22 Thread Tal Sliwowicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180512#comment-14180512
 ] 

Tal Sliwowicz edited comment on SPARK-4006 at 10/22/14 8:48 PM:


Cool! Would be very interesting to know.
For us it's hard to force reproduce this (because you need two registers 
without a remove in between), but it always happens eventually, so I can tell 
for sure that the fix resolved our issue.


was (Author: sliwo):
Cool! Would be very interesting to know.
For us it's hard to force reproduce this (because you need to registers without 
a remove in between), but it always happens eventually, so I can tell for sure 
that the fix resolved our issue.

 Spark Driver crashes whenever an Executor is registered twice
 -

 Key: SPARK-4006
 URL: https://issues.apache.org/jira/browse/SPARK-4006
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 0.9.2, 1.0.2, 1.1.0, 1.2.0
 Environment: Mesos, Coarse Grained
Reporter: Tal Sliwowicz
Priority: Critical

 This is a huge robustness issue for us (Taboola), in mission critical , time 
 sensitive (real time) spark jobs.
 We have long running spark drivers and even though we have state of the art 
 hardware, from time to time executors disconnect. In many cases, the 
 RemoveExecutor is not received, and when the new executor registers, the 
 driver crashes. In mesos coarse grained, executor ids are fixed. 
 The issue is with the System.exit(1) in BlockManagerMasterActor
 {code}
 private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
 ActorRef) {
 if (!blockManagerInfo.contains(id)) {
   blockManagerIdByExecutor.get(id.executorId) match {
 case Some(manager) =
   // A block manager of the same executor already exists.
   // This should never happen. Let's just quit.
   logError(Got two different block manager registrations on  + 
 id.executorId)
   System.exit(1)
 case None =
   blockManagerIdByExecutor(id.executorId) = id
   }
   logInfo(Registering block manager %s with %s RAM.format(
 id.hostPort, Utils.bytesToString(maxMemSize)))
   blockManagerInfo(id) =
 new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
 slaveActor)
 }
 listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

2014-10-21 Thread Tal Sliwowicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179351#comment-14179351
 ] 

Tal Sliwowicz commented on SPARK-4006:
--

Another pull request - this time on master - 
https://github.com/apache/spark/pull/2886

 Spark Driver crashes whenever an Executor is registered twice
 -

 Key: SPARK-4006
 URL: https://issues.apache.org/jira/browse/SPARK-4006
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 0.9.2, 1.0.2, 1.1.0
 Environment: Mesos, Coarse Grained
Reporter: Tal Sliwowicz
Priority: Critical

 This is a huge robustness issue for us (Taboola), in mission critical , time 
 sensitive (real time) spark jobs.
 We have long running spark drivers and even though we have state of the art 
 hardware, from time to time executors disconnect. In many cases, the 
 RemoveExecutor is not received, and when the new executor registers, the 
 driver crashes. In mesos coarse grained, executor ids are fixed. 
 The issue is with the System.exit(1) in BlockManagerMasterActor
 private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
 ActorRef) {
 if (!blockManagerInfo.contains(id)) {
   blockManagerIdByExecutor.get(id.executorId) match {
 case Some(manager) =
   // A block manager of the same executor already exists.
   // This should never happen. Let's just quit.
   logError(Got two different block manager registrations on  + 
 id.executorId)
   System.exit(1)
 case None =
   blockManagerIdByExecutor(id.executorId) = id
   }
   logInfo(Registering block manager %s with %s RAM.format(
 id.hostPort, Utils.bytesToString(maxMemSize)))
   blockManagerInfo(id) =
 new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
 slaveActor)
 }
 listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

2014-10-21 Thread Tal Sliwowicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tal Sliwowicz updated SPARK-4006:
-
Comment: was deleted

(was: Another pull request - this time on master - 
https://github.com/apache/spark/pull/2886)

 Spark Driver crashes whenever an Executor is registered twice
 -

 Key: SPARK-4006
 URL: https://issues.apache.org/jira/browse/SPARK-4006
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 0.9.2, 1.0.2, 1.1.0
 Environment: Mesos, Coarse Grained
Reporter: Tal Sliwowicz
Priority: Critical

 This is a huge robustness issue for us (Taboola), in mission critical , time 
 sensitive (real time) spark jobs.
 We have long running spark drivers and even though we have state of the art 
 hardware, from time to time executors disconnect. In many cases, the 
 RemoveExecutor is not received, and when the new executor registers, the 
 driver crashes. In mesos coarse grained, executor ids are fixed. 
 The issue is with the System.exit(1) in BlockManagerMasterActor
 private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
 ActorRef) {
 if (!blockManagerInfo.contains(id)) {
   blockManagerIdByExecutor.get(id.executorId) match {
 case Some(manager) =
   // A block manager of the same executor already exists.
   // This should never happen. Let's just quit.
   logError(Got two different block manager registrations on  + 
 id.executorId)
   System.exit(1)
 case None =
   blockManagerIdByExecutor(id.executorId) = id
   }
   logInfo(Registering block manager %s with %s RAM.format(
 id.hostPort, Utils.bytesToString(maxMemSize)))
   blockManagerInfo(id) =
 new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
 slaveActor)
 }
 listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1042) spark cleans all java broadcast variables when it hits the spark.cleaner.ttl

2014-10-20 Thread Tal Sliwowicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176708#comment-14176708
 ] 

Tal Sliwowicz commented on SPARK-1042:
--

[~qqsun8819] I think the issue was resolved in 0.9.2. We are not experiencing 
it any more. Thanks!

 spark cleans all java broadcast variables when it hits the spark.cleaner.ttl 
 -

 Key: SPARK-1042
 URL: https://issues.apache.org/jira/browse/SPARK-1042
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 0.8.0, 0.8.1, 0.9.0
Reporter: Tal Sliwowicz
Assignee: OuyangJin
Priority: Critical
  Labels: memory_leak

 When setting spark.cleaner.ttl, spark performs the cleanup on time - but it 
 cleans all broadcast variables, not just the ones that are older than the 
 ttl. This creates an exception when the next mapPartitions runs because it 
 cannot find the broadcast variable, even when it was created immediately 
 before running the task.
 Our temp workaround - not set the ttl and suffer from an ongoing memory leak 
 (forces a restart).
 We are using JavaSparkContext and our broadcast variables are Java HashMaps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

2014-10-20 Thread Tal Sliwowicz (JIRA)
Tal Sliwowicz created SPARK-4006:


 Summary: Spark Driver crashes whenever an Executor is registered 
twice
 Key: SPARK-4006
 URL: https://issues.apache.org/jira/browse/SPARK-4006
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 1.1.0, 1.0.2, 0.9.2
 Environment: Mesos, Coarse Grained
Reporter: Tal Sliwowicz
Priority: Critical


We have long running spark drivers and even though we have state of the art 
hardware, from time to time executors disconnect. In many cases, the 
RemoveExecutor is not received, and when the new executor registers, the driver 
crashes. In mesos coarse grained, executor ids are fixed. 

The issue is with the System.exit(1) in BlockManagerMasterActor


private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
ActorRef) {
if (!blockManagerInfo.contains(id)) {
  blockManagerIdByExecutor.get(id.executorId) match {
case Some(manager) =
  // A block manager of the same executor already exists.
  // This should never happen. Let's just quit.
  logError(Got two different block manager registrations on  + 
id.executorId)
  System.exit(1)
case None =
  blockManagerIdByExecutor(id.executorId) = id
  }

  logInfo(Registering block manager %s with %s RAM.format(
id.hostPort, Utils.bytesToString(maxMemSize)))

  blockManagerInfo(id) =
new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
slaveActor)
}
listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

2014-10-20 Thread Tal Sliwowicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tal Sliwowicz updated SPARK-4006:
-
Description: 
This is a huge robustness issue for us, in mission critical , time sensitive 
(real time) spark jobs.

We have long running spark drivers and even though we have state of the art 
hardware, from time to time executors disconnect. In many cases, the 
RemoveExecutor is not received, and when the new executor registers, the driver 
crashes. In mesos coarse grained, executor ids are fixed. 

The issue is with the System.exit(1) in BlockManagerMasterActor


private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
ActorRef) {
if (!blockManagerInfo.contains(id)) {
  blockManagerIdByExecutor.get(id.executorId) match {
case Some(manager) =
  // A block manager of the same executor already exists.
  // This should never happen. Let's just quit.
  logError(Got two different block manager registrations on  + 
id.executorId)
  System.exit(1)
case None =
  blockManagerIdByExecutor(id.executorId) = id
  }

  logInfo(Registering block manager %s with %s RAM.format(
id.hostPort, Utils.bytesToString(maxMemSize)))

  blockManagerInfo(id) =
new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
slaveActor)
}
listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
  }

  was:
We have long running spark drivers and even though we have state of the art 
hardware, from time to time executors disconnect. In many cases, the 
RemoveExecutor is not received, and when the new executor registers, the driver 
crashes. In mesos coarse grained, executor ids are fixed. 

The issue is with the System.exit(1) in BlockManagerMasterActor


private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
ActorRef) {
if (!blockManagerInfo.contains(id)) {
  blockManagerIdByExecutor.get(id.executorId) match {
case Some(manager) =
  // A block manager of the same executor already exists.
  // This should never happen. Let's just quit.
  logError(Got two different block manager registrations on  + 
id.executorId)
  System.exit(1)
case None =
  blockManagerIdByExecutor(id.executorId) = id
  }

  logInfo(Registering block manager %s with %s RAM.format(
id.hostPort, Utils.bytesToString(maxMemSize)))

  blockManagerInfo(id) =
new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
slaveActor)
}
listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
  }


 Spark Driver crashes whenever an Executor is registered twice
 -

 Key: SPARK-4006
 URL: https://issues.apache.org/jira/browse/SPARK-4006
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 0.9.2, 1.0.2, 1.1.0
 Environment: Mesos, Coarse Grained
Reporter: Tal Sliwowicz
Priority: Critical

 This is a huge robustness issue for us, in mission critical , time sensitive 
 (real time) spark jobs.
 We have long running spark drivers and even though we have state of the art 
 hardware, from time to time executors disconnect. In many cases, the 
 RemoveExecutor is not received, and when the new executor registers, the 
 driver crashes. In mesos coarse grained, executor ids are fixed. 
 The issue is with the System.exit(1) in BlockManagerMasterActor
 private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
 ActorRef) {
 if (!blockManagerInfo.contains(id)) {
   blockManagerIdByExecutor.get(id.executorId) match {
 case Some(manager) =
   // A block manager of the same executor already exists.
   // This should never happen. Let's just quit.
   logError(Got two different block manager registrations on  + 
 id.executorId)
   System.exit(1)
 case None =
   blockManagerIdByExecutor(id.executorId) = id
   }
   logInfo(Registering block manager %s with %s RAM.format(
 id.hostPort, Utils.bytesToString(maxMemSize)))
   blockManagerInfo(id) =
 new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
 slaveActor)
 }
 listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

2014-10-20 Thread Tal Sliwowicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tal Sliwowicz updated SPARK-4006:
-
Description: 
This is a huge robustness issue for us (Taboola), in mission critical , time 
sensitive (real time) spark jobs.

We have long running spark drivers and even though we have state of the art 
hardware, from time to time executors disconnect. In many cases, the 
RemoveExecutor is not received, and when the new executor registers, the driver 
crashes. In mesos coarse grained, executor ids are fixed. 

The issue is with the System.exit(1) in BlockManagerMasterActor


private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
ActorRef) {
if (!blockManagerInfo.contains(id)) {
  blockManagerIdByExecutor.get(id.executorId) match {
case Some(manager) =
  // A block manager of the same executor already exists.
  // This should never happen. Let's just quit.
  logError(Got two different block manager registrations on  + 
id.executorId)
  System.exit(1)
case None =
  blockManagerIdByExecutor(id.executorId) = id
  }

  logInfo(Registering block manager %s with %s RAM.format(
id.hostPort, Utils.bytesToString(maxMemSize)))

  blockManagerInfo(id) =
new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
slaveActor)
}
listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
  }

  was:
This is a huge robustness issue for us, in mission critical , time sensitive 
(real time) spark jobs.

We have long running spark drivers and even though we have state of the art 
hardware, from time to time executors disconnect. In many cases, the 
RemoveExecutor is not received, and when the new executor registers, the driver 
crashes. In mesos coarse grained, executor ids are fixed. 

The issue is with the System.exit(1) in BlockManagerMasterActor


private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
ActorRef) {
if (!blockManagerInfo.contains(id)) {
  blockManagerIdByExecutor.get(id.executorId) match {
case Some(manager) =
  // A block manager of the same executor already exists.
  // This should never happen. Let's just quit.
  logError(Got two different block manager registrations on  + 
id.executorId)
  System.exit(1)
case None =
  blockManagerIdByExecutor(id.executorId) = id
  }

  logInfo(Registering block manager %s with %s RAM.format(
id.hostPort, Utils.bytesToString(maxMemSize)))

  blockManagerInfo(id) =
new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
slaveActor)
}
listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
  }


 Spark Driver crashes whenever an Executor is registered twice
 -

 Key: SPARK-4006
 URL: https://issues.apache.org/jira/browse/SPARK-4006
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 0.9.2, 1.0.2, 1.1.0
 Environment: Mesos, Coarse Grained
Reporter: Tal Sliwowicz
Priority: Critical

 This is a huge robustness issue for us (Taboola), in mission critical , time 
 sensitive (real time) spark jobs.
 We have long running spark drivers and even though we have state of the art 
 hardware, from time to time executors disconnect. In many cases, the 
 RemoveExecutor is not received, and when the new executor registers, the 
 driver crashes. In mesos coarse grained, executor ids are fixed. 
 The issue is with the System.exit(1) in BlockManagerMasterActor
 private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
 ActorRef) {
 if (!blockManagerInfo.contains(id)) {
   blockManagerIdByExecutor.get(id.executorId) match {
 case Some(manager) =
   // A block manager of the same executor already exists.
   // This should never happen. Let's just quit.
   logError(Got two different block manager registrations on  + 
 id.executorId)
   System.exit(1)
 case None =
   blockManagerIdByExecutor(id.executorId) = id
   }
   logInfo(Registering block manager %s with %s RAM.format(
 id.hostPort, Utils.bytesToString(maxMemSize)))
   blockManagerInfo(id) =
 new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
 slaveActor)
 }
 listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

2014-10-20 Thread Tal Sliwowicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176774#comment-14176774
 ] 

Tal Sliwowicz commented on SPARK-4006:
--

Fixed in - https://github.com/apache/spark/pull/2854

 Spark Driver crashes whenever an Executor is registered twice
 -

 Key: SPARK-4006
 URL: https://issues.apache.org/jira/browse/SPARK-4006
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 0.9.2, 1.0.2, 1.1.0
 Environment: Mesos, Coarse Grained
Reporter: Tal Sliwowicz
Priority: Critical

 This is a huge robustness issue for us (Taboola), in mission critical , time 
 sensitive (real time) spark jobs.
 We have long running spark drivers and even though we have state of the art 
 hardware, from time to time executors disconnect. In many cases, the 
 RemoveExecutor is not received, and when the new executor registers, the 
 driver crashes. In mesos coarse grained, executor ids are fixed. 
 The issue is with the System.exit(1) in BlockManagerMasterActor
 private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
 ActorRef) {
 if (!blockManagerInfo.contains(id)) {
   blockManagerIdByExecutor.get(id.executorId) match {
 case Some(manager) =
   // A block manager of the same executor already exists.
   // This should never happen. Let's just quit.
   logError(Got two different block manager registrations on  + 
 id.executorId)
   System.exit(1)
 case None =
   blockManagerIdByExecutor(id.executorId) = id
   }
   logInfo(Registering block manager %s with %s RAM.format(
 id.hostPort, Utils.bytesToString(maxMemSize)))
   blockManagerInfo(id) =
 new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
 slaveActor)
 }
 listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1175) on shutting down a long running job, the cluster does not accept new jobs and gets hung

2014-04-16 Thread Tal Sliwowicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13970641#comment-13970641
 ] 

Tal Sliwowicz commented on SPARK-1175:
--

This prevents us from having a real automated solution for fail over when the 
driver fails. We cannot automatically start a new driver because spark is stuck 
on cleanup.

 on shutting down a long running job, the cluster does not accept new jobs and 
 gets hung
 ---

 Key: SPARK-1175
 URL: https://issues.apache.org/jira/browse/SPARK-1175
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0
Reporter: Tal Sliwowicz
Assignee: Nan Zhu
  Labels: shutdown, worker

 When shutting down a long processing job (24+ hours) that runs periodically 
 on the same context and generates a lot of shuffles (many hundreds of GB) the 
 spark workers get hung for a long while and the cluster does not accept new 
 jobs. The only way to proceed is to kill -9 the workers.
 This is a big problem because when multiple contexts run on the same cluster, 
 one mast stop them all for a simple restart.
 The context is stopped using sc.stop()
 This happens both in standalone mode and under mesos.
 We suspect this is caused by the delete Spark local dirs thread. Attached a 
 thread dump of the worker. Also, the relevant part may be:
 SIGTERM handler - Thread t@41040
java.lang.Thread.State: BLOCKED
   at java.lang.Shutdown.exit(Shutdown.java:168)
   - waiting to lock 69eab6a3 (a java.lang.Class) owned by SIGTERM 
 handler t@41038
   at java.lang.Terminator$1.handle(Terminator.java:35)
   at sun.misc.Signal$1.run(Signal.java:195)
   at java.lang.Thread.run(Thread.java:662)
Locked ownable synchronizers:
   - None
 delete Spark local dirs - Thread t@40
java.lang.Thread.State: RUNNABLE
   at java.io.UnixFileSystem.delete0(Native Method)
   at java.io.UnixFileSystem.delete(UnixFileSystem.java:251)
   at java.io.File.delete(File.java:904)
   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:482)
   at 
 org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479)
   at 
 org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478)
   at 
 org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479)
   at 
 org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478)
   at 
 org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:141)
   at 
 org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:139)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:139)
Locked ownable synchronizers:
   - None
 SIGTERM handler - Thread t@41038
java.lang.Thread.State: WAITING
   at java.lang.Object.wait(Native Method)
   - waiting on 355c6c8d (a 
 org.apache.spark.storage.DiskBlockManager$$anon$1)
   at java.lang.Thread.join(Thread.java:1186)
   at java.lang.Thread.join(Thread.java:1239)
   at 
 java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79)
   at 
 java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24)
   at java.lang.Shutdown.runHooks(Shutdown.java:79)
   at java.lang.Shutdown.sequence(Shutdown.java:123)
   at java.lang.Shutdown.exit(Shutdown.java:168)
   - locked 69eab6a3 (a java.lang.Class)
   at java.lang.Terminator$1.handle(Terminator.java:35)
   at sun.misc.Signal$1.run(Signal.java:195)
   at java.lang.Thread.run(Thread.java:662)
Locked ownable synchronizers:
   - None



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1175) on shutting down a long running job, the cluster does not accept new jobs and gets hung

2014-04-16 Thread Tal Sliwowicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971520#comment-13971520
 ] 

Tal Sliwowicz commented on SPARK-1175:
--

Yes

 on shutting down a long running job, the cluster does not accept new jobs and 
 gets hung
 ---

 Key: SPARK-1175
 URL: https://issues.apache.org/jira/browse/SPARK-1175
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0
Reporter: Tal Sliwowicz
Assignee: Nan Zhu
  Labels: shutdown, worker

 When shutting down a long processing job (24+ hours) that runs periodically 
 on the same context and generates a lot of shuffles (many hundreds of GB) the 
 spark workers get hung for a long while and the cluster does not accept new 
 jobs. The only way to proceed is to kill -9 the workers.
 This is a big problem because when multiple contexts run on the same cluster, 
 one mast stop them all for a simple restart.
 The context is stopped using sc.stop()
 This happens both in standalone mode and under mesos.
 We suspect this is caused by the delete Spark local dirs thread. Attached a 
 thread dump of the worker. Also, the relevant part may be:
 SIGTERM handler - Thread t@41040
java.lang.Thread.State: BLOCKED
   at java.lang.Shutdown.exit(Shutdown.java:168)
   - waiting to lock 69eab6a3 (a java.lang.Class) owned by SIGTERM 
 handler t@41038
   at java.lang.Terminator$1.handle(Terminator.java:35)
   at sun.misc.Signal$1.run(Signal.java:195)
   at java.lang.Thread.run(Thread.java:662)
Locked ownable synchronizers:
   - None
 delete Spark local dirs - Thread t@40
java.lang.Thread.State: RUNNABLE
   at java.io.UnixFileSystem.delete0(Native Method)
   at java.io.UnixFileSystem.delete(UnixFileSystem.java:251)
   at java.io.File.delete(File.java:904)
   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:482)
   at 
 org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479)
   at 
 org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478)
   at 
 org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479)
   at 
 org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478)
   at 
 org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:141)
   at 
 org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:139)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:139)
Locked ownable synchronizers:
   - None
 SIGTERM handler - Thread t@41038
java.lang.Thread.State: WAITING
   at java.lang.Object.wait(Native Method)
   - waiting on 355c6c8d (a 
 org.apache.spark.storage.DiskBlockManager$$anon$1)
   at java.lang.Thread.join(Thread.java:1186)
   at java.lang.Thread.join(Thread.java:1239)
   at 
 java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79)
   at 
 java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24)
   at java.lang.Shutdown.runHooks(Shutdown.java:79)
   at java.lang.Shutdown.sequence(Shutdown.java:123)
   at java.lang.Shutdown.exit(Shutdown.java:168)
   - locked 69eab6a3 (a java.lang.Class)
   at java.lang.Terminator$1.handle(Terminator.java:35)
   at sun.misc.Signal$1.run(Signal.java:195)
   at java.lang.Thread.run(Thread.java:662)
Locked ownable synchronizers:
   - None



--
This message was sent by Atlassian JIRA
(v6.2#6252)