from:"Kay Ousterhout \(JIRA\)"

[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-10 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080789#comment-16080789
 ] 

Kay Ousterhout commented on SPARK-21349:


Out of curiosity, what are the task sizes that you're seeing?

+[~shivaram] -- I know you've looked at task size a lot.  Are these getting 
bigger / do you think we should just raise the warning size for everyone?

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-09 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079746#comment-16079746
 ] 

Kay Ousterhout commented on SPARK-21349:


Does that mean we should just raise this threshold for all users?

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-09 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079724#comment-16079724
 ] 

Kay Ousterhout commented on SPARK-21349:


Is this a major usability issue (and what's the use case where task sizes are 
regularly > 100KB)?  I'm hesitant to make this a configuration parameter -- 
Spark already has a huge number of configuration parameters, making it hard for 
users to figure out which ones are relevant for them.

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20219) Schedule tasks based on size of input from ScheduledRDD

2017-04-05 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957253#comment-15957253
 ] 

Kay Ousterhout commented on SPARK-20219:


Like [~mridulm80] (as mentioned on the PR) I'm hesitant about this idea because 
of the added complexity and information "leakage" from the TaskScheduler back 
to the DAGScheduler (in general, we should be making this interface between 
these components smaller, to make the code easier to reason about -- not 
larger).  [~jinxing6...@126.com] you mentioned some use cases when this is 
helpful; can you post some concrete performance numbers about difference in 
runtimes?

cc [~imranr]-- thoughts here about whether the performance improvement is worth 
the added complexity?

> Schedule tasks based on size of input from ScheduledRDD
> ---
>
> Key: SPARK-20219
> URL: https://issues.apache.org/jira/browse/SPARK-20219
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>
> When data is highly skewed on ShuffledRDD, it make sense to launch those 
> tasks which process much more input as soon as possible. The current 
> scheduling mechanism in *TaskSetManager* is quite simple:
> {code}
>   for (i <- (0 until numTasks).reverse) {
> addPendingTask(i)
>   }
> {code}
> In scenario that "large tasks" locate at bottom half of tasks array, if tasks 
> with much more input are launched early, we can significantly reduce the time 
> cost and save resource when *"dynamic allocation"* is disabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19868) conflict TasksetManager lead to spark stopped

2017-03-28 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout reassigned SPARK-19868:
--

Assignee: liujianhui

> conflict TasksetManager lead to spark stopped
> -
>
> Key: SPARK-19868
> URL: https://issues.apache.org/jira/browse/SPARK-19868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: liujianhui
>Assignee: liujianhui
> Fix For: 2.2.0
>
>
> ##scenario
>  conflict taskSetManager throw an exception which lead to sparkcontext 
> stopped. log as 
> {code}
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 4571114: 4571114.2,4571114.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> the reason for that is the resubmitting of stage conflict with the running 
> stage，the missing task of stage should be resubmit since the zoombie of the 
> tasksetManager assigned by true
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting
>  ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks 
> had failed: 0
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting
>  ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at 
> MainApp.scala:73), which has no missing parents
> {code}
> the executor which the shuffle task ran on was lost
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring
>  possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4
> {code}
> the time of the task set finished and the resubmit of stage
> {code}
> handleSuccessfuleTask
> [INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed
>  TaskSet 4571114.1, whose tasks have all completed, from pool 
> resubmit stage
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding
>  task set 4571114.2 with 1 tasks
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19868) conflict TasksetManager lead to spark stopped

2017-03-28 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19868.

   Resolution: Fixed
Fix Version/s: 2.2.0

> conflict TasksetManager lead to spark stopped
> -
>
> Key: SPARK-19868
> URL: https://issues.apache.org/jira/browse/SPARK-19868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: liujianhui
> Fix For: 2.2.0
>
>
> ##scenario
>  conflict taskSetManager throw an exception which lead to sparkcontext 
> stopped. log as 
> {code}
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 4571114: 4571114.2,4571114.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> the reason for that is the resubmitting of stage conflict with the running 
> stage，the missing task of stage should be resubmit since the zoombie of the 
> tasksetManager assigned by true
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting
>  ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks 
> had failed: 0
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting
>  ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at 
> MainApp.scala:73), which has no missing parents
> {code}
> the executor which the shuffle task ran on was lost
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring
>  possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4
> {code}
> the time of the task set finished and the resubmit of stage
> {code}
> handleSuccessfuleTask
> [INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed
>  TaskSet 4571114.1, whose tasks have all completed, from pool 
> resubmit stage
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding
>  task set 4571114.2 with 1 tasks
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20116) Remove task-level functionality from the DAGScheduler

2017-03-27 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-20116:
--

 Summary: Remove task-level functionality from the DAGScheduler
 Key: SPARK-20116
 URL: https://issues.apache.org/jira/browse/SPARK-20116
 Project: Spark
  Issue Type: Sub-task
  Components: Scheduler
Affects Versions: 2.2.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout


Long, long ago, the scheduler code was more modular, and the DAGScheduler 
handled the logic of scheduling DAGs of stages (as the name suggests) and the 
TaskSchedulerImpl handled scheduling the tasks within a stage.  Over time, more 
and more task-specific functionality has been added to the DAGScheduler, and 
now, the DAGScheduler duplicates a bunch of the task tracking that's done by 
other scheduler components.  This makes the scheduler code harder to reason 
about, and has led to some tricky bugs (e.g., SPARK-19263).  We should move all 
of this functionality back to the TaskSchedulerImpl and TaskSetManager, which 
should "hide" that complexity from the DAGScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19612) Tests failing with timeout

2017-03-27 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944181#comment-15944181
 ] 

Kay Ousterhout commented on SPARK-19612:


And another: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75272/console

> Tests failing with timeout
> --
>
> Key: SPARK-19612
> URL: https://issues.apache.org/jira/browse/SPARK-19612
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
> I've seen at least one recent test failure due to hitting the 250m timeout: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/
> Filing this JIRA to track this; if it happens repeatedly we should up the 
> timeout.
> cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19820) Allow users to kill tasks, and propagate a kill reason

2017-03-24 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout reassigned SPARK-19820:
--

Assignee: Eric Liang

> Allow users to kill tasks, and propagate a kill reason
> --
>
> Key: SPARK-19820
> URL: https://issues.apache.org/jira/browse/SPARK-19820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19820) Allow users to kill tasks, and propagate a kill reason

2017-03-24 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19820.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Allow users to kill tasks, and propagate a kill reason
> --
>
> Key: SPARK-19820
> URL: https://issues.apache.org/jira/browse/SPARK-19820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Minor
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19820) Allow users to kill tasks, and propagate a kill reason

2017-03-24 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19820:
---
Summary: Allow users to kill tasks, and propagate a kill reason  (was: 
Allow reason to be specified for task kill)

> Allow users to kill tasks, and propagate a kill reason
> --
>
> Key: SPARK-19820
> URL: https://issues.apache.org/jira/browse/SPARK-19820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16929) Speculation-related synchronization bottleneck in checkSpeculatableTasks

2017-03-24 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-16929.

   Resolution: Fixed
 Assignee: jin xing
Fix Version/s: 2.2.0

> Speculation-related synchronization bottleneck in checkSpeculatableTasks
> 
>
> Key: SPARK-16929
> URL: https://issues.apache.org/jira/browse/SPARK-16929
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Nicholas Brown
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Our cluster has been running slowly since I got speculation working, I looked 
> into it and noticed that stderr was saying some tasks were taking almost an 
> hour to run even though in the application logs on the nodes that task only 
> took a minute or so to run.  Digging into the thread dump for the master node 
> I noticed a number of threads are blocked, apparently by speculation thread.  
> At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while 
> it looks through the tasks to see what needs to be rerun.  Unfortunately that 
> code loops through each of the tasks, so when you have even just a couple 
> hundred thousand tasks to run that can be prohibitively slow to run inside of 
> a synchronized block.  Once I disabled speculation, the job went back to 
> having acceptable performance.
> There are no comments around that lock indicating why it was added, and the 
> git history seems to have a couple refactorings so its hard to find where it 
> was added.  I'm tempted to believe it is the result of someone assuming that 
> an extra synchronized block never hurt anyone (in reality I've probably just 
> as many bugs caused by over synchronization as too little) as it looks too 
> broad to be actually guarding any potential concurrency issue.  But, since 
> concurrency issues can be tricky to reproduce (and yes, I understand that's 
> an extreme understatement) I'm not sure just blindly removing it without 
> being familiar with the history is necessarily safe.  
> Can someone look into this?  Or at least make a note in the documentation 
> that speculation should not be used with large clusters?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19612) Tests failing with timeout

2017-03-23 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19612:
---
Affects Version/s: (was: 2.1.1)
   2.2.0

> Tests failing with timeout
> --
>
> Key: SPARK-19612
> URL: https://issues.apache.org/jira/browse/SPARK-19612
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
> I've seen at least one recent test failure due to hitting the 250m timeout: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/
> Filing this JIRA to track this; if it happens repeatedly we should up the 
> timeout.
> cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-19612) Tests failing with timeout

2017-03-23 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout reopened SPARK-19612:


This seems to be back: saw two recently:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75124
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75127

> Tests failing with timeout
> --
>
> Key: SPARK-19612
> URL: https://issues.apache.org/jira/browse/SPARK-19612
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
> I've seen at least one recent test failure due to hitting the 250m timeout: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/
> Filing this JIRA to track this; if it happens repeatedly we should up the 
> timeout.
> cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19567) Support some Schedulable variables immutability and access

2017-03-23 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19567.

   Resolution: Fixed
 Assignee: Eren Avsarogullari
Fix Version/s: 2.2.0

> Support some Schedulable variables immutability and access
> --
>
> Key: SPARK-19567
> URL: https://issues.apache.org/jira/browse/SPARK-19567
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>Priority: Minor
> Fix For: 2.2.0
>
>
> Support some Schedulable variables immutability and access
> Some Schedulable variables need refactoring for immutability and access 
> modifiers as follows:
> - from vars to vals(if there is no requirement): This is important to support 
> immutability as much as possible. 
> Sample => Pool: weight, minShare, priority, name and 
> taskSetSchedulingAlgorithm.
> - access modifiers: Specially, vars access needs to be restricted from other 
> parts of codebase to prevent potential side effects. Sample: 
> Sample => TaskSetManager: tasksSuccessful, totalResultSize, calculatedTasks 
> etc...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19868) conflict TasksetManager lead to spark stopped

2017-03-23 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19868:
---
Target Version/s: 2.2.0

> conflict TasksetManager lead to spark stopped
> -
>
> Key: SPARK-19868
> URL: https://issues.apache.org/jira/browse/SPARK-19868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: liujianhui
>
> ##scenario
>  conflict taskSetManager throw an exception which lead to sparkcontext 
> stopped. log as 
> {code}
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 4571114: 4571114.2,4571114.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> the reason for that is the resubmitting of stage conflict with the running 
> stage，the missing task of stage should be resubmit since the zoombie of the 
> tasksetManager assigned by true
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting
>  ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks 
> had failed: 0
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting
>  ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at 
> MainApp.scala:73), which has no missing parents
> {code}
> the executor which the shuffle task ran on was lost
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring
>  possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4
> {code}
> the time of the task set finished and the resubmit of stage
> {code}
> handleSuccessfuleTask
> [INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed
>  TaskSet 4571114.1, whose tasks have all completed, from pool 
> resubmit stage
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding
>  task set 4571114.2 with 1 tasks
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19613) Flaky test: StateStoreRDDSuite.versioning and immutability

2017-03-22 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19613.

Resolution: Cannot Reproduce

I'm closing this because, while it had a burst of failures about a month ago 
(see here: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite=versioning+and+immutability)
 it hasn't failed since.

> Flaky test: StateStoreRDDSuite.versioning and immutability
> --
>
> Key: SPARK-19613
> URL: https://issues.apache.org/jira/browse/SPARK-19613
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Priority: Minor
>
> This test: 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite.versioning 
> and immutability failed on a recent PR: 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72948/testReport/junit/org.apache.spark.sql.execution.streaming.state/StateStoreRDDSuite/versioning_and_immutability/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19612) Tests failing with timeout

2017-03-22 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19612.

Resolution: Cannot Reproduce

Closing this for now because I haven't seen this issue in a while (we can 
re-open if this starts occurring again)

> Tests failing with timeout
> --
>
> Key: SPARK-19612
> URL: https://issues.apache.org/jira/browse/SPARK-19612
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Priority: Minor
>
> I've seen at least one recent test failure due to hitting the 250m timeout: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/
> Filing this JIRA to track this; if it happens repeatedly we should up the 
> timeout.
> cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive

2017-03-19 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19988.

   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/17344 (which was merged with 
the wrong JIRA number, but should resolve this issue)

> Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column 
> written by Hive
> 
>
> Key: SPARK-19988
> URL: https://issues.apache.org/jira/browse/SPARK-19988
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Assignee: Xiao Li
>  Labels: flaky-test
> Fix For: 2.2.0
>
> Attachments: trimmed-unit-test.log
>
>
> "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by 
> Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 
> days here:
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive
> eg. 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/
> {noformat}
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> SemanticException [Error 10072]: Database does not exist: db2
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2017-03-17 Thread Kay Ousterhout (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931009#comment-15931009
]

Kay Ousterhout commented on SPARK-18886:

Sorry for the slow response here! I realized this is the same issue as
SPARK-11460 (although that JIRA proposed a slightly different solution), which
stalled for reasons that are completely my fault (I neglected it because I
couldn't think of a practical way of solving it).

Imran, unfortunately I don't think your latest idea will quite work. Delay
scheduling was originally intended for situations where the number of slots
that a particular job could use was limited by a fairness policy. In that
case, it can be better to wait a bit for a "better" slot (i.e., one that
satisfies locality preferences). In particular, if you never wait, you end up
with this "sticky slot" issue where tasks for a job keep finishing up in a
"bad" slot (one with no locality preferences), and then they'll be re-offered
to the same job, which will again accept the bad slot. If the job just waited
a bit, it could get a better slot (e.g., as a result of tasks from another job
finishing). [1]

This relates to your idea because of the following situation: suppose you have
a cluster with 10 machines, the job has locality preferences for 5 of them
(with ids 1, 2, 3, 4, 5), and fairness dictates that the job can only use 3
slots at a time (e.g., it's sharing equally with 2 other jobs). Suppose that
for a long time, the job has been running tasks on slots 1, 2, and 3 (so local
slots). At this point, the times for machines 6, 7, 8, 9, and 10 will have
expired, because the job has been running for a while. But if the job is now
offered a slot on one of those non-local machines (e.g., 6), the job hasn't
been waiting long for non-local resources: until this point, it's been running
it's full share of 3 slots at a time, and it's been doing so on machines that
satisfy locality preferences. So, we shouldn't accept that slot on machine 6
-- we should wait a bit to see if we can get a slot on 1, 2, 3, 4, or 5.

The solution I proposed (in a long PR comment) for the other JIRA is: if the
task set is using fewer than the number of slots it could be using (where “#
slots it could be using” is all of the slots in the cluster if the job is
running alone, or the job’s fair share, if it’s not) for some period of time,
increase the locality level. The problem with that solution is that I thought
it was completely impractical to determine the number of slots a TSM "should"
be allowed to use.

However, after thinking about this more today, I think we might be able to do
this in a practical way:
- First, I thought that we could use information about when offers are rejected
to determine this (e.g., if you've been rejecting offers for a while, then
you're not using your fair share). But the problem here is that it's not easy
to determine when you *are* using your fair / allowed share: accepting a single
offer doesn't necessarily mean that you're now using the allowed share. This
is precisely the problem with the current approach, hence this JIRA.
- v1: one possible proxy for this is if there are slots that are currently
available that haven't been accepted by any job. The TaskSchedulerImpl could
feasibly pass this information to each TaskSetManager, and the TSM could use it
to update it's delay timer: something like only reset the delay timer to 0 if
(a) the TSM accepts an offer and (b) the flag passed by the TSM indicates that
there are no other unused slots in the cluster. This fixes the problem
described in the JIRA: in that case, the flag would indicate that there *were*
other unused slots, even though a task got successfully scheduled with this
offer, so the delay timer wouldn't be reset, and would eventually correctly
expire.
- v2: The problem with v1 is that it doesn't correctly handle situations where
e.g., you have two jobs A and B with equal shares. B is "greedy" and will
accept any slot (e.g., it's a reduce stage), and A is doing delay scheduling.
In this case, A might have much less than its share, but the flag from the
TaskSchedulerImpl would indicate that there were no other free slots in the
cluster, so the delay timer wouldn't ever expire. I suspect we could handle
this (e.g., with some logic in the TaskSchedulerImpl to detect when a
particular TSM is getting starved: when it keeps rejecting offers that are
later accepted by someone else) but before thinking about this further, I
wanted to run the general idea by you to see what your thoughts are.

[1] There's a whole side question / discussion of how often this is useful for
Spark at all. It can be useful if you're running in a shared cluster where
e.g. Yarn might be assigning you more slots over time, and it's also useful
when a single Spark context is being shared across many

[jira] [Resolved] (SPARK-11460) Locality waits should be based on task set creation time, not last launch time

2017-03-17 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-11460.

Resolution: Duplicate

> Locality waits should be based on task set creation time, not last launch time
> --
>
> Key: SPARK-11460
> URL: https://issues.apache.org/jira/browse/SPARK-11460
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: YARN
>Reporter: Shengyue Ji
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Spark waits for spark.locality.waits period before going from RACK_LOCAL to 
> ANY when selecting an executor for assignment. The timeout was essentially 
> reset each time a new assignment is made.
> We were running Spark streaming on Kafka with a 10 second batch window on 32 
> Kafka partitions with 16 executors. All executors were in the ANY group. At 
> one point one RACK_LOCAL executor was added and all tasks were assigned to 
> it. Each task took about 0.6 second to process, resetting the 
> spark.locality.wait timeout (3000ms) repeatedly. This caused the whole 
> process to under utilize resources and created an increasing backlog.
> spark.locality.wait should be based on the task set creation time, not last 
> launch time so that after 3000ms of initial creation, all executors can get 
> tasks assigned to them.
> We are specifying a zero timeout for now as a workaround to disable locality 
> optimization. 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L556



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2017-03-17 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-18890.

Resolution: Invalid

I closed this because, as [~imranr] pointed out on the PR, these already happen 
in the same thread.  [~witgo], can you change your PR to reference SPARK-19486, 
which describes the behavior you implemented?

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> 
>
> Key: SPARK-18890
> URL: https://issues.apache.org/jira/browse/SPARK-18890
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19565) After fetching failed, success of old attempt of stage should be taken as valid.

2017-03-17 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930910#comment-15930910
 ] 

Kay Ousterhout commented on SPARK-19565:


[~jinxing6...@126.com] I closed this because it looks like a duplicate with the 
word you did for SPARK-19263.  Feel free to re-open if I've misunderstood.

> After fetching failed, success of old attempt of stage should be taken as 
> valid.
> 
>
> Key: SPARK-19565
> URL: https://issues.apache.org/jira/browse/SPARK-19565
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>
> This is related to SPARK-19263. 
> When fetch failed, stage will be resubmitted. There can be running tasks from 
> both old and new stage attempts. Success of tasks from old stage attempt 
> should be taken as valid and partitionId should be removed from stage's 
> pendingPartitions accordingly. When pending partitions is empty, downstream 
> stage can be scheduled, even though there's still running tasks in the 
> active(new) stage attempt.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19565) After fetching failed, success of old attempt of stage should be taken as valid.

2017-03-17 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19565.

Resolution: Duplicate

> After fetching failed, success of old attempt of stage should be taken as 
> valid.
> 
>
> Key: SPARK-19565
> URL: https://issues.apache.org/jira/browse/SPARK-19565
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>
> This is related to SPARK-19263. 
> When fetch failed, stage will be resubmitted. There can be running tasks from 
> both old and new stage attempts. Success of tasks from old stage attempt 
> should be taken as valid and partitionId should be removed from stage's 
> pendingPartitions accordingly. When pending partitions is empty, downstream 
> stage can be scheduled, even though there's still running tasks in the 
> active(new) stage attempt.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19755) Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.

2017-03-17 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930909#comment-15930909
 ] 

Kay Ousterhout commented on SPARK-19755:


I'm closing this because the configs you're proposing adding already exist: 
spark.blacklist.enabled already exists to turn of all blacklisting (this is 
false by default, so the fact that you're seeing blacklisting behavior means 
that your configuration enables blacklisting), and 
spark.blacklist.maxFailedTaskPerExecutor is the other thing you proposed 
adding.  All of the blacklisting parameters are listed here: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L101

Feel free to re-open this if I've misunderstood and the existing configs don't 
address the issues you're seeing!

> Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result 
> - scheduler cannot create an executor after some time.
> ---
>
> Key: SPARK-19755
> URL: https://issues.apache.org/jira/browse/SPARK-19755
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 2.1.0
> Environment: mesos, marathon, docker - driver and executors are 
> dockerized.
>Reporter: Timur Abakumov
>
> When for some reason task fails - MesosCoarseGrainedSchedulerBackend 
> increased failure counter for a slave where that task was running.
> When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.  
> Over time  scheduler cannot create a new executor - every slave is is in the 
> blacklist.  Task failure not necessary related to host health- especially for 
> long running stream apps.
> If accepted as a bug: possible solution is to use: spark.blacklist.enabled to 
> make that functionality optional and if it make sense   MAX_SLAVE_FAILURES 
> also can be configurable.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19755) Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.

2017-03-17 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19755.

Resolution: Not A Problem

> Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result 
> - scheduler cannot create an executor after some time.
> ---
>
> Key: SPARK-19755
> URL: https://issues.apache.org/jira/browse/SPARK-19755
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 2.1.0
> Environment: mesos, marathon, docker - driver and executors are 
> dockerized.
>Reporter: Timur Abakumov
>
> When for some reason task fails - MesosCoarseGrainedSchedulerBackend 
> increased failure counter for a slave where that task was running.
> When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.  
> Over time  scheduler cannot create a new executor - every slave is is in the 
> blacklist.  Task failure not necessary related to host health- especially for 
> long running stream apps.
> If accepted as a bug: possible solution is to use: spark.blacklist.enabled to 
> make that functionality optional and if it make sense   MAX_SLAVE_FAILURES 
> also can be configurable.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929422#comment-15929422
 ] 

Kay Ousterhout commented on SPARK-19990:


Thanks [~windpiger]!

> Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create 
> temporary view using
> --
>
> Key: SPARK-19990
> URL: https://issues.apache.org/jira/browse/SPARK-19990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> This test seems to be failing consistently on all of the maven builds: 
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
>  and is possibly caused by SPARK-19763.
> Here's a stack trace for the failure: 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HiveDDLSuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.runTest(HiveDDLSuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
>

[jira] [Updated] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19990:
---
Description: 
This test seems to be failing consistently on all of the maven builds: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
 and is possibly caused by SPARK-19763.

Here's a stack trace for the failure: 

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: 
jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
  at org.apache.hadoop.fs.Path.initialize(Path.java:206)
  at org.apache.hadoop.fs.Path.(Path.java:172)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
  at 
org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
  at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
  at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
  at 
org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705)
  at 
org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186)
  at 
org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171)
  at 
org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704)
  at 
org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
  at 
org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
  at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
  at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
  at 
org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HiveDDLSuite.scala:41)
  at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
  at 
org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.runTest(HiveDDLSuite.scala:41)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
  at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
  at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
  at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
  at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)

[jira] [Commented] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive

2017-03-16 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929334#comment-15929334
 ] 

Kay Ousterhout commented on SPARK-19988:


With some help from [~joshrosen] I spent some time digging into this and found:

(1) if you look at the failures, they're all from the maven build.  In fact, 
100% of the maven builds shown there fail (and none of the SBT ones).  This is 
weird because this is also failing on the PR builder, which uses SBT. 

(2) The maven build failures are all accompanied by 3 other tests; the group of 
4 tests seems to consistently fail together.  3 tests fail with errors similar 
to this one (saying that some database does not exist).  The 4th test, 
org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary 
view using, fails with a more real error.  I filed SPARK-19990 for that issue.

(3) A commit right around the time the tests started failing: 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9#diff-b7094baa12601424a5d19cb930e3402fR46
 added code to remove all of the databases after each test.  I wonder if that's 
somehow getting run concurrently or asynchronously in the maven build (after 
the HiveCataloguedDDLSuite fails), which is why the error in the DDLSuite 
causes the other tests to fail saying that a database can't be found.  I have 
extremely limited knowledge of both (a) how the maven tests are executed and 
(b) the SQL code so it's possible these are totally unrelated issues.

None of this explains why the test is failing in the PR builder, where the 
failures have been isolated to this test.

> Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column 
> written by Hive
> 
>
> Key: SPARK-19988
> URL: https://issues.apache.org/jira/browse/SPARK-19988
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>  Labels: flaky-test
> Attachments: trimmed-unit-test.log
>
>
> "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by 
> Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 
> days here:
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive
> eg. 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/
> {noformat}
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> SemanticException [Error 10072]: Database does not exist: db2
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19990:
--

 Summary: Flaky test: 
org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary 
view using
 Key: SPARK-19990
 URL: https://issues.apache.org/jira/browse/SPARK-19990
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.2.0
Reporter: Kay Ousterhout


This test seems to be failing consistently on all of the maven builds: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
 and is possibly caused by SPARK-19763.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19964) Flaky test: SparkSubmitSuite fails due to Timeout

2017-03-16 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19964:
---
Summary: Flaky test: SparkSubmitSuite fails due to Timeout  (was: 
SparkSubmitSuite fails due to Timeout)

> Flaky test: SparkSubmitSuite fails due to Timeout
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19964) Flaky test: SparkSubmitSuite fails due to Timeout

2017-03-16 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929257#comment-15929257
 ] 

Kay Ousterhout edited comment on SPARK-19964 at 3/17/17 12:54 AM:
--

[~srowen] it looks like this is failing periodically in master: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite_name=includes+jars+passed+in+through+--jars
 (I added flaky to the name which is I suspect the source of confusion)



was (Author: kayousterhout):
[~srowen] it looks like this is failing periodically in master: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite_name=includes+jars+passed+in+through+--jars


> Flaky test: SparkSubmitSuite fails due to Timeout
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19964) SparkSubmitSuite fails due to Timeout

2017-03-16 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929257#comment-15929257
 ] 

Kay Ousterhout commented on SPARK-19964:


[~srowen] it looks like this is failing periodically in master: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite_name=includes+jars+passed+in+through+--jars


> SparkSubmitSuite fails due to Timeout
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive

2017-03-16 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19988:
---
Component/s: SQL

> Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column 
> written by Hive
> 
>
> Key: SPARK-19988
> URL: https://issues.apache.org/jira/browse/SPARK-19988
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>  Labels: flaky-test
> Attachments: trimmed-unit-test.log
>
>
> "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by 
> Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 
> days here:
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive
> eg. 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/
> {noformat}
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> SemanticException [Error 10072]: Database does not exist: db2
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19989) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2017-03-16 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19989:
--

 Summary: Flaky Test: 
org.apache.spark.sql.kafka010.KafkaSourceStressSuite
 Key: SPARK-19989
 URL: https://issues.apache.org/jira/browse/SPARK-19989
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.2.0
Reporter: Kay Ousterhout
Priority: Minor


This test failed recently here: 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/stress_test_with_multiple_topics_and_partitions/

And based on Josh's dashboard 
(https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressSuite_name=stress+test+with+multiple+topics+and+partitions),
 seems to fail a few times every month.  Here's the full error from the most 
recent failure:

Error Message

org.scalatest.exceptions.TestFailedException:  Error adding data: replication 
factor: 1 larger than available brokers: 0 
kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)  
kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)  
org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173)
  
org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903)
  
org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901)
  
org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93)
  
org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92)
  scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)  
org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92)
  
org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494)
   == Progress ==AssertOnQuery(, )CheckAnswer: 
StopStream
StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@5d888be0,Map()) 
   AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data 
= Range(0, 1, 2, 3, 4, 5, 6, 7, 8), message = )CheckAnswer: 
[1],[2],[3],[4],[5],[6],[7],[8],[9]StopStream
StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@1be724ee,Map()) 
   AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data 
= Range(9, 10, 11, 12, 13, 14), message = )CheckAnswer: 
[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15]StopStream 
   AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data 
= Range(), message = ) => AddKafkaData(topics = Set(stress4, stress6, stress2, 
stress1, stress5, stress3), data = Range(15), message = Add topic stress7)
AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
stress3), data = Range(16, 17, 18, 19, 20, 21, 22), message = Add partition)
AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
stress3), data = Range(23, 24), message = Add partition)AddKafkaData(topics 
= Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = 
Range(), message = Add topic stress9)AddKafkaData(topics = Set(stress4, 
stress6, stress2, stress8, stress1, stress5, stress3), data = Range(25, 26, 27, 
28, 29, 30, 31, 32, 33), message = )AddKafkaData(topics = Set(stress4, 
stress6, stress2, stress8, stress1, stress5, stress3), data = Range(), message 
= )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
stress5, stress3), data = Range(), message = )AddKafkaData(topics = 
Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = 
Range(34, 35, 36, 37, 38, 39), message = )AddKafkaData(topics = 
Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = 
Range(40, 41, 42, 43), message = )AddKafkaData(topics = Set(stress4, 
stress6, stress2, stress8, stress1, stress5, stress3), data = Range(44), 
message = Add partition)AddKafkaData(topics = Set(stress4, stress6, 
stress2, stress8, stress1, stress5, stress3), data = Range(45, 46, 47, 48, 49, 
50, 51, 52), message = Add partition)AddKafkaData(topics = Set(stress4, 
stress6, stress2, stress8, stress1, stress5, stress3), data = Range(53, 54, 
55), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, 
stress8, stress1, stress5, stress3), data = Range(56, 57, 58, 59, 60, 61, 62, 
63), message = Add partition)AddKafkaData(topics = Set(stress4, stress6, 
stress2, stress8, stress1, stress5, stress3), data = Range(64, 65, 66, 67, 68, 
69, 70), message = )
StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@65068637,Map()) 
   AddKafkaData(topics = Set(stress4, stress6, stress2,

[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-15 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927539#comment-15927539
 ] 

Kay Ousterhout commented on SPARK-19803:


Awesome thanks!

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Shubham Chopra
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-15 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927263#comment-15927263
 ] 

Kay Ousterhout commented on SPARK-19803:


This failed again today: 

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74621/testReport/org.apache.spark.storage/BlockManagerProactiveReplicationSuite/proactive_block_replication___3_replicas___2_block_manager_deletions/

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Shubham Chopra
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers

2017-03-15 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-18066.

   Resolution: Fixed
 Assignee: Eren Avsarogullari
Fix Version/s: 2.2.0

> Add Pool usage policies test coverage for FIFO & FAIR Schedulers
> 
>
> Key: SPARK-18066
> URL: https://issues.apache.org/jira/browse/SPARK-18066
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>Priority: Minor
> Fix For: 2.2.0
>
>
> The following Pool usage cases need to have Unit test coverage :
> - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* 
> property is set, related pool is not created and *TaskSetManagers* are added 
> to root pool.
> - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is 
> not set. This can be happened when Properties object is null or empty(*new 
> Properties()*) or points default pool(*spark.scheduler.pool*=_default_).
> - FAIR Scheduler creates a new pool with default values when 
> *spark.scheduler.pool* property points _non-existent_ pool. This can be 
> happened when scheduler allocation file is not set or it does not contain 
> related pool.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-14 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout reopened SPARK-19803:

  Assignee: Shubham Chopra  (was: Genmao Yu)

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Shubham Chopra
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-14 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925353#comment-15925353
 ] 

Kay Ousterhout commented on SPARK-19803:


This does not appear to be fixed -- it looks like there's some error condition 
in the underlying code that can cause this to break?  From 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74412/testReport/org.apache.spark.storage/BlockManagerProactiveReplicationSuite/proactive_block_replication___5_replicas___4_block_manager_deletions/:
 

org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 493 times over 5.00752125399 
seconds. Last failure message: 4 did not equal 5.
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.apache.spark.storage.BlockManagerProactiveReplicationSuite.testProactiveReplication(BlockManagerReplicationSuite.scala:492)
at 
org.apache.spark.storage.BlockManagerProactiveReplicationSuite$$anonfun$12$$anonfun$apply$mcVI$sp$1.apply$mcV$sp(BlockManagerReplicationSuite.scala:464)
at 
org.apache.spark.storage.BlockManagerProactiveReplicationSuite$$anonfun$12$$anonfun$apply$mcVI$sp$1.apply(BlockManagerReplicationSuite.scala:464)
at 
org.apache.spark.storage.BlockManagerProactiveReplicationSuite$$anonfun$12$$anonfun$apply$mcVI$sp$1.apply(BlockManagerReplicationSuite.scala:464)

[~shubhamc] and [~cloud_fan], since you worked on the original code for this, 
can you take a look at this?  I looked at this for a bit and based on some 
experimentation it looked like there were some race conditions in the 
underlying code.

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19803:
---
Labels: flaky-test  (was: )

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19803:
---
Affects Version/s: (was: 2.3.0)
   2.2.0

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19803:
---
Component/s: Tests

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19803.

   Resolution: Fixed
 Assignee: Genmao Yu
Fix Version/s: 2.2.0

Thanks for fixing this [~uncleGen] and for reporting it [~sitalke...@gmail.com]

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling

2017-03-02 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19276.

   Resolution: Fixed
 Assignee: Imran Rashid
Fix Version/s: 2.2.0

> FetchFailures can be hidden by user (or sql) exception handling
> ---
>
> Key: SPARK-19276
> URL: https://issues.apache.org/jira/browse/SPARK-19276
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Critical
> Fix For: 2.2.0
>
>
> The scheduler handles node failures by looking for a special 
> {{FetchFailedException}} thrown by the shuffle block fetcher.  This is 
> handled in {{Executor}} and then passed as a special msg back to the driver: 
> https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403
> However, user code exists in between the shuffle block fetcher and that catch 
> block -- it could intercept the exception, wrap it with something else, and 
> throw a different exception.  If that happens, spark treats it as an ordinary 
> task failure, and retries the task, rather than regenerating the missing 
> shuffle data.  The task eventually is retried 4 times, its doomed to fail 
> each time, and the job is failed.
> You might think that no user code should do that -- but even sparksql does it:
> https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214
> Here's an example stack trace.  This is from Spark 1.6, so the sql code is 
> not the same, but the problem is still there:
> {noformat}
> 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while 
> writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to xxx/yyy:zzz
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
> ...
> 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 
> failed 4 times; aborting job
> {noformat}
> I think the right fix here is to also set a fetch failure status in the 
> {{TaskContextImpl}}, so the executor can check that instead of just one 
> exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893376#comment-15893376
 ] 

Kay Ousterhout commented on SPARK-19796:


Do you think we should (separately) fix the underlying problem?  Specifically, 
we could:

(a) not send the SPARK_JOB_DESCRIPTION property to the workers, since it's only 
used on the master for the UI (and while users *could* access it, the variable 
name SPARK_JOB_DESCRIPTION is spark-private, which suggests that it shouldn't 
be used by users).  Perhaps this is too risky because users could be using it?

(b) Truncate SPARK_JOB_DESCRIPTION to something reasonable (100 characters?) 
before sending it to the workers.  This is more backwards compatible if users 
are actually reading the property, but maybe a useless intermediate approach?

(c) (Possibly in addition to one of the above) Log a warning if any of the 
properties is longer than 100 characters (or some threshold).

Thoughts?  I can file a JIRA if you think any of these is worthwhile.

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Blocker
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19631) OutputCommitCoordinator should not allow commits for already failed tasks

2017-03-02 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout reassigned SPARK-19631:
--

Assignee: Patrick Woody

> OutputCommitCoordinator should not allow commits for already failed tasks
> -
>
> Key: SPARK-19631
> URL: https://issues.apache.org/jira/browse/SPARK-19631
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Patrick Woody
>Assignee: Patrick Woody
> Fix For: 2.2.0
>
>
> This is similar to SPARK-6614, but there a race condition where a task may 
> fail (e.g. Executor heartbeat timeout) and still manage to go through the 
> commit protocol successfully. After this any retries of the task will fail 
> indefinitely because of TaskCommitDenied.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19631) OutputCommitCoordinator should not allow commits for already failed tasks

2017-03-02 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19631.

   Resolution: Fixed
Fix Version/s: 2.2.0

> OutputCommitCoordinator should not allow commits for already failed tasks
> -
>
> Key: SPARK-19631
> URL: https://issues.apache.org/jira/browse/SPARK-19631
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Patrick Woody
> Fix For: 2.2.0
>
>
> This is similar to SPARK-6614, but there a race condition where a task may 
> fail (e.g. Executor heartbeat timeout) and still manage to go through the 
> commit protocol successfully. After this any retries of the task will fail 
> indefinitely because of TaskCommitDenied.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13931) Resolve stage hanging up problem in a particular case

2017-03-01 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-13931.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Resolve stage hanging up problem in a particular case
> -
>
> Key: SPARK-13931
> URL: https://issues.apache.org/jira/browse/SPARK-13931
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 1.6.1
>Reporter: ZhengYaofeng
> Fix For: 2.2.0
>
>
> Suppose the following steps:
> 1. Open speculation switch in the application. 
> 2. Run this app and suppose last task of shuffleMapStage 1 finishes. Let's 
> get the record straight, from the eyes of DAG, this stage really finishes, 
> and from the eyes of TaskSetManager, variable 'isZombie' is set to true, but 
> variable runningTasksSet isn't empty because of speculation.
> 3. Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes 
> all executorLost functions of rootPool's taskSetManagers. DAG receiving this 
> signal, removes all this executor's outputLocs.
> 4. TaskSetManager adds all this executor's tasks to pendingTasks and tells 
> DAG they will be resubmitted (Attention: possibly not on time).
> 5. DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and 
> going to find that shuffleMapStage 1 is its missing parent because some 
> outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 
> 1 again.
> 6. DAG still receives Task 'Resubmitted' signal from old taskSetManager, and 
> increases the number of pendingTasks of shuffleMapStage 1 each time. However, 
> old taskSetManager won't resolve new task to submit because its variable 
> 'isZombie' is set to true.
> 7. Finally shuffleMapStage 1 never finishes in DAG together with all stages 
> depending on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19777) Scan runningTasksSet when check speculatable tasks in TaskSetManager.

2017-03-01 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19777.

   Resolution: Fixed
 Assignee: jin xing
Fix Version/s: 2.2.0

> Scan runningTasksSet when check speculatable tasks in TaskSetManager.
> -
>
> Key: SPARK-19777
> URL: https://issues.apache.org/jira/browse/SPARK-19777
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Minor
> Fix For: 2.2.0
>
>
> When check speculatable tasks in TaskSetManager, only scan runningTasksSet 
> instead of scanning all taskInfos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19772) Flaky test: pyspark.streaming.tests.WindowFunctionTests

2017-02-28 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19772:
---
Labels: flaky-test  (was: )

> Flaky test: pyspark.streaming.tests.WindowFunctionTests
> ---
>
> Key: SPARK-19772
> URL: https://issues.apache.org/jira/browse/SPARK-19772
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>  Labels: flaky-test
>
> Here's the link to the failed build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73598
> FAIL [16.440s]: test_count_by_value_and_window 
> (pyspark.streaming.tests.WindowFunctionTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py",
>  line 668, in test_count_by_value_and_window
> self._test_func(input, func, expected)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py",
>  line 162, in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[(0,[312 chars] 2), (5, 1)], [(0, 1), (1, 1), 
> (2, 1), (3, 1), (4, 1), (5, 1)]] != [[(0,[312 chars] 2), (5, 1)]]
> First list contains 1 additional elements.
> First extra element 9:
> [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
>   [[(0, 1)],
>[(0, 2), (1, 1)],
>[(0, 3), (1, 2), (2, 1)],
>[(0, 4), (1, 3), (2, 2), (3, 1)],
>[(0, 5), (1, 4), (2, 3), (3, 2), (4, 1)],
>[(0, 5), (1, 5), (2, 4), (3, 3), (4, 2), (5, 1)],
>[(0, 4), (1, 4), (2, 4), (3, 3), (4, 2), (5, 1)],
>[(0, 3), (1, 3), (2, 3), (3, 3), (4, 2), (5, 1)],
> -  [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)],
> ?  ^
> +  [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)]]
> ?  ^
> -  [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]]
> Stdout:
> ('timeout after', 15)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19772) Flaky test: pyspark.streaming.tests.WindowFunctionTests

2017-02-28 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888758#comment-15888758
 ] 

Kay Ousterhout commented on SPARK-19772:


[~mengxr][~davies] It looks like this came up a while ago in SPARK-7497 and you 
fixed it.  Any chance you could look at this again?

> Flaky test: pyspark.streaming.tests.WindowFunctionTests
> ---
>
> Key: SPARK-19772
> URL: https://issues.apache.org/jira/browse/SPARK-19772
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> Here's the link to the failed build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73598
> FAIL [16.440s]: test_count_by_value_and_window 
> (pyspark.streaming.tests.WindowFunctionTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py",
>  line 668, in test_count_by_value_and_window
> self._test_func(input, func, expected)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py",
>  line 162, in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[(0,[312 chars] 2), (5, 1)], [(0, 1), (1, 1), 
> (2, 1), (3, 1), (4, 1), (5, 1)]] != [[(0,[312 chars] 2), (5, 1)]]
> First list contains 1 additional elements.
> First extra element 9:
> [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
>   [[(0, 1)],
>[(0, 2), (1, 1)],
>[(0, 3), (1, 2), (2, 1)],
>[(0, 4), (1, 3), (2, 2), (3, 1)],
>[(0, 5), (1, 4), (2, 3), (3, 2), (4, 1)],
>[(0, 5), (1, 5), (2, 4), (3, 3), (4, 2), (5, 1)],
>[(0, 4), (1, 4), (2, 4), (3, 3), (4, 2), (5, 1)],
>[(0, 3), (1, 3), (2, 3), (3, 3), (4, 2), (5, 1)],
> -  [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)],
> ?  ^
> +  [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)]]
> ?  ^
> -  [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]]
> Stdout:
> ('timeout after', 15)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19772) Flaky test: pyspark.streaming.tests.WindowFunctionTests

2017-02-28 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19772:
--

 Summary: Flaky test: pyspark.streaming.tests.WindowFunctionTests
 Key: SPARK-19772
 URL: https://issues.apache.org/jira/browse/SPARK-19772
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Structured Streaming, Tests
Affects Versions: 2.2.0
Reporter: Kay Ousterhout


Here's the link to the failed build: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73598

FAIL [16.440s]: test_count_by_value_and_window 
(pyspark.streaming.tests.WindowFunctionTests)
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py",
 line 668, in test_count_by_value_and_window
self._test_func(input, func, expected)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py",
 line 162, in _test_func
self.assertEqual(expected, result)
AssertionError: Lists differ: [[(0,[312 chars] 2), (5, 1)], [(0, 1), (1, 1), 
(2, 1), (3, 1), (4, 1), (5, 1)]] != [[(0,[312 chars] 2), (5, 1)]]

First list contains 1 additional elements.
First extra element 9:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]

  [[(0, 1)],
   [(0, 2), (1, 1)],
   [(0, 3), (1, 2), (2, 1)],
   [(0, 4), (1, 3), (2, 2), (3, 1)],
   [(0, 5), (1, 4), (2, 3), (3, 2), (4, 1)],
   [(0, 5), (1, 5), (2, 4), (3, 3), (4, 2), (5, 1)],
   [(0, 4), (1, 4), (2, 4), (3, 3), (4, 2), (5, 1)],
   [(0, 3), (1, 3), (2, 3), (3, 3), (4, 2), (5, 1)],
-  [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)],
?  ^

+  [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)]]
?  ^

-  [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]]

Stdout:
('timeout after', 15)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19597) ExecutorSuite should have test for tasks that are not deserialiazable

2017-02-24 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19597.

   Resolution: Fixed
Fix Version/s: 2.2.0

> ExecutorSuite should have test for tasks that are not deserialiazable
> -
>
> Key: SPARK-19597
> URL: https://issues.apache.org/jira/browse/SPARK-19597
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.2.0
>
>
> We should have a test case that ensures that Executors gracefully handle a 
> task that fails to deserialize, by sending back a reasonable failure message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4681) Turn on executor level blacklisting by default

2017-02-24 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-4681.
-
Resolution: Duplicate

This was for the old blacklisting mechanism.  The linked JIRAs introduce a new 
blacklisting mechanism that should eventually be enabled by default, but are 
currently considered experimental.

> Turn on executor level blacklisting by default
> --
>
> Key: SPARK-4681
> URL: https://issues.apache.org/jira/browse/SPARK-4681
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Patrick Wendell
>Assignee: Kay Ousterhout
>
> Per discussion in https://github.com/apache/spark/pull/3541.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19560) Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask from a failed executor

2017-02-24 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-19560.
--
  Resolution: Fixed
Target Version/s: 2.2.0

> Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask 
> from a failed executor
> 
>
> Key: SPARK-19560
> URL: https://issues.apache.org/jira/browse/SPARK-19560
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> There's some tricky code around the case when the DAGScheduler learns of a 
> ShuffleMapTask that completed successfully, but ran on an executor that 
> failed sometime after the task was launched.  This case is tricky because the 
> TaskSetManager (i.e., the lower level scheduler) thinks the task completed 
> successfully, but the DAGScheduler considers the output it generated to be no 
> longer valid (because it was probably lost when the executor was lost).  As a 
> result, the DAGScheduler needs to re-submit the stage, so that the task can 
> be re-run.  This is tested in some of the tests but not clearly documented, 
> so we should improve this to prevent future bugs (this was encountered by 
> [~markhamstra] in attempting to find a better fix for SPARK-19263).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19596) After a Stage is completed, all Tasksets for the stage should be marked as zombie

2017-02-23 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881726#comment-15881726
 ] 

Kay Ousterhout commented on SPARK-19596:


I agree that this is an issue (although it would be implicitly fixed if we 
cancel running tasks in zombie stages, because that would mean that a task 
attempt from an earlier, still-running stage attempt can't cause a stage to be 
marked as complete)

> After a Stage is completed, all Tasksets for the stage should be marked as 
> zombie
> -
>
> Key: SPARK-19596
> URL: https://issues.apache.org/jira/browse/SPARK-19596
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>
> Fetch Failures can lead to multiple simultaneous tasksets for one stage.  The 
> stage may eventually be finished by task completions from a prior stage 
> attempt.  When this happens, the most recent taskset is not marked as a 
> zombie.  This means that taskset may continue to submit new tasks even after 
> the stage is complete.
> This is not a correctness issue, but it will effect performance, as cluster 
> resources will get tied up running tasks that are not needed.
> This is a follow up to https://issues.apache.org/jira/browse/SPARK-19565.  
> See some discussion in the pr for that issue: 
> https://github.com/apache/spark/pull/16901



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881688#comment-15881688
 ] 

Kay Ousterhout commented on SPARK-19698:


My concern is that there are other cases in Spark where this issue could arise 
(so Spark tasks need to be very careful about how they modify external state).  
Here's another scenario:

- Attempt 0 of a task starts and takes a long time to run
- A second, speculative copy of the task is started (attempt 1)
- Attempt 0 finishes successfully, but attempt 1 is still running
- Attempt 1 gets partway through modifying the external state, but then gets 
killed because of an OOM on the machine
- Attempt 1 won't get re-started, because a copy of the task already finished 
successfully

This seems like it will have the same issue you mentioned in the JIRA, right?

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881564#comment-15881564
 ] 

Kay Ousterhout commented on SPARK-19698:


I see -- I agree that everything in your description is correct.  The driver 
will allow all tasks to finish if it's still running (e.g., if other tasks are 
being submitted), but you're right it will shut down the workers while some 
tasks are still in progress if the Driver shuts down.

To think about how to fix this, let me ask you a question about your workload: 
suppose a task is in the middle of manipulating some external state (as you 
described in the JIRA description) and it gets killed suddenly because the JVM 
runs out of memory (e.g., because another concurrently running task used up all 
of the memory).  In that case, the job listener won't be told about the failed 
task, and it will be re-tried.  Does that pose a problem in the same way that 
the behavior described in the PR is problematic?

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished

2017-02-23 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-14658:
---
Fix Version/s: 2.2.0

> when executor lost DagScheduer may submit one stage twice even if the first 
> running taskset for this stage is not finished
> --
>
> Key: SPARK-14658
> URL: https://issues.apache.org/jira/browse/SPARK-14658
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0
> Environment: spark1.6.1  hadoop-2.6.0-cdh5.4.2
>Reporter: yixiaohua
> Fix For: 2.2.0
>
>
> {code}
> 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 57: 
> 57.2,57.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> First Time:
> {code}
> 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 
> 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 
> 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, 
> 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, 
> 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, 
> 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, 
> 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, 
> 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, 
> 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, 
> 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, 
> 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, 
> 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, 
> 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110)
> {code}
> Second Time:
> {code}
> 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 26
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished

2017-02-23 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-14658.

Resolution: Duplicate

I'm fairly sure this duplicates SPARK-19263, as Mark mentioned on the PR.  
Check out this comment for a description of what's going on: 
https://github.com/apache/spark/pull/16620#issuecomment-279125227

Josh, feel free to re-open if you think this is a different issue.

> when executor lost DagScheduer may submit one stage twice even if the first 
> running taskset for this stage is not finished
> --
>
> Key: SPARK-14658
> URL: https://issues.apache.org/jira/browse/SPARK-14658
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0
> Environment: spark1.6.1  hadoop-2.6.0-cdh5.4.2
>Reporter: yixiaohua
>
> {code}
> 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 57: 
> 57.2,57.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> First Time:
> {code}
> 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 
> 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 
> 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, 
> 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, 
> 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, 
> 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, 
> 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, 
> 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, 
> 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, 
> 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, 
> 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, 
> 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, 
> 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110)
> {code}
> Second Time:
> {code}
> 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 26
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19263) DAGScheduler should avoid sending conflicting task set.

2017-02-23 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881406#comment-15881406
 ] 

Kay Ousterhout commented on SPARK-19263:


Just noting that this was fixed by https://github.com/apache/spark/pull/16620 
(the other PR was accidentally created with the same JIRA ID)

> DAGScheduler should avoid sending conflicting task set.
> ---
>
> Key: SPARK-19263
> URL: https://issues.apache.org/jira/browse/SPARK-19263
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> In current *DAGScheduler handleTaskCompletion* code, when *event.reason* is 
> *Success*, it will first do *stage.pendingPartitions -= task.partitionId*, 
> which maybe a bug when *FetchFailed* happens. Think about below:
> # Stage 0 runs and generates shuffle output data.
> # Stage 1 reads the output from stage 0 and generates more shuffle data. It 
> has two tasks: ShuffleMapTask1 and ShuffleMapTask2, and these tasks are 
> launched on executorA.
> # ShuffleMapTask1 fails to fetch blocks locally and sends a FetchFailed to 
> the driver. The driver marks executorA as lost and updates failedEpoch;
> # The driver resubmits stage 0 so the missing output can be re-generated, and 
> then once it completes, resubmits stage 1 with ShuffleMapTask1x and 
> ShuffleMapTask2x.
> # ShuffleMapTask2 (from the original attempt of stage 1) successfully 
> finishes on executorA and sends Success back to driver. This causes 
> DAGScheduler::handleTaskCompletion to remove partition 2 from 
> stage.pendingPartitions (line 1149), but it does not add the partition to the 
> set of output locations (line 1192), because the task’s epoch is less than 
> the failure epoch for the executor (because of the earlier failure on 
> executor A)
> # ShuffleMapTask1x successfully finishes on executorB, causing the driver to 
> remove partition 1 from stage.pendingPartitions. Combined with the previous 
> step, this means that there are no more pending partitions for the stage, so 
> the DAGScheduler marks the stage as finished (line 1196). However, the 
> shuffle stage is not available (line 1215) because the completion for 
> ShuffleMapTask2 was ignored because of its epoch, so the DAGScheduler 
> resubmits the stage.
> # ShuffleMapTask2x is still running, so when TaskSchedulerImpl::submitTasks 
> is called for the re-submitted stage, it throws an error, because there’s an 
> existing active task set
> To reproduce the bug:
> 1. We need to do some modification in *ShuffleBlockFetcherIterator*: check 
> whether the task's index in *TaskSetManager* and stage attempt equal to 0 at 
> the same time, if so, throw FetchFailedException;
> 2. Rebuild spark then submit following job:
> {code}
> val rdd = sc.parallelize(List((0, 1), (1, 1), (2, 1), (3, 1), (1, 2), (0, 
> 3), (2, 1), (3, 1)), 2)
> rdd.reduceByKey {
>   (v1, v2) => {
> Thread.sleep(1)
> v1 + v2
>   }
> }.map {
>   keyAndValue => {
> (keyAndValue._1 % 2, keyAndValue._2)
>   }
> }.reduceByKey {
>   (v1, v2) => {
> Thread.sleep(1)
> v1 + v2
>   }
> }.collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881358#comment-15881358
 ] 

Kay Ousterhout edited comment on SPARK-19698 at 2/23/17 9:57 PM:
-

I think this is the same issue as SPARK-19263 -- can you check to see if that 
fixes the problem / have you looked at that JIRA?  I wrote a super long 
description of the problem towards the end of the associated PR.

One more note is that right now, Spark won't cancel running task attempts 
(although there's a JIRA to fix this), even when a stage is marked as failed.  
So the exact scenario you described, where the 2nd task attempt gets shut down, 
shouldn't occur (the driver will wait for the 2nd task attempt to complete, but 
will ignore the result).


was (Author: kayousterhout):
I think this is the same issue as SPARK-19263 -- can you check to see if that 
fixes the problem / have you looked at that JIRA?  I wrote a super long 
description of the problem towards the end of the associated PR.

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881358#comment-15881358
 ] 

Kay Ousterhout commented on SPARK-19698:


I think this is the same issue as SPARK-19263 -- can you check to see if that 
fixes the problem / have you looked at that JIRA?  I wrote a super long 
description of the problem towards the end of the associated PR.

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19684) Move info about running specific tests to developer website

2017-02-23 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19684.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Move info about running specific tests to developer website
> ---
>
> Key: SPARK-19684
> URL: https://issues.apache.org/jira/browse/SPARK-19684
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.1.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 2.2.0
>
>
> This JIRA accompanies this change to the website: 
> https://github.com/apache/spark-website/pull/33.
> Running individual tests is not something that changes with new versions of 
> the project, and is primarily used by developers (not users) so should be 
> moved to the developer-tools page of the main website (with a link from the 
> building-spark page on the release-specific docs).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19684) Move info about running specific tests to developer website

2017-02-21 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19684:
--

 Summary: Move info about running specific tests to developer 
website
 Key: SPARK-19684
 URL: https://issues.apache.org/jira/browse/SPARK-19684
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


This JIRA accompanies this change to the website: 
https://github.com/apache/spark-website/pull/33.

Running individual tests is not something that changes with new versions of the 
project, and is primarily used by developers (not users) so should be moved to 
the developer-tools page of the main website (with a link from the 
building-spark page on the release-specific docs).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19263) DAGScheduler should avoid sending conflicting task set.

2017-02-18 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19263.

   Resolution: Fixed
 Assignee: jin xing
Fix Version/s: 1.2.0

> DAGScheduler should avoid sending conflicting task set.
> ---
>
> Key: SPARK-19263
> URL: https://issues.apache.org/jira/browse/SPARK-19263
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
> Fix For: 1.2.0
>
>
> In current *DAGScheduler handleTaskCompletion* code, when *event.reason* is 
> *Success*, it will first do *stage.pendingPartitions -= task.partitionId*, 
> which maybe a bug when *FetchFailed* happens. Think about below:
> # Stage 0 runs and generates shuffle output data.
> # Stage 1 reads the output from stage 0 and generates more shuffle data. It 
> has two tasks: ShuffleMapTask1 and ShuffleMapTask2, and these tasks are 
> launched on executorA.
> # ShuffleMapTask1 fails to fetch blocks locally and sends a FetchFailed to 
> the driver. The driver marks executorA as lost and updates failedEpoch;
> # The driver resubmits stage 0 so the missing output can be re-generated, and 
> then once it completes, resubmits stage 1 with ShuffleMapTask1x and 
> ShuffleMapTask2x.
> # ShuffleMapTask2 (from the original attempt of stage 1) successfully 
> finishes on executorA and sends Success back to driver. This causes 
> DAGScheduler::handleTaskCompletion to remove partition 2 from 
> stage.pendingPartitions (line 1149), but it does not add the partition to the 
> set of output locations (line 1192), because the task’s epoch is less than 
> the failure epoch for the executor (because of the earlier failure on 
> executor A)
> # ShuffleMapTask1x successfully finishes on executorB, causing the driver to 
> remove partition 1 from stage.pendingPartitions. Combined with the previous 
> step, this means that there are no more pending partitions for the stage, so 
> the DAGScheduler marks the stage as finished (line 1196). However, the 
> shuffle stage is not available (line 1215) because the completion for 
> ShuffleMapTask2 was ignored because of its epoch, so the DAGScheduler 
> resubmits the stage.
> # ShuffleMapTask2x is still running, so when TaskSchedulerImpl::submitTasks 
> is called for the re-submitted stage, it throws an error, because there’s an 
> existing active task set
> To reproduce the bug:
> 1. We need to do some modification in *ShuffleBlockFetcherIterator*: check 
> whether the task's index in *TaskSetManager* and stage attempt equal to 0 at 
> the same time, if so, throw FetchFailedException;
> 2. Rebuild spark then submit following job:
> {code}
> val rdd = sc.parallelize(List((0, 1), (1, 1), (2, 1), (3, 1), (1, 2), (0, 
> 3), (2, 1), (3, 1)), 2)
> rdd.reduceByKey {
>   (v1, v2) => {
> Thread.sleep(1)
> v1 + v2
>   }
> }.map {
>   keyAndValue => {
> (keyAndValue._1 % 2, keyAndValue._2)
>   }
> }.reduceByKey {
>   (v1, v2) => {
> Thread.sleep(1)
> v1 + v2
>   }
> }.collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19613) Flaky test: StateStoreRDDSuite.versioning and immutability

2017-02-15 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19613:
--

 Summary: Flaky test: StateStoreRDDSuite.versioning and immutability
 Key: SPARK-19613
 URL: https://issues.apache.org/jira/browse/SPARK-19613
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming, Tests
Affects Versions: 2.1.1
Reporter: Kay Ousterhout
Priority: Minor


This test: 
org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite.versioning 
and immutability failed on a recent PR: 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72948/testReport/junit/org.apache.spark.sql.execution.streaming.state/StateStoreRDDSuite/versioning_and_immutability/




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19612) Tests failing with timeout

2017-02-15 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868565#comment-15868565
 ] 

Kay Ousterhout commented on SPARK-19612:


Does that mean we could potentially fix this by limiting the concurrency on 
Jenkins? 

> Tests failing with timeout
> --
>
> Key: SPARK-19612
> URL: https://issues.apache.org/jira/browse/SPARK-19612
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Priority: Minor
>
> I've seen at least one recent test failure due to hitting the 250m timeout: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/
> Filing this JIRA to track this; if it happens repeatedly we should up the 
> timeout.
> cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19612) Tests failing with timeout

2017-02-15 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19612:
--

 Summary: Tests failing with timeout
 Key: SPARK-19612
 URL: https://issues.apache.org/jira/browse/SPARK-19612
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.1.1
Reporter: Kay Ousterhout
Priority: Minor


I've seen at least one recent test failure due to hitting the 250m timeout: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/

Filing this JIRA to track this; if it happens repeatedly we should up the 
timeout.

cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage

2017-02-10 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19537.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Move the pendingPartitions variable from Stage to ShuffleMapStage
> -
>
> Key: SPARK-19537
> URL: https://issues.apache.org/jira/browse/SPARK-19537
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 2.2.0
>
>
> This variable is only used by ShuffleMapStages, and it is confusing to have 
> it in the Stage class rather than the ShuffleMapStage class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19502) Remove unnecessary code to re-submit stages in the DAGScheduler

2017-02-10 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-19502.
--
Resolution: Not A Problem

This code actually is currently needed to handle cases where a ShuffleMapTask 
succeeds on an executor, but that executor was marked as failed (so the task 
needs to be re-run), as described in this comment: 
https://github.com/apache/spark/pull/16620#issuecomment-279125227

> Remove unnecessary code to re-submit stages in the DAGScheduler
> ---
>
> Key: SPARK-19502
> URL: https://issues.apache.org/jira/browse/SPARK-19502
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.1.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> There are a [few lines of code in the 
> DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215)
>  to re-submit shuffle map stages when some of the tasks fail.  My 
> understanding is that there should be a 1:1 mapping between pending tasks 
> (which are tasks that haven't completed successfully) and available output 
> locations, so that code should never be reachable.  Furthermore, the approach 
> taken by that code (to re-submit an entire stage as a result of task 
> failures) is not how we handle task failures in a stage (the lower-level 
> scheduler resubmits the individual tasks) which is what the 5-years-old TODO 
> on that code seems to be implying should be done.
> The big caveat is that there's a bug being fixed in SPARK-19263 that means 
> there is *not* a 1:1 relationship between pendingTasks and available 
> outputLocations, so that code is serving as a (buggy) band-aid.  This should 
> be fixed once we resolve SPARK-19263.
> cc [~imranr] [~markhamstra] [~jinxing6...@126.com] (let me know if any of you 
> see any reason we actually do need that code)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.

2017-02-10 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19538:
---
Priority: Minor  (was: Major)

> DAGScheduler and TaskSetManager can have an inconsistent view of whether a 
> stage is complete.
> -
>
> Key: SPARK-19538
> URL: https://issues.apache.org/jira/browse/SPARK-19538
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> The pendingPartitions in Stage tracks partitions that still need to be 
> computed, and is used by the DAGScheduler to determine when to mark the stage 
> as complete.  In most cases, this variable is exactly consistent with the 
> tasks in the TaskSetManager (for the current version of the stage) that are 
> still pending.  However, as discussed in SPARK-19263, these can become 
> inconsistent when an ShuffleMapTask for an earlier attempt of the stage 
> completes, in which case the DAGScheduler may think the stage has finished, 
> while the TaskSetManager is still waiting for some tasks to complete (see the 
> description in this pull request: 
> https://github.com/apache/spark/pull/16620).  This leads to bugs like 
> SPARK-19263.  Another problem with this behavior is that listeners can get 
> two StageCompleted messages: once when the DAGScheduler thinks the stage is 
> complete, and a second when the TaskSetManager later decides the stage is 
> complete.  We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19560) Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask from a failed executor

2017-02-10 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19560:
--

 Summary: Improve tests for when DAGScheduler learns of 
"successful" ShuffleMapTask from a failed executor
 Key: SPARK-19560
 URL: https://issues.apache.org/jira/browse/SPARK-19560
 Project: Spark
  Issue Type: Test
  Components: Scheduler
Affects Versions: 2.1.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


There's some tricky code around the case when the DAGScheduler learns of a 
ShuffleMapTask that completed successfully, but ran on an executor that failed 
sometime after the task was launched.  This case is tricky because the 
TaskSetManager (i.e., the lower level scheduler) thinks the task completed 
successfully, but the DAGScheduler considers the output it generated to be no 
longer valid (because it was probably lost when the executor was lost).  As a 
result, the DAGScheduler needs to re-submit the stage, so that the task can be 
re-run.  This is tested in some of the tests but not clearly documented, so we 
should improve this to prevent future bugs (this was encountered by 
[~markhamstra] in attempting to find a better fix for SPARK-19263).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19559) Fix flaky KafkaSourceSuite.subscribing topic by pattern with topic deletions

2017-02-10 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19559:
---
Description: 
This test has started failing frequently recently; e.g., 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/
 and 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/

cc [~zsxwing] and [~tcondie] who seemed to have modified the related code most 
recently

  was:This test has started failing frequently recently; e.g., 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/
 and 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/


> Fix flaky KafkaSourceSuite.subscribing topic by pattern with topic deletions
> 
>
> Key: SPARK-19559
> URL: https://issues.apache.org/jira/browse/SPARK-19559
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>
> This test has started failing frequently recently; e.g., 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/
>  and 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/
> cc [~zsxwing] and [~tcondie] who seemed to have modified the related code 
> most recently



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19559) Fix flaky KafkaSourceSuite.subscribing topic by pattern with topic deletions

2017-02-10 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19559:
--

 Summary: Fix flaky KafkaSourceSuite.subscribing topic by pattern 
with topic deletions
 Key: SPARK-19559
 URL: https://issues.apache.org/jira/browse/SPARK-19559
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming, Tests
Affects Versions: 2.1.0
Reporter: Kay Ousterhout


This test has started failing frequently recently; e.g., 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/
 and 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19466) Improve Fair Scheduler Logging

2017-02-10 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19466.

   Resolution: Fixed
 Assignee: Eren Avsarogullari
Fix Version/s: 2.2.0

> Improve Fair Scheduler Logging
> --
>
> Key: SPARK-19466
> URL: https://issues.apache.org/jira/browse/SPARK-19466
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>Priority: Minor
> Fix For: 2.2.0
>
>
> Fair Scheduler Logging for the following cases can be useful for the user.
> 1- If *valid* spark.scheduler.allocation.file property is set, user can be 
> informed so user can aware which scheduler file is processed when 
> SparkContext initializes.
> 2- If *invalid* spark.scheduler.allocation.file property is set, currently, 
> the following stacktrace is shown to user. In addition to this, more 
> meaningful message can be shown to user by emphasizing the problem at 
> building level of fair scheduler and covering other potential issues at this 
> level.
> {code:xml}
> Exception in thread "main" java.io.FileNotFoundException: INVALID_FILE (No 
> such file or directory)
>   at java.io.FileInputStream.open0(Native Method)
>   at java.io.FileInputStream.open(FileInputStream.java:195)
>   at java.io.FileInputStream.(FileInputStream.java:138)
>   at java.io.FileInputStream.(FileInputStream.java:93)
>   at 
> org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$buildPools$1.apply(SchedulableBuilder.scala:76)
>   at 
> org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$buildPools$1.apply(SchedulableBuilder.scala:75)
> {code}
> 3- If spark.scheduler.allocation.file property is not set and *default* fair 
> scheduler file(fairscheduler.xml) is found in classpath, it will be loaded 
> but currently, user is not informed so logging can be useful.
> 4- If spark.scheduler.allocation.file property is not set and default fair 
> scheduler file does not exist, currently, user is not informed so logging can 
> be useful.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.

2017-02-09 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19538:
--

 Summary: DAGScheduler and TaskSetManager can have an inconsistent 
view of whether a stage is complete.
 Key: SPARK-19538
 URL: https://issues.apache.org/jira/browse/SPARK-19538
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout


The pendingPartitions in Stage tracks partitions that still need to be 
computed, and is used by the DAGScheduler to determine when to mark the stage 
as complete.  In most cases, this variable is exactly consistent with the tasks 
in the TaskSetManager (for the current version of the stage) that are still 
pending.  However, as discussed in SPARK-19263, these can become inconsistent 
when an ShuffleMapTask for an earlier attempt of the stage completes, in which 
case the DAGScheduler may think the stage has finished, while the 
TaskSetManager is still waiting for some tasks to complete (see the description 
in this pull request: https://github.com/apache/spark/pull/16620).  This leads 
to bugs like SPARK-19263.  Another problem with this behavior is that listeners 
can get two StageCompleted messages: once when the DAGScheduler thinks the 
stage is complete, and a second when the TaskSetManager later decides the stage 
is complete.  We should fix this.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage

2017-02-09 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19537:
--

 Summary: Move the pendingPartitions variable from Stage to 
ShuffleMapStage
 Key: SPARK-19537
 URL: https://issues.apache.org/jira/browse/SPARK-19537
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


This variable is only used by ShuffleMapStages, and it is confusing to have it 
in the Stage class rather than the ShuffleMapStage class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18967) Locality preferences should be used when scheduling even when delay scheduling is turned off

2017-02-07 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855615#comment-15855615
 ] 

Kay Ousterhout edited comment on SPARK-18967 at 2/7/17 9:42 PM:


Please use 2.2.0, not 2.2


was (Author: rxin):
Oops this was me [~rxin] sorry!

> Locality preferences should be used when scheduling even when delay 
> scheduling is turned off
> 
>
> Key: SPARK-18967
> URL: https://issues.apache.org/jira/browse/SPARK-18967
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 2.2.0
>
>
> If you turn delay scheduling off by setting {{spark.locality.wait=0}}, you 
> effectively turn off the use the of locality preferences when there is a bulk 
> scheduling event.  {{TaskSchedulerImpl}} will use resources based on whatever 
> random order it decides to shuffle them, rather than taking advantage of the 
> most local options.
> This happens because {{TaskSchedulerImpl}} offers resources to a 
> {{TaskSetManager}} one at a time, each time subject to a maxLocality 
> constraint.  However, that constraint doesn't move through all possible 
> locality levels -- it uses [{{tsm.myLocalityLevels}} 
> |https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L360].
>   And {{tsm.myLocalityLevels}} [skips locality levels completely if the wait 
> == 0 | 
> https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L953].
>   So with delay scheduling off, {{TaskSchedulerImpl}} immediately jumps to 
> giving tsms the offers with {{maxLocality = ANY}}.
> *WORKAROUND*: instead of setting {{spark.locality.wait=0}}, use 
> {{spark.locality.wait=1ms}}.  The one downside of this is if you have tasks 
> that actually take less than 1ms.  You could even run into SPARK-18886.  But 
> that is a relatively unlikely scenario for real workloads.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18967) Locality preferences should be used when scheduling even when delay scheduling is turned off

2017-02-07 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856842#comment-15856842
 ] 

Kay Ousterhout commented on SPARK-18967:


Oops this was me [~rxin] sorry!

> Locality preferences should be used when scheduling even when delay 
> scheduling is turned off
> 
>
> Key: SPARK-18967
> URL: https://issues.apache.org/jira/browse/SPARK-18967
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 2.2.0
>
>
> If you turn delay scheduling off by setting {{spark.locality.wait=0}}, you 
> effectively turn off the use the of locality preferences when there is a bulk 
> scheduling event.  {{TaskSchedulerImpl}} will use resources based on whatever 
> random order it decides to shuffle them, rather than taking advantage of the 
> most local options.
> This happens because {{TaskSchedulerImpl}} offers resources to a 
> {{TaskSetManager}} one at a time, each time subject to a maxLocality 
> constraint.  However, that constraint doesn't move through all possible 
> locality levels -- it uses [{{tsm.myLocalityLevels}} 
> |https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L360].
>   And {{tsm.myLocalityLevels}} [skips locality levels completely if the wait 
> == 0 | 
> https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L953].
>   So with delay scheduling off, {{TaskSchedulerImpl}} immediately jumps to 
> giving tsms the offers with {{maxLocality = ANY}}.
> *WORKAROUND*: instead of setting {{spark.locality.wait=0}}, use 
> {{spark.locality.wait=1ms}}.  The one downside of this is if you have tasks 
> that actually take less than 1ms.  You could even run into SPARK-18886.  But 
> that is a relatively unlikely scenario for real workloads.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18967) Locality preferences should be used when scheduling even when delay scheduling is turned off

2017-02-07 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855615#comment-15855615
 ] 

Kay Ousterhout edited comment on SPARK-18967 at 2/7/17 9:40 PM:


Oops this was me [~rxin] sorry!


was (Author: rxin):
[~imranr] please don't use "2.2" as the version. It should be "2.2.0". Thanks!


> Locality preferences should be used when scheduling even when delay 
> scheduling is turned off
> 
>
> Key: SPARK-18967
> URL: https://issues.apache.org/jira/browse/SPARK-18967
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 2.2.0
>
>
> If you turn delay scheduling off by setting {{spark.locality.wait=0}}, you 
> effectively turn off the use the of locality preferences when there is a bulk 
> scheduling event.  {{TaskSchedulerImpl}} will use resources based on whatever 
> random order it decides to shuffle them, rather than taking advantage of the 
> most local options.
> This happens because {{TaskSchedulerImpl}} offers resources to a 
> {{TaskSetManager}} one at a time, each time subject to a maxLocality 
> constraint.  However, that constraint doesn't move through all possible 
> locality levels -- it uses [{{tsm.myLocalityLevels}} 
> |https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L360].
>   And {{tsm.myLocalityLevels}} [skips locality levels completely if the wait 
> == 0 | 
> https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L953].
>   So with delay scheduling off, {{TaskSchedulerImpl}} immediately jumps to 
> giving tsms the offers with {{maxLocality = ANY}}.
> *WORKAROUND*: instead of setting {{spark.locality.wait=0}}, use 
> {{spark.locality.wait=1ms}}.  The one downside of this is if you have tasks 
> that actually take less than 1ms.  You could even run into SPARK-18886.  But 
> that is a relatively unlikely scenario for real workloads.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19502) Remove unnecessary code to re-submit stages in the DAGScheduler

2017-02-07 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19502:
---
Description: 
There are a [few lines of code in the 
DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215)
 to re-submit shuffle map stages when some of the tasks fail.  My understanding 
is that there should be a 1:1 mapping between pending tasks (which are tasks 
that haven't completed successfully) and available output locations, so that 
code should never be reachable.  Furthermore, the approach taken by that code 
(to re-submit an entire stage as a result of task failures) is not how we 
handle task failures in a stage (the lower-level scheduler resubmits the 
individual tasks) which is what the 5-years-old TODO on that code seems to be 
implying should be done.

The big caveat is that there's a bug being fixed in SPARK-19263 that means 
there is *not* a 1:1 relationship between pendingTasks and available 
outputLocations, so that code is serving as a (buggy) band-aid.  This should be 
fixed once we resolve SPARK-19263.

cc [~imranr] [~markhamstra] [~jinxing6...@126.com] (let me know if any of you 
see any reason we actually do need that code)


  was:
There are a [few lines of code in the 
DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215)
 to re-submit shuffle map stages when some of the tasks fail.  My understanding 
is that there should be a 1:1 mapping between pending tasks (which are tasks 
that haven't completed successfully) and available output locations, so that 
code should never be reachable.  Furthermore, the approach taken by that code 
(to re-submit an entire stage as a result of task failures) is not how we 
handle task failures in a stage (the lower-level scheduler resubmits the 
individual tasks) which is what the 5-years-old TODO on that code seems to be 
implying should be done.

The big caveat is that there's a bug being fixed in SPARK-19263 that means 
there is *not* a 1:1 relationship between pendingTasks and available 
outputLocations, so that code is serving as a (buggy) band-aid.  This should be 
fixed once we resolve SPARK-19263.

cc [~imranr] [~markhamstra] [~jinxing6...@126.com]


> Remove unnecessary code to re-submit stages in the DAGScheduler
> ---
>
> Key: SPARK-19502
> URL: https://issues.apache.org/jira/browse/SPARK-19502
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.1.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> There are a [few lines of code in the 
> DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215)
>  to re-submit shuffle map stages when some of the tasks fail.  My 
> understanding is that there should be a 1:1 mapping between pending tasks 
> (which are tasks that haven't completed successfully) and available output 
> locations, so that code should never be reachable.  Furthermore, the approach 
> taken by that code (to re-submit an entire stage as a result of task 
> failures) is not how we handle task failures in a stage (the lower-level 
> scheduler resubmits the individual tasks) which is what the 5-years-old TODO 
> on that code seems to be implying should be done.
> The big caveat is that there's a bug being fixed in SPARK-19263 that means 
> there is *not* a 1:1 relationship between pendingTasks and available 
> outputLocations, so that code is serving as a (buggy) band-aid.  This should 
> be fixed once we resolve SPARK-19263.
> cc [~imranr] [~markhamstra] [~jinxing6...@126.com] (let me know if any of you 
> see any reason we actually do need that code)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19502) Remove unnecessary code to re-submit stages in the DAGScheduler

2017-02-07 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19502:
--

 Summary: Remove unnecessary code to re-submit stages in the 
DAGScheduler
 Key: SPARK-19502
 URL: https://issues.apache.org/jira/browse/SPARK-19502
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.1.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


There are a [few lines of code in the 
DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215)
 to re-submit shuffle map stages when some of the tasks fail.  My understanding 
is that there should be a 1:1 mapping between pending tasks (which are tasks 
that haven't completed successfully) and available output locations, so that 
code should never be reachable.  Furthermore, the approach taken by that code 
(to re-submit an entire stage as a result of task failures) is not how we 
handle task failures in a stage (the lower-level scheduler resubmits the 
individual tasks) which is what the 5-years-old TODO on that code seems to be 
implying should be done.

The big caveat is that there's a bug being fixed in SPARK-19263 that means 
there is *not* a 1:1 relationship between pendingTasks and available 
outputLocations, so that code is serving as a (buggy) band-aid.  This should be 
fixed once we resolve SPARK-19263.

cc [~imranr] [~markhamstra] [~jinxing6...@126.com]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18967) Locality preferences should be used when scheduling even when delay scheduling is turned off

2017-02-06 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-18967.

   Resolution: Fixed
Fix Version/s: 2.2

> Locality preferences should be used when scheduling even when delay 
> scheduling is turned off
> 
>
> Key: SPARK-18967
> URL: https://issues.apache.org/jira/browse/SPARK-18967
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 2.2
>
>
> If you turn delay scheduling off by setting {{spark.locality.wait=0}}, you 
> effectively turn off the use the of locality preferences when there is a bulk 
> scheduling event.  {{TaskSchedulerImpl}} will use resources based on whatever 
> random order it decides to shuffle them, rather than taking advantage of the 
> most local options.
> This happens because {{TaskSchedulerImpl}} offers resources to a 
> {{TaskSetManager}} one at a time, each time subject to a maxLocality 
> constraint.  However, that constraint doesn't move through all possible 
> locality levels -- it uses [{{tsm.myLocalityLevels}} 
> |https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L360].
>   And {{tsm.myLocalityLevels}} [skips locality levels completely if the wait 
> == 0 | 
> https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L953].
>   So with delay scheduling off, {{TaskSchedulerImpl}} immediately jumps to 
> giving tsms the offers with {{maxLocality = ANY}}.
> *WORKAROUND*: instead of setting {{spark.locality.wait=0}}, use 
> {{spark.locality.wait=1ms}}.  The one downside of this is if you have tasks 
> that actually take less than 1ms.  You could even run into SPARK-18886.  But 
> that is a relatively unlikely scenario for real workloads.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19398) Log in TaskSetManager is not correct

2017-02-06 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19398:
---
Fix Version/s: 2.2

> Log in TaskSetManager is not correct
> 
>
> Key: SPARK-19398
> URL: https://issues.apache.org/jira/browse/SPARK-19398
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Trivial
> Fix For: 2.2
>
>
> Log below is misleading:
> {code:title="TaskSetManager.scala"}
> if (successful(index)) {
>   logInfo(
> s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " +
> "but another instance of the task has already succeeded, " +
> "so not re-queuing the task to be re-executed.")
> }
> {code}
> If fetch failed, the task is marked as *successful* in *TaskSetManager:: 
> handleFailedTask*. Then log above will be printed. The *successful* just 
> means task will not be scheduled any longer, not a real success.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19398) Log in TaskSetManager is not correct

2017-02-06 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19398.

Resolution: Fixed
  Assignee: jin xing

> Log in TaskSetManager is not correct
> 
>
> Key: SPARK-19398
> URL: https://issues.apache.org/jira/browse/SPARK-19398
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Trivial
>
> Log below is misleading:
> {code:title="TaskSetManager.scala"}
> if (successful(index)) {
>   logInfo(
> s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " +
> "but another instance of the task has already succeeded, " +
> "so not re-queuing the task to be re-executed.")
> }
> {code}
> If fetch failed, the task is marked as *successful* in *TaskSetManager:: 
> handleFailedTask*. Then log above will be printed. The *successful* just 
> means task will not be scheduled any longer, not a real success.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios

2017-02-03 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852085#comment-15852085
 ] 

Kay Ousterhout commented on SPARK-19326:


I see that makes sense; thanks for the additional explanation.  [~andrewor14] 
did you think about this issue when implementing dynamic allocation originally? 
I noticed there'a a [comment saying that speculation is not considered for 
simplicity](https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L579),
 but it does seem like this functionality can prevent speculation from 
occurring.

> Speculated task attempts do not get launched in few scenarios
> -
>
> Key: SPARK-19326
> URL: https://issues.apache.org/jira/browse/SPARK-19326
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> Speculated copies of tasks do not get launched in some cases.
> Examples:
> - All the running executors have no CPU slots left to accommodate a 
> speculated copy of the task(s). If the all running executors reside over a 
> set of slow / bad hosts, they will keep the job running for long time
> - `spark.task.cpus` > 1 and the running executor has not filled up all its 
> CPU slots. Since the [speculated copies of tasks should run on different 
> host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283]
>  and not the host where the first copy was launched.
> In both these cases, `ExecutorAllocationManager` does not know about pending 
> speculation task attempts and thinks that all the resource demands are well 
> taken care of. ([relevant 
> code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265])
> This adds variation in the job completion times and more importantly SLA 
> misses :( In prod, with a large number of jobs, I see this happening more 
> often than one would think. Chasing the bad hosts or reason for slowness 
> doesn't scale.
> Here is a tiny repro. Note that you need to launch this with (Mesos or YARN 
> or standalone deploy mode) along with `--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100`
> {code}
> val n = 100
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index == 1) {
>   Thread.sleep(Long.MaxValue)  // fake long running task(s)
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios

2017-02-02 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850548#comment-15850548
 ] 

Kay Ousterhout commented on SPARK-19326:


What is the bad behavior that occurs with your example code?  Is the problem 
that only one executor is requested, so no speculation can occur because 
there's not a different node to run tasks on?

> Speculated task attempts do not get launched in few scenarios
> -
>
> Key: SPARK-19326
> URL: https://issues.apache.org/jira/browse/SPARK-19326
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> Speculated copies of tasks do not get launched in some cases.
> Examples:
> - All the running executors have no CPU slots left to accommodate a 
> speculated copy of the task(s). If the all running executors reside over a 
> set of slow / bad hosts, they will keep the job running for long time
> - `spark.task.cpus` > 1 and the running executor has not filled up all its 
> CPU slots. Since the [speculated copies of tasks should run on different 
> host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283]
>  and not the host where the first copy was launched.
> In both these cases, `ExecutorAllocationManager` does not know about pending 
> speculation task attempts and thinks that all the resource demands are well 
> taken care of. ([relevant 
> code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265])
> This adds variation in the job completion times and more importantly SLA 
> misses :( In prod, with a large number of jobs, I see this happening more 
> often than one would think. Chasing the bad hosts or reason for slowness 
> doesn't scale.
> Here is a tiny repro. Note that you need to launch this with (Mesos or YARN 
> or standalone deploy mode) along with `spark.speculation=true`
> {code}
> val someRDD = sc.parallelize(1 to 8, 8)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index == 7) {
>   Thread.sleep(Long.MaxValue)  // fake long running task(s)
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2017-01-06 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-18890:
---
Issue Type: Improvement  (was: Bug)

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> 
>
> Key: SPARK-18890
> URL: https://issues.apache.org/jira/browse/SPARK-18890
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2017-01-06 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805495#comment-15805495
 ] 

Kay Ousterhout commented on SPARK-18890:


I just opened SPARK-19108 for the broadcast issue.  In the meantime, after 
thinking about this more (and also based on your comments on the associated PRs 
Imran) I think we should go ahead and merge this change to consolidate the 
serialization in one place.  If nothing else, that change makes the code more 
readable, and I suspect will make it easier to implement further optimizations 
to the serialization in the future.

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> 
>
> Key: SPARK-18890
> URL: https://issues.apache.org/jira/browse/SPARK-18890
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19108) Broadcast all shared parts of tasks (to reduce task serialization time)

2017-01-06 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-19108:
--

 Summary: Broadcast all shared parts of tasks (to reduce task 
serialization time)
 Key: SPARK-19108
 URL: https://issues.apache.org/jira/browse/SPARK-19108
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Kay Ousterhout


Expand the amount of information that's broadcasted for tasks, to avoid 
serializing data per-task that should only be sent to each executor once for 
the entire stage.

Conceptually, this means we'd have new classes  specially for sending the 
minimal necessary data to the executor, like:

{code}
/**
  * metadata about the taskset needed by the executor for all tasks in this 
taskset.  Subset of the
  * full data kept on the driver to make it faster to serialize and send to 
executors.
  */
class ExecutorTaskSetMeta(
  val stageId: Int,
  val stageAttemptId: Int,
  val properties: Properties,
  val addedFiles: Map[String, String],
  val addedJars: Map[String, String]
  // maybe task metrics here?
)

class ExecutorTaskData(
  val partitionId: Int,
  val attemptNumber: Int,
  val taskId: Long,
  val taskBinary: Broadcast[Array[Byte]],
  val taskSetMeta: Broadcast[ExecutorTaskSetMeta]
)
{code}

Then all the info you'd need to send to the executors would be a serialized 
version of ExecutorTaskData.  Furthermore, given the simplicity of that class, 
you could serialize manually, and then for each task you could just modify the 
first two ints & one long directly in the byte buffer.  (You could do the same 
trick for serialization even if ExecutorTaskSetMeta was not a broadcast, but 
that will keep the msgs small as well.)

There a bunch of details I'm skipping here: you'd also need to do some special 
handling for the TaskMetrics; the way tasks get started in the executor would 
change; you'd also need to refactor {{Task}} to let it get reconstructed from 
this information (or add more to ExecutorTaskSetMeta); and probably other 
details I'm overlooking now.

(this is copied from SPARK-18890 and [~imranr]'s comment there; cc [~shivaram])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3937) Unsafe memory access inside of Snappy library

2017-01-06 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-3937.
-

> Unsafe memory access inside of Snappy library
> -
>
> Key: SPARK-3937
> URL: https://issues.apache.org/jira/browse/SPARK-3937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Patrick Wendell
>
> This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
> have much information about this other than the stack trace. However, it was 
> concerning enough I figured I should post it.
> {code}
> java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code
> org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
> org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
> org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
> 
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
> 
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
> 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> 
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
> 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> scala.collection.AbstractIterator.to(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
> 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> org.apache.spark.scheduler.Task.run(Task.scala:56)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands,

[jira] [Resolved] (SPARK-3937) Unsafe memory access inside of Snappy library

2017-01-06 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-3937.
---
Resolution: Won't Fix

Closing this due to lack of activity / reports of issues on recent versions of 
Spark

> Unsafe memory access inside of Snappy library
> -
>
> Key: SPARK-3937
> URL: https://issues.apache.org/jira/browse/SPARK-3937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Patrick Wendell
>
> This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
> have much information about this other than the stack trace. However, it was 
> concerning enough I figured I should post it.
> {code}
> java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code
> org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
> org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
> org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
> 
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
> 
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
> 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> 
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
> 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> scala.collection.AbstractIterator.to(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
> 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> org.apache.spark.scheduler.Task.run(Task.scala:56)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (SPARK-14958) Failed task hangs if error is encountered when getting task result

2017-01-05 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-14958.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Failed task hangs if error is encountered when getting task result
> --
>
> Key: SPARK-14958
> URL: https://issues.apache.org/jira/browse/SPARK-14958
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Rui Li
>Assignee: Rui Li
> Fix For: 2.2.0
>
>
> In {{TaskResultGetter}}, if we get an error when deserialize 
> {{TaskEndReason}}, TaskScheduler won't have a chance to handle the failed 
> task and the task just hangs.
> {code}
>   def enqueueFailedTask(taskSetManager: TaskSetManager, tid: Long, taskState: 
> TaskState,
> serializedData: ByteBuffer) {
> var reason : TaskEndReason = UnknownReason
> try {
>   getTaskResultExecutor.execute(new Runnable {
> override def run(): Unit = Utils.logUncaughtExceptions {
>   val loader = Utils.getContextOrSparkClassLoader
>   try {
> if (serializedData != null && serializedData.limit() > 0) {
>   reason = serializer.get().deserialize[TaskEndReason](
> serializedData, loader)
> }
>   } catch {
> case cnd: ClassNotFoundException =>
>   // Log an error but keep going here -- the task failed, so not 
> catastrophic
>   // if we can't deserialize the reason.
>   logError(
> "Could not deserialize TaskEndReason: ClassNotFound with 
> classloader " + loader)
> case ex: Exception => {}
>   }
>   scheduler.handleFailedTask(taskSetManager, tid, taskState, reason)
> }
>   })
> } catch {
>   case e: RejectedExecutionException if sparkEnv.isStopped =>
> // ignore it
> }
>   }
> {code}
> In my specific case, I got a NoClassDefFoundError and the failed task hangs 
> forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14958) Failed task hangs if error is encountered when getting task result

2017-01-05 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-14958:
---
Affects Version/s: 1.6.0
   2.0.0
   2.1.0

> Failed task hangs if error is encountered when getting task result
> --
>
> Key: SPARK-14958
> URL: https://issues.apache.org/jira/browse/SPARK-14958
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Rui Li
>Assignee: Rui Li
> Fix For: 2.2.0
>
>
> In {{TaskResultGetter}}, if we get an error when deserialize 
> {{TaskEndReason}}, TaskScheduler won't have a chance to handle the failed 
> task and the task just hangs.
> {code}
>   def enqueueFailedTask(taskSetManager: TaskSetManager, tid: Long, taskState: 
> TaskState,
> serializedData: ByteBuffer) {
> var reason : TaskEndReason = UnknownReason
> try {
>   getTaskResultExecutor.execute(new Runnable {
> override def run(): Unit = Utils.logUncaughtExceptions {
>   val loader = Utils.getContextOrSparkClassLoader
>   try {
> if (serializedData != null && serializedData.limit() > 0) {
>   reason = serializer.get().deserialize[TaskEndReason](
> serializedData, loader)
> }
>   } catch {
> case cnd: ClassNotFoundException =>
>   // Log an error but keep going here -- the task failed, so not 
> catastrophic
>   // if we can't deserialize the reason.
>   logError(
> "Could not deserialize TaskEndReason: ClassNotFound with 
> classloader " + loader)
> case ex: Exception => {}
>   }
>   scheduler.handleFailedTask(taskSetManager, tid, taskState, reason)
> }
>   })
> } catch {
>   case e: RejectedExecutionException if sparkEnv.isStopped =>
> // ignore it
> }
>   }
> {code}
> In my specific case, I got a NoClassDefFoundError and the failed task hangs 
> forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14958) Failed task hangs if error is encountered when getting task result

2017-01-05 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-14958:
---
Assignee: Rui Li

> Failed task hangs if error is encountered when getting task result
> --
>
> Key: SPARK-14958
> URL: https://issues.apache.org/jira/browse/SPARK-14958
> Project: Spark
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>
> In {{TaskResultGetter}}, if we get an error when deserialize 
> {{TaskEndReason}}, TaskScheduler won't have a chance to handle the failed 
> task and the task just hangs.
> {code}
>   def enqueueFailedTask(taskSetManager: TaskSetManager, tid: Long, taskState: 
> TaskState,
> serializedData: ByteBuffer) {
> var reason : TaskEndReason = UnknownReason
> try {
>   getTaskResultExecutor.execute(new Runnable {
> override def run(): Unit = Utils.logUncaughtExceptions {
>   val loader = Utils.getContextOrSparkClassLoader
>   try {
> if (serializedData != null && serializedData.limit() > 0) {
>   reason = serializer.get().deserialize[TaskEndReason](
> serializedData, loader)
> }
>   } catch {
> case cnd: ClassNotFoundException =>
>   // Log an error but keep going here -- the task failed, so not 
> catastrophic
>   // if we can't deserialize the reason.
>   logError(
> "Could not deserialize TaskEndReason: ClassNotFound with 
> classloader " + loader)
> case ex: Exception => {}
>   }
>   scheduler.handleFailedTask(taskSetManager, tid, taskState, reason)
> }
>   })
> } catch {
>   case e: RejectedExecutionException if sparkEnv.isStopped =>
> // ignore it
> }
>   }
> {code}
> In my specific case, I got a NoClassDefFoundError and the failed task hangs 
> forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19062) Utils.writeByteBuffer should not modify buffer position

2017-01-04 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19062:
---
Affects Version/s: (was: 1.2.1)
   2.1.0

> Utils.writeByteBuffer should not modify buffer position
> ---
>
> Key: SPARK-19062
> URL: https://issues.apache.org/jira/browse/SPARK-19062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 2.2.0
>
>
> [~mridulm80] pointed out that Utils.writeByteBuffer may change the position 
> of the underlying byte buffer, which could potentially lead to subtle bugs 
> for callers of that function.  We should change this so Utils.writeByteBuffer 
> doesn't change the buffer position.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19062) Utils.writeByteBuffer should not modify buffer position

2017-01-04 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19062.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Utils.writeByteBuffer should not modify buffer position
> ---
>
> Key: SPARK-19062
> URL: https://issues.apache.org/jira/browse/SPARK-19062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 2.2.0
>
>
> [~mridulm80] pointed out that Utils.writeByteBuffer may change the position 
> of the underlying byte buffer, which could potentially lead to subtle bugs 
> for callers of that function.  We should change this so Utils.writeByteBuffer 
> doesn't change the buffer position.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19072) Catalyst's IN always returns false for infinity

2017-01-03 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19072:
---
Description: 
This bug was caused by the fix for SPARK-18999 
(https://github.com/apache/spark/pull/16402)

This can be reproduced by adding the following test to PredicateSuite.scala 
(which will consistently fail):

val value = NonFoldableLiteral(Double.PositiveInfinity, DoubleType)
checkEvaluation(In(value, List(value)), true)

This bug is causing org.apache.spark.sql.catalyst.expressions.PredicateSuite.IN 
to fail approximately 10% of the time (it fails anytime the value is Infinity 
or -Infinity and the correct answer is True -- e.g., 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70826/testReport/org.apache.spark.sql.catalyst.expressions/PredicateSuite/IN/,
 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70830/console).

  was:
This can be reproduced by adding the following test to PredicateSuite.scala 
(which will consistently fail):

val value = NonFoldableLiteral(Double.PositiveInfinity, DoubleType)
checkEvaluation(In(value, List(value)), true)

This bug is causing org.apache.spark.sql.catalyst.expressions.PredicateSuite.IN 
to fail approximately 10% of the time (it fails anytime the value is Infinity 
or -Infinity and the correct answer is True -- e.g., 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70826/testReport/org.apache.spark.sql.catalyst.expressions/PredicateSuite/IN/,
 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70830/console).


> Catalyst's IN always returns false for infinity
> ---
>
> Key: SPARK-19072
> URL: https://issues.apache.org/jira/browse/SPARK-19072
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Kay Ousterhout
>
> This bug was caused by the fix for SPARK-18999 
> (https://github.com/apache/spark/pull/16402)
> This can be reproduced by adding the following test to PredicateSuite.scala 
> (which will consistently fail):
> val value = NonFoldableLiteral(Double.PositiveInfinity, DoubleType)
> checkEvaluation(In(value, List(value)), true)
> This bug is causing 
> org.apache.spark.sql.catalyst.expressions.PredicateSuite.IN to fail 
> approximately 10% of the time (it fails anytime the value is Infinity or 
> -Infinity and the correct answer is True -- e.g., 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70826/testReport/org.apache.spark.sql.catalyst.expressions/PredicateSuite/IN/,
>  
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70830/console).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 382 matches

Mail list logo