from:"Mark Hamstra \(JIRA\)"

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113494#comment-16113494
 ] 

Mark Hamstra commented on SPARK-21619:
--

Can you provide a little more context, Reynold, since on its face it would seem 
that if plans are to be blocked from executing based on their form, then 
non-canonical plans would be the ones that should be blocked. 

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113510#comment-16113510
 ] 

Mark Hamstra commented on SPARK-21619:
--

Ok, but my point is that if plans are to be canonicalized for some reasons, 
maybe they should also be canonicalized before execution. It seems odd both to 
execute plans that are not in a canonical form and to not be able to execute 
plans that are in a canonical form. That view makes failing the execution of 
canonical plans look more like a workaround/hack (maybe needed in the short 
term) than a solution to a deeper issue. 

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113526#comment-16113526
 ] 

Mark Hamstra commented on SPARK-21619:
--

Two reason, mostly: 1) To provide better guarantees that plans that are deemed 
to be semantically equivalent actually end up being expressed the same way 
before execution and thus go down the same code paths; 2) To simplify some 
downstream logic; so instead of needing to maintain a mapping between multiple, 
semantically equivalent plans and a single canonical form, after a certain 
canonicalization point the plans really are the same.

To perhaps clear up my confusion, maybe you can answer the question going the 
other way: Why would you want to execute multiple semantically equivalent plans 
in different forms?

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113555#comment-16113555
 ] 

Mark Hamstra commented on SPARK-21619:
--

Yes, I absolutely understand that this issue and PR are meant to address an 
immediate need, and that a deeper redesign would be one or likely more separate 
issues. I more trying to raise awareness or improve my understanding than to 
delay or block progress on addressing the immediate need.

I do have concerns, though, that making canonical plans unexecutable just 
because they are in canonical form does make certain evolutions of Spark more 
difficult. As one half-baked example, you could want to decouple query plans 
from a single execution engine, so that certain kinds of logical plans could be 
sent toward execution on one engine (or cluster configuration) while other 
plans could be directed to a separate engine (presumably more suitable to those 
plans in some way.) Splitting and forking Spark's query execution pipeline in 
that kind of way isn't really that difficult (I've done it in at least a 
proof-of-concept), and has some perhaps significant potential benefits. To do 
that, though, you'd really like to have a single, canonical form for any 
semantically equivalent queries by the time they reach your dispatch function 
for determining the destination execution engine for a query (and where results 
will be cached locally, etc.) Making the canonical form unexecutable throws a 
wrench into that.  

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113571#comment-16113571
 ] 

Mark Hamstra commented on SPARK-21619:
--

_"Why would you want to execute multiple semantically equivalent plans in 
different forms?" -> Because they can be executed in different times, using 
different aliases, etc?_

Right, so for separate executions of semantically equivalent plans you need to 
maintain a mapping between the aliases of a particular plan and their canonical 
form, but after doing that you can more easily recover data and metadata 
associated with a prior execution of an equivalent plan.

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113555#comment-16113555
 ] 

Mark Hamstra edited comment on SPARK-21619 at 8/3/17 10:01 PM:
---

Yes, I absolutely understand that this issue and PR are meant to address an 
immediate need, and that a deeper redesign would be one or likely more separate 
issues. I'm more trying to raise awareness or improve my understanding than to 
delay or block progress on addressing the immediate need.

I do have concerns, though, that making canonical plans unexecutable just 
because they are in canonical form does make certain evolutions of Spark more 
difficult. As one half-baked example, you could want to decouple query plans 
from a single execution engine, so that certain kinds of logical plans could be 
sent toward execution on one engine (or cluster configuration) while other 
plans could be directed to a separate engine (presumably more suitable to those 
plans in some way.) Splitting and forking Spark's query execution pipeline in 
that kind of way isn't really that difficult (I've done it in at least a 
proof-of-concept), and has some perhaps significant potential benefits. To do 
that, though, you'd really like to have a single, canonical form for any 
semantically equivalent queries by the time they reach your dispatch function 
for determining the destination execution engine for a query (and where results 
will be cached locally, etc.) Making the canonical form unexecutable throws a 
wrench into that.  


was (Author: markhamstra):
Yes, I absolutely understand that this issue and PR are meant to address an 
immediate need, and that a deeper redesign would be one or likely more separate 
issues. I more trying to raise awareness or improve my understanding than to 
delay or block progress on addressing the immediate need.

I do have concerns, though, that making canonical plans unexecutable just 
because they are in canonical form does make certain evolutions of Spark more 
difficult. As one half-baked example, you could want to decouple query plans 
from a single execution engine, so that certain kinds of logical plans could be 
sent toward execution on one engine (or cluster configuration) while other 
plans could be directed to a separate engine (presumably more suitable to those 
plans in some way.) Splitting and forking Spark's query execution pipeline in 
that kind of way isn't really that difficult (I've done it in at least a 
proof-of-concept), and has some perhaps significant potential benefits. To do 
that, though, you'd really like to have a single, canonical form for any 
semantically equivalent queries by the time they reach your dispatch function 
for determining the destination execution engine for a query (and where results 
will be cached locally, etc.) Making the canonical form unexecutable throws a 
wrench into that.  

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113586#comment-16113586
 ] 

Mark Hamstra commented on SPARK-21619:
--

Or you can just enlighten me on how one should design a dispatch function for 
multiple expressions of semantically equivalent query plans under the current 
architecture. :) Dispatching based on a canonical form of a plan seems like an 
obvious solution to me, but maybe I'm missing something.  

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-08-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113605#comment-16113605
 ] 

Mark Hamstra commented on SPARK-21619:
--

But part of the point of the split in my half-baked example is to fork the 
query execution pipeline before physical plan generation, allowing the cost of 
that generation to be parallelized with an instance per execution engine. Yes, 
maybe doing dispatch of physical plans via the CBO or other means is all that I 
should realistically hope for, but it doesn't mean that it isn't worth thinking 
about alternatives. 

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2016-07-23 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390865#comment-15390865
 ] 

Mark Hamstra commented on SPARK-16693:
--

The 2.1.0 release is the very earliest that we can allow removal of things 
first deprecated in 2.0.0.  That is different from saying that we *should* 
remove those things in 2.1.0.  The decision as to exactly when we should and 
will remove things first deprecated in 2.0.0 needs more discussion. 

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2016-07-25 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392517#comment-15392517
 ] 

Mark Hamstra commented on SPARK-16693:
--

As much as makes sense and is possible, we should also strive to coordinate 
removal of stuff deprecated in 2.0.0 of SparkR with removal of stuff deprecated 
elsewhere in Spark 2.0.0.  It's just easier for all parties concerned to be 
able to say that all the things were removed as of release x.y.z than to have 
Spark developers and users need to keep track that SparkR removed stuff in 
release x.y.a, Spark SQL removed stuff in x.y.b, etc.

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26502) Get rid of hiveResultString() in QueryExecution

2019-01-02 Thread Mark Hamstra (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732435#comment-16732435
 ] 

Mark Hamstra commented on SPARK-26502:
--

Don't lose track of this comment: 
[https://github.com/apache/spark/blob/948414afe706e0b526d7f83f598cbd204d2fc687/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L41]

 

Any significant change to QueryExecution needs to doc'd carefully and included 
in the release notes since you will be forcing 3rd party changes. 

> Get rid of hiveResultString() in QueryExecution
> ---
>
> Key: SPARK-26502
> URL: https://issues.apache.org/jira/browse/SPARK-26502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The method hiveResultString() of QueryExecution is used in test and 
> SparkSQLDriver. It should be moved from QueryExecution to more specific class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling

2017-01-19 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830387#comment-15830387
 ] 

Mark Hamstra commented on SPARK-19276:
--

This all makes sense, and the PR is a good effort to fix this kind of 
"accidental" swallowing of FetchFailedException.  I guess my only real question 
is if we should allow for the possibility of a FetchFailedException not only 
being caught, but also the failure being remedied by some means other than the 
usual handling in the driver. I'm not sure exactly how or why that kind of "fix 
the fetch failure before Spark tries to handle it" would be done; and it would 
seem that something like that would be prone to subtle errors, so maybe we 
should just set it in stone that nobody but the driver should try to fix a 
fetch failure -- which would make the approach of your "guarantee that the 
FetchFailedException is seen by the driver" PR completely correct.

> FetchFailures can be hidden by user (or sql) exception handling
> ---
>
> Key: SPARK-19276
> URL: https://issues.apache.org/jira/browse/SPARK-19276
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Priority: Critical
>
> The scheduler handles node failures by looking for a special 
> {{FetchFailedException}} thrown by the shuffle block fetcher.  This is 
> handled in {{Executor}} and then passed as a special msg back to the driver: 
> https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403
> However, user code exists in between the shuffle block fetcher and that catch 
> block -- it could intercept the exception, wrap it with something else, and 
> throw a different exception.  If that happens, spark treats it as an ordinary 
> task failure, and retries the task, rather than regenerating the missing 
> shuffle data.  The task eventually is retried 4 times, its doomed to fail 
> each time, and the job is failed.
> You might think that no user code should do that -- but even sparksql does it:
> https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214
> Here's an example stack trace.  This is from Spark 1.6, so the sql code is 
> not the same, but the problem is still there:
> {noformat}
> 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while 
> writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to xxx/yyy:zzz
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
> ...
> 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 
> failed 4 times; aborting job
> {noformat}
> I think the right fix here is to also set a fetch failure status in the 
> {{TaskContextImpl}}, so the executor can check that instead of just one 
> exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling

2017-01-19 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830519#comment-15830519
 ] 

Mark Hamstra commented on SPARK-19276:
--

Ok, I haven't read your PR closely yet, so I missed that.

This question looks like something that could use more eyes and insights. 
[~kayousterhout][~matei][~r...@databricks.com]


> FetchFailures can be hidden by user (or sql) exception handling
> ---
>
> Key: SPARK-19276
> URL: https://issues.apache.org/jira/browse/SPARK-19276
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Priority: Critical
>
> The scheduler handles node failures by looking for a special 
> {{FetchFailedException}} thrown by the shuffle block fetcher.  This is 
> handled in {{Executor}} and then passed as a special msg back to the driver: 
> https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403
> However, user code exists in between the shuffle block fetcher and that catch 
> block -- it could intercept the exception, wrap it with something else, and 
> throw a different exception.  If that happens, spark treats it as an ordinary 
> task failure, and retries the task, rather than regenerating the missing 
> shuffle data.  The task eventually is retried 4 times, its doomed to fail 
> each time, and the job is failed.
> You might think that no user code should do that -- but even sparksql does it:
> https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214
> Here's an example stack trace.  This is from Spark 1.6, so the sql code is 
> not the same, but the problem is still there:
> {noformat}
> 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while 
> writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to xxx/yyy:zzz
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
> ...
> 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 
> failed 4 times; aborting job
> {noformat}
> I think the right fix here is to also set a fetch failure status in the 
> {{TaskContextImpl}}, so the executor can check that instead of just one 
> exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17529) On highly skewed data, outer join merges are slow

2016-09-13 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-17529:
-
Priority: Major  (was: Trivial)

> On highly skewed data, outer join merges are slow
> -
>
> Key: SPARK-17529
> URL: https://issues.apache.org/jira/browse/SPARK-17529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: David C Navas
>
> All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the 
> same performance problem.
> My co-worker Yewei Zhang was investigating a performance problem with a 
> highly skewed dataset.
> "Part of this query performs a full outer join over [an ID] on highly skewed 
> data. On the left view, there is one record for id = 0 out of 2,272,486 
> records; On the right view there are 8,353,097 records for id = 0 out of 
> 12,685,073 records"
> The sub-query was taking 5.2 minutes.  We discovered that snappy was 
> responsible for some measure of this problem and incorporated the new snappy 
> release.  This brought the sub-query down to 2.4 minutes.  A large percentage 
> of the remaining time was spent in the merge code which I tracked down to a 
> BitSet clearing issue.  We have noted that you already have the snappy fix, 
> this issue describes the problem with the BitSet.
> The BitSet grows to handle the largest matching set of keys and is used to 
> track joins.  The BitSet is re-used in all subsequent joins (unless it is too 
> small)
> The skewing of our data caused a very large BitSet to be allocated on the 
> very first row joined.  Unfortunately, the entire BitSet is cleared on each 
> re-use.  For each of the remaining rows which likely match only a few rows on 
> the other side, the entire 1MB of the BitSet is cleared.  If unpartitioned, 
> this would happen roughly 6 million times.  The fix I developed for this is 
> to clear only the portion of the BitSet that is needed.  After applying it, 
> the sub-query dropped from 2.4 minutes to 29 seconds.
> Small (0 or negative) IDs are often used as place-holders for null, empty, or 
> unknown data, so I expect this fix to be generally useful, if rarely 
> encountered to this particular degree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17769) Some FetchFailure refactoring in the DAGScheduler

2016-10-03 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-17769:


 Summary: Some FetchFailure refactoring in the DAGScheduler
 Key: SPARK-17769
 URL: https://issues.apache.org/jira/browse/SPARK-17769
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Mark Hamstra
Assignee: Mark Hamstra
Priority: Minor


SPARK-17644 opened up a discussion about further refactoring of the 
DAGScheduler's handling of FetchFailure events.  These include:
* rewriting code and comments to improve readability
* doing fetchFailedAttemptIds.add(stageAttemptId) even when 
disallowStageRetryForTest is true
* issuing a ResubmitFailedStages event based on whether one is already enqueued 
for the current failed stage, not any prior failed stage
* logging the resubmission of all failed stages 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17911) Scheduler does not need messageScheduler for ResubmitFailedStages

2016-10-17 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583334#comment-15583334
 ] 

Mark Hamstra commented on SPARK-17911:
--

I think we're pretty much on the same page when it comes to the net effects of 
just eliminating the RESUBMIT_TIMEOUT delay.  I need to find some time to think 
about what something better than the current delayed-resubmit-event approach 
would look like.

> Scheduler does not need messageScheduler for ResubmitFailedStages
> -
>
> Key: SPARK-17911
> URL: https://issues.apache.org/jira/browse/SPARK-17911
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> Its not totally clear what the purpose of the {{messageScheduler}} is in 
> {{DAGScheduler}}.  It can perhaps be eliminated completely; or perhaps we 
> should just clearly document its purpose.
> This comes from a long discussion w/ [~markhamstra] on an unrelated PR here: 
> https://github.com/apache/spark/pull/15335/files/c80ad22a242255cac91cce2c7c537f9b21100f70#diff-6a9ff7fb74fd490a50462d45db2d5e11
> But its tricky so breaking it out here for archiving the discussion.
> Note: this issue requires a decision on what to do before a code change, so 
> lets just discuss it on jira first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-02-07 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16355895#comment-16355895
 ] 

Mark Hamstra commented on SPARK-22683:
--

A concern that I have is that the discussion seems to be very focused on Spark 
"jobs" that are actually Spark Applications that run only a single Job – i.e. a 
workload where Applications and Jobs are submitted 1:1. Tuning for just this 
kind of workload has the potential to negatively impact other Spark usage 
patterns where several Jobs run in a single Application or the common 
Job-server pattern where a single, long-running Application serves many Jobs 
over an extended time.

I'm not saying that the proposals made here and in the associated PR will 
necessarily affect those other usage patterns negatively. What I am saying is 
that there must be strong demonstration that those patterns will not be 
negatively affected before I will be comfortable approving the requested 
changes.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executo

[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-02-07 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16355943#comment-16355943
 ] 

Mark Hamstra commented on SPARK-22683:
--

I agree that setting the config to 1 should be sufficient to retain current 
behaviour; however, there seemed to be at least some urging in the discussion 
toward a default of 2. I'm not sure that we want to change Spark's default 
behavior in that way.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Updated] (SPARK-21084) Improvements to dynamic allocation for notebook use cases

2018-02-07 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-21084:
-
Description: 
One important application of Spark is to support many notebook users with a 
single YARN or Spark Standalone cluster.  We at IBM have seen this requirement 
across multiple deployments of Spark: on-premises and private cloud deployments 
at our clients, as well as on the IBM cloud.  The scenario goes something like 
this: "Every morning at 9am, 500 analysts log into their computers and start 
running Spark notebooks intermittently for the next 8 hours." I'm sure that 
many other members of the community are interested in making similar scenarios 
work.

Dynamic allocation is supposed to support these kinds of use cases by shifting 
cluster resources towards users who are currently executing scalable code.  In 
our own testing, we have encountered a number of issues with using the current 
implementation of dynamic allocation for this purpose:
*Issue #1: Starvation.* A Spark job acquires all available containers, 
preventing other jobs or applications from starting.
*Issue #2: Request latency.* Jobs that would normally finish in less than 30 
seconds take 2-4x longer than normal with dynamic allocation.
*Issue #3: Unfair resource allocation due to cached data.* Applications that 
have cached RDD partitions hold onto executors indefinitely, denying those 
resources to other applications.
*Issue #4: Loss of cached data leads to thrashing.*  Applications repeatedly 
lose partitions of cached RDDs because the underlying executors are removed; 
the applications then need to rerun expensive computations.

This umbrella JIRA covers efforts to address these issues by making 
enhancements to Spark.
Here's a high-level summary of the current planned work:
* [SPARK-21097]: Preserve an executor's cached data when shutting down the 
executor.
* [SPARK-21122]: Make Spark give up executors in a controlled fashion when the 
RM indicates it is running low on capacity.
* (JIRA TBD): Reduce the delay for dynamic allocation to spin up new executors.

Note that this overall plan is subject to change, and other members of the 
community should feel free to suggest changes and to help out.

  was:
One important application of Spark is to support many notebook users with a 
single YARN or Spark Standalone cluster.  We at IBM have seen this requirement 
across multiple deployments of Spark: on-premises and private cloud deployments 
at our clients, as well as on the IBM cloud.  The scenario goes something like 
this: "Every morning at 9am, 500 analysts log into their computers and start 
running Spark notebooks intermittently for the next 8 hours." I'm sure that 
many other members of the community are interested in making similar scenarios 
work.

Dynamic allocation is supposed to support these kinds of use cases by shifting 
cluster resources towards users who are currently executing scalable code.  In 
our own testing, we have encountered a number of issues with using the current 
implementation of dynamic allocation for this purpose:
*Issue #1: Starvation.* A Spark job acquires all available containers, 
preventing other jobs or applications from starting.
*Issue #2: Request latency.* Jobs that would normally finish in less than 30 
seconds take 2-4x longer than normal with dynamic allocation.
*Issue #3: Unfair resource allocation due to cached data.* Applications that 
have cached RDD partitions hold onto executors indefinitely, denying those 
resources to other applications.
*Issue #4: Loss of cached data leads to thrashing.*  Applications repeatedly 
lose partitions of cached RDDs because the underlying executors are removed; 
the applications then need to rerun expensive computations.

This umbrella JIRA covers efforts to address these issues by making 
enhancements to Spark.
Here's a high-level summary of the current planned work:
* [SPARK-21097]: Preserve an executor's cached data when shutting down the 
executor.
* (JIRA TBD): Make Spark give up executors in a controlled fashion when the RM 
indicates it is running low on capacity.
* (JIRA TBD): Reduce the delay for dynamic allocation to spin up new executors.

Note that this overall plan is subject to change, and other members of the 
community should feel free to suggest changes and to help out.


> Improvements to dynamic allocation for notebook use cases
> -
>
> Key: SPARK-21084
> URL: https://issues.apache.org/jira/browse/SPARK-21084
> Project: Spark
>  Issue Type: Umbrella
>  Components: Block Manager, Scheduler, Spark Core, YARN
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Frederick Reiss
>Priority: Major
>
> One important application of Spark is to support many notebook users with a 
> single YARN

[jira] [Resolved] (SPARK-13523) Reuse the exchanges in a query

2016-03-04 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra resolved SPARK-13523.
--
Resolution: Duplicate

> Reuse the exchanges in a query
> --
>
> Key: SPARK-13523
> URL: https://issues.apache.org/jira/browse/SPARK-13523
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> In exchange, the RDD will be materialized (shuffled or collected), it's a 
> good point to eliminate common part of a query.
> In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange 
> or BroadcastExchange) could be used multiple times, we should re-use them to 
> avoid the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13523) Reuse the exchanges in a query

2016-03-08 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185967#comment-15185967
 ] 

Mark Hamstra commented on SPARK-13523:
--

Yes that is a good point.  But they are closely enough linked (e.g. in the need 
for plan canonicalization) that I'm going to create an epic to cover both 
issues.

> Reuse the exchanges in a query
> --
>
> Key: SPARK-13523
> URL: https://issues.apache.org/jira/browse/SPARK-13523
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> In exchange, the RDD will be materialized (shuffled or collected), it's a 
> good point to eliminate common part of a query.
> In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange 
> or BroadcastExchange) could be used multiple times, we should re-use them to 
> avoid the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13756) Reuse Query Fragments

2016-03-08 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-13756:


 Summary: Reuse Query Fragments
 Key: SPARK-13756
 URL: https://issues.apache.org/jira/browse/SPARK-13756
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Mark Hamstra


Query fragments that have been materialized in RDDs can and should be reused in 
subsequent queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13523) Reuse the exchanges in a query

2016-03-08 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185967#comment-15185967
 ] 

Mark Hamstra edited comment on SPARK-13523 at 3/8/16 10:36 PM:
---

Yes that is a good point.  But they are closely enough linked (e.g. in the need 
for plan canonicalization) that I'm going to create an umbrella to cover both 
issues.


was (Author: markhamstra):
Yes that is a good point.  But they are closely enough linked (e.g. in the need 
for plan canonicalization) that I'm going to create an ubmrella to cover both 
issues.

> Reuse the exchanges in a query
> --
>
> Key: SPARK-13523
> URL: https://issues.apache.org/jira/browse/SPARK-13523
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> In exchange, the RDD will be materialized (shuffled or collected), it's a 
> good point to eliminate common part of a query.
> In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange 
> or BroadcastExchange) could be used multiple times, we should re-use them to 
> avoid the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13523) Reuse the exchanges in a query

2016-03-08 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185967#comment-15185967
 ] 

Mark Hamstra edited comment on SPARK-13523 at 3/8/16 10:36 PM:
---

Yes that is a good point.  But they are closely enough linked (e.g. in the need 
for plan canonicalization) that I'm going to create an ubmrella to cover both 
issues.


was (Author: markhamstra):
Yes that is a good point.  But they are closely enough linked (e.g. in the need 
for plan canonicalization) that I'm going to create an epic to cover both 
issues.

> Reuse the exchanges in a query
> --
>
> Key: SPARK-13523
> URL: https://issues.apache.org/jira/browse/SPARK-13523
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> In exchange, the RDD will be materialized (shuffled or collected), it's a 
> good point to eliminate common part of a query.
> In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange 
> or BroadcastExchange) could be used multiple times, we should re-use them to 
> avoid the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13756) Reuse Query Fragments

2016-03-08 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185991#comment-15185991
 ] 

Mark Hamstra commented on SPARK-13756:
--

Fragment reuse across queries

> Reuse Query Fragments
> -
>
> Key: SPARK-13756
> URL: https://issues.apache.org/jira/browse/SPARK-13756
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Mark Hamstra
>
> Query fragments that have been materialized in RDDs can and should be reused 
> in subsequent queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13756) Reuse Query Fragments

2016-03-08 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-13756:
-
Description: Query fragments that have been materialized in RDDs can and 
should be reused either within the same query or in subsequent queries.  (was: 
Query fragments that have been materialized in RDDs can and should be reused in 
subsequent queries.)

> Reuse Query Fragments
> -
>
> Key: SPARK-13756
> URL: https://issues.apache.org/jira/browse/SPARK-13756
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Mark Hamstra
>
> Query fragments that have been materialized in RDDs can and should be reused 
> either within the same query or in subsequent queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13756) Reuse Query Fragments

2016-03-08 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185991#comment-15185991
 ] 

Mark Hamstra edited comment on SPARK-13756 at 3/8/16 10:42 PM:
---

Collecting the prior issues under an umbrella.


was (Author: markhamstra):
Fragment reuse across queries

> Reuse Query Fragments
> -
>
> Key: SPARK-13756
> URL: https://issues.apache.org/jira/browse/SPARK-13756
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Mark Hamstra
>
> Query fragments that have been materialized in RDDs can and should be reused 
> either within the same query or in subsequent queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11838) Spark SQL query fragment RDD reuse across queries

2016-03-08 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-11838:
-
Summary: Spark SQL query fragment RDD reuse across queries  (was: Spark SQL 
query fragment RDD reuse)

> Spark SQL query fragment RDD reuse across queries
> -
>
> Key: SPARK-11838
> URL: https://issues.apache.org/jira/browse/SPARK-11838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Mikhail Bautin
>
> With many analytical Spark SQL workloads against slowly changing tables, 
> successive queries frequently share fragments that produce the same result. 
> Instead of re-computing those fragments for every query, it makes sense to 
> detect similar fragments and substitute RDDs previously created for matching 
> SparkPlan fragments into every new SparkPlan at the execution time whenever 
> possible. Even if no RDDs are persist()-ed to memory/disk/off-heap memory, 
> many stages can still be skipped due to map output files being present on 
> executor nodes.
> The implementation involves the following steps:
> (1) Logical plan "canonicalization". 
> Logical plans mapping to the same "canonical" logical plan should always 
> produce the same results (except for possible output column reordering), 
> although the inverse statement won't always be true. 
>   - Re-mapping expression ids to "canonical expression ids" (successively 
> increasing numbers always starting with 1).
>   - Eliminating alias names that are unimportant after analysis completion. 
> Only the names that are necessary to determine the Hive table columns to be 
> scanned are retained.
>   - Reordering columns in projections, grouping/aggregation expressions, etc. 
> This can be done e.g. by using the string representation as a sort key. Union 
> inputs always have to be reordered the same way.
>   - Tree traversal has to happen starting from leaves and progressing towards 
> the root, because we need to already have identified canonical expression ids 
> for children of a node before we can come up with sort keys that would allow 
> to reorder expressions in a node deterministically. This is a bit more 
> complicated for Union nodes.
>   - Special handling for MetastoreRelations. We replace MetastoreRelation 
> with a special class CanonicalMetastoreRelation that uses attributes and 
> partitionKeys as part of its equals() and hashCode() implementation, but the 
> visible attributes and aprtitionKeys are restricted to expression ids that 
> the rest of the query actually needs from that MetastoreRelation.
> An example of logical plans and corresponding canonical logical plans: 
> https://gist.githubusercontent.com/mbautin/ef1317b341211d9606cf/raw
> (2) Tracking LogicalPlan fragments corresponding to SparkPlan fragments. When 
> generating a SparkPlan, we keep an optional reference to a LogicalPlan 
> instance in every node. This allows us to populate the cache with mappings 
> from canonical logical plans of query fragments to the corresponding RDDs 
> generated as part of query execution. Note that there is no new work 
> necessary to generate the RDDs, we are merely utilizing the RDDs that would 
> have been produced as part of SparkPlan execution anyway.
> (3) SparkPlan fragment substitution. After generating a SparkPlan and before 
> calling prepare() or execute() on it, we check if any of its nodes have an 
> associated LogicalPlan that maps to a canonical logical plan matching a cache 
> entry. If so, we substitute a PhysicalRDD (or a new class UnsafePhysicalRDD 
> wrapping an RDD of UnsafeRow) scanning the previously created RDD instead of 
> the current query fragment. If the expected column order differs from what 
> the current SparkPlan fragment produces, we add a projection to reorder the 
> columns. We also add safe/unsafe row conversions as necessary to match the 
> row type that is expected by the parent of the current SparkPlan fragment.
> (4) The execute() method of SparkPlan also needs to perform the cache lookup 
> and substitution described above before producing a new RDD for the current 
> SparkPlan node. The "loading cache" pattern (e.g. as implemented in Guava) 
> allows to reuse query fragments between simultaneously submitted queries: 
> whichever query runs execute() for a particular fragment's canonical logical 
> plan starts producing an RDD first, and if another query has a fragment with 
> the same canonical logical plan, it waits for the RDD to be produced by the 
> first query and inserts it in its SparkPlan instead.
> This kind of query fragment caching will mostly be useful for slowly-changing 
> or static tables. Even with slowly-changing tables, the cache needs to be 
> invalidated when those data set changes take place. One of the following 
> approaches could be us

[jira] [Created] (SPARK-13806) SQL round() produces incorrect results for negative values

2016-03-10 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-13806:


 Summary: SQL round() produces incorrect results for negative values
 Key: SPARK-13806
 URL: https://issues.apache.org/jira/browse/SPARK-13806
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1, 1.6.0, 2.0.0
Reporter: Mark Hamstra


Round in catalyst/expressions/mathExpressions.scala appears to be untested with 
negative values, and it doesn't handle them correctly.

There are at least two issues here:

1) In the genCode for FloatType and DoubleType with _scale == 0, round() will 
not produce the same results as for the BigDecimal.ROUND_HALF_UP strategy used 
in all other cases.  This is because Math.round is used for these _scale == 0 
cases.  For example, Math.round(-3.5) is -3, while BigDecimal.ROUND_HALF_UP at 
scale 0 for -3.5 is -4. 

Even after this bug is fixed with something like...
{code}
if (${ce.value} < 0) {
  ${ev.value} = -1 * Math.round(-1 * ${ce.value});
} else {
  ${ev.value} = Math.round(${ce.value});
}
{code}
...which will allow an additional test like this to succeed in 
MathFunctionsSuite.scala:
{code}
checkEvaluation(Round(-3.5D, 0), -4.0D)
{code}
...there still appears to be a problem on at least the 
checkEvalutionWithUnsafeProjection path, where failures like this are produced:
{code}
Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
[0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
{code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13806) SQL round() produces incorrect results for negative values

2016-03-10 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-13806:
-
Description: 
Round in catalyst/expressions/mathExpressions.scala appears to be untested with 
negative values, and it doesn't handle them correctly.

There are at least two issues here:

First, in the genCode for FloatType and DoubleType with _scale == 0, round() 
will not produce the same results as for the BigDecimal.ROUND_HALF_UP strategy 
used in all other cases.  This is because Math.round is used for these _scale 
== 0 cases.  For example, Math.round(-3.5) is -3, while 
BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. 

Even after this bug is fixed with something like...
{code}
if (${ce.value} < 0) {
  ${ev.value} = -1 * Math.round(-1 * ${ce.value});
} else {
  ${ev.value} = Math.round(${ce.value});
}
{code}
...which will allow an additional test like this to succeed in 
MathFunctionsSuite.scala:
{code}
checkEvaluation(Round(-3.5D, 0), -4.0D)
{code}
...there still appears to be a problem on at least the 
checkEvalutionWithUnsafeProjection path, where failures like this are produced:
{code}
Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
[0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
{code} 

  was:
Round in catalyst/expressions/mathExpressions.scala appears to be untested with 
negative values, and it doesn't handle them correctly.

There are at least two issues here:

1) In the genCode for FloatType and DoubleType with _scale == 0, round() will 
not produce the same results as for the BigDecimal.ROUND_HALF_UP strategy used 
in all other cases.  This is because Math.round is used for these _scale == 0 
cases.  For example, Math.round(-3.5) is -3, while BigDecimal.ROUND_HALF_UP at 
scale 0 for -3.5 is -4. 

Even after this bug is fixed with something like...
{code}
if (${ce.value} < 0) {
  ${ev.value} = -1 * Math.round(-1 * ${ce.value});
} else {
  ${ev.value} = Math.round(${ce.value});
}
{code}
...which will allow an additional test like this to succeed in 
MathFunctionsSuite.scala:
{code}
checkEvaluation(Round(-3.5D, 0), -4.0D)
{code}
...there still appears to be a problem on at least the 
checkEvalutionWithUnsafeProjection path, where failures like this are produced:
{code}
Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
[0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
{code} 


> SQL round() produces incorrect results for negative values
> --
>
> Key: SPARK-13806
> URL: https://issues.apache.org/jira/browse/SPARK-13806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Mark Hamstra
>
> Round in catalyst/expressions/mathExpressions.scala appears to be untested 
> with negative values, and it doesn't handle them correctly.
> There are at least two issues here:
> First, in the genCode for FloatType and DoubleType with _scale == 0, round() 
> will not produce the same results as for the BigDecimal.ROUND_HALF_UP 
> strategy used in all other cases.  This is because Math.round is used for 
> these _scale == 0 cases.  For example, Math.round(-3.5) is -3, while 
> BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. 
> Even after this bug is fixed with something like...
> {code}
> if (${ce.value} < 0) {
>   ${ev.value} = -1 * Math.round(-1 * ${ce.value});
> } else {
>   ${ev.value} = Math.round(${ce.value});
> }
> {code}
> ...which will allow an additional test like this to succeed in 
> MathFunctionsSuite.scala:
> {code}
> checkEvaluation(Round(-3.5D, 0), -4.0D)
> {code}
> ...there still appears to be a problem on at least the 
> checkEvalutionWithUnsafeProjection path, where failures like this are 
> produced:
> {code}
> Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
> [0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13806) SQL round() produces incorrect results for negative values

2016-03-11 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-13806:
-
Description: 
Round in catalyst/expressions/mathExpressions.scala appears to be untested with 
negative values, and it doesn't handle them correctly.

There are at least two issues here:

First, in the genCode for FloatType and DoubleType with _scale == 0, round() 
will not produce the same results as for the BigDecimal.ROUND_HALF_UP strategy 
used in all other cases.  This is because Math.round is used for these _scale 
== 0 cases.  For example, Math.round(-3.5) is -3, while 
BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. 

Even after this bug is fixed with something like...
{code}
if (${ce.value} < 0) {
  ${ev.value} = -1 * Math.round(-1 * ${ce.value});
} else {
  ${ev.value} = Math.round(${ce.value});
}
{code}
...which will allow an additional test like this to succeed in 
MathFunctionsSuite.scala:
{code}
checkEvaluation(Round(-3.5D, 0), -4.0D, EmptyRow)
{code}
...there still appears to be a problem on at least the 
checkEvalutionWithUnsafeProjection path, where failures like this are produced:
{code}
Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
[0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
{code} 

  was:
Round in catalyst/expressions/mathExpressions.scala appears to be untested with 
negative values, and it doesn't handle them correctly.

There are at least two issues here:

First, in the genCode for FloatType and DoubleType with _scale == 0, round() 
will not produce the same results as for the BigDecimal.ROUND_HALF_UP strategy 
used in all other cases.  This is because Math.round is used for these _scale 
== 0 cases.  For example, Math.round(-3.5) is -3, while 
BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. 

Even after this bug is fixed with something like...
{code}
if (${ce.value} < 0) {
  ${ev.value} = -1 * Math.round(-1 * ${ce.value});
} else {
  ${ev.value} = Math.round(${ce.value});
}
{code}
...which will allow an additional test like this to succeed in 
MathFunctionsSuite.scala:
{code}
checkEvaluation(Round(-3.5D, 0), -4.0D)
{code}
...there still appears to be a problem on at least the 
checkEvalutionWithUnsafeProjection path, where failures like this are produced:
{code}
Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
[0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
{code} 


> SQL round() produces incorrect results for negative values
> --
>
> Key: SPARK-13806
> URL: https://issues.apache.org/jira/browse/SPARK-13806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Mark Hamstra
>
> Round in catalyst/expressions/mathExpressions.scala appears to be untested 
> with negative values, and it doesn't handle them correctly.
> There are at least two issues here:
> First, in the genCode for FloatType and DoubleType with _scale == 0, round() 
> will not produce the same results as for the BigDecimal.ROUND_HALF_UP 
> strategy used in all other cases.  This is because Math.round is used for 
> these _scale == 0 cases.  For example, Math.round(-3.5) is -3, while 
> BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. 
> Even after this bug is fixed with something like...
> {code}
> if (${ce.value} < 0) {
>   ${ev.value} = -1 * Math.round(-1 * ${ce.value});
> } else {
>   ${ev.value} = Math.round(${ce.value});
> }
> {code}
> ...which will allow an additional test like this to succeed in 
> MathFunctionsSuite.scala:
> {code}
> checkEvaluation(Round(-3.5D, 0), -4.0D, EmptyRow)
> {code}
> ...there still appears to be a problem on at least the 
> checkEvalutionWithUnsafeProjection path, where failures like this are 
> produced:
> {code}
> Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
> [0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13872) Memory leak in SortMergeOuterJoin

2016-03-15 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196561#comment-15196561
 ] 

Mark Hamstra commented on SPARK-13872:
--

[~joshrosen]

> Memory leak in SortMergeOuterJoin
> -
>
> Key: SPARK-13872
> URL: https://issues.apache.org/jira/browse/SPARK-13872
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ian
> Attachments: Screen Shot 2016-03-11 at 5.42.32 PM.png
>
>
> SortMergeJoin composes its partition/iterator from 
> org.apache.spark.sql.execution.Sort, which in turns designates the sorting to 
> UnsafeExternalRowSorter.
> UnsafeExternalRowSorter's implementation cleans up the resources when:
> 1. org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator is fully 
> iterated.
> 2. task is done execution.
> In outer join case of SortMergeJoin, when the left or right iterator is not 
> fully iterated, the only chance for the resources to be cleaned up is at the 
> end of the spark task run. 
> This probably ok most of the time, however when a SortMergeOuterJoin is 
> nested within a CartesianProduct, the "deferred" resources cleanup allows a 
> none-ignorable memory leak amplified/cumulated by the loop driven by the 
> CartesianRdd's looping iteration.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13806) SQL round() produces incorrect results for negative values

2016-03-21 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205134#comment-15205134
 ] 

Mark Hamstra commented on SPARK-13806:
--

Yes, there is the mostly orthogonal question about which rounding strategy 
should be used -- see the comments in SPARK-8279.  But, assuming that we are 
adopting the ROUND_HALF_UP strategy, there is the problem with negative values 
that this JIRA points out: When using ROUND_HALF_UP and scale == 0, -x.5 must 
round to -(x+1), but Math.round will round it to -x.

In addition to this, the code gen for rounding of negative floating point 
values with negative scales is broken.

All of this stems from Spark SQL's implementation of round() being untested 
with negative values. 

> SQL round() produces incorrect results for negative values
> --
>
> Key: SPARK-13806
> URL: https://issues.apache.org/jira/browse/SPARK-13806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Mark Hamstra
>
> Round in catalyst/expressions/mathExpressions.scala appears to be untested 
> with negative values, and it doesn't handle them correctly.
> There are at least two issues here:
> First, in the genCode for FloatType and DoubleType with _scale == 0, round() 
> will not produce the same results as for the BigDecimal.ROUND_HALF_UP 
> strategy used in all other cases.  This is because Math.round is used for 
> these _scale == 0 cases.  For example, Math.round(-3.5) is -3, while 
> BigDecimal.ROUND_HALF_UP at scale 0 for -3.5 is -4. 
> Even after this bug is fixed with something like...
> {code}
> if (${ce.value} < 0) {
>   ${ev.value} = -1 * Math.round(-1 * ${ce.value});
> } else {
>   ${ev.value} = Math.round(${ce.value});
> }
> {code}
> ...which will allow an additional test like this to succeed in 
> MathFunctionsSuite.scala:
> {code}
> checkEvaluation(Round(-3.5D, 0), -4.0D, EmptyRow)
> {code}
> ...there still appears to be a problem on at least the 
> checkEvalutionWithUnsafeProjection path, where failures like this are 
> produced:
> {code}
> Incorrect evaluation in unsafe mode: round(-3.141592653589793, -6), actual: 
> [0,0], expected: [0,8000] (ExpressionEvalHelper.scala:145)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-09-12 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1021:

Assignee: Erik Erlandson  (was: Mark Hamstra)

> sortByKey() launches a cluster job when it shouldn't
> 
>
> Key: SPARK-1021
> URL: https://issues.apache.org/jira/browse/SPARK-1021
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0
>Reporter: Andrew Ash
>Assignee: Erik Erlandson
>  Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API

2014-09-22 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143814#comment-14143814
 ] 

Mark Hamstra commented on SPARK-2321:
-

Which would be kind of the opposite half of the SparkListenerJobStart event, 
which includes an array of the StageIds in a Job.  I included that way back 
when as a suggestion of at least some of what might be needed to implement 
better job-based progress reporting.  I'd have to look, but I don't believe 
anything is actually using that stage-reporting on JobStart right now.  In any 
event, any proper progress reporting should rationalize, extend or eliminate 
that part of SparkListenerJobStart.

> Design a proper progress reporting & event listener API
> ---
>
> Key: SPARK-2321
> URL: https://issues.apache.org/jira/browse/SPARK-2321
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>Priority: Critical
>
> This is a ticket to track progress on redesigning the SparkListener and 
> JobProgressListener API.
> There are multiple problems with the current design, including:
> 0. I'm not sure if the API is usable in Java (there are at least some enums 
> we used in Scala and a bunch of case classes that might complicate things).
> 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
> attention to it yet. Something as important as progress reporting deserves a 
> more stable API.
> 2. There is no easy way to connect jobs with stages. Similarly, there is no 
> easy way to connect job groups with jobs / stages.
> 3. JobProgressListener itself has no encapsulation at all. States can be 
> arbitrarily mutated by external programs. Variable names are sort of randomly 
> decided and inconsistent. 
> We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11326) Split networking in standalone mode

2015-11-05 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992282#comment-14992282
 ] 

Mark Hamstra commented on SPARK-11326:
--

Can someone clarify for me just what is motivating this JIRA and PR?  My 
meta-point is that even though the standalone mode has drifted a fair amount 
from its originally intended use as a simple, limited cluster deployment that 
could be quickly and easily setup when all the functionality of Mesos (and 
later YARN) was not needed, I think that it is still important that limits on 
its functionality be retained.  We don't really want to get into a pattern of 
continually adding more and more functionality to standalone mode until it 
becomes a functional rival of Mesos and YARN -- two such full-featured resource 
managers is already plenty, and I really don't see the advantage of sliding 
into the maintenance burden of developing a third within Spark itself.

So, my questions really boil down to: 1) Where do we intend to draw the line 
limiting the functionality of standalone mode? and 2) Why does adding stronger 
security clearly fall on the "include it" side of that line? or Why can't we 
just tell Spark users who need stronger security to use YARN or Mesos?

> Split networking in standalone mode
> ---
>
> Key: SPARK-11326
> URL: https://issues.apache.org/jira/browse/SPARK-11326
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> h3.The idea
> Currently, in standalone mode, all components, for all network connections 
> need to use the same secure token if they want to have any security ensured. 
> This ticket is intended to split the communication in standalone mode to make 
> it more like in Yarn mode - application internal communication, scheduler 
> internal communication and communication between the client and scheduler. 
> Such refactoring will allow for the scheduler (master, workers) to use a 
> distinct secret, which will remain unknown for the users. Similarly, it will 
> allow for better security in applications, because each application will be 
> able to use a distinct secret as well. 
> By providing Kerberos based SASL authentication/encryption for connections 
> between a client (Client or AppClient) and Spark Master, it will be possible 
> to introduce authentication and automatic generation of digest tokens and 
> safe sharing them among the application processes. 
> h3.User facing changes when running application
> h4.General principles:
> - conf: {{spark.authenticate.secret}} is *never sent* over the wire
> - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
> - In all situations env variable will overwrite conf variable if present. 
> - In all situations when a user has to pass secret, it is better (safer) to 
> do this through env variable
> - In work modes with multiple secrets we assume encrypted communication 
> between client and master, between driver and master, between master and 
> workers
> 
> h4.Work modes and descriptions
> h5.Client mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is running locally
> - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master
> - _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it 
> will look for it in the worker configuration and it will find it there (its 
> presence is implied). 
> 
> h5.Client mode, multiple secrets
> h6.Configuration
> - env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: 
> {{spark.app.authenticate.secret=secret}}
> - env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: 
> {{spark.submission.authenticate.secret=scheduler_secret}}
> h6.Description
> - The driver is running locally
> - The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: 
> {{spark.submission.authenticate.secret}} to connect to the master
> - The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor 
> conf: {{spark.submission.authenticate.secret}}
> - The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: 
> {{spark.app.authenticate.secret}} for communication with the executors
> - The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so 
> that the executors can use it to communicate with the driver
> - _ExecutorRunner_ will find that secret in _ApplicationDescription_ and it 
> will set it in env: {{SPARK_AUTH_SECRET}} which will be read by 
> _ExecutorBackend_ afterwards and used for all the connections (with driver, 
> other executors and external shuffle servi

[jira] [Commented] (SPARK-11326) Split networking in standalone mode

2015-11-05 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992468#comment-14992468
 ] 

Mark Hamstra commented on SPARK-11326:
--

bq. I do believe that there are quite a lot of people who don't want to setup 
the whole Yarn or Mesos clusters to achieve basic security

And I remain unconvinced, or at least not persuaded that there aren't at least 
as many who would rather not have the standalone mode be any more complex or 
any further encumbered with configuration requirements or options.

We really need to have a clear architectural direction for standalone mode in 
order to decide whether these kinds of new features should be added. [~rxin] or 
[~pwendell], care to take a pass at the larger question?

> Split networking in standalone mode
> ---
>
> Key: SPARK-11326
> URL: https://issues.apache.org/jira/browse/SPARK-11326
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> h3.The idea
> Currently, in standalone mode, all components, for all network connections 
> need to use the same secure token if they want to have any security ensured. 
> This ticket is intended to split the communication in standalone mode to make 
> it more like in Yarn mode - application internal communication, scheduler 
> internal communication and communication between the client and scheduler. 
> Such refactoring will allow for the scheduler (master, workers) to use a 
> distinct secret, which will remain unknown for the users. Similarly, it will 
> allow for better security in applications, because each application will be 
> able to use a distinct secret as well. 
> By providing Kerberos based SASL authentication/encryption for connections 
> between a client (Client or AppClient) and Spark Master, it will be possible 
> to introduce authentication and automatic generation of digest tokens and 
> safe sharing them among the application processes. 
> h3.User facing changes when running application
> h4.General principles:
> - conf: {{spark.authenticate.secret}} is *never sent* over the wire
> - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
> - In all situations env variable will overwrite conf variable if present. 
> - In all situations when a user has to pass secret, it is better (safer) to 
> do this through env variable
> - In work modes with multiple secrets we assume encrypted communication 
> between client and master, between driver and master, between master and 
> workers
> 
> h4.Work modes and descriptions
> h5.Client mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is running locally
> - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master
> - _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it 
> will look for it in the worker configuration and it will find it there (its 
> presence is implied). 
> 
> h5.Client mode, multiple secrets
> h6.Configuration
> - env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: 
> {{spark.app.authenticate.secret=secret}}
> - env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: 
> {{spark.submission.authenticate.secret=scheduler_secret}}
> h6.Description
> - The driver is running locally
> - The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: 
> {{spark.submission.authenticate.secret}} to connect to the master
> - The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor 
> conf: {{spark.submission.authenticate.secret}}
> - The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: 
> {{spark.app.authenticate.secret}} for communication with the executors
> - The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so 
> that the executors can use it to communicate with the driver
> - _ExecutorRunner_ will find that secret in _ApplicationDescription_ and it 
> will set it in env: {{SPARK_AUTH_SECRET}} which will be read by 
> _ExecutorBackend_ afterwards and used for all the connections (with driver, 
> other executors and external shuffle service).
> 
> h5.Cluster mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is run by _DriverRunner_ which is is a part of the worker
> - The client will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The client will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master and submit the 
> drive

[jira] [Commented] (SPARK-11326) Split networking in standalone mode

2015-11-05 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992478#comment-14992478
 ] 

Mark Hamstra commented on SPARK-11326:
--

[~rspitzer] A focus on reaching feature parity is definitely not where 
standalone mode started.  Both it and Mesos were really begun by [~matei], so 
he clearly intended standalone mode to be something different from the already 
existing Mesos cluster manager.  Equally clearly, the distinction in purpose 
between standalone mode and each of the other cluster managers has become 
blurry.   

> Split networking in standalone mode
> ---
>
> Key: SPARK-11326
> URL: https://issues.apache.org/jira/browse/SPARK-11326
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> h3.The idea
> Currently, in standalone mode, all components, for all network connections 
> need to use the same secure token if they want to have any security ensured. 
> This ticket is intended to split the communication in standalone mode to make 
> it more like in Yarn mode - application internal communication, scheduler 
> internal communication and communication between the client and scheduler. 
> Such refactoring will allow for the scheduler (master, workers) to use a 
> distinct secret, which will remain unknown for the users. Similarly, it will 
> allow for better security in applications, because each application will be 
> able to use a distinct secret as well. 
> By providing Kerberos based SASL authentication/encryption for connections 
> between a client (Client or AppClient) and Spark Master, it will be possible 
> to introduce authentication and automatic generation of digest tokens and 
> safe sharing them among the application processes. 
> h3.User facing changes when running application
> h4.General principles:
> - conf: {{spark.authenticate.secret}} is *never sent* over the wire
> - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
> - In all situations env variable will overwrite conf variable if present. 
> - In all situations when a user has to pass secret, it is better (safer) to 
> do this through env variable
> - In work modes with multiple secrets we assume encrypted communication 
> between client and master, between driver and master, between master and 
> workers
> 
> h4.Work modes and descriptions
> h5.Client mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is running locally
> - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master
> - _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it 
> will look for it in the worker configuration and it will find it there (its 
> presence is implied). 
> 
> h5.Client mode, multiple secrets
> h6.Configuration
> - env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: 
> {{spark.app.authenticate.secret=secret}}
> - env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: 
> {{spark.submission.authenticate.secret=scheduler_secret}}
> h6.Description
> - The driver is running locally
> - The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: 
> {{spark.submission.authenticate.secret}} to connect to the master
> - The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor 
> conf: {{spark.submission.authenticate.secret}}
> - The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: 
> {{spark.app.authenticate.secret}} for communication with the executors
> - The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so 
> that the executors can use it to communicate with the driver
> - _ExecutorRunner_ will find that secret in _ApplicationDescription_ and it 
> will set it in env: {{SPARK_AUTH_SECRET}} which will be read by 
> _ExecutorBackend_ afterwards and used for all the connections (with driver, 
> other executors and external shuffle service).
> 
> h5.Cluster mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is run by _DriverRunner_ which is is a part of the worker
> - The client will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The client will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master and submit the 
> driver
> - _DriverRunner_ will not find any secret in _DriverDescription_ so it will 
> look for it in the worker configuration and it will find it there (its 
> presence is implied)
> - _Driv

[jira] [Commented] (SPARK-11539) Debian packaging

2015-11-05 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992601#comment-14992601
 ] 

Mark Hamstra commented on SPARK-11539:
--

I agree with [~srowen].  See https://issues.apache.org/jira/browse/SPARK-5727 
-- I'm not seeing anything that would lead us to reverse that decision.

> Debian packaging
> 
>
> Key: SPARK-11539
> URL: https://issues.apache.org/jira/browse/SPARK-11539
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Reporter: Simon Hafner
>
> This patch adds debian packaging with systemd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-13 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004944#comment-15004944
 ] 

Mark Hamstra commented on SPARK-11153:
--

Is there a reason why parquet.version hasn't been pushed up in Spark 1.6 and 
this issue actually fixed instead of just disabling filter push-down for 
strings and binaries?

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-14 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005542#comment-15005542
 ] 

Mark Hamstra commented on SPARK-11153:
--

Thanks.

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11838) Spark SQL query fragment RDD reuse

2015-11-18 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012724#comment-15012724
 ] 

Mark Hamstra commented on SPARK-11838:
--

One significant difference between this and CacheManager is that what is 
proposed here is caching and reuse of the RDD itself, not the blocks of data 
computed by that RDD.  As Mikhail noted, that can avoid significant amounts of 
duplicate computation even when nothing is explicitly persisted/cached.

> Spark SQL query fragment RDD reuse
> --
>
> Key: SPARK-11838
> URL: https://issues.apache.org/jira/browse/SPARK-11838
> Project: Spark
>  Issue Type: Improvement
>Reporter: Mikhail Bautin
>
> With many analytical Spark SQL workloads against slowly changing tables, 
> successive queries frequently share fragments that produce the same result. 
> Instead of re-computing those fragments for every query, it makes sense to 
> detect similar fragments and substitute RDDs previously created for matching 
> SparkPlan fragments into every new SparkPlan at the execution time whenever 
> possible. Even if no RDDs are persist()-ed to memory/disk/off-heap memory, 
> many stages can still be skipped due to map output files being present on 
> executor nodes.
> The implementation involves the following steps:
> (1) Logical plan "canonicalization". 
> Logical plans mapping to the same "canonical" logical plan should always 
> produce the same results (except for possible output column reordering), 
> although the inverse statement won't always be true. 
>   - Re-mapping expression ids to "canonical expression ids" (successively 
> increasing numbers always starting with 1).
>   - Eliminating alias names that are unimportant after analysis completion. 
> Only the names that are necessary to determine the Hive table columns to be 
> scanned are retained.
>   - Reordering columns in projections, grouping/aggregation expressions, etc. 
> This can be done e.g. by using the string representation as a sort key. Union 
> inputs always have to be reordered the same way.
>   - Tree traversal has to happen starting from leaves and progressing towards 
> the root, because we need to already have identified canonical expression ids 
> for children of a node before we can come up with sort keys that would allow 
> to reorder expressions in a node deterministically. This is a bit more 
> complicated for Union nodes.
>   - Special handling for MetastoreRelations. We replace MetastoreRelation 
> with a special class CanonicalMetastoreRelation that uses attributes and 
> partitionKeys as part of its equals() and hashCode() implementation, but the 
> visible attributes and aprtitionKeys are restricted to expression ids that 
> the rest of the query actually needs from that MetastoreRelation.
> An example of logical plans and corresponding canonical logical plans: 
> https://gist.githubusercontent.com/mbautin/ef1317b341211d9606cf/raw
> (2) Tracking LogicalPlan fragments corresponding to SparkPlan fragments. When 
> generating a SparkPlan, we keep an optional reference to a LogicalPlan 
> instance in every node. This allows us to populate the cache with mappings 
> from canonical logical plans of query fragments to the corresponding RDDs 
> generated as part of query execution. Note that there is no new work 
> necessary to generate the RDDs, we are merely utilizing the RDDs that would 
> have been produced as part of SparkPlan execution anyway.
> (3) SparkPlan fragment substitution. After generating a SparkPlan and before 
> calling prepare() or execute() on it, we check if any of its nodes have an 
> associated LogicalPlan that maps to a canonical logical plan matching a cache 
> entry. If so, we substitute a PhysicalRDD (or a new class UnsafePhysicalRDD 
> wrapping an RDD of UnsafeRow) scanning the previously created RDD instead of 
> the current query fragment. If the expected column order differs from what 
> the current SparkPlan fragment produces, we add a projection to reorder the 
> columns. We also add safe/unsafe row conversions as necessary to match the 
> row type that is expected by the parent of the current SparkPlan fragment.
> (4) The execute() method of SparkPlan also needs to perform the cache lookup 
> and substitution described above before producing a new RDD for the current 
> SparkPlan node. The "loading cache" pattern (e.g. as implemented in Guava) 
> allows to reuse query fragments between simultaneously submitted queries: 
> whichever query runs execute() for a particular fragment's canonical logical 
> plan starts producing an RDD first, and if another query has a fragment with 
> the same canonical logical plan, it waits for the RDD to be produced by the 
> first query and inserts it in its SparkPlan instead.
> This kind of query fragment caching will mostly be useful

[jira] [Updated] (SPARK-12258) Hive Timestamp UDF is binded with '1969-12-31 15:59:59.999999' for null value

2015-12-09 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-12258:
-
Component/s: SQL

> Hive Timestamp UDF is binded with '1969-12-31 15:59:59.99' for null value
> -
>
> Key: SPARK-12258
> URL: https://issues.apache.org/jira/browse/SPARK-12258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ian
>
> {code}
>   test("Timestamp UDF and Null value") {
> hiveContext.runSqlHive("CREATE TABLE ts_test (ts TIMESTAMP) STORED AS 
> TEXTFILE")
> hiveContext.runSqlHive("INSERT INTO TABLE ts_test VALUES(Null)")
> hiveContext.udf.register("dummy",
>   (ts: Timestamp) => ts
> )
> val result = hiveContext.sql("SELECT dummy(ts) FROM 
> ts_test").collect().mkString("\n")
> assertResult("[null]")(result)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6416) RDD.fold() requires the operator to be commutative

2016-01-01 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076367#comment-15076367
 ] 

Mark Hamstra commented on SPARK-6416:
-

I don't see any reason to change the API wrt `fold`.  With operations on RDDs, 
we generally try to achieve the same semantics as for Scala parallel 
collections, and that does hold true for `fold`:

  scala> val list = (1 to 1).toList

  scala> list.fold(0)(_ + _)
  res0: Int = 50005000

  scala> list.par.fold(0)(_ + _)
  res1: Int = 50005000

  scala> list.fold(1)(_ + _)
  res2: Int = 50005001

  scala> list.par.fold(1)(_ + _)
  res3: Int = 50005039


If we need to change anything, it would simply be to change our API 
documentation to more closely match that of the Scala Standard Library, where 
the first argument to `fold` is described as: "a neutral element for the fold 
operation, it may be added to the result an arbitrary number of times, not 
changing the result (e.g. Nil for list concatenation, 0 for addition, or 1 for 
multiplication)".

> RDD.fold() requires the operator to be commutative
> --
>
> Key: SPARK-6416
> URL: https://issues.apache.org/jira/browse/SPARK-6416
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Reporter: Josh Rosen
>Priority: Critical
>
> Spark's {{RDD.fold}} operation has some confusing behaviors when a 
> non-commutative reduce function is used.
> Here's an example, which was originally reported on StackOverflow 
> (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output):
> {code}
> sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
> 8
> {code}
> To understand what's going on here, let's look at the definition of Spark's 
> `fold` operation.  
> I'm going to show the Python version of the code, but the Scala version 
> exhibits the exact same behavior (you can also [browse the source on 
> GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]:
> {code}
> def fold(self, zeroValue, op):
> """
> Aggregate the elements of each partition, and then the results for all
> the partitions, using a given associative function and a neutral "zero
> value."
> The function C{op(t1, t2)} is allowed to modify C{t1} and return it
> as its result value to avoid object allocation; however, it should not
> modify C{t2}.
> >>> from operator import add
> >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
> 15
> """
> def func(iterator):
> acc = zeroValue
> for obj in iterator:
> acc = op(obj, acc)
> yield acc
> vals = self.mapPartitions(func).collect()
> return reduce(op, vals, zeroValue)
> {code}
> (For comparison, see the [Scala implementation of 
> `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]).
> Spark's `fold` operates by first folding each partition and then folding the 
> results.  The problem is that an empty partition gets folded down to the zero 
> element, so the final driver-side fold ends up folding one value for _every_ 
> partition rather than one value for each _non-empty_ partition.  This means 
> that the result of `fold` is sensitive to the number of partitions:
> {code}
> >>> sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
> 100
> >>> sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
> 50
> >>> sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
> 1
> {code}
> In this last case, what's happening is that the single partition is being 
> folded down to the correct value, then that value is folded with the 
> zero-value at the driver to yield 1.
> I think the underlying problem here is that our fold() operation implicitly 
> requires the operator to be commutative in addition to associative, but this 
> isn't documented anywhere.  Due to ordering non-determinism elsewhere in 
> Spark, such as SPARK-5750, I don't think there's an easy way to fix this.  
> Therefore, I think we should update the documentation and examples to clarify 
> this requirement and explain that our fold acts more like a reduce with a 
> default value than the type of ordering-sensitive fold() that users may 
> expect in functional languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6416) RDD.fold() requires the operator to be commutative

2016-01-02 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076704#comment-15076704
 ] 

Mark Hamstra commented on SPARK-6416:
-

I still don't see RDD#fold as being out of bounds with what should be expected 
from the Scala parallel collections model -- there, too, you can get confusing 
results if you don't pay attention to the partitioned nature of the operation:
{code}
scala> val list1 = (1 to 1).toList

scala> val list2 = (1 to 100).toList

scala> list1.fold(0){ case (a, b) => a + 1 }
res0: Int = 1

scala> list1.par.fold(0){ case (a, b) => a + 1 }
res1: Int = 162

scala> list2.fold(0){ case (a, b) => a + 1 }
res2: Int = 100

scala> list2.par.fold(0){ case (a, b) => a + 1 }
res3: Int = 7
{code}

> RDD.fold() requires the operator to be commutative
> --
>
> Key: SPARK-6416
> URL: https://issues.apache.org/jira/browse/SPARK-6416
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Reporter: Josh Rosen
>Priority: Critical
>
> Spark's {{RDD.fold}} operation has some confusing behaviors when a 
> non-commutative reduce function is used.
> Here's an example, which was originally reported on StackOverflow 
> (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output):
> {code}
> sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
> 8
> {code}
> To understand what's going on here, let's look at the definition of Spark's 
> `fold` operation.  
> I'm going to show the Python version of the code, but the Scala version 
> exhibits the exact same behavior (you can also [browse the source on 
> GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]:
> {code}
> def fold(self, zeroValue, op):
> """
> Aggregate the elements of each partition, and then the results for all
> the partitions, using a given associative function and a neutral "zero
> value."
> The function C{op(t1, t2)} is allowed to modify C{t1} and return it
> as its result value to avoid object allocation; however, it should not
> modify C{t2}.
> >>> from operator import add
> >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
> 15
> """
> def func(iterator):
> acc = zeroValue
> for obj in iterator:
> acc = op(obj, acc)
> yield acc
> vals = self.mapPartitions(func).collect()
> return reduce(op, vals, zeroValue)
> {code}
> (For comparison, see the [Scala implementation of 
> `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]).
> Spark's `fold` operates by first folding each partition and then folding the 
> results.  The problem is that an empty partition gets folded down to the zero 
> element, so the final driver-side fold ends up folding one value for _every_ 
> partition rather than one value for each _non-empty_ partition.  This means 
> that the result of `fold` is sensitive to the number of partitions:
> {code}
> >>> sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
> 100
> >>> sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
> 50
> >>> sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
> 1
> {code}
> In this last case, what's happening is that the single partition is being 
> folded down to the correct value, then that value is folded with the 
> zero-value at the driver to yield 1.
> I think the underlying problem here is that our fold() operation implicitly 
> requires the operator to be commutative in addition to associative, but this 
> isn't documented anywhere.  Due to ordering non-determinism elsewhere in 
> Spark, such as SPARK-5750, I don't think there's an easy way to fix this.  
> Therefore, I think we should update the documentation and examples to clarify 
> this requirement and explain that our fold acts more like a reduce with a 
> default value than the type of ordering-sensitive fold() that users may 
> expect in functional languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1201) Do not materialize partitions whenever possible in BlockManager

2014-06-13 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031257#comment-14031257
 ] 

Mark Hamstra commented on SPARK-1201:
-

What causes this to not be fixable within the scope of 1.0.1?

> Do not materialize partitions whenever possible in BlockManager
> ---
>
> Key: SPARK-1201
> URL: https://issues.apache.org/jira/browse/SPARK-1201
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 1.1.0
>
>
> This is a slightly more complex version of SPARK-942 where we try to avoid 
> unrolling iterators in other situations where it is possible. SPARK-942 
> focused on the case where the DISK_ONLY storage level was used. There are 
> other cases though, such as if data is stored serialized and in memory and 
> but there is not enough memory left to store the RDD.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1201) Do not materialize partitions whenever possible in BlockManager

2014-06-13 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031426#comment-14031426
 ] 

Mark Hamstra commented on SPARK-1201:
-

Okay, but my question is really whether resolution of this issue will require 
new API that will exclude it from consideration from 1.0.x or whether this will 
be just implementation details that can be considered a bug fix and included in 
the maintenance branch.

> Do not materialize partitions whenever possible in BlockManager
> ---
>
> Key: SPARK-1201
> URL: https://issues.apache.org/jira/browse/SPARK-1201
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 1.1.0
>
>
> This is a slightly more complex version of SPARK-942 where we try to avoid 
> unrolling iterators in other situations where it is possible. SPARK-942 
> focused on the case where the DISK_ONLY storage level was used. There are 
> other cases though, such as if data is stored serialized and in memory and 
> but there is not enough memory left to store the RDD.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2158) FileAppenderSuite is not cleaning up after itself

2014-06-16 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-2158:
---

 Summary: FileAppenderSuite is not cleaning up after itself
 Key: SPARK-2158
 URL: https://issues.apache.org/jira/browse/SPARK-2158
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra
Priority: Trivial
 Fix For: 1.1.0


FileAppenderSuite is leaving behind the file core/stdout



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2198) Partition the scala build file so that it is easier to maintain

2014-06-19 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037431#comment-14037431
 ] 

Mark Hamstra commented on SPARK-2198:
-

While this is an admirable goal, I'm afraid that hand editing the SBT build 
files won't be a very durable solution.  That is because it is currently our 
goal to consolidate the Maven and SBT builds by deriving the SBT build 
configuration from the Maven POMs: 
https://issues.apache.org/jira/browse/SPARK-1776.  As such, any partitioning of 
the SBT build file will really need to be incorporated into the code that is 
generating that file from the Maven input. 

> Partition the scala build file so that it is easier to maintain
> ---
>
> Key: SPARK-2198
> URL: https://issues.apache.org/jira/browse/SPARK-2198
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Helena Edelson
>Priority: Minor
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Partition to standard Dependencies, Version, Settings, Publish.scala. keeping 
> the SparkBuild clean to describe the modules and their deps so that changes 
> in versions, for example, need only be made in Version.scala, settings 
> changes such as in scalac in Settings.scala, etc.
> I'd be happy to do this ([~helena_e])



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2126) Move MapOutputTracker behind ShuffleManager interface

2014-06-19 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-2126:


Assignee: Nan Zhu

> Move MapOutputTracker behind ShuffleManager interface
> -
>
> Key: SPARK-2126
> URL: https://issues.apache.org/jira/browse/SPARK-2126
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Matei Zaharia
>Assignee: Nan Zhu
>
> This will require changing the interface between the DAGScheduler and 
> MapOutputTracker to be method calls on the ShuffleManager instead. However, 
> it will make it easier to do push-based shuffle and other ideas requiring 
> changes to map output tracking.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2424) ApplicationState.MAX_NUM_RETRY should be configurable

2014-07-09 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-2424:
---

 Summary: ApplicationState.MAX_NUM_RETRY should be configurable
 Key: SPARK-2424
 URL: https://issues.apache.org/jira/browse/SPARK-2424
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra


ApplicationState.MAX_NUM_RETRY, controlling the number of times standalone 
Executors can exit unsuccessfully before Master will remove the Application 
that the Executors are trying to run, is currently hard-coded to 10.  There's 
no reason why this should be a single, fixed value for all standalone clusters 
(e.g., it should probably scale with the number of Executors), so it should be 
SparkConf-able. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2425) Standalone Master is too aggressive in removing Applications

2014-07-09 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-2425:
---

 Summary: Standalone Master is too aggressive in removing 
Applications
 Key: SPARK-2425
 URL: https://issues.apache.org/jira/browse/SPARK-2425
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra


When standalone Executors trying to run a particular Application fail a 
cummulative ApplicationState.MAX_NUM_RETRY times, Master will remove the 
Application.  This will be true even if there actually are a number of 
Executors that are successfully running the Application.  This makes 
long-running standalone-mode Applications in particular unnecessarily 
vulnerable to limited failures in the cluster -- e.g., a single bad node on 
which Executors repeatedly fail for any reason can prevent an Application from 
starting or can result in a running Application being removed even though it 
could continue to run successfully (just not making use of all potential 
Workers and Executors.) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2158) FileAppenderSuite is not cleaning up after itself

2014-07-13 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra resolved SPARK-2158.
-

Resolution: Fixed

> FileAppenderSuite is not cleaning up after itself
> -
>
> Key: SPARK-2158
> URL: https://issues.apache.org/jira/browse/SPARK-2158
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Trivial
> Fix For: 1.1.0
>
>
> FileAppenderSuite is leaving behind the file core/stdout



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2158) FileAppenderSuite is not cleaning up after itself

2014-07-13 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060201#comment-14060201
 ] 

Mark Hamstra commented on SPARK-2158:
-

This is fixed at 4cb33a83e0 from https://github.com/apache/spark/pull/1100


> FileAppenderSuite is not cleaning up after itself
> -
>
> Key: SPARK-2158
> URL: https://issues.apache.org/jira/browse/SPARK-2158
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Trivial
> Fix For: 1.1.0
>
>
> FileAppenderSuite is leaving behind the file core/stdout



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once

2014-07-18 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066539#comment-14066539
 ] 

Mark Hamstra commented on SPARK-2568:
-

What is at least as much a problem as the making of three passes through the 
data is that the count and sample are separate hidden/special jobs within the 
RangePartitioner that aren't launched by RDD actions under the user's control.  
This ends up not only breaking Spark's "transformations are lazy; jobs are only 
launched by actions" model, but it also messes up the construction of 
FutureActions on sorted RDDs, accounting of resource usage of jobs that include 
a sort, etc.

> RangePartitioner should go through the data only once
> -
>
> Key: SPARK-2568
> URL: https://issues.apache.org/jira/browse/SPARK-2568
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Xiangrui Meng
>
> As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
> the count and once to do sampling. As a result, to do sortByKey, Spark goes 
> through data 3 times (once to count, once to sample, and once to sort).
> RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once

2014-07-18 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066553#comment-14066553
 ] 

Mark Hamstra commented on SPARK-2568:
-

Sure, if they can be cleanly separated -- but there's also interaction with the 
ShuffleManager refactoring.

Do you have some strategy in mind for addressing just SPARK-2568 in isolation?

> RangePartitioner should go through the data only once
> -
>
> Key: SPARK-2568
> URL: https://issues.apache.org/jira/browse/SPARK-2568
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Xiangrui Meng
>
> As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
> the count and once to do sampling. As a result, to do sortByKey, Spark goes 
> through data 3 times (once to count, once to sample, and once to sort).
> RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb)

2014-07-22 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070908#comment-14070908
 ] 

Mark Hamstra commented on SPARK-2614:
-

It's also common for installers/admins to not want all of the examples 
installed on their production machines.

This issue is really touching on the fact that Spark's Debian packaging isn't 
presently everything that it could be.  It's really just a hack to allow 
deployment tools like Chef to be able to manage Spark.  To really do Debian 
packaging right, we should be creating multiple packages -- perhaps spark-core, 
spark-sql, spark-mllib, spark-graphx, spark-examples, etc.  The only thing that 
is really preventing us from doing this is that proper Debian packaging would 
require a package maintainer willing and committed to do all of the maintenance 
work. 

> Add the spark-examples-xxx-.jar to the Debian package created by 
> assembly/pom.xml (e.g. -Pdeb)
> --
>
> Key: SPARK-2614
> URL: https://issues.apache.org/jira/browse/SPARK-2614
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Deploy
>Reporter: Christian Tzolov
>
> The tar.gz distribution includes already the spark-examples.jar in the 
> bundle. It is a common practice for installers to run SparkPi as a smoke test 
> to verify that the installation is OK
> /usr/share/spark/bin/spark-submit \
>   --num-executors 10  --master yarn-cluster \
>   --class org.apache.spark.examples.SparkPi \
>   /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2529) Clean the closure in foreach and foreachPartition

2014-07-24 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074002#comment-14074002
 ] 

Mark Hamstra commented on SPARK-2529:
-

Actually, we were cleaning those closures, but that was removed in 
https://github.com/apache/spark/commit/6b288b75d4c05f42ad3612813dc77ff824bb6203 
-- not sure why.

> Clean the closure in foreach and foreachPartition
> -
>
> Key: SPARK-2529
> URL: https://issues.apache.org/jira/browse/SPARK-2529
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>
> Somehow we didn't clean the closure for foreach and foreachPartition. Should 
> do that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11

2014-07-25 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075176#comment-14075176
 ] 

Mark Hamstra commented on SPARK-1812:
-

FWIW scalatest can be pushed to 2.2.0 without any pain, and 2.2 does add some 
nice new functionality.

> Support cross-building with Scala 2.11
> --
>
> Key: SPARK-1812
> URL: https://issues.apache.org/jira/browse/SPARK-1812
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Spark Core
>Reporter: Matei Zaharia
>Assignee: Prashant Sharma
>
> Since Scala 2.10/2.11 are source compatible, we should be able to cross build 
> for both versions. From what I understand there are basically three things we 
> need to figure out:
> 1. Have a two versions of our dependency graph, one that uses 2.11 
> dependencies and the other that uses 2.10 dependencies.
> 2. Figure out how to publish different poms for 2.10 and 2.11.
> I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't 
> really well supported by Maven since published pom's aren't generated 
> dynamically. But we can probably script around it to make it work. I've done 
> some initial sanity checks with a simple build here:
> https://github.com/pwendell/scala-maven-crossbuild



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-07-28 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076778#comment-14076778
 ] 

Mark Hamstra commented on SPARK-1860:
-

I don't think that there is much in the way of conflict, but something to be 
aware of is that the proposed fix to SPARK-2425 does modify Executor state 
transitions and cleanup: https://github.com/apache/spark/pull/1360

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Critical
> Fix For: 1.1.0
>
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2714) DAGScheduler logs jobid when runJob finishes

2014-07-29 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-2714:


Issue Type: Improvement  (was: Documentation)

> DAGScheduler logs jobid when runJob finishes
> 
>
> Key: SPARK-2714
> URL: https://issues.apache.org/jira/browse/SPARK-2714
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: YanTang Zhai
>Priority: Minor
>
> DAGScheduler logs jobid when runJob finishes



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-3289) Avoid job failures due to rescheduling of failing tasks on buggy machines

2014-08-28 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114611#comment-14114611
 ] 

Mark Hamstra commented on SPARK-3289:
-

https://github.com/apache/spark/pull/1360

> Avoid job failures due to rescheduling of failing tasks on buggy machines
> -
>
> Key: SPARK-3289
> URL: https://issues.apache.org/jira/browse/SPARK-3289
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>
> Some users have reported issues where a task fails due to an environment / 
> configuration issue on some machine, then the task is reattempted _on that 
> same buggy machine_ until the entire job failures because that single task 
> has failed too many times.
> To guard against this, maybe we should add some randomization in how we 
> reschedule failed tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8279) udf_round_3 test fails

2016-01-08 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090098#comment-15090098
 ] 

Mark Hamstra commented on SPARK-8279:
-

Why is `round` in Spark SQL using HALF_UP instead of the typical database 
practice of using the IEEE 754 standard for arithmetic operations -- i.e. 
HALF_EVEN, a.k.a. "Banker's rounding"?

> udf_round_3 test fails
> --
>
> Key: SPARK-8279
> URL: https://issues.apache.org/jira/browse/SPARK-8279
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yijie Shen
>Priority: Blocker
> Fix For: 1.5.0
>
>
> query
> {code}
> select round(cast(negative(pow(2, 31)) as INT)), round(cast((pow(2, 31) - 1) 
> as INT)), round(-32769), round(32768) from src tablesample (1 rows);
> {code}
> {code}
> [info] - udf_round_3 *** FAILED *** (4 seconds, 803 milliseconds)
> [info]   Failed to execute query using catalyst:
> [info]   Error: java.lang.Integer cannot be cast to java.lang.Double
> [info]   java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> java.lang.Double
> [info]at 
> scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.BinaryMathExpression.eval(math.scala:86)
> [info]at 
> org.apache.spark.sql.hive.HiveInspectors$class.toInspector(HiveInspectors.scala:628)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.toInspector(hiveUdfs.scala:148)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf$$anonfun$argumentInspectors$1.apply(hiveUdfs.scala:160)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf$$anonfun$argumentInspectors$1.apply(hiveUdfs.scala:160)
> [info]at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> [info]at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> [info]at scala.collection.immutable.List.foreach(List.scala:318)
> [info]at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> [info]at 
> scala.collection.AbstractTraversable.map(Traversable.scala:105)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.argumentInspectors$lzycompute(hiveUdfs.scala:160)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.argumentInspectors(hiveUdfs.scala:160)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.returnInspector$lzycompute(hiveUdfs.scala:164)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.returnInspector(hiveUdfs.scala:163)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.dataType$lzycompute(hiveUdfs.scala:180)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.dataType(hiveUdfs.scala:180)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Cast.resolved$lzycompute(Cast.scala:31)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Cast.resolved(Cast.scala:31)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:121)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:121)
> [info]at 
> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
> [info]at scala.collection.immutable.List.forall(List.scala:84)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:121)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:109)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:109)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:121)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:121)
> [info]at 
> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
> [info]at scala.collection.immutable.List.forall(List.scala:84)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:121)
> [info]at 
> org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$ConvertNaNs$$anonfun$apply$2$$anonfun$applyOrElse$2.applyOrElse(HiveTypeCoercion.scala:138)
> [info]at 
> org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$ConvertNaNs$$anonfun$apply$2$$anonfun$applyOrElse$2.applyOrElse(HiveTypeCoercion.scala:136)
> [info]at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> [info]at 
> org.apache.s

[jira] [Commented] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"

2016-01-19 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106820#comment-15106820
 ] 

Mark Hamstra commented on SPARK-12485:
--

Actually, Sean, I'd argue that they are not the same and that something like 
"elastic scaling" be retained for another concept.  What Spark is doing with 
dynamic allocation is to re-allocate resources from an essentially fixed pool 
of cluster resources -- a particular Application may be re-assigned a different 
number of Executors, for example, but the total number of Worker nodes remains 
the same.  Elastic scaling, on the other hand, I would argue should be applied 
to dynamically changing the number of Worker nodes or other cluster resources 
-- changing the size of the pool, not just re-allocating from a constant size 
pool.

> Rename "dynamic allocation" to "elastic scaling"
> 
>
> Key: SPARK-12485
> URL: https://issues.apache.org/jira/browse/SPARK-12485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Fewer syllables, sounds more natural.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"

2016-01-19 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106966#comment-15106966
 ] 

Mark Hamstra commented on SPARK-12485:
--

Right, I am saying that "dynamic allocation" is the right name for what Spark 
currently does, and that "elastic scaling" would be a good name for changing 
the number of cluster nodes (or otherwise modifying the size of the cluster 
resource pool), which Spark currently doesn't do or effectively support (but 
which would be a very nice feature to have.)

> Rename "dynamic allocation" to "elastic scaling"
> 
>
> Key: SPARK-12485
> URL: https://issues.apache.org/jira/browse/SPARK-12485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Fewer syllables, sounds more natural.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8279) udf_round_3 test fails

2016-01-26 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117924#comment-15117924
 ] 

Mark Hamstra commented on SPARK-8279:
-

It's a bit of an under-defined mess, actually.  I don't have a SQL Standard 
document in front of me to be able to say whether any of the standards 
specifies a particular rounding scheme.  W3 suggests: "Many database systems 
have adopted the IEEE 754 standard for arithmetic operations, according to 
which the default rounding behavior is 'round half to even.'" PostgreSQL and 
Oracle both use round half away from zero for NUMBER values but bankers 
rounding/half-even for BINARY_FLOAT and BINARY_DOUBLE (cf. IEEE754 Conformance 
for BINARY_FLOAT and BINARY_DOUBLE in 
https://docs.oracle.com/database/121/SQLRF/sql_elements001.htm#SQLRF0021 -- 
where for these types "The default rounding mode [i.e. round half to even] is 
supported."  Other data-related tools also don't rigorously adhere to a 
standard rounding scheme, and Microsoft products are not consistent across, 
e.g., versions of VB, .NET, SQL Server, and EXCEL.

Much of the time, the precise rounding scheme being used doesn't matter much.  
When it does matter, though, we should at least have the ability to require 
that bankers rounding (and perhaps other rounding schemes) be enforced.

> udf_round_3 test fails
> --
>
> Key: SPARK-8279
> URL: https://issues.apache.org/jira/browse/SPARK-8279
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yijie Shen
>Priority: Blocker
> Fix For: 1.5.0
>
>
> query
> {code}
> select round(cast(negative(pow(2, 31)) as INT)), round(cast((pow(2, 31) - 1) 
> as INT)), round(-32769), round(32768) from src tablesample (1 rows);
> {code}
> {code}
> [info] - udf_round_3 *** FAILED *** (4 seconds, 803 milliseconds)
> [info]   Failed to execute query using catalyst:
> [info]   Error: java.lang.Integer cannot be cast to java.lang.Double
> [info]   java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> java.lang.Double
> [info]at 
> scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.BinaryMathExpression.eval(math.scala:86)
> [info]at 
> org.apache.spark.sql.hive.HiveInspectors$class.toInspector(HiveInspectors.scala:628)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.toInspector(hiveUdfs.scala:148)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf$$anonfun$argumentInspectors$1.apply(hiveUdfs.scala:160)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf$$anonfun$argumentInspectors$1.apply(hiveUdfs.scala:160)
> [info]at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> [info]at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> [info]at scala.collection.immutable.List.foreach(List.scala:318)
> [info]at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> [info]at 
> scala.collection.AbstractTraversable.map(Traversable.scala:105)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.argumentInspectors$lzycompute(hiveUdfs.scala:160)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.argumentInspectors(hiveUdfs.scala:160)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.returnInspector$lzycompute(hiveUdfs.scala:164)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.returnInspector(hiveUdfs.scala:163)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.dataType$lzycompute(hiveUdfs.scala:180)
> [info]at 
> org.apache.spark.sql.hive.HiveGenericUdf.dataType(hiveUdfs.scala:180)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Cast.resolved$lzycompute(Cast.scala:31)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Cast.resolved(Cast.scala:31)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:121)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:121)
> [info]at 
> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
> [info]at scala.collection.immutable.List.forall(List.scala:84)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:121)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:109)
> [info]at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:109)
> [info]at 
> org.apache.spark.sql.catalyst.expr

[jira] [Created] (SPARK-13087) Grouping by a complex expression may lead to incorrect AttributeReferences in aggregations

2016-01-29 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-13087:


 Summary: Grouping by a complex expression may lead to incorrect 
AttributeReferences in aggregations
 Key: SPARK-13087
 URL: https://issues.apache.org/jira/browse/SPARK-13087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Mark Hamstra


This is a regression from 1.5.

An example of the failure:

Working with this table...
{code}
0: jdbc:hive2://10.1.3.203:1> DESCRIBE 
csd_0ae1abc1_a3af_4c63_95b0_9599faca6c3d;
+---++--+--+
|   col_name| data_type  | comment  |
+---++--+--+
| c_date| timestamp  | NULL |
| c_count   | int| NULL |
| c_location_fips_code  | string | NULL |
| c_airtemp | float  | NULL |
| c_dewtemp | float  | NULL |
| c_pressure| int| NULL |
| c_rain| float  | NULL |
| c_snow| float  | NULL |
+---++--+--+
{code}
...and this query (which isn't necessarily all that sensical or useful, but has 
been adapted from a similarly failing query that uses a custom UDF where the 
Spark SQL built-in `day` function has been substituted into this query)...
{code}
SELECT day ( c_date )  AS c_date, percentile_approx(c_rain, 0.5) AS 
c_expr_1256887735 FROM csd_0ae1abc1_a3af_4c63_95b0_9599faca6c3d GROUP BY day ( 
c_date )  ORDER BY c_date;
{code}
Spark 1.5 produces the expected results without error.

In Spark 1.6, this plan is produced...
{code}
Exchange rangepartitioning(c_date#63009 ASC,16), None
+- SortBasedAggregate(key=[dayofmonth(cast(c_date#63011 as date))#63020], 
functions=[(hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.Gene
ricUDAFPercentileApprox@6f211801),c_rain#63017,0.5,false,0,0),mode=Complete,isDistinct=false)],
 output=[c_date#63009,c_expr_1256887735#63010])
   +- ConvertToSafe
  +- !Sort [dayofmonth(cast(c_date#63011 as date))#63020 ASC], false, 0
 +- !TungstenExchange hashpartitioning(dayofmonth(cast(c_date#63011 as 
date))#63020,16), None
+- ConvertToUnsafe
   +- HiveTableScan [c_date#63011,c_rain#63017], MetastoreRelation 
default, csd_0ae1abc1_a3af_4c63_95b0_9599faca6c3d, None
{code}
...which fails with a TreeNodeException and stack traces that include this...
{code}
Caused by: ! org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 2842.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 2842.0 (TID 15007, ip-10-1-1-59.dev.clearstory.com): 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: dayofmonth(cast(c_date#63011 as date))#63020
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:86)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:85)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:85)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:62)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:62)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(Projection.scala:62)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:254)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:254)
at 
org.apache.spark.sql.execution.Exchange.org$apache$spark$sql$execution$Exc

[jira] [Created] (SPARK-10666) Use properties from ActiveJob associated with a Stage

2015-09-17 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-10666:


 Summary: Use properties from ActiveJob associated with a Stage
 Key: SPARK-10666
 URL: https://issues.apache.org/jira/browse/SPARK-10666
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.5.0, 1.4.1
Reporter: Mark Hamstra
Assignee: Mark Hamstra


This issue was addressed in #5494, but the fix in that PR, while safe in the 
sense that it will prevent the SparkContext from shutting down, misses the 
actual bug. The intent of submitMissingTasks should be understood as "submit 
the Tasks that are missing for the Stage, and run them as part of the ActiveJob 
identified by jobId". Because of a long-standing bug, the jobId parameter was 
never being used. Instead, we were trying to use the jobId with which the Stage 
was created -- which may no longer exist as an ActiveJob, hence the crash 
reported in SPARK-6880.

The correct fix is to use the ActiveJob specified by the supplied jobId 
parameter, which is guaranteed to exist at the call sites of submitMissingTasks.

This fix should be applied to all maintenance branches, since it has existed 
since 1.0.

Tasks for a Stage that was previously part of a Job that is no longer active 
would be re-submitted as though they were part of the prior Job and with no 
properties set. Since properties are what are used to set an other-than-default 
scheduling pool, this would affect FAIR scheduler usage, but it would also 
affect anything else that depends on the settings of the properties (which 
would be just user code at this point, since Spark itself doesn't really use 
the properties for anything else other than Job Group and Description, which 
end up in the WebUI, can be used to kill by JobGroup, etc.) Even the default, 
FIFO scheduling would be affected, however, since the resubmission of the Tasks 
under the earlier jobId would effectively give them a higher priority/greater 
urgency than the ActiveJob that now actually needs them. In any event, the 
Tasks would generate correct results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-09-17 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803261#comment-14803261
 ] 

Mark Hamstra commented on SPARK-6880:
-

see SPARK-10666

> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDD
> --
>
> Key: SPARK-6880
> URL: https://issues.apache.org/jira/browse/SPARK-6880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: CentOs6.0, java7
>Reporter: pankaj arora
>Assignee: pankaj arora
> Fix For: 1.4.0
>
>
> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDDs
> Below is the stack trace
> 15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
> failed; shutting down SparkContext
> java.util.NoSuchElementException: key not found: 28
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9882) Priority-based scheduling for Spark applications

2016-04-10 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234024#comment-15234024
 ] 

Mark Hamstra commented on SPARK-9882:
-

This isn't a very well written JIRA.  You are just duplicating the description 
you gave in the PR.  That description is good for a PR -- it explains what is 
being done.  That is not good for a JIRA ticket, which should explain why 
something should be changed or added.

> Priority-based scheduling for Spark applications
> 
>
> Key: SPARK-9882
> URL: https://issues.apache.org/jira/browse/SPARK-9882
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> This is a priority-based scheduling based on the interface proposed in #7958. 
> It uses a XML configuration file to specify the submission pools for 
> applications. This patch adds a new parameter `--pool` to SparkSubmit under 
> standalone mode to specify which pool the submitted application should be 
> assigned to. The priority of the submitted application is defined by the 
> assigned pool. It also defines the cores it can acquire.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9882) Priority-based scheduling for Spark applications

2016-04-10 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234024#comment-15234024
 ] 

Mark Hamstra edited comment on SPARK-9882 at 4/10/16 9:49 AM:
--

This isn't a very well written JIRA.  You are just duplicating some of the 
description you gave in the PR.  That description is good for a PR -- it 
explains what is being done.  That is not good for a JIRA ticket, which should 
explain why something should be changed or added.  In other words, the 
"Motivation" section of your PR description really belongs here in the JIRA 
description.


was (Author: markhamstra):
This isn't a very well written JIRA.  You are just duplicating the description 
you gave in the PR.  That description is good for a PR -- it explains what is 
being done.  That is not good for a JIRA ticket, which should explain why 
something should be changed or added.

> Priority-based scheduling for Spark applications
> 
>
> Key: SPARK-9882
> URL: https://issues.apache.org/jira/browse/SPARK-9882
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> This is a priority-based scheduling based on the interface proposed in #7958. 
> It uses a XML configuration file to specify the submission pools for 
> applications. This patch adds a new parameter `--pool` to SparkSubmit under 
> standalone mode to specify which pool the submitted application should be 
> assigned to. The priority of the submitted application is defined by the 
> assigned pool. It also defines the cores it can acquire.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14582) Increase the parallelism for small tables

2016-04-13 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240389#comment-15240389
 ] 

Mark Hamstra commented on SPARK-14582:
--

The total absence of any description in both this JIRA and the accompanying PR 
is really annoying -- especially because I find that queries involving small 
tables frequently suffer from using too much parallelism, not too little.

> Increase the parallelism for small tables
> -
>
> Key: SPARK-14582
> URL: https://issues.apache.org/jira/browse/SPARK-14582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27816) make TreeNode tag type safe

2019-05-24 Thread Mark Hamstra (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847624#comment-16847624
 ] 

Mark Hamstra commented on SPARK-27816:
--

Jira issues with no description are really irritating – even worse when there 
is little to no discussion in any associated PR.

> make TreeNode tag type safe
> ---
>
> Key: SPARK-27816
> URL: https://issues.apache.org/jira/browse/SPARK-27816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17064) Reconsider spark.job.interruptOnCancel

2016-08-15 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-17064:


 Summary: Reconsider spark.job.interruptOnCancel
 Key: SPARK-17064
 URL: https://issues.apache.org/jira/browse/SPARK-17064
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Spark Core
Reporter: Mark Hamstra


There is a frequent need or desire in Spark to cancel already running Tasks.  
This has been recognized for a very long time (see, e.g., the ancient TODO 
comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
never had more than an incomplete solution.  Killing running Tasks at the 
Executor level has been implemented by interrupting the threads running the 
Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 
addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task 
threads in this way has only been possible if interruptThread is true, and that 
typically comes from the setting of the interruptOnCancel property in the 
JobGroup, which in turn typically comes from the setting of 
spark.job.interruptOnCancel.  Because of concerns over 
https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
being marked dead when a Task thread is interrupted, the default value of the 
boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
running on Executor even when the Task has been canceled in the DAGScheduler, 
or the Stage has been abort, or the Job has been killed, etc.

There are several issues resulting from this current state of affairs, and they 
each probably need to spawn their own JIRA issue and PR once we decide on an 
overall strategy here.  Among those issues:

* Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
versions that Spark now supports so that we set the default value of 
spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
* Even if interrupting Task threads is no longer an issue for HDFS, is it still 
enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need 
protection similar to what the current default value of 
spark.job.interruptOnCancel provides?
* If interrupting Task threads isn't safe enough, what should we do instead?
* Once we have a safe mechanism to stop and clean up after already executing 
Tasks, there is still the question of whether we _should_ end executing Tasks.  
While that is likely a good thing to do in cases where individual Tasks are 
lightweight in terms of resource usage, at least in some cases not all running 
Tasks should be ended: https://github.com/apache/spark/pull/12436  That means 
that we probably need to continue to make allowing Task interruption 
configurable at the Job or JobGroup level (and we need better documentation 
explaining how and when to allow interruption or not.)
* There is one place in the current code (TaskSetManager#handleSuccessfulTask) 
that hard codes interruptThread to "true".  This should be fixed, and similar 
misuses of killTask be denied in pull requests until this issue is adequately 
resolved.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17064) Reconsider spark.job.interruptOnCancel

2016-08-15 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421503#comment-15421503
 ] 

Mark Hamstra commented on SPARK-17064:
--

[~kayousterhout] [~r...@databricks.com] [~imranr]

> Reconsider spark.job.interruptOnCancel
> --
>
> Key: SPARK-17064
> URL: https://issues.apache.org/jira/browse/SPARK-17064
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Mark Hamstra
>
> There is a frequent need or desire in Spark to cancel already running Tasks.  
> This has been recognized for a very long time (see, e.g., the ancient TODO 
> comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
> never had more than an incomplete solution.  Killing running Tasks at the 
> Executor level has been implemented by interrupting the threads running the 
> Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
> https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3
>  addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting 
> Task threads in this way has only been possible if interruptThread is true, 
> and that typically comes from the setting of the interruptOnCancel property 
> in the JobGroup, which in turn typically comes from the setting of 
> spark.job.interruptOnCancel.  Because of concerns over 
> https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
> being marked dead when a Task thread is interrupted, the default value of the 
> boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
> running on Executor even when the Task has been canceled in the DAGScheduler, 
> or the Stage has been abort, or the Job has been killed, etc.
> There are several issues resulting from this current state of affairs, and 
> they each probably need to spawn their own JIRA issue and PR once we decide 
> on an overall strategy here.  Among those issues:
> * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
> versions that Spark now supports so that we set the default value of 
> spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
> * Even if interrupting Task threads is no longer an issue for HDFS, is it 
> still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still 
> need protection similar to what the current default value of 
> spark.job.interruptOnCancel provides?
> * If interrupting Task threads isn't safe enough, what should we do instead?
> * Once we have a safe mechanism to stop and clean up after already executing 
> Tasks, there is still the question of whether we _should_ end executing 
> Tasks.  While that is likely a good thing to do in cases where individual 
> Tasks are lightweight in terms of resource usage, at least in some cases not 
> all running Tasks should be ended: https://github.com/apache/spark/pull/12436 
>  That means that we probably need to continue to make allowing Task 
> interruption configurable at the Job or JobGroup level (and we need better 
> documentation explaining how and when to allow interruption or not.)
> * There is one place in the current code 
> (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to 
> "true".  This should be fixed, and similar misuses of killTask be denied in 
> pull requests until this issue is adequately resolved.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17064) Reconsider spark.job.interruptOnCancel

2016-08-15 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-17064:
-
Description: 
There is a frequent need or desire in Spark to cancel already running Tasks.  
This has been recognized for a very long time (see, e.g., the ancient TODO 
comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
never had more than an incomplete solution.  Killing running Tasks at the 
Executor level has been implemented by interrupting the threads running the 
Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 
addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task 
threads in this way has only been possible if interruptThread is true, and that 
typically comes from the setting of the interruptOnCancel property in the 
JobGroup, which in turn typically comes from the setting of 
spark.job.interruptOnCancel.  Because of concerns over 
https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
being marked dead when a Task thread is interrupted, the default value of the 
boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
running on Executor even when the Task has been canceled in the DAGScheduler, 
or the Stage has been abort, or the Job has been killed, etc.

There are several issues resulting from this current state of affairs, and they 
each probably need to spawn their own JIRA issue and PR once we decide on an 
overall strategy here.  Among those issues:

* Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
versions that Spark now supports so that we can set the default value of 
spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
* Even if interrupting Task threads is no longer an issue for HDFS, is it still 
enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need 
protection similar to what the current default value of 
spark.job.interruptOnCancel provides?
* If interrupting Task threads isn't safe enough, what should we do instead?
* Once we have a safe mechanism to stop and clean up after already executing 
Tasks, there is still the question of whether we _should_ end executing Tasks.  
While that is likely a good thing to do in cases where individual Tasks are 
lightweight in terms of resource usage, at least in some cases not all running 
Tasks should be ended: https://github.com/apache/spark/pull/12436  That means 
that we probably need to continue to make allowing Task interruption 
configurable at the Job or JobGroup level (and we need better documentation 
explaining how and when to allow interruption or not.)
* There is one place in the current code (TaskSetManager#handleSuccessfulTask) 
that hard codes interruptThread to "true".  This should be fixed, and similar 
misuses of killTask be denied in pull requests until this issue is adequately 
resolved.   

  was:
There is a frequent need or desire in Spark to cancel already running Tasks.  
This has been recognized for a very long time (see, e.g., the ancient TODO 
comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
never had more than an incomplete solution.  Killing running Tasks at the 
Executor level has been implemented by interrupting the threads running the 
Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 
addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task 
threads in this way has only been possible if interruptThread is true, and that 
typically comes from the setting of the interruptOnCancel property in the 
JobGroup, which in turn typically comes from the setting of 
spark.job.interruptOnCancel.  Because of concerns over 
https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
being marked dead when a Task thread is interrupted, the default value of the 
boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
running on Executor even when the Task has been canceled in the DAGScheduler, 
or the Stage has been abort, or the Job has been killed, etc.

There are several issues resulting from this current state of affairs, and they 
each probably need to spawn their own JIRA issue and PR once we decide on an 
overall strategy here.  Among those issues:

* Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
versions that Spark now supports so that we set the default value of 
spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
* Even if interrupting Task threads is no longer an issue for HDFS, is it still 
enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need 
protection similar to what the current defau

[jira] [Commented] (SPARK-27726) Performance of InMemoryStore suffers under load

2019-05-15 Thread Mark Hamstra (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840636#comment-16840636
 ] 

Mark Hamstra commented on SPARK-27726:
--

[~vanzin]

> Performance of InMemoryStore suffers under load
> ---
>
> Key: SPARK-27726
> URL: https://issues.apache.org/jira/browse/SPARK-27726
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: David C Navas
>Priority: Major
> Attachments: GCRateIssues.pdf, PerformanceBeforeAndAfter.pdf
>
>
> When our Spark system has been under load for an extended period of time, GC 
> remains highly active and the jobs page becomes unresponsive even when load 
> eases.  Please see the attached GCRateIssues for more details regarding the 
> problem definition.
> We found a number of separate issues which are detailed in the subtasks.  I 
> anticipate committing a single PR for all subtasks whose commits roughly 
> align with the descriptions in the subtasks.
> The performance of the code is measured before and after the change and is 
> attached in the document PerformanceBeforeAndAfter.  tl;dr in our use case, 
> we saw about five (computed) orders of magnitude improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-839) Bug in how failed executors are removed by ID from standalone cluster

2015-02-08 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra closed SPARK-839.
--
Resolution: Fixed

> Bug in how failed executors are removed by ID from standalone cluster
> -
>
> Key: SPARK-839
> URL: https://issues.apache.org/jira/browse/SPARK-839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.0, 0.7.3
>Reporter: Mark Hamstra
>Priority: Critical
>
> ClearStory data reported the following issue, where some hashmaps are indexed 
> by executorId and some by appId/executorId, and we use the wrong string to 
> search for an executor: https://github.com/clearstorydata/spark/pull/9. This 
> affects FT on the standalone mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-839) Bug in how failed executors are removed by ID from standalone cluster

2015-02-08 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311734#comment-14311734
 ] 

Mark Hamstra commented on SPARK-839:


Fixed long ago.

> Bug in how failed executors are removed by ID from standalone cluster
> -
>
> Key: SPARK-839
> URL: https://issues.apache.org/jira/browse/SPARK-839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.0, 0.7.3
>Reporter: Mark Hamstra
>Priority: Critical
>
> ClearStory data reported the following issue, where some hashmaps are indexed 
> by executorId and some by appId/executorId, and we use the wrong string to 
> search for an executor: https://github.com/clearstorydata/spark/pull/9. This 
> affects FT on the standalone mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-05-19 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551647#comment-14551647
 ] 

Mark Hamstra commented on SPARK-6880:
-

This fix should also be applied as far back as we care to go in the maintenance 
branches, since the bug also exists in branch-1.0, branch-1.1, branch-1.2 and 
branch-1.3.

> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDD
> --
>
> Key: SPARK-6880
> URL: https://issues.apache.org/jira/browse/SPARK-6880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: CentOs6.0, java7
>Reporter: pankaj arora
>Assignee: pankaj arora
> Fix For: 1.4.0
>
>
> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDDs
> Below is the stack trace
> 15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
> failed; shutting down SparkContext
> java.util.NoSuchElementException: key not found: 28
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7750) Rename "json" endpoints to "api" endpoints

2015-05-19 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551718#comment-14551718
 ] 

Mark Hamstra commented on SPARK-7750:
-

Including `@Produces` annotations is also probably a good idea, especially if 
we are anticipating producing more than one kind of output.

> Rename "json" endpoints to "api" endpoints
> --
>
> Key: SPARK-7750
> URL: https://issues.apache.org/jira/browse/SPARK-7750
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Hari Shreedharan
>Priority: Critical
>
> Sorry for bringing this in at the last moment, but we should rename the 
> endpoints from /json/v1... to /api/v1.. since it is likely 
> we will add API that can do more than return JSON, like being able to 
> download event logs. In that case the current api can be misleading.
> This is a compatibility issue, so it has to go into 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6416) RDD.fold() requires the operator to be commutative

2015-05-21 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554871#comment-14554871
 ] 

Mark Hamstra commented on SPARK-6416:
-

Why remove it?  It's very useful when used correctly.

> RDD.fold() requires the operator to be commutative
> --
>
> Key: SPARK-6416
> URL: https://issues.apache.org/jira/browse/SPARK-6416
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Reporter: Josh Rosen
>Priority: Critical
>
> Spark's {{RDD.fold}} operation has some confusing behaviors when a 
> non-commutative reduce function is used.
> Here's an example, which was originally reported on StackOverflow 
> (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output):
> {code}
> sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
> 8
> {code}
> To understand what's going on here, let's look at the definition of Spark's 
> `fold` operation.  
> I'm going to show the Python version of the code, but the Scala version 
> exhibits the exact same behavior (you can also [browse the source on 
> GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]:
> {code}
> def fold(self, zeroValue, op):
> """
> Aggregate the elements of each partition, and then the results for all
> the partitions, using a given associative function and a neutral "zero
> value."
> The function C{op(t1, t2)} is allowed to modify C{t1} and return it
> as its result value to avoid object allocation; however, it should not
> modify C{t2}.
> >>> from operator import add
> >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
> 15
> """
> def func(iterator):
> acc = zeroValue
> for obj in iterator:
> acc = op(obj, acc)
> yield acc
> vals = self.mapPartitions(func).collect()
> return reduce(op, vals, zeroValue)
> {code}
> (For comparison, see the [Scala implementation of 
> `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]).
> Spark's `fold` operates by first folding each partition and then folding the 
> results.  The problem is that an empty partition gets folded down to the zero 
> element, so the final driver-side fold ends up folding one value for _every_ 
> partition rather than one value for each _non-empty_ partition.  This means 
> that the result of `fold` is sensitive to the number of partitions:
> {code}
> >>> sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
> 100
> >>> sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
> 50
> >>> sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
> 1
> {code}
> In this last case, what's happening is that the single partition is being 
> folded down to the correct value, then that value is folded with the 
> zero-value at the driver to yield 1.
> I think the underlying problem here is that our fold() operation implicitly 
> requires the operator to be commutative in addition to associative, but this 
> isn't documented anywhere.  Due to ordering non-determinism elsewhere in 
> Spark, such as SPARK-5750, I don't think there's an easy way to fix this.  
> Therefore, I think we should update the documentation and examples to clarify 
> this requirement and explain that our fold acts more like a reduce with a 
> default value than the type of ordering-sensitive fold() that users may 
> expect in functional languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956642#comment-13956642
 ] 

Mark Hamstra commented on SPARK-1355:
-

Resources are limited as we progress toward our 1.0 release.  I can't see 
reallocating those commitments just to avoid looking funny in the estimation of 
some observers.  If someone not otherwise occupied wants to contribute the work 
to convert to Apache CMS, that's another thing.

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956685#comment-13956685
 ] 

Mark Hamstra commented on SPARK-1378:
-

A tip: It can be useful to create a file called something like 
~/.m2/empty-settings.xml that contains nothing but .  Then 
you can test a build with no local settings interference via 'mvn -s 
~/.m2/empty-settings.xml ...'

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956692#comment-13956692
 ] 

Mark Hamstra commented on SPARK-1355:
-

That looked more like an insult than a contribution.

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1398) FindBugs 1 --> FindBugs 2

2014-04-02 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-1398:
---

 Summary: FindBugs 1 --> FindBugs 2 
 Key: SPARK-1398
 URL: https://issues.apache.org/jira/browse/SPARK-1398
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra
Priority: Minor


We're not making much use of FindBugs at this point, but findbugs-2.0.x is a 
drop-in replacement for 1.3.9 and does offer significant improvements 
(http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we 
want to be for Spark 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1398) Remove FindBugs jsr305 dependency

2014-04-03 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959064#comment-13959064
 ] 

Mark Hamstra commented on SPARK-1398:
-

UPDATE: We can actually get away with not including the jsr305 jar at all, so 
I've changed this JIRA and associated PR to do just that instead of bumping the 
FindBugs version.

> Remove FindBugs jsr305 dependency
> -
>
> Key: SPARK-1398
> URL: https://issues.apache.org/jira/browse/SPARK-1398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Minor
>
> We're not making much use of FindBugs at this point, but findbugs-2.0.x is a 
> drop-in replacement for 1.3.9 and does offer significant improvements 
> (http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we 
> want to be for Spark 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1398) Remove FindBugs jsr305 dependency

2014-04-03 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1398:


Summary: Remove FindBugs jsr305 dependency  (was: FindBugs 1 --> FindBugs 2 
)

> Remove FindBugs jsr305 dependency
> -
>
> Key: SPARK-1398
> URL: https://issues.apache.org/jira/browse/SPARK-1398
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Minor
>
> We're not making much use of FindBugs at this point, but findbugs-2.0.x is a 
> drop-in replacement for 1.3.9 and does offer significant improvements 
> (http://findbugs.sourceforge.net/findbugs2.html), so it's probably where we 
> want to be for Spark 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1419) Apache parent POM to version 14

2014-04-04 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-1419:
---

 Summary: Apache parent POM to version 14
 Key: SPARK-1419
 URL: https://issues.apache.org/jira/browse/SPARK-1419
 Project: Spark
  Issue Type: Bug
  Components: Build, Deploy
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra


Latest version of the Apache parent POM includes several improvements and 
bugfixes, including to the release plugin: 
http://svn.apache.org/viewvc/maven/pom/tags/apache-14/pom.xml?r1=HEAD&r2=1434717&diff_format=h




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1620) Uncaught exception from Akka scheduler

2014-04-24 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-1620:
---

 Summary: Uncaught exception from Akka scheduler
 Key: SPARK-1620
 URL: https://issues.apache.org/jira/browse/SPARK-1620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Mark Hamstra
Priority: Blocker


I've been looking at this one in the context of a BlockManagerMaster that OOMs 
and doesn't respond to heartBeat(), but I suspect that there may be problems 
elsewhere where we use Akka's scheduler.

The basic nature of the problem is that we are expecting exceptions thrown from 
a scheduled function to be caught in the thread where 
_ActorSystem_.scheduler.schedule() or scheduleOnce() has been called.  In fact, 
the scheduled function runs on its own thread, so any exceptions that it throws 
are not caught in the thread that called schedule() -- e.g., unanswered 
BlockManager heartBeats (scheduled in BlockManager#initialize) that end up 
throwing exceptions in BlockManagerMaster#askDriverWithReply do not cause those 
exceptions to be handled by the Executor thread's UncaughtExceptionHandler. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1620) Uncaught exception from Akka scheduler

2014-04-25 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981779#comment-13981779
 ] 

Mark Hamstra commented on SPARK-1620:
-

I'm going to close this one for now, since I think something different is 
happening than I previously thought.

> Uncaught exception from Akka scheduler
> --
>
> Key: SPARK-1620
> URL: https://issues.apache.org/jira/browse/SPARK-1620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Mark Hamstra
>Priority: Blocker
>
> I've been looking at this one in the context of a BlockManagerMaster that 
> OOMs and doesn't respond to heartBeat(), but I suspect that there may be 
> problems elsewhere where we use Akka's scheduler.
> The basic nature of the problem is that we are expecting exceptions thrown 
> from a scheduled function to be caught in the thread where 
> _ActorSystem_.scheduler.schedule() or scheduleOnce() has been called.  In 
> fact, the scheduled function runs on its own thread, so any exceptions that 
> it throws are not caught in the thread that called schedule() -- e.g., 
> unanswered BlockManager heartBeats (scheduled in BlockManager#initialize) 
> that end up throwing exceptions in BlockManagerMaster#askDriverWithReply do 
> not cause those exceptions to be handled by the Executor thread's 
> UncaughtExceptionHandler. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-1620) Uncaught exception from Akka scheduler

2014-04-25 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra closed SPARK-1620.
---

Resolution: Invalid

> Uncaught exception from Akka scheduler
> --
>
> Key: SPARK-1620
> URL: https://issues.apache.org/jira/browse/SPARK-1620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Mark Hamstra
>Priority: Blocker
>
> I've been looking at this one in the context of a BlockManagerMaster that 
> OOMs and doesn't respond to heartBeat(), but I suspect that there may be 
> problems elsewhere where we use Akka's scheduler.
> The basic nature of the problem is that we are expecting exceptions thrown 
> from a scheduled function to be caught in the thread where 
> _ActorSystem_.scheduler.schedule() or scheduleOnce() has been called.  In 
> fact, the scheduled function runs on its own thread, so any exceptions that 
> it throws are not caught in the thread that called schedule() -- e.g., 
> unanswered BlockManager heartBeats (scheduled in BlockManager#initialize) 
> that end up throwing exceptions in BlockManagerMaster#askDriverWithReply do 
> not cause those exceptions to be handled by the Executor thread's 
> UncaughtExceptionHandler. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient

2014-04-30 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-1685:
---

 Summary: retryTimer not canceled on actor restart in Worker and 
AppClient
 Key: SPARK-1685
 URL: https://issues.apache.org/jira/browse/SPARK-1685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Mark Hamstra
Assignee: Mark Hamstra


Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actors restarting, registerWithMaster being called again on 
restart, and another retryTimer being scheduled without canceling the already 
running retryTimer.  Assuming that all of the rest of the restart logic is 
correct for these actors (which I don't believe is actually a given), having 
multiple retryTimers running presents at least a condition in which the 
restarted actor will not be able to make the full number of retry attempts 
before an earlier retryTimer takes the "give up" action.

Canceling the retryTimer in the actor's postStop hook should suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient

2014-04-30 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1685:


Description: 
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actor restarting, registerWithMaster being called again on restart, 
and another retryTimer being scheduled without canceling the already running 
retryTimer.  Assuming that all of the rest of the restart logic is correct for 
these actors (which I don't believe is actually a given), having multiple 
retryTimers running presents at least a condition in which the restarted actor 
will not be able to make the full number of retry attempts before an earlier 
retryTimer takes the "give up" action.

Canceling the retryTimer in the actor's postStop hook should suffice. 

  was:
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actors restarting, registerWithMaster being called again on 
restart, and another retryTimer being scheduled without canceling the already 
running retryTimer.  Assuming that all of the rest of the restart logic is 
correct for these actors (which I don't believe is actually a given), having 
multiple retryTimers running presents at least a condition in which the 
restarted actor will not be able to make the full number of retry attempts 
before an earlier retryTimer takes the "give up" action.

Canceling the retryTimer in the actor's postStop hook should suffice. 


> retryTimer not canceled on actor restart in Worker and AppClient
> 
>
> Key: SPARK-1685
> URL: https://issues.apache.org/jira/browse/SPARK-1685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>
> Both deploy.worker.Worker and deploy.client.AppClient try to 
> registerWithMaster when those Actors start.  The attempt at registration is 
> accomplished by starting a retryTimer via the Akka scheduler that will use 
> the registered timeout interval and retry number to make repeated attempts to 
> register with all known Masters before giving up and either marking as dead 
> or calling System.exit.
> The receive methods of these actors can, however, throw exceptions, which 
> will lead to the actor restarting, registerWithMaster being called again on 
> restart, and another retryTimer being scheduled without canceling the already 
> running retryTimer.  Assuming that all of the rest of the restart logic is 
> correct for these actors (which I don't believe is actually a given), having 
> multiple retryTimers running presents at least a condition in which the 
> restarted actor will not be able to make the full number of retry attempts 
> before an earlier retryTimer takes the "give up" action.
> Canceling the retryTimer in the actor's postStop hook should suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient

2014-04-30 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1685:


Description: 
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actor restarting, registerWithMaster being called again on restart, 
and another retryTimer being scheduled without canceling the already running 
retryTimer.  Assuming that all of the rest of the restart logic is correct for 
these actors (which I don't believe is actually a given), having multiple 
retryTimers running presents at least a condition in which the restarted actor 
may not be able to make the full number of retry attempts before an earlier 
retryTimer takes the "give up" action.

Canceling the retryTimer in the actor's postStop hook should suffice. 

  was:
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actor restarting, registerWithMaster being called again on restart, 
and another retryTimer being scheduled without canceling the already running 
retryTimer.  Assuming that all of the rest of the restart logic is correct for 
these actors (which I don't believe is actually a given), having multiple 
retryTimers running presents at least a condition in which the restarted actor 
will not be able to make the full number of retry attempts before an earlier 
retryTimer takes the "give up" action.

Canceling the retryTimer in the actor's postStop hook should suffice. 


> retryTimer not canceled on actor restart in Worker and AppClient
> 
>
> Key: SPARK-1685
> URL: https://issues.apache.org/jira/browse/SPARK-1685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>
> Both deploy.worker.Worker and deploy.client.AppClient try to 
> registerWithMaster when those Actors start.  The attempt at registration is 
> accomplished by starting a retryTimer via the Akka scheduler that will use 
> the registered timeout interval and retry number to make repeated attempts to 
> register with all known Masters before giving up and either marking as dead 
> or calling System.exit.
> The receive methods of these actors can, however, throw exceptions, which 
> will lead to the actor restarting, registerWithMaster being called again on 
> restart, and another retryTimer being scheduled without canceling the already 
> running retryTimer.  Assuming that all of the rest of the restart logic is 
> correct for these actors (which I don't believe is actually a given), having 
> multiple retryTimers running presents at least a condition in which the 
> restarted actor may not be able to make the full number of retry attempts 
> before an earlier retryTimer takes the "give up" action.
> Canceling the retryTimer in the actor's postStop hook should suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Reopened] (SPARK-1620) Uncaught exception from Akka scheduler

2014-04-30 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra reopened SPARK-1620:
-


On further investigation, it looks like there really is a problem with 
exceptions thrown by scheduled functions not being caught by any uncaught 
exception handler.

> Uncaught exception from Akka scheduler
> --
>
> Key: SPARK-1620
> URL: https://issues.apache.org/jira/browse/SPARK-1620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Mark Hamstra
>Priority: Blocker
>
> I've been looking at this one in the context of a BlockManagerMaster that 
> OOMs and doesn't respond to heartBeat(), but I suspect that there may be 
> problems elsewhere where we use Akka's scheduler.
> The basic nature of the problem is that we are expecting exceptions thrown 
> from a scheduled function to be caught in the thread where 
> _ActorSystem_.scheduler.schedule() or scheduleOnce() has been called.  In 
> fact, the scheduled function runs on its own thread, so any exceptions that 
> it throws are not caught in the thread that called schedule() -- e.g., 
> unanswered BlockManager heartBeats (scheduled in BlockManager#initialize) 
> that end up throwing exceptions in BlockManagerMaster#askDriverWithReply do 
> not cause those exceptions to be handled by the Executor thread's 
> UncaughtExceptionHandler. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1686) Master switches thread when ElectedLeader

2014-04-30 Thread Mark Hamstra (JIRA)

Mark Hamstra created SPARK-1686:
---

 Summary: Master switches thread when ElectedLeader
 Key: SPARK-1686
 URL: https://issues.apache.org/jira/browse/SPARK-1686
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Mark Hamstra


In deploy.master.Master, the completeRecovery method is the last thing to be 
called when a standalone Master is recovering from failure.  It is responsible 
for resetting some state, relaunching drivers, and eventually resuming its 
scheduling duties.

There are currently four places in Master.scala where completeRecovery is 
called.  Three of them are from within the actor's receive method, and aren't 
problems.  The last starts from within receive when the ElectedLeader message 
is received, but the actual completeRecovery() call is made from the Akka 
scheduler.  That means that it will execute on a different scheduler thread, 
and Master itself will end up running (i.e., schedule() ) from that Akka 
scheduler thread.  Among other things, that means that uncaught exception 
handling will be different -- https://issues.apache.org/jira/browse/SPARK-1620 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1620) Uncaught exception from Akka scheduler

2014-04-30 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986175#comment-13986175
 ] 

Mark Hamstra commented on SPARK-1620:
-

Another two instances of the problem that actually aren't a problem at the 
moment: In deploy.worker.Worker and deploy.client.AppClient, 
tryRegisterAllMasters() can throw exceptions (e.g., from 
Master.toAkkaUrl(masterUrl)), and those exception would go unhandled in the 
calls from within the Akka scheduler -- i.e. within an invocation of 
registerWithMaster, all but the first call to tryRegisterAllMasters.  Right 
now, any later call to tryRegisterAllMasters() that would throw an exception 
should already have thrown in the first call that occurs outside the scheduled 
thread, so we should never get to the problem case.  If in the future, however, 
that behavior would change so that tryRegisterAllMasters() could succeed on the 
first call but throw within the later, scheduled calls (or if code added within 
the scheduled retryTimer could throw an exception) then the exception thrown 
from the scheduler thread will not be caught. 

> Uncaught exception from Akka scheduler
> --
>
> Key: SPARK-1620
> URL: https://issues.apache.org/jira/browse/SPARK-1620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Mark Hamstra
>Priority: Blocker
>
> I've been looking at this one in the context of a BlockManagerMaster that 
> OOMs and doesn't respond to heartBeat(), but I suspect that there may be 
> problems elsewhere where we use Akka's scheduler.
> The basic nature of the problem is that we are expecting exceptions thrown 
> from a scheduled function to be caught in the thread where 
> _ActorSystem_.scheduler.schedule() or scheduleOnce() has been called.  In 
> fact, the scheduled function runs on its own thread, so any exceptions that 
> it throws are not caught in the thread that called schedule() -- e.g., 
> unanswered BlockManager heartBeats (scheduled in BlockManager#initialize) 
> that end up throwing exceptions in BlockManagerMaster#askDriverWithReply do 
> not cause those exceptions to be handled by the Executor thread's 
> UncaughtExceptionHandler. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1620) Uncaught exception from Akka scheduler

2014-04-30 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986180#comment-13986180
 ] 

Mark Hamstra commented on SPARK-1620:
-

And one last instance: In scheduler.TaskSchedulerImpl#start(), 
checkSpeculatableTasks() is scheduled to be called every SPECULATION_INTERVAL.  
If checkSpeculatableTasks() throws an exception, that exception will not be 
caught and no more invocations of checkSpeculatableTasks() will occur. 

> Uncaught exception from Akka scheduler
> --
>
> Key: SPARK-1620
> URL: https://issues.apache.org/jira/browse/SPARK-1620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Mark Hamstra
>Priority: Blocker
>
> I've been looking at this one in the context of a BlockManagerMaster that 
> OOMs and doesn't respond to heartBeat(), but I suspect that there may be 
> problems elsewhere where we use Akka's scheduler.
> The basic nature of the problem is that we are expecting exceptions thrown 
> from a scheduled function to be caught in the thread where 
> _ActorSystem_.scheduler.schedule() or scheduleOnce() has been called.  In 
> fact, the scheduled function runs on its own thread, so any exceptions that 
> it throws are not caught in the thread that called schedule() -- e.g., 
> unanswered BlockManager heartBeats (scheduled in BlockManager#initialize) 
> that end up throwing exceptions in BlockManagerMaster#askDriverWithReply do 
> not cause those exceptions to be handled by the Executor thread's 
> UncaughtExceptionHandler. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1715) Ensure actor is self-contained in DAGScheduler

2014-05-05 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1715:


Fix Version/s: (was: 1.1.0)
   1.0.1

> Ensure actor is self-contained in DAGScheduler
> --
>
> Key: SPARK-1715
> URL: https://issues.apache.org/jira/browse/SPARK-1715
> Project: Spark
>  Issue Type: Bug
>Reporter: Nan Zhu
>Assignee: Nan Zhu
> Fix For: 1.0.1
>
>
> Though the current supervisor-child structure works fine for fault-tolerance, 
> it violates the basic rule that the actor is better to be self-contained
> We should forward the message from supervisor to the child actor, so that we 
> can eliminate the hard-coded timeout threshold for starting the DAGScheduler 
> and provide more convenient interface for future development like parallel 
> DAGScheduler, or new changes to the DAGScheduler



--
This message was sent by Atlassian JIRA
(v6.2#6252)

1 2 >

1 - 100 of 137 matches

Mail list logo