date:20140605

Matei Zaharia created SPARK-2032:


 Summary: Add an RDD.samplePartitions method for partition-level 
sampling
 Key: SPARK-2032
 URL: https://issues.apache.org/jira/browse/SPARK-2032
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia


This would allow us to sample a percent of the partitions and not have to 
materialize all of them. It's less uniform but much faster and may be useful 
for quickly exploring data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2032) Add an RDD.samplePartitions method for partition-level sampling


 [ 
https://issues.apache.org/jira/browse/SPARK-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2032:
-

Priority: Minor  (was: Major)

 Add an RDD.samplePartitions method for partition-level sampling
 ---

 Key: SPARK-2032
 URL: https://issues.apache.org/jira/browse/SPARK-2032
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Minor

 This would allow us to sample a percent of the partitions and not have to 
 materialize all of them. It's less uniform but much faster and may be useful 
 for quickly exploring data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-1228) confusion matrix

2014-06-05 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-1228.


   Resolution: Implemented
Fix Version/s: 1.0.0
 Assignee: Xiangrui Meng

Confusion matrix was added in v1.0 as part of binary classification model 
evaluation.

 confusion matrix
 

 Key: SPARK-1228
 URL: https://issues.apache.org/jira/browse/SPARK-1228
 Project: Spark
  Issue Type: Story
  Components: MLlib
Reporter: Arshak Navruzyan
Assignee: Xiangrui Meng
  Labels: classification
 Fix For: 1.0.0


 utility that print confusion matrix for multi-class classification including 
 precision and recall 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2033) Automatically cleanup checkpoint

Guoqiang Li created SPARK-2033:
--

 Summary: Automatically cleanup checkpoint 
 Key: SPARK-2033
 URL: https://issues.apache.org/jira/browse/SPARK-2033
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Guoqiang Li


Now we use ContextCleaner asynchronous cleanup RDD, shuffle, and broadcast. But 
no checkpoint.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason


 [ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-2019:
---

Affects Version/s: (was: 0.9.1)
   0.9.0

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam
Priority: Critical

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason


 [ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-2019:
---

Fix Version/s: (was: 0.9.2)

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam
Priority: Critical

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2033) Automatically cleanup checkpoint


[ 
https://issues.apache.org/jira/browse/SPARK-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018621#comment-14018621
 ] 

Guoqiang Li commented on SPARK-2033:


The PR: https://github.com/apache/spark/pull/855

 Automatically cleanup checkpoint 
 -

 Key: SPARK-2033
 URL: https://issues.apache.org/jira/browse/SPARK-2033
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 Now we use ContextCleaner asynchronous cleanup RDD, shuffle, and broadcast. 
 But no checkpoint.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason


[ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018048#comment-14018048
 ] 

sam edited comment on SPARK-2019 at 6/5/14 9:47 AM:


Sorry. Its -0.9.1- 0.9.0


was (Author: sams):
Sorry. Its 0.9.1

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam
Priority: Critical

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason


 [ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-2019:
---

Priority: Major  (was: Critical)

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason


[ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018756#comment-14018756
 ] 

sam commented on SPARK-2019:


[~srowen] so when will CDH package up and distribute spark 1.0.0?? Currently 
they only distribute 0.9.0.  Thanks.

We seem to be hitting a few bugs with the 0.9.0 - particularly we know that the 
s3 jets problem is 0.9.0 specific and crops it's head when we add s3 creds to 
our hdfs-site.xml.

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-06-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018760#comment-14018760
 ] 

Sean Owen commented on SPARK-2019:
--

I believe that's coming with 5.1 but I don't know when that is scheduled. We 
can talk about issues like this offline -- really your best bet is support 
anyway.

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2035) Make a stage's call stack available on the UI

2014-06-05 Thread Daniel Darabos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-2035:
--

Attachment: example-html.tgz

I've sent a pull request (https://github.com/apache/spark/pull/981), and here 
is an example of the resulting HTML. It is the worst possible example, because 
I used `spark-shell`, but it's hopefully enough to demo the idea.

 Make a stage's call stack available on the UI
 -

 Key: SPARK-2035
 URL: https://issues.apache.org/jira/browse/SPARK-2035
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Daniel Darabos
Priority: Minor
 Attachments: example-html.tgz


 Currently the stage table displays the file name and line number that is the 
 call site that triggered the given stage. This is enormously useful for 
 understanding the execution. But once a project adds utility classes and 
 other indirections, the call site can become less meaningful, because the 
 interesting line is further up the stack.
 An idea to fix this is to display the entire call stack that triggered the 
 stage. It would be collapsed by default and could be revealed with a click.
 I have started working on this. It is a good way to learn about how the RDD 
 interface ties into the UI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-2024) Add saveAsSequenceFile to PySpark

2014-06-05 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-2024:
-

Comment: was deleted

(was: You meant SPARK-1416?)

 Add saveAsSequenceFile to PySpark
 -

 Key: SPARK-2024
 URL: https://issues.apache.org/jira/browse/SPARK-2024
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Matei Zaharia

 After SPARK-1416 we will be able to read SequenceFiles from Python, but it 
 remains to write them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2036) CaseConversionExpression should check if the evaluated value is null.

Takuya Ueshin created SPARK-2036:


 Summary: CaseConversionExpression should check if the evaluated 
value is null.
 Key: SPARK-2036
 URL: https://issues.apache.org/jira/browse/SPARK-2036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


{{CaseConversionExpression}} should check if the evaluated value is {{null}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2036) CaseConversionExpression should check if the evaluated value is null.


[ 
https://issues.apache.org/jira/browse/SPARK-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018928#comment-14018928
 ] 

Takuya Ueshin commented on SPARK-2036:
--

PRed: https://github.com/apache/spark/pull/982

 CaseConversionExpression should check if the evaluated value is null.
 -

 Key: SPARK-2036
 URL: https://issues.apache.org/jira/browse/SPARK-2036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin

 {{CaseConversionExpression}} should check if the evaluated value is {{null}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2037) yarn client mode doesn't support spark.yarn.max.executor.failures

2014-06-05 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-2037:


 Summary: yarn client mode doesn't support 
spark.yarn.max.executor.failures
 Key: SPARK-2037
 URL: https://issues.apache.org/jira/browse/SPARK-2037
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves


yarn client mode doesn't support the config spark.yarn.max.executor.failures.  
We should investigate if we need it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

[
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-2019:
---

Description:
We either have to reboot all the nodes, or run 'sudo service spark-worker
restart' across our cluster. I don't think this should happen - the job
failures are often not even that bad. There is a 5 upvoted SO question here:
http://stackoverflow.com/questions/22

We shouldn't be giving restart privileges to our devs, and therefore our sysadm
has to frequently restart the workers. When the sysadm is not around, there is
nothing our devs can do.

Many thanks

was:
We either have to reboot all the nodes, or run 'sudo service spark-worker
restart' across our cluster. I don't think this should happen - the job
failures are often not even that bad. There is a 5 upvoted SO question here:
http://stackoverflow.com/questions/22Hey
@sam031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails

We shouldn't be giving restart privileges to our devs, and therefore our sysadm
has to frequently restart the workers. When the sysadm is not around, there is
nothing our devs can do.

Many thanks

Spark workers die/disappear when job fails for nearly any reason

Key: SPARK-2019
URL: https://issues.apache.org/jira/browse/SPARK-2019
Project: Spark
Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam

We either have to reboot all the nodes, or run 'sudo service spark-worker
restart' across our cluster. I don't think this should happen - the job
failures are often not even that bad. There is a 5 upvoted SO question here:
http://stackoverflow.com/questions/22
We shouldn't be giving restart privileges to our devs, and therefore our
sysadm has to frequently restart the workers. When the sysadm is not around,
there is nothing our devs can do.
Many thanks

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2029) Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.


 [ 
https://issues.apache.org/jira/browse/SPARK-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2029.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 974
[https://github.com/apache/spark/pull/974]

 Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.
 ---

 Key: SPARK-2029
 URL: https://issues.apache.org/jira/browse/SPARK-2029
 Project: Spark
  Issue Type: Bug
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0


 Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2030) Bump SparkBuild.scala version number of branch-1.0 to 1.0.1-SNAPSHOT.


 [ 
https://issues.apache.org/jira/browse/SPARK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2030.


   Resolution: Fixed
Fix Version/s: 1.0.1

Issue resolved by pull request 975
[https://github.com/apache/spark/pull/975]

 Bump SparkBuild.scala version number of branch-1.0 to 1.0.1-SNAPSHOT.
 -

 Key: SPARK-2030
 URL: https://issues.apache.org/jira/browse/SPARK-2030
 Project: Spark
  Issue Type: Bug
Reporter: Takuya Ueshin
 Fix For: 1.0.1


 Bump SparkBuild.scala version number of branch-1.0 to 1.0.1-SNAPSHOT.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1677) Allow users to avoid Hadoop output checks if desired


 [ 
https://issues.apache.org/jira/browse/SPARK-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1677.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

Issue resolved by pull request 947
[https://github.com/apache/spark/pull/947]

 Allow users to avoid Hadoop output checks if desired
 

 Key: SPARK-1677
 URL: https://issues.apache.org/jira/browse/SPARK-1677
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Nan Zhu
 Fix For: 1.0.1, 1.1.0


 For compatibility with older versions of Spark it would be nice to have an 
 option `spark.hadoop.validateOutputSpecs` (default true) and a description 
 If set to true, validates the output specification used in saveAsHadoopFile 
 and other variants. This can be disabled to silence exceptions due to 
 pre-existing output directories.
 This would just wrap the checking done in this PR:
 https://issues.apache.org/jira/browse/SPARK-1100
 https://github.com/apache/spark/pull/11
 By first checking the spark conf.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2039) Run hadoop output checks for all formats

Patrick Wendell created SPARK-2039:
--

 Summary: Run hadoop output checks for all formats
 Key: SPARK-2039
 URL: https://issues.apache.org/jira/browse/SPARK-2039
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Nan Zhu


Now that SPARK-1677 allows users to disable output checks, we should just run 
them for all types of output formats. I'm not sure why we didn't do this 
originally but it might have been out of defensiveness since we weren't sure 
what all implementations did.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2040) Support cross-building with Scala 2.11

Patrick Wendell created SPARK-2040:
--

 Summary: Support cross-building with Scala 2.11
 Key: SPARK-2040
 URL: https://issues.apache.org/jira/browse/SPARK-2040
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma


Since Scala 2.10/2.11 are source compatible, we should be able to cross build 
for both versions. From what I understand there are basically three things we 
need to figure out:

1. Have a two versions of our dependency graph, one that uses 2.11 dependencies 
and the other that uses 2.10 dependencies.
2. Figure out how to publish different poms for 2.10 and 2.11.

I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't 
really well supported by Maven since published pom's aren't generated 
dynamically. But we can probably script around it to make it work. I've done 
some initial sanity checks with a simple build here:

https://github.com/pwendell/scala-maven-crossbuild



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1812) Support cross-building with Scala 2.11

[
https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-1812:
---

Description:
Since Scala 2.10/2.11 are source compatible, we should be able to cross build
for both versions. From what I understand there are basically three things we
need to figure out:

1. Have a two versions of our dependency graph, one that uses 2.11 dependencies
and the other that uses 2.10 dependencies.
2. Figure out how to publish different poms for 2.10 and 2.11.

I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't
really well supported by Maven since published pom's aren't generated
dynamically. But we can probably script around it to make it work. I've done
some initial sanity checks with a simple build here:

https://github.com/pwendell/scala-maven-crossbuild

was:We should cross-build for this in addition to 2.10.

Support cross-building with Scala 2.11
--

Key: SPARK-1812
URL: https://issues.apache.org/jira/browse/SPARK-1812
Project: Spark
Issue Type: New Feature
Components: Build, Spark Core
Reporter: Matei Zaharia
Assignee: Prashant Sharma

Since Scala 2.10/2.11 are source compatible, we should be able to cross build
for both versions. From what I understand there are basically three things we
need to figure out:
1. Have a two versions of our dependency graph, one that uses 2.11
dependencies and the other that uses 2.10 dependencies.
2. Figure out how to publish different poms for 2.10 and 2.11.
I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't
really well supported by Maven since published pom's aren't generated
dynamically. But we can probably script around it to make it work. I've done
some initial sanity checks with a simple build here:
https://github.com/pwendell/scala-maven-crossbuild

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1812) Support Scala 2.11


 [ 
https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1812:
---

Assignee: Prashant Sharma

 Support Scala 2.11
 --

 Key: SPARK-1812
 URL: https://issues.apache.org/jira/browse/SPARK-1812
 Project: Spark
  Issue Type: New Feature
  Components: Build, Spark Core
Reporter: Matei Zaharia
Assignee: Prashant Sharma

 We should cross-build for this in addition to 2.10.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1812) Support cross-building with Scala 2.11


 [ 
https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1812:
---

Summary: Support cross-building with Scala 2.11  (was: Support Scala 2.11)

 Support cross-building with Scala 2.11
 --

 Key: SPARK-1812
 URL: https://issues.apache.org/jira/browse/SPARK-1812
 Project: Spark
  Issue Type: New Feature
  Components: Build, Spark Core
Reporter: Matei Zaharia
Assignee: Prashant Sharma

 We should cross-build for this in addition to 2.10.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1749) DAGScheduler supervisor strategy broken with Mesos


 [ 
https://issues.apache.org/jira/browse/SPARK-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1749:
---

Target Version/s: 1.0.1, 1.1.0  (was: 1.0.1)

 DAGScheduler supervisor strategy broken with Mesos
 --

 Key: SPARK-1749
 URL: https://issues.apache.org/jira/browse/SPARK-1749
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Spark Core
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
Assignee: Mark Hamstra
Priority: Blocker
  Labels: mesos, scheduler, scheduling

 Any bad Python code will trigger this bug, for example 
 `sc.parallelize(range(100)).map(lambda n: undefined_variable * 2).collect()` 
 will cause a `undefined_variable isn't defined`, which will cause spark to 
 try to kill the task, resulting in the following stacktrace:
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:184)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:182)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:182)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:182)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:175)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:175)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1058)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:499)
   at 
 org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1151)
   at 
 org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1147)
   at akka.actor.SupervisorStrategy.handleFailure(FaultHandling.scala:295)
   at 
 akka.actor.dungeon.FaultHandling$class.handleFailure(FaultHandling.scala:253)
   at akka.actor.ActorCell.handleFailure(ActorCell.scala:338)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:423)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
   at akka.dispatch.Mailbox.run(Mailbox.scala:218)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This is because killTask isn't implemented for the MesosSchedulerBackend. I 
 assume this isn't pyspark-specific, as there will be other instances where 
 you might want to kill the task 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2041) Exception when querying when tableName == columnName

Michael Armbrust created SPARK-2041:
---

 Summary: Exception when querying when tableName == columnName
 Key: SPARK-2041
 URL: https://issues.apache.org/jira/browse/SPARK-2041
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust


{code}
[info]   java.util.NoSuchElementException: next on empty iterator
[info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
[info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
[info]   at 
scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
[info]   at scala.collection.IterableLike$class.head(IterableLike.scala:91)
[info]   at 
scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108)
[info]   at 
scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
[info]   at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:68)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65)
[info]   at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
[info]   at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
[info]   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65)
[info]   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100)
[info]   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2010) Support for nested data in PySpark SQL


 [ 
https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2010:


Assignee: Kan Zhang  (was: Michael Armbrust)

 Support for nested data in PySpark SQL
 --

 Key: SPARK-2010
 URL: https://issues.apache.org/jira/browse/SPARK-2010
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Kan Zhang
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-2010) Support for nested data in PySpark SQL


 [ 
https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-2010:
---

Assignee: Michael Armbrust

 Support for nested data in PySpark SQL
 --

 Key: SPARK-2010
 URL: https://issues.apache.org/jira/browse/SPARK-2010
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2026) Maven hadoop* Profiles Should Set the expected Hadoop Version.

2014-06-05 Thread Bernardo Gomez Palacio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019279#comment-14019279
 ] 

Bernardo Gomez Palacio commented on SPARK-2026:
---

I'll submit a PR [~srowen]. I am not using Hadoop 0.23 but my guess is that 
using 0.23.10 as default will suffice.

 Maven hadoop* Profiles Should Set the expected Hadoop Version.
 

 Key: SPARK-2026
 URL: https://issues.apache.org/jira/browse/SPARK-2026
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.0.0
Reporter: Bernardo Gomez Palacio

 The Maven Profiles that refer to _hadoopX_, e.g. hadoop2.4, should set the 
 expected _hadoop.version_.
 e.g.
 {code}
 profile
   idhadoop-2.4/id
   properties
 protobuf.version2.5.0/protobuf.version
 jets3t.version0.9.0/jets3t.version
   /properties
 /profile
 {code}
 as it is suggested
 {code}
 profile
   idhadoop-2.4/id
   properties
 hadoop.version2.4.0/hadoop.version
  yarn.version${hadoop.version}/yarn.version
 protobuf.version2.5.0/protobuf.version
 jets3t.version0.9.0/jets3t.version
   /properties
 /profile
 {code}
 Builds can still define the -Dhadoop.version option but this will correctly 
 default the Hadoop Version to the one that is expected according the profile 
 that is selected.
 e.g.
 {code}
 $ mvn -P hadoop-2.4,yarn clean compile
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2042) Take triggers unneeded shuffle.

Michael Armbrust created SPARK-2042:
---

 Summary: Take triggers unneeded shuffle.
 Key: SPARK-2042
 URL: https://issues.apache.org/jira/browse/SPARK-2042
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust


This query really shouldn't trigger a shuffle:

{code}
sql(SELECT * FROM src LIMIT 10).take(5)
{code}

One fix would be to make the following changes:
 * Fix take to insert a logical limit and then collect()
 * Add a rule for collapsing adjacent limits




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-937) Executors that exit cleanly should not have KILLED status

2014-06-05 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-937:


Comment: was deleted

(was: Hi Aaron, are you still working on this one? If not, could you assign it 
to me? I have a PR for SPARK-1118 (closed as a duplicate of this JIRA) that I 
could re-sumit for this one. If you are still working on it or plan to, feel 
free to pick whatever might be useful to you 
https://github.com/apache/spark/pull/306)

 Executors that exit cleanly should not have KILLED status
 -

 Key: SPARK-937
 URL: https://issues.apache.org/jira/browse/SPARK-937
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.7.3
Reporter: Aaron Davidson
Assignee: Kan Zhang
Priority: Critical
 Fix For: 1.1.0


 This is an unintuitive and overloaded status message when Executors are 
 killed during normal termination of an application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-937) Executors that exit cleanly should not have KILLED status

2014-06-05 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019309#comment-14019309
 ] 

Kan Zhang commented on SPARK-937:
-

PR: https://github.com/apache/spark/pull/306

 Executors that exit cleanly should not have KILLED status
 -

 Key: SPARK-937
 URL: https://issues.apache.org/jira/browse/SPARK-937
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.7.3
Reporter: Aaron Davidson
Assignee: Kan Zhang
Priority: Critical
 Fix For: 1.1.0


 This is an unintuitive and overloaded status message when Executors are 
 killed during normal termination of an application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2043) ExternalAppendOnlyMap doesn't always find matching keys

Matei Zaharia created SPARK-2043:


 Summary: ExternalAppendOnlyMap doesn't always find matching keys
 Key: SPARK-2043
 URL: https://issues.apache.org/jira/browse/SPARK-2043
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 0.9.1, 0.9.0
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Priority: Blocker


The current implementation reads one key with the next hash code as it finishes 
reading the keys with the current hash code, which may cause it to miss some 
matches of the next key. This can cause operations like join to give the wrong 
result when reduce tasks spill to disk and there are hash collisions, as values 
won't be matched together.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2014-06-05 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019394#comment-14019394
 ] 

Mridul Muralidharan commented on SPARK-2017:


Currently, for our jobs, I run with spark.ui.retainedStages=3 (so that there is 
some visibility into past stages) : this is to prevent OOM's in the master when 
number of tasks per stage is not low (50k for example is not very high imo)

The stage details UI becomes very sluggish to pretty much unresponsive for our 
tasks where tasks  30k ... though that might also be a browser issue 
(firefox/chrome) ?

 web ui stage page becomes unresponsive when the number of tasks is large
 

 Key: SPARK-2017
 URL: https://issues.apache.org/jira/browse/SPARK-2017
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
  Labels: starter

 {code}
 sc.parallelize(1 to 100, 100).count()
 {code}
 The above code creates one million tasks to be executed. The stage detail web 
 ui page takes forever to load (if it ever completes).
 There are again a few different alternatives:
 0. Limit the number of tasks we show.
 1. Pagination
 2. By default only show the aggregate metrics and failed tasks, and hide the 
 successful ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2045) Sort-based shuffle implementation

Matei Zaharia created SPARK-2045:


 Summary: Sort-based shuffle implementation
 Key: SPARK-2045
 URL: https://issues.apache.org/jira/browse/SPARK-2045
 Project: Spark
  Issue Type: New Feature
Reporter: Matei Zaharia


Building on the pluggability in SPARK-2044, a sort-based shuffle implementation 
that takes advantage of an Ordering for keys (or just sorts by hashcode for 
keys that don't have it) would likely improve performance and memory usage in 
very large shuffles. Our current hash-based shuffle needs an open file for each 
reduce task, which can fill up a lot of memory for compression buffers and 
cause inefficient IO. This would avoid both of those issues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2011) Eliminate duplicate join in Pregel

2014-06-05 Thread Ankur Dave (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-2011:
--

Priority: Minor  (was: Major)

 Eliminate duplicate join in Pregel
 --

 Key: SPARK-2011
 URL: https://issues.apache.org/jira/browse/SPARK-2011
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Minor

 In the iteration loop, Pregel currently performs an innerJoin to apply 
 messages to vertices followed by an outerJoinVertices to join the resulting 
 subset of vertices back to the graph. These two operations could be merged 
 into a single call to joinVertices, which should be reimplemented in a more 
 efficient manner. This would allow us to examine only the vertices that 
 received messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason


[ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019449#comment-14019449
 ] 

Patrick Wendell commented on SPARK-2019:


Hey @sams - I'm going to temporarily close this until you get a bit more 
information. But please do re-open this and/or open other JIRA's if you have 
any specific issues with 0.9.1 or 1.0.0 that you'd like to report.

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
  
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason


 [ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2019.


Resolution: Incomplete

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
  
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2046) Support config properties that are changeable across tasks/stages within a job

2014-06-05 Thread Zongheng Yang (JIRA)

Zongheng Yang created SPARK-2046:


 Summary: Support config properties that are changeable across 
tasks/stages within a job
 Key: SPARK-2046
 URL: https://issues.apache.org/jira/browse/SPARK-2046
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Zongheng Yang


Suppose an application consists of multiple stages, where some stages contain 
computation-intensive tasks, and other stages contain less 
computation-intensive (or otherwise ordinary) tasks. 

For such job to run efficiently, it might make sense to provide user a function 
to set spark.task.cpus to a high number right before the 
computation-intensive stages/tasks are getting generated in the user code, and 
set the property to a lower number for other stages/tasks.

As a first step, supporting this feature across stages instead of the more 
fine-grained task-level might suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2046) Support config properties that are changeable across tasks/stages within a job

2014-06-05 Thread Zongheng Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019457#comment-14019457
 ] 

Zongheng Yang commented on SPARK-2046:
--

[~shivaram]

 Support config properties that are changeable across tasks/stages within a job
 --

 Key: SPARK-2046
 URL: https://issues.apache.org/jira/browse/SPARK-2046
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Zongheng Yang

 Suppose an application consists of multiple stages, where some stages contain 
 computation-intensive tasks, and other stages contain less 
 computation-intensive (or otherwise ordinary) tasks. 
 For such job to run efficiently, it might make sense to provide user a 
 function to set spark.task.cpus to a high number right before the 
 computation-intensive stages/tasks are getting generated in the user code, 
 and set the property to a lower number for other stages/tasks.
 As a first step, supporting this feature across stages instead of the more 
 fine-grained task-level might suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2046) Support config properties that are changeable across tasks/stages within a job

2014-06-05 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019458#comment-14019458
 ] 

Shivaram Venkataraman commented on SPARK-2046:
--

FWIW I have an older implementation that did this using LocalProperties in 
SparkContext. 
https://github.com/shivaram/spark-1/commit/256a34c12d4f3c8ed1a09174f331868a7bf30e11
 

I haven't tested it in a setting with multiple jobs running at the same time 
though

 Support config properties that are changeable across tasks/stages within a job
 --

 Key: SPARK-2046
 URL: https://issues.apache.org/jira/browse/SPARK-2046
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Zongheng Yang

 Suppose an application consists of multiple stages, where some stages contain 
 computation-intensive tasks, and other stages contain less 
 computation-intensive (or otherwise ordinary) tasks. 
 For such job to run efficiently, it might make sense to provide user a 
 function to set spark.task.cpus to a high number right before the 
 computation-intensive stages/tasks are getting generated in the user code, 
 and set the property to a lower number for other stages/tasks.
 As a first step, supporting this feature across stages instead of the more 
 fine-grained task-level might suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator

Matei Zaharia created SPARK-2047:


 Summary: Use less memory in AppendOnlyMap.destructiveSortedIterator
 Key: SPARK-2047
 URL: https://issues.apache.org/jira/browse/SPARK-2047
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia


This method tries to sort an the key-value pairs in the map in-place but ends 
up allocating a Tuple2 object for each one, which allocates a nontrivial amount 
of memory (32 or more bytes per entry on a 64-bit JVM). We could instead try to 
sort the objects in-place within the data array, or allocate an int array 
with the indices and sort those using a custom comparator. The latter is 
probably easiest to begin with.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator


 [ 
https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2047:
-

Priority: Minor  (was: Major)

 Use less memory in AppendOnlyMap.destructiveSortedIterator
 --

 Key: SPARK-2047
 URL: https://issues.apache.org/jira/browse/SPARK-2047
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Minor

 This method tries to sort an the key-value pairs in the map in-place but ends 
 up allocating a Tuple2 object for each one, which allocates a nontrivial 
 amount of memory (32 or more bytes per entry on a 64-bit JVM). We could 
 instead try to sort the objects in-place within the data array, or allocate 
 an int array with the indices and sort those using a custom comparator. The 
 latter is probably easiest to begin with.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator


 [ 
https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2047:
-

Priority: Major  (was: Minor)

 Use less memory in AppendOnlyMap.destructiveSortedIterator
 --

 Key: SPARK-2047
 URL: https://issues.apache.org/jira/browse/SPARK-2047
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia

 This method tries to sort an the key-value pairs in the map in-place but ends 
 up allocating a Tuple2 object for each one, which allocates a nontrivial 
 amount of memory (32 or more bytes per entry on a 64-bit JVM). We could 
 instead try to sort the objects in-place within the data array, or allocate 
 an int array with the indices and sort those using a custom comparator. The 
 latter is probably easiest to begin with.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2043) ExternalAppendOnlyMap doesn't always find matching keys


[ 
https://issues.apache.org/jira/browse/SPARK-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019482#comment-14019482
 ] 

Matei Zaharia commented on SPARK-2043:
--

https://github.com/apache/spark/pull/986

 ExternalAppendOnlyMap doesn't always find matching keys
 ---

 Key: SPARK-2043
 URL: https://issues.apache.org/jira/browse/SPARK-2043
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Priority: Blocker

 The current implementation reads one key with the next hash code as it 
 finishes reading the keys with the current hash code, which may cause it to 
 miss some matches of the next key. This can cause operations like join to 
 give the wrong result when reduce tasks spill to disk and there are hash 
 collisions, as values won't be matched together.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2049) avg function in aggregation may cause overflow

2014-06-05 Thread egraldlo (JIRA)

egraldlo created SPARK-2049:
---

 Summary: avg function in aggregation may cause overflow 
 Key: SPARK-2049
 URL: https://issues.apache.org/jira/browse/SPARK-2049
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: egraldlo


https://github.com/apache/spark/pull/978
Avg of 2147483644 and 2147483646, this will cause overflow in the current 
implementation. Maybe this is a problem. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1988) Enable storing edges out-of-core

2014-06-05 Thread Ankur Dave (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-1988.
---

Resolution: Fixed

This is mitigated by SPARK-1991, because the user can increase the number of 
edge partitions so that each edge partition individually fits in memory, then 
set the storage level of the edges to MEMORY_AND_DISK.

 Enable storing edges out-of-core
 

 Key: SPARK-1988
 URL: https://issues.apache.org/jira/browse/SPARK-1988
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Minor

 A graph's edges are usually the largest component of the graph, and a cluster 
 may not have enough memory to hold them. For example, a graph with 20 billion 
 edges requires at least 400 GB of memory, because each edge takes 20 bytes.
 GraphX only ever accesses the edges using full table scans or cluster scans 
 using the clustered index on source vertex ID. The edges are therefore 
 amenable to being stored on disk. EdgePartition should provide the option of 
 storing edges on disk transparently and streaming through them as needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2042) Take triggers unneeded shuffle.


 [ 
https://issues.apache.org/jira/browse/SPARK-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2042:


Assignee: Sameer Agarwal

 Take triggers unneeded shuffle.
 ---

 Key: SPARK-2042
 URL: https://issues.apache.org/jira/browse/SPARK-2042
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Sameer Agarwal

 This query really shouldn't trigger a shuffle:
 {code}
 sql(SELECT * FROM src LIMIT 10).take(5)
 {code}
 One fix would be to make the following changes:
  * Fix take to insert a logical limit and then collect()
  * Add a rule for collapsing adjacent limits



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

2014-06-05 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019511#comment-14019511
 ] 

Saisai Shao commented on SPARK-2044:


Hi Matei, it's great to see you guys have plan on shuffle things. We also 
implemented pluggable shuffle manager and are planing to submit a PR, I think 
the basic idea is quite the same, would you mind taking a look at our 
implementation 
(https://github.com/jerryshao/apache-spark/tree/shuffle-write-improvement/core/src/main/scala/org/apache/spark/storage/shuffle).
 Also I'm wondering if I can contribute my efforts to this proposal or have 
chances to cooperate. Thanks a lot.

 Pluggable interface for shuffles
 

 Key: SPARK-2044
 URL: https://issues.apache.org/jira/browse/SPARK-2044
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Attachments: Pluggableshuffleproposal.pdf


 Given that a lot of the current activity in Spark Core is in shuffles, I 
 wanted to propose factoring out shuffle implementations in a way that will 
 make experimentation easier. Ideally we will converge on one implementation, 
 but for a while, this could also be used to have several implementations 
 coexist. I'm suggesting this because I aware of at least three efforts to 
 look at shuffle (from Yahoo!, Intel and Databricks). Some of the things 
 people are investigating are:
 * Push-based shuffle where data moves directly from mappers to reducers
 * Sorting-based instead of hash-based shuffle, to create fewer files (helps a 
 lot with file handles and memory usage on large shuffles)
 * External spilling within a key
 * Changing the level of parallelism or even algorithm for downstream stages 
 at runtime based on statistics of the map output (this is a thing we had 
 prototyped in the Shark research project but never merged in core)
 I've attached a design doc with a proposed interface. It's not too crazy 
 because the interface between shuffles and the rest of the code is already 
 pretty narrow (just some iterators for reading data and a writer interface 
 for writing it). Bigger changes will be needed in the interaction with 
 DAGScheduler and BlockManager for some of the ideas above, but we can handle 
 those separately, and this interface will allow us to experiment with some 
 short-term stuff sooner.
 If things go well I'd also like to send a sort-based shuffle implementation 
 for 1.1, but we'll see how the timing on that works out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2044) Pluggable interface for shuffles

2014-06-05 Thread Raymond Liu (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019531#comment-14019531
]

Raymond Liu edited comment on SPARK-2044 at 6/6/14 3:11 AM:

Hi Matei

regarding the changes to block mnager:

That will allow ShuffleManagers to reuse a common block manager.
However the interface also allows ShuffleManagers to try new approaches.

Have you figure out what the interface should looks like? I see the shuffle
writter/read interface is generalize to be a Product2, while eventually, the
specific shuffle module will interaction with the disk, and go through
blockmanager. will you expect it to be Product2 when talk with
DiskBlockmanager, or keep the current implementation by using Files where a lot
of shortcut involved in various components say shuffle, spill etc? or anything
else like a buf , iterator etc?

Since we have also have pluggable storage support in mind spark-1733. the
actually IO for a store, even diskstore might not always go though FILE
interface. so I have this question.

was (Author: colorant):
Hi Matei

regarding the changes to block mnager:

That will allow ShuffleManagers to reuse a common block manager.
However the interface also allows ShuffleManagers to try new approaches.

Since we have also have pluggable storage support in mind spark-1733. the
actually IO for a store, even diskstroe might not always go though FILE
interface. so I have this question.

Pluggable interface for shuffles

Key: SPARK-2044
URL: https://issues.apache.org/jira/browse/SPARK-2044
Project: Spark
Issue Type: Improvement
Components: Shuffle, Spark Core
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Attachments: Pluggableshuffleproposal.pdf

Given that a lot of the current activity in Spark Core is in shuffles, I
wanted to propose factoring out shuffle implementations in a way that will
make experimentation easier. Ideally we will converge on one implementation,
but for a while, this could also be used to have several implementations
coexist. I'm suggesting this because I aware of at least three efforts to
look at shuffle (from Yahoo!, Intel and Databricks). Some of the things
people are investigating are:
* Push-based shuffle where data moves directly from mappers to reducers
* Sorting-based instead of hash-based shuffle, to create fewer files (helps a
lot with file handles and memory usage on large shuffles)
* External spilling within a key
* Changing the level of parallelism or even algorithm for downstream stages
at runtime based on statistics of the map output (this is a thing we had
prototyped in the Shark research project but never merged in core)
I've attached a design doc with a proposed interface. It's not too crazy
because the interface between shuffles and the rest of the code is already
pretty narrow (just some iterators for reading data and a writer interface
for writing it). Bigger changes will be needed in the interaction with
DAGScheduler and BlockManager for some of the ideas above, but we can handle
those separately, and this interface will allow us to experiment with some
short-term stuff sooner.
If things go well I'd also like to send a sort-based shuffle implementation
for 1.1, but we'll see how the timing on that works out.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2051) In yarn.ClientBase spark.yarn.dist.* do not work

Guoqiang Li created SPARK-2051:
--

 Summary: In yarn.ClientBase spark.yarn.dist.* do not work
 Key: SPARK-2051
 URL: https://issues.apache.org/jira/browse/SPARK-2051
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Guoqiang Li


  Spark configuration
{{conf/spark-defaults.conf}}:
{quote}
spark.yarn.dist.archives /toona/conf
spark.executor.extraClassPath ./conf
spark.driver.extraClassPath  ./conf
{quote}


HDFS directory
{{hadoop dfs -cat /toona/conf/toona.conf}} :
{quote}
 redis.num=4
{quote}

The following command execution fails
{code}
YARN_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --num-executors 2 
--driver-memory 2g --executor-memory 2g --master yarn-cluster --class 
toona.DeployTest toona-assembly.jar  
{code}


The following is testing the code
{code}
package toona
import com.typesafe.config.Config
import com.typesafe.config.ConfigFactory

object DeployTest {
  def main(args: Array[String]) {
val conf = ConfigFactory.load(toona.conf)
val redisNum = conf.getInt(redis.num) // Here will throw an 
`ConfigException` exception
assert(redisNum == 4)

  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2051) In yarn.ClientBase spark.yarn.dist.* do not work


 [ 
https://issues.apache.org/jira/browse/SPARK-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2051:
---

Description: 
  Spark configuration
{{conf/spark-defaults.conf}}:
{quote}
spark.yarn.dist.archives /toona/conf
spark.executor.extraClassPath ./conf
spark.driver.extraClassPath  ./conf
{quote}


HDFS directory
{{hadoop dfs -cat /toona/conf/toona.conf}} :
{quote}
 redis.num=4
{quote}

The following command execution fails
{code}
YARN_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --num-executors 2 
--driver-memory 2g --executor-memory 2g --master yarn-cluster --class 
toona.DeployTest toona-assembly.jar  
{code}


The following is  the test code
{code}
package toona
import com.typesafe.config.Config
import com.typesafe.config.ConfigFactory

object DeployTest {
  def main(args: Array[String]) {
val conf = ConfigFactory.load(toona.conf)
val redisNum = conf.getInt(redis.num) // Here will throw an 
`ConfigException` exception
assert(redisNum == 4)

  }
}
{code}

  was:
  Spark configuration
{{conf/spark-defaults.conf}}:
{quote}
spark.yarn.dist.archives /toona/conf
spark.executor.extraClassPath ./conf
spark.driver.extraClassPath  ./conf
{quote}


HDFS directory
{{hadoop dfs -cat /toona/conf/toona.conf}} :
{quote}
 redis.num=4
{quote}

The following command execution fails
{code}
YARN_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --num-executors 2 
--driver-memory 2g --executor-memory 2g --master yarn-cluster --class 
toona.DeployTest toona-assembly.jar  
{code}


The following is testing the code
{code}
package toona
import com.typesafe.config.Config
import com.typesafe.config.ConfigFactory

object DeployTest {
  def main(args: Array[String]) {
val conf = ConfigFactory.load(toona.conf)
val redisNum = conf.getInt(redis.num) // Here will throw an 
`ConfigException` exception
assert(redisNum == 4)

  }
}
{code}


 In yarn.ClientBase spark.yarn.dist.* do not work
 

 Key: SPARK-2051
 URL: https://issues.apache.org/jira/browse/SPARK-2051
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Guoqiang Li

   Spark configuration
 {{conf/spark-defaults.conf}}:
 {quote}
 spark.yarn.dist.archives /toona/conf
 spark.executor.extraClassPath ./conf
 spark.driver.extraClassPath  ./conf
 {quote}
 
 HDFS directory
 {{hadoop dfs -cat /toona/conf/toona.conf}} :
 {quote}
  redis.num=4
 {quote}
 
 The following command execution fails
 {code}
 YARN_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --num-executors 2 
 --driver-memory 2g --executor-memory 2g --master yarn-cluster --class 
 toona.DeployTest toona-assembly.jar  
 {code}
 
 The following is  the test code
 {code}
 package toona
 import com.typesafe.config.Config
 import com.typesafe.config.ConfigFactory
 object DeployTest {
   def main(args: Array[String]) {
 val conf = ConfigFactory.load(toona.conf)
 val redisNum = conf.getInt(redis.num) // Here will throw an 
 `ConfigException` exception
 assert(redisNum == 4)
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2051) In yarn.ClientBase spark.yarn.dist.* do not work


[ 
https://issues.apache.org/jira/browse/SPARK-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019560#comment-14019560
 ] 

Guoqiang Li commented on SPARK-2051:


The PR: https://github.com/apache/spark/pull/969

 In yarn.ClientBase spark.yarn.dist.* do not work
 

 Key: SPARK-2051
 URL: https://issues.apache.org/jira/browse/SPARK-2051
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Guoqiang Li

   Spark configuration
 {{conf/spark-defaults.conf}}:
 {quote}
 spark.yarn.dist.archives /toona/conf
 spark.executor.extraClassPath ./conf
 spark.driver.extraClassPath  ./conf
 {quote}
 
 HDFS directory
 {{hadoop dfs -cat /toona/conf/toona.conf}} :
 {quote}
  redis.num=4
 {quote}
 
 The following command execution fails
 {code}
 YARN_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --num-executors 2 
 --driver-memory 2g --executor-memory 2g --master yarn-cluster --class 
 toona.DeployTest toona-assembly.jar  
 {code}
 
 The following is  the test code
 {code}
 package toona
 import com.typesafe.config.Config
 import com.typesafe.config.ConfigFactory
 object DeployTest {
   def main(args: Array[String]) {
 val conf = ConfigFactory.load(toona.conf)
 val redisNum = conf.getInt(redis.num) // Here will throw an 
 `ConfigException` exception
 assert(redisNum == 4)
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2052) Add optimization for CaseConversionExpression's.

Takuya Ueshin created SPARK-2052:


 Summary: Add optimization for CaseConversionExpression's.
 Key: SPARK-2052
 URL: https://issues.apache.org/jira/browse/SPARK-2052
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin


Add optimization for {{CaseConversionExpression}}'s.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2052) Add optimization for CaseConversionExpression's.