[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-09 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126686#comment-14126686
 ] 

Shay Rojansky commented on SPARK-2972:
--

 you're right! imho, this means your program is written better than the 
 examples. it would be good to enhance the examples w/ try/finally semantics. 
 however,

Then I can submit a pull request for that, no problem.

 getting the shutdown semantics right is difficult, and may not apply broadly 
 across applications. for instance, your application may want to catch a 
 failure in stop() and retry to make sure that a history record is written. 
 another application may be ok w/ best effort writing history events. still 
 another application may want to exit w/o stop() to avoid having a history 
 event written.

I don't think explicit stop() should be removed - of course users may choose to 
manually manage stop(), catch exceptions and retry, etc. For me it's just a 
question of what to do with a context that *didn't* get explicitly closed at 
the end of the application.

As to apps that need to exit without a history event - it's a requirement 
that's hard to imagine (for me). At least with YARN/Mesos you will be leaving 
traces anyway, and these traces will be partial and difficult to understand, 
since the corresponding Spark traces haven't been produced.

 asking the context creator to do context destruction shifts burden to the 
 application writer and maintains flexibility for applications.

I guess it's a question of how high-level a tool you want Spark to be. It seems 
a bit strange for Spark to handle so much of the troublesome low-level details, 
while forcing the user to boilerplate-wrap all their programs with try/finally.

But I do understand the points you're making and it can be argued both ways. As 
a minimum, I suggest having context implement the language-specific dispose 
patterns ('using' in Java, 'with' in Python), so at least the code looks better?

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2014-09-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3452:
---
Priority: Critical  (was: Major)

 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical

 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2014-09-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3452:
---
Affects Version/s: 1.1.0
   1.0.0

 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical

 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2014-09-09 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3452:
--

 Summary: Maven build should skip publishing artifacts people 
shouldn't depend on
 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma


I think it's easy to do this by just adding a skip configuration somewhere. We 
shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126698#comment-14126698
 ] 

Apache Spark commented on SPARK-3404:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2328

 SparkSubmitSuite fails with spark-submit exits with code 1
 

 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
Reporter: Sean Owen
Priority: Critical

 Maven-based Jenkins builds have been failing for over a month. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 It's SparkSubmitSuite that fails. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
 {code}
 SparkSubmitSuite
 ...
 - launch simple application with spark-submit *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
 local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 - spark submit includes jars passed in through --jar *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
 local-cluster[2,1,512], --jars, 
 file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 {code}
 SBT builds don't fail, so it is likely to be due to some difference in how 
 the tests are run rather than a problem with test or core project.
 This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
 cause identified in that JIRA is, at least, not the only cause. (Although, it 
 wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
 config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)
 This JIRA tracks investigation into a different cause. Right now I have some 
 further information but not a PR yet.
 Part of the issue is that there is no clue in the log about why 
 {{spark-submit}} exited with status 1. See 
 https://github.com/apache/spark/pull/2108/files and 
 https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
 least print stdout to the log too.
 The SparkSubmit program exits with 1 when the main class it is supposed to 
 run is not found 
 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
  This is for example SimpleApplicationTest 
 (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
 The test actually submits an empty JAR not containing this class. It relies 
 on {{spark-submit}} finding the class within the compiled test-classes of the 
 Spark project. However it does seem to be compiled and present even with 
 Maven.
 If modified to print stdout and stderr, and dump the 

[jira] [Created] (SPARK-3453) Refactor Netty module to use BlockTransferService

2014-09-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3453:
--

 Summary: Refactor Netty module to use BlockTransferService
 Key: SPARK-3453
 URL: https://issues.apache.org/jira/browse/SPARK-3453
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3453) Refactor Netty module to use BlockTransferService

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126726#comment-14126726
 ] 

Apache Spark commented on SPARK-3453:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2330

 Refactor Netty module to use BlockTransferService
 -

 Key: SPARK-3453
 URL: https://issues.apache.org/jira/browse/SPARK-3453
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree

2014-09-09 Thread Qiping Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126791#comment-14126791
 ] 

Qiping Li commented on SPARK-3272:
--

Hi Joseph, I created a PR [#2332|https://github.com/apache/spark/pull/2332] 
based on our discussion.
Could you please help me review this, thanks for your help.

 Calculate prediction for nodes separately from calculating information gain 
 for splits in decision tree
 ---

 Key: SPARK-3272
 URL: https://issues.apache.org/jira/browse/SPARK-3272
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Qiping Li
 Fix For: 1.1.0


 In current implementation, prediction for a node is calculated along with 
 calculation of information gain stats for each possible splits. The value to 
 predict for a specific node is determined, no matter what the splits are.
 To save computation, we can first calculate prediction first and then 
 calculate information gain stats for each split.
 This is also necessary if we want to support minimum instances per node 
 parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) 
 because when all splits don't satisfy minimum instances requirement , we 
 don't use information gain of any splits. There should be a way to get the 
 prediction value.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126799#comment-14126799
 ] 

Apache Spark commented on SPARK-3272:
-

User 'chouqin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2332

 Calculate prediction for nodes separately from calculating information gain 
 for splits in decision tree
 ---

 Key: SPARK-3272
 URL: https://issues.apache.org/jira/browse/SPARK-3272
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Qiping Li
 Fix For: 1.1.0


 In current implementation, prediction for a node is calculated along with 
 calculation of information gain stats for each possible splits. The value to 
 predict for a specific node is determined, no matter what the splits are.
 To save computation, we can first calculate prediction first and then 
 calculate information gain stats for each split.
 This is also necessary if we want to support minimum instances per node 
 parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) 
 because when all splits don't satisfy minimum instances requirement , we 
 don't use information gain of any splits. There should be a way to get the 
 prediction value.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2207) Add minimum information gain and minimum instances per node as training parameters for decision tree.

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126798#comment-14126798
 ] 

Apache Spark commented on SPARK-2207:
-

User 'chouqin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2332

 Add minimum information gain and minimum instances per node as training 
 parameters for decision tree.
 -

 Key: SPARK-2207
 URL: https://issues.apache.org/jira/browse/SPARK-2207
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Manish Amde
Assignee: Qiping Li





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3222) cross join support in HiveQl

2014-09-09 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang resolved SPARK-3222.

   Resolution: Fixed
Fix Version/s: 1.1.0

resolved by pr #2124

 cross join support in HiveQl
 

 Key: SPARK-3222
 URL: https://issues.apache.org/jira/browse/SPARK-3222
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
 Fix For: 1.1.0


 Spark SQL hiveQl should support cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3454) Expose JSON representation of data shown in WebUI

2014-09-09 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3454:
--
Summary: Expose JSON representation of data shown in WebUI  (was: Expose 
JSON expression of data shown in WebUI)

 Expose JSON representation of data shown in WebUI
 -

 Key: SPARK-3454
 URL: https://issues.apache.org/jira/browse/SPARK-3454
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 If WebUI support to JSON format extracting, it's helpful for user who want to 
 analyse stage / task / executor information.
 Fortunately, WebUI has renderJson method so we can implement the method in 
 each subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3454) Expose JSON expression of data shown in WebUI

2014-09-09 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3454:
-

 Summary: Expose JSON expression of data shown in WebUI
 Key: SPARK-3454
 URL: https://issues.apache.org/jira/browse/SPARK-3454
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Kousuke Saruta


If WebUI support to JSON format extracting, it's helpful for user who want to 
analyse stage / task / executor information.

Fortunately, WebUI has renderJson method so we can implement the method in each 
subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3454) Expose JSON representation of data shown in WebUI

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126854#comment-14126854
 ] 

Apache Spark commented on SPARK-3454:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2333

 Expose JSON representation of data shown in WebUI
 -

 Key: SPARK-3454
 URL: https://issues.apache.org/jira/browse/SPARK-3454
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 If WebUI support to JSON format extracting, it's helpful for user who want to 
 analyse stage / task / executor information.
 Fortunately, WebUI has renderJson method so we can implement the method in 
 each subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3455) **HotFix** Unit test failed due to can not resolve the attribute references

2014-09-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-3455:


 Summary: **HotFix** Unit test failed due to can not resolve the 
attribute references
 Key: SPARK-3455
 URL: https://issues.apache.org/jira/browse/SPARK-3455
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker


The test case SPARK-3349 partitioning after limit failed, the exception as :
{panel}
23:10:04.117 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 
274.0 failed 1 times; aborting job
[info] - SPARK-3349 partitioning after limit *** FAILED ***
[info]   Exception thrown while executing query:
[info]   == Parsed Logical Plan ==
[info]   Project [*]
[info]Join Inner, Some(('subset1.n = 'lowerCaseData.n))
[info] UnresolvedRelation None, lowerCaseData, None
[info] UnresolvedRelation None, subset1, None
[info]   
[info]   == Analyzed Logical Plan ==
[info]   Project [n#605,l#606,n#12]
[info]Join Inner, Some((n#12 = n#605))
[info] SparkLogicalPlan (ExistingRdd [n#605,l#606], MapPartitionsRDD[13] at 
mapPartitions at basicOperators.scala:219)
[info] Limit 2
[info]  Sort [n#12 DESC]
[info]   Distinct 
[info]Project [n#12]
[info] SparkLogicalPlan (ExistingRdd [n#607,l#608], 
MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219)
[info]   
[info]   == Optimized Logical Plan ==
[info]   Project [n#605,l#606,n#12]
[info]Join Inner, Some((n#12 = n#605))
[info] SparkLogicalPlan (ExistingRdd [n#605,l#606], MapPartitionsRDD[13] at 
mapPartitions at basicOperators.scala:219)
[info] Limit 2
[info]  Sort [n#12 DESC]
[info]   Distinct 
[info]Project [n#12]
[info] SparkLogicalPlan (ExistingRdd [n#607,l#608], 
MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219)
[info]   
[info]   == Physical Plan ==
[info]   Project [n#605,l#606,n#12]
[info]ShuffledHashJoin [n#605], [n#12], BuildRight
[info] Exchange (HashPartitioning [n#605], 10)
[info]  ExistingRdd [n#605,l#606], MapPartitionsRDD[13] at mapPartitions at 
basicOperators.scala:219
[info] Exchange (HashPartitioning [n#12], 10)
[info]  TakeOrdered 2, [n#12 DESC]
[info]   Distinct false
[info]Exchange (HashPartitioning [n#12], 10)
[info] Distinct true
[info]  Project [n#12]
[info]   ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at 
mapPartitions at basicOperators.scala:219
[info]   
[info]   Code Generation: false
[info]   == RDD ==
[info]   == Exception ==
[info]   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
execute, tree:
[info]   Exchange (HashPartitioning [n#12], 10)
[info]TakeOrdered 2, [n#12 DESC]
[info] Distinct false
[info]  Exchange (HashPartitioning [n#12], 10)
[info]   Distinct true
[info]Project [n#12]
[info] ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at mapPartitions 
at basicOperators.scala:219
[info]   
[info]   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
execute, tree:
[info]   Exchange (HashPartitioning [n#12], 10)
[info]TakeOrdered 2, [n#12 DESC]
[info] Distinct false
[info]  Exchange (HashPartitioning [n#12], 10)
[info]   Distinct true
[info]Project [n#12]
[info] ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at mapPartitions 
at basicOperators.scala:219
[info]   
[info]  at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
[info]  at 
org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
[info]  at 
org.apache.spark.sql.execution.ShuffledHashJoin.execute(joins.scala:354)
[info]  at 
org.apache.spark.sql.execution.Project.execute(basicOperators.scala:42)
[info]  at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
[info]  at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
[info]  at 
org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:40)
[info]  at 
org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply$mcV$sp(SQLQuerySuite.scala:369)
[info]  at 
org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply(SQLQuerySuite.scala:362)
[info]  at 
org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply(SQLQuerySuite.scala:362)
[info]  at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
[info]  at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
[info]  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]  at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]  at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]  at 
org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
[info]  at 

[jira] [Commented] (SPARK-3455) **HotFix** Unit test failed due to can not resolve the attribute references

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126909#comment-14126909
 ] 

Apache Spark commented on SPARK-3455:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/2334

 **HotFix** Unit test failed due to can not resolve the attribute references
 ---

 Key: SPARK-3455
 URL: https://issues.apache.org/jira/browse/SPARK-3455
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 The test case SPARK-3349 partitioning after limit failed, the exception as :
 {panel}
 23:10:04.117 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 
 274.0 failed 1 times; aborting job
 [info] - SPARK-3349 partitioning after limit *** FAILED ***
 [info]   Exception thrown while executing query:
 [info]   == Parsed Logical Plan ==
 [info]   Project [*]
 [info]Join Inner, Some(('subset1.n = 'lowerCaseData.n))
 [info] UnresolvedRelation None, lowerCaseData, None
 [info] UnresolvedRelation None, subset1, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Project [n#605,l#606,n#12]
 [info]Join Inner, Some((n#12 = n#605))
 [info] SparkLogicalPlan (ExistingRdd [n#605,l#606], MapPartitionsRDD[13] 
 at mapPartitions at basicOperators.scala:219)
 [info] Limit 2
 [info]  Sort [n#12 DESC]
 [info]   Distinct 
 [info]Project [n#12]
 [info] SparkLogicalPlan (ExistingRdd [n#607,l#608], 
 MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219)
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Project [n#605,l#606,n#12]
 [info]Join Inner, Some((n#12 = n#605))
 [info] SparkLogicalPlan (ExistingRdd [n#605,l#606], MapPartitionsRDD[13] 
 at mapPartitions at basicOperators.scala:219)
 [info] Limit 2
 [info]  Sort [n#12 DESC]
 [info]   Distinct 
 [info]Project [n#12]
 [info] SparkLogicalPlan (ExistingRdd [n#607,l#608], 
 MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219)
 [info]   
 [info]   == Physical Plan ==
 [info]   Project [n#605,l#606,n#12]
 [info]ShuffledHashJoin [n#605], [n#12], BuildRight
 [info] Exchange (HashPartitioning [n#605], 10)
 [info]  ExistingRdd [n#605,l#606], MapPartitionsRDD[13] at mapPartitions 
 at basicOperators.scala:219
 [info] Exchange (HashPartitioning [n#12], 10)
 [info]  TakeOrdered 2, [n#12 DESC]
 [info]   Distinct false
 [info]Exchange (HashPartitioning [n#12], 10)
 [info] Distinct true
 [info]  Project [n#12]
 [info]   ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at 
 mapPartitions at basicOperators.scala:219
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   == Exception ==
 [info]   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
 execute, tree:
 [info]   Exchange (HashPartitioning [n#12], 10)
 [info]TakeOrdered 2, [n#12 DESC]
 [info] Distinct false
 [info]  Exchange (HashPartitioning [n#12], 10)
 [info]   Distinct true
 [info]Project [n#12]
 [info] ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at 
 mapPartitions at basicOperators.scala:219
 [info]   
 [info]   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
 execute, tree:
 [info]   Exchange (HashPartitioning [n#12], 10)
 [info]TakeOrdered 2, [n#12 DESC]
 [info] Distinct false
 [info]  Exchange (HashPartitioning [n#12], 10)
 [info]   Distinct true
 [info]Project [n#12]
 [info] ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at 
 mapPartitions at basicOperators.scala:219
 [info]   
 [info]at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
 [info]at 
 org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
 [info]at 
 org.apache.spark.sql.execution.ShuffledHashJoin.execute(joins.scala:354)
 [info]at 
 org.apache.spark.sql.execution.Project.execute(basicOperators.scala:42)
 [info]at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 [info]at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
 [info]at 
 org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:40)
 [info]at 
 org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply$mcV$sp(SQLQuerySuite.scala:369)
 [info]at 
 org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply(SQLQuerySuite.scala:362)
 [info]at 
 org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply(SQLQuerySuite.scala:362)
 [info]at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
 [info]at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
 

[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-09 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127016#comment-14127016
 ] 

Matthew Farrellee commented on SPARK-2972:
--

 I suggest having context implement the language-specific dispose patterns 
 ('using' in Java, 'with' in Python), so at least the code looks better?

that's a great idea. i'll spec this out for python, would you care to do it for 
java / scala?

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3456) YarnAllocator can lose container requests to RM

2014-09-09 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3456:


 Summary: YarnAllocator can lose container requests to RM
 Key: SPARK-3456
 URL: https://issues.apache.org/jira/browse/SPARK-3456
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Priority: Critical


I haven't actually tested this yet, but I believe that spark on yarn can lose 
container requests to the RM.  The reason is that we ask for the total number 
upfront (say x) but then we don't ask for anymore unless some are missing and 
if we do then we could erase the original request.

For example

- ask for 3 containers
- 1 is allocated
- ask for 0 containers since asked for 3 originally (2 left)
- the 1 allocated dies
- We now ask for 1 since its missing, this will override whatever is on the 
yarn side (in this case 2).

Then we lose the 2 more we need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3456) YarnAllocator can lose container requests to RM

2014-09-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127088#comment-14127088
 ] 

Thomas Graves commented on SPARK-3456:
--

Note this is only a problem on yarn alpha because in stable we use the 
AMRMClient interface which actually does an add.  

 YarnAllocator can lose container requests to RM
 ---

 Key: SPARK-3456
 URL: https://issues.apache.org/jira/browse/SPARK-3456
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Priority: Critical

 I haven't actually tested this yet, but I believe that spark on yarn can lose 
 container requests to the RM.  The reason is that we ask for the total number 
 upfront (say x) but then we don't ask for anymore unless some are missing and 
 if we do then we could erase the original request.
 For example
 - ask for 3 containers
 - 1 is allocated
 - ask for 0 containers since asked for 3 originally (2 left)
 - the 1 allocated dies
 - We now ask for 1 since its missing, this will override whatever is on the 
 yarn side (in this case 2).
 Then we lose the 2 more we need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3457) ConcurrentModificationException starting up pyspark

2014-09-09 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-3457:


 Summary: ConcurrentModificationException starting up pyspark
 Key: SPARK-3457
 URL: https://issues.apache.org/jira/browse/SPARK-3457
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Hadoop 2.3 (CDH 5.1) on Ubuntu precise
Reporter: Shay Rojansky


Just downloaded Spark 1.1.0-rc4. Launching pyspark for the very first time in 
yarn-client mode (no additional params or anything), I got the exception below. 
Rerunning pyspark 5 times afterwards did not reproduce the issue.

14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Application report from ASM:
 appMasterRpcPort: 0
 appStartTime: 1410275267606
 yarnAppState: RUNNING

14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=master.
grid.eaglerd.local,PROXY_URI_BASE=http://master.grid.eaglerd.local:8088/proxy/application_1410268447887_0011,
 /proxy/application_1410268447887_0011
Traceback (most recent call last):
  File /opt/spark/python/pyspark/shell.py, line 44, in module
14/09/09 18:07:58 INFO JettyUtils: Adding filter: 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
  File /opt/spark/python/pyspark/context.py, line 107, in __init__
conf)
  File /opt/spark/python/pyspark/context.py, line 155, in _do_init
self._jsc = self._initialize_context(self._conf._jconf)
  File /opt/spark/python/pyspark/context.py, line 201, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
  File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 
701, in __call__
  File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.
: java.util.ConcurrentModificationException
at java.util.Hashtable$Enumerator.next(Hashtable.java:1167)
at 
scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458)
at 
scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454)
at scala.collection.Iterator$class.toStream(Iterator.scala:1143)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1157)
at 
scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at 
scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at 
scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
at 
scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at scala.collection.immutable.Stream.length(Stream.scala:284)
at scala.collection.SeqLike$class.sorted(SeqLike.scala:608)
at scala.collection.AbstractSeq.sorted(Seq.scala:40)
at org.apache.spark.SparkEnv$.environmentDetails(SparkEnv.scala:324)
at 
org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:1297)
at org.apache.spark.SparkContext.init(SparkContext.scala:334)
at 
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3457) ConcurrentModificationException starting up pyspark

2014-09-09 Thread Shay Rojansky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Rojansky updated SPARK-3457:
-
Description: 
Just downloaded Spark 1.1.0-rc4. Launching pyspark for the very first time in 
yarn-client mode (no additional params or anything), I got the exception below. 
Rerunning pyspark 5 times afterwards did not reproduce the issue.

{code}
14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Application report from ASM:
 appMasterRpcPort: 0
 appStartTime: 1410275267606
 yarnAppState: RUNNING

14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=master.
grid.eaglerd.local,PROXY_URI_BASE=http://master.grid.eaglerd.local:8088/proxy/application_1410268447887_0011,
 /proxy/application_1410268447887_0011
Traceback (most recent call last):
  File /opt/spark/python/pyspark/shell.py, line 44, in module
14/09/09 18:07:58 INFO JettyUtils: Adding filter: 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
  File /opt/spark/python/pyspark/context.py, line 107, in __init__
conf)
  File /opt/spark/python/pyspark/context.py, line 155, in _do_init
self._jsc = self._initialize_context(self._conf._jconf)
  File /opt/spark/python/pyspark/context.py, line 201, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
  File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 
701, in __call__
  File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.
: java.util.ConcurrentModificationException
at java.util.Hashtable$Enumerator.next(Hashtable.java:1167)
at 
scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458)
at 
scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454)
at scala.collection.Iterator$class.toStream(Iterator.scala:1143)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1157)
at 
scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at 
scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at 
scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
at 
scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at scala.collection.immutable.Stream.length(Stream.scala:284)
at scala.collection.SeqLike$class.sorted(SeqLike.scala:608)
at scala.collection.AbstractSeq.sorted(Seq.scala:40)
at org.apache.spark.SparkEnv$.environmentDetails(SparkEnv.scala:324)
at 
org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:1297)
at org.apache.spark.SparkContext.init(SparkContext.scala:334)
at 
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
Just downloaded Spark 1.1.0-rc4. Launching pyspark for the very first time in 
yarn-client mode (no additional params or anything), I got the exception below. 
Rerunning pyspark 5 times afterwards did not reproduce the issue.

14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Application report from ASM:
 appMasterRpcPort: 0
 appStartTime: 1410275267606
 yarnAppState: RUNNING

14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=master.
grid.eaglerd.local,PROXY_URI_BASE=http://master.grid.eaglerd.local:8088/proxy/application_1410268447887_0011,
 

[jira] [Created] (SPARK-3458) enable use of python's with statements for SparkContext management

2014-09-09 Thread Matthew Farrellee (JIRA)
Matthew Farrellee created SPARK-3458:


 Summary: enable use of python's with statements for SparkContext 
management
 Key: SPARK-3458
 URL: https://issues.apache.org/jira/browse/SPARK-3458
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Matthew Farrellee


best practice for managing SparkContexts involves exception handling, e.g.

```
try:
  sc = SparkContext()
  app(sc)
finally:
  sc.stop()
```

python provides the with statement to simplify this code, e.g.

```
with SparkContext() as sc:
  app(sc)
```

the SparkContext should be usable in a with statement



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-09 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127171#comment-14127171
 ] 

Shay Rojansky commented on SPARK-2972:
--

I'd love to help on this, but I know 0 Scala (I could have helped with the 
Python though :)).

A quick search shows that Scala has no Python 'with' or Java Closeable 
equivalent in Java. There are several third-party implementations out there, 
but it doesn't seem right to bring in a non-core library for this kind of 
thing. I think someone with real Scala knowledge should take a look at this.

We can close this issue and open a separate one for the Scala closeability if 
you want.

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-09 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127187#comment-14127187
 ] 

Matthew Farrellee commented on SPARK-2972:
--

+1 close this and open 2 feature requests, one for java and one for scala that 
mirror SPARK-3458

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127188#comment-14127188
 ] 

Thomas Graves commented on SPARK-3174:
--

Since you mention the graceful decommission as large enough to be a feature of 
its own the only way we would give executors back is if they are not being used 
and have no data in the cache, correct?

 Under YARN, add and remove executors based on load
 --

 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or
 Attachments: SPARK-3174design.pdf


 A common complaint with Spark in a multi-tenant environment is that 
 applications have a fixed allocation that doesn't grow and shrink with their 
 resource needs.  We're blocked on YARN-1197 for dynamically changing the 
 resources within executors, but we can still allocate and discard whole 
 executors.
 I think it would be useful to have some heuristics that
 * Request more executors when many pending tasks are building up
 * Request more executors when RDDs can't fit in memory
 * Discard executors when few tasks are running / pending and there's not much 
 in memory
 Bonus points: migrate blocks from executors we're about to discard to 
 executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3459) MulticlassMetrics is not serializable

2014-09-09 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3459:


 Summary: MulticlassMetrics is not serializable
 Key: SPARK-3459
 URL: https://issues.apache.org/jira/browse/SPARK-3459
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Xiangrui Meng


Some task closures contains member variables and hence have reference to 
itself, which causes task not serializable exception on a real cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3458) enable use of python's with statements for SparkContext management

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127210#comment-14127210
 ] 

Apache Spark commented on SPARK-3458:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2335

 enable use of python's with statements for SparkContext management
 

 Key: SPARK-3458
 URL: https://issues.apache.org/jira/browse/SPARK-3458
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Matthew Farrellee
Assignee: Matthew Farrellee
  Labels: features, python, sparkcontext

 best practice for managing SparkContexts involves exception handling, e.g.
 {code}
 try:
   sc = SparkContext()
   app(sc)
 finally:
   sc.stop()
 {code}
 python provides the with statement to simplify this code, e.g.
 {code}
 with SparkContext() as sc:
   app(sc)
 {code}
 the SparkContext should be usable in a with statement



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127188#comment-14127188
 ] 

Thomas Graves edited comment on SPARK-3174 at 9/9/14 4:51 PM:
--

Since you mention the graceful decommission as large enough to be a feature of 
its own the only way we would give executors back is if they are not being used 
and have no data in the cache, correct?

Perhaps this needs umbrella jira if we are splitting those apart.


was (Author: tgraves):
Since you mention the graceful decommission as large enough to be a feature of 
its own the only way we would give executors back is if they are not being used 
and have no data in the cache, correct?

 Under YARN, add and remove executors based on load
 --

 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or
 Attachments: SPARK-3174design.pdf


 A common complaint with Spark in a multi-tenant environment is that 
 applications have a fixed allocation that doesn't grow and shrink with their 
 resource needs.  We're blocked on YARN-1197 for dynamically changing the 
 resources within executors, but we can still allocate and discard whole 
 executors.
 I think it would be useful to have some heuristics that
 * Request more executors when many pending tasks are building up
 * Request more executors when RDDs can't fit in memory
 * Discard executors when few tasks are running / pending and there's not much 
 in memory
 Bonus points: migrate blocks from executors we're about to discard to 
 executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3460) Graceful decommission of idle YARN sessions

2014-09-09 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3460:
--

 Summary: Graceful decommission of idle YARN sessions
 Key: SPARK-3460
 URL: https://issues.apache.org/jira/browse/SPARK-3460
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Patrick Wendell
Assignee: Andrew Or


This is a simpler case of the more general ideas discussed in SPARK-3174. If we 
have a YARN session that is no longer submitting tasks and has no in-scope 
shuffle data or cached blocks, then we should scale down the cluster and give 
up containers.

This general behavior could be enabled/disabled with a config setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127232#comment-14127232
 ] 

Patrick Wendell edited comment on SPARK-3174 at 9/9/14 5:08 PM:


Yeah so how about we create a sub-task that covers only graceful decommission. 
IMO that's a much simpler feature to implement. [~tgraves] is this an issue 
you've run into at Yahoo (people leaving clusters up that are no longer using 
any resources?).


was (Author: pwendell):
Yeah so how about we create a sub-task that covers only graceful decommission. 
IMO that's a much simpler feature to implement.

 Under YARN, add and remove executors based on load
 --

 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or
 Attachments: SPARK-3174design.pdf


 A common complaint with Spark in a multi-tenant environment is that 
 applications have a fixed allocation that doesn't grow and shrink with their 
 resource needs.  We're blocked on YARN-1197 for dynamically changing the 
 resources within executors, but we can still allocate and discard whole 
 executors.
 I think it would be useful to have some heuristics that
 * Request more executors when many pending tasks are building up
 * Request more executors when RDDs can't fit in memory
 * Discard executors when few tasks are running / pending and there's not much 
 in memory
 Bonus points: migrate blocks from executors we're about to discard to 
 executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode

2014-09-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3438:
---
Component/s: Deploy

 Support for accessing secured HDFS in Standalone Mode
 -

 Key: SPARK-3438
 URL: https://issues.apache.org/jira/browse/SPARK-3438
 Project: Spark
  Issue Type: New Feature
  Components: Deploy, Spark Core
Affects Versions: 1.0.2
Reporter: Zhanfeng Huo

 Reading data from secure HDFS into spark is a usefull feature. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode

2014-09-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3438:
---
Summary: Support for accessing secured HDFS in Standalone Mode  (was: 
Adding support for accessing secured HDFS)

 Support for accessing secured HDFS in Standalone Mode
 -

 Key: SPARK-3438
 URL: https://issues.apache.org/jira/browse/SPARK-3438
 Project: Spark
  Issue Type: New Feature
  Components: Deploy, Spark Core
Affects Versions: 1.0.2
Reporter: Zhanfeng Huo

 Reading data from secure HDFS into spark is a usefull feature. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode

2014-09-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3438:
---
Description: Secured HDFS is supported in YARN currently, but not in 
standalone mode. The tricky bit is how disseminate the delegation tokens 
securely in standalone mode.  (was: Reading data from secure HDFS into spark is 
a usefull feature. )

 Support for accessing secured HDFS in Standalone Mode
 -

 Key: SPARK-3438
 URL: https://issues.apache.org/jira/browse/SPARK-3438
 Project: Spark
  Issue Type: New Feature
  Components: Deploy, Spark Core
Affects Versions: 1.0.2
Reporter: Zhanfeng Huo

 Secured HDFS is supported in YARN currently, but not in standalone mode. The 
 tricky bit is how disseminate the delegation tokens securely in standalone 
 mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2182) Scalastyle rule blocking unicode operators

2014-09-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2182:
---
Assignee: Prashant Sharma

 Scalastyle rule blocking unicode operators
 --

 Key: SPARK-2182
 URL: https://issues.apache.org/jira/browse/SPARK-2182
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Andrew Ash
Assignee: Prashant Sharma
 Attachments: Screen Shot 2014-06-18 at 3.28.44 PM.png


 Some IDEs don't support Scala's [unicode 
 operators|http://www.scala-lang.org/old/node/4723] so we should consider 
 adding a scalastyle rule to block them for wider compatibility among 
 contributors.
 See this PR for a place we reverted a unicode operator: 
 https://github.com/apache/spark/pull/1119



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2182) Scalastyle rule blocking unicode operators

2014-09-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127248#comment-14127248
 ] 

Patrick Wendell commented on SPARK-2182:


[~prashant_] as our resident expert on scalastyle... is this possible?

 Scalastyle rule blocking unicode operators
 --

 Key: SPARK-2182
 URL: https://issues.apache.org/jira/browse/SPARK-2182
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Andrew Ash
Assignee: Prashant Sharma
 Attachments: Screen Shot 2014-06-18 at 3.28.44 PM.png


 Some IDEs don't support Scala's [unicode 
 operators|http://www.scala-lang.org/old/node/4723] so we should consider 
 adding a scalastyle rule to block them for wider compatibility among 
 contributors.
 See this PR for a place we reverted a unicode operator: 
 https://github.com/apache/spark/pull/1119



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3426) Sort-based shuffle compression behavior is inconsistent

2014-09-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3426:
---
Priority: Blocker  (was: Critical)

 Sort-based shuffle compression behavior is inconsistent
 ---

 Key: SPARK-3426
 URL: https://issues.apache.org/jira/browse/SPARK-3426
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker

 We have the following configs:
 {code}
 spark.shuffle.compress
 spark.shuffle.spill.compress
 {code}
 When these two diverge, sort-based shuffle fails with a compression exception 
 under certain workloads. This is because in sort-based shuffle we serve the 
 index file (using spark.shuffle.spill.compress) as a normal shuffle file 
 (using spark.shuffle.compress). It was unfortunate in retrospect that these 
 two configs were exposed so we can't easily remove them.
 Here is how this can be reproduced. Set the following in your 
 spark-defaults.conf:
 {code}
 spark.master  local-cluster[1,1,512]
 spark.shuffle.spill.compress  false
 spark.shuffle.compresstrue
 spark.shuffle.manager sort
 spark.shuffle.memoryFraction  0.001
 {code}
 Then run the following in spark-shell:
 {code}
 sc.parallelize(0 until 10).map(i = (i/4, i)).groupByKey().collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3461) Support external group-by

2014-09-09 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3461:
--

 Summary: Support external group-by
 Key: SPARK-3461
 URL: https://issues.apache.org/jira/browse/SPARK-3461
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Patrick Wendell


Given that we have SPARK-2978, it seems like we could support an external group 
by operator pretty easily. We'd just have to wrap the existing iterator exposed 
by SPARK-2978 with a lookahead iterator that detects the group boundaries. 
Also, we'd have to override the cache() operator to cache the parent RDD so 
that if this object is cached it doesn't wind through the iterator.

I haven't totally followed all the sort-shuffle internals, but just given the 
stated semantics of SPARK-2978 it seems like this would be possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3461) Support external groupBy

2014-09-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3461:
---
Summary: Support external groupBy  (was: Support external group-by)

 Support external groupBy
 

 Key: SPARK-3461
 URL: https://issues.apache.org/jira/browse/SPARK-3461
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 Given that we have SPARK-2978, it seems like we could support an external 
 group by operator pretty easily. We'd just have to wrap the existing iterator 
 exposed by SPARK-2978 with a lookahead iterator that detects the group 
 boundaries. Also, we'd have to override the cache() operator to cache the 
 parent RDD so that if this object is cached it doesn't wind through the 
 iterator.
 I haven't totally followed all the sort-shuffle internals, but just given the 
 stated semantics of SPARK-2978 it seems like this would be possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3461) Support external groupBy

2014-09-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3461:
---
Description: 
Given that we have SPARK-2978, it seems like we could support an external group 
by operator pretty easily. We'd just have to wrap the existing iterator exposed 
by SPARK-2978 with a lookahead iterator that detects the group boundaries. 
Also, we'd have to override the cache() operator to cache the parent RDD so 
that if this object is cached it doesn't wind through the iterator.

I haven't totally followed all the sort-shuffle internals, but just given the 
stated semantics of SPARK-2978 it seems like this would be possible.

It would be really nice to externalize this because many beginner users write 
jobs in terms of groupBy.

  was:
Given that we have SPARK-2978, it seems like we could support an external group 
by operator pretty easily. We'd just have to wrap the existing iterator exposed 
by SPARK-2978 with a lookahead iterator that detects the group boundaries. 
Also, we'd have to override the cache() operator to cache the parent RDD so 
that if this object is cached it doesn't wind through the iterator.

I haven't totally followed all the sort-shuffle internals, but just given the 
stated semantics of SPARK-2978 it seems like this would be possible.


 Support external groupBy
 

 Key: SPARK-3461
 URL: https://issues.apache.org/jira/browse/SPARK-3461
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 Given that we have SPARK-2978, it seems like we could support an external 
 group by operator pretty easily. We'd just have to wrap the existing iterator 
 exposed by SPARK-2978 with a lookahead iterator that detects the group 
 boundaries. Also, we'd have to override the cache() operator to cache the 
 parent RDD so that if this object is cached it doesn't wind through the 
 iterator.
 I haven't totally followed all the sort-shuffle internals, but just given the 
 stated semantics of SPARK-2978 it seems like this would be possible.
 It would be really nice to externalize this because many beginner users write 
 jobs in terms of groupBy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3461) Support external groupBy

2014-09-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3461:
---
Component/s: Spark Core

 Support external groupBy
 

 Key: SPARK-3461
 URL: https://issues.apache.org/jira/browse/SPARK-3461
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 Given that we have SPARK-2978, it seems like we could support an external 
 group by operator pretty easily. We'd just have to wrap the existing iterator 
 exposed by SPARK-2978 with a lookahead iterator that detects the group 
 boundaries. Also, we'd have to override the cache() operator to cache the 
 parent RDD so that if this object is cached it doesn't wind through the 
 iterator.
 I haven't totally followed all the sort-shuffle internals, but just given the 
 stated semantics of SPARK-2978 it seems like this would be possible.
 It would be really nice to externalize this because many beginner users write 
 jobs in terms of groupBy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3459) MulticlassMetrics is not serializable

2014-09-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-3459.

Resolution: Cannot Reproduce

 MulticlassMetrics is not serializable
 -

 Key: SPARK-3459
 URL: https://issues.apache.org/jira/browse/SPARK-3459
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Xiangrui Meng

 Some task closures contains member variables and hence have reference to 
 itself, which causes task not serializable exception on a real cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3445) Deprecate and later remove YARN alpha support

2014-09-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127301#comment-14127301
 ] 

Reynold Xin commented on SPARK-3445:


[~tgraves] when is Yahoo moving? (or was that already completed?)

 Deprecate and later remove YARN alpha support
 -

 Key: SPARK-3445
 URL: https://issues.apache.org/jira/browse/SPARK-3445
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Patrick Wendell

 This will depend a bit on both user demand and the commitment level of 
 maintainers, but I'd like to propose the following timeline for yarn-alpha 
 support.
 Spark 1.2: Deprecate YARN-alpha
 Spark 1.3: Remove YARN-alpha (i.e. require YARN-stable)
 Since YARN-alpha is clearly identified as an alpha API, it seems reasonable 
 to drop support for it in a minor release. However, it does depend a bit 
 whether anyone uses this outside of Yahoo!, and that I'm not sure of. In the 
 past this API has been used and maintained by Yahoo, but they'll be migrating 
 soon to the stable API's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3462) parquet pushdown for unionAll

2014-09-09 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-3462:
-

 Summary: parquet pushdown for unionAll
 Key: SPARK-3462
 URL: https://issues.apache.org/jira/browse/SPARK-3462
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cody Koeninger


http://apache-spark-developers-list.1001551.n3.nabble.com/parquet-predicate-projection-pushdown-into-unionAll-td8339.html

// single table, pushdown
scala p.where('age  40).select('name)
res36: org.apache.spark.sql.SchemaRDD =
SchemaRDD[97] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
Project [name#3]
 ParquetTableScan [name#3,age#4], (ParquetRelation /var/tmp/people, 
Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, 
mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), [(age#4  40)]


// union of 2 tables, no pushdown
scala b.where('age  40).select('name)
res37: org.apache.spark.sql.SchemaRDD =
SchemaRDD[99] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
Project [name#3]
 Filter (age#4  40)
  Union [ParquetTableScan [name#3,age#4,phones#5], (ParquetRelation 
/var/tmp/people, Some(Configuration: core-default.xml, core-site.xml, 
mapred-default.xml, mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, 
[]), []
,ParquetTableScan [name#0,age#1,phones#2], (ParquetRelation /var/tmp/people2, 
Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, 
mapred-site.xml), org.apache.spark.sql.SQLContext@6d7e79f6, []), []
]  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2425) Standalone Master is too aggressive in removing Applications

2014-09-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2425:
-
Fix Version/s: (was: 1.1.1)

 Standalone Master is too aggressive in removing Applications
 

 Key: SPARK-2425
 URL: https://issues.apache.org/jira/browse/SPARK-2425
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra
Priority: Critical
 Fix For: 1.2.0


 When standalone Executors trying to run a particular Application fail a 
 cummulative ApplicationState.MAX_NUM_RETRY times, Master will remove the 
 Application.  This will be true even if there actually are a number of 
 Executors that are successfully running the Application.  This makes 
 long-running standalone-mode Applications in particular unnecessarily 
 vulnerable to limited failures in the cluster -- e.g., a single bad node on 
 which Executors repeatedly fail for any reason can prevent an Application 
 from starting or can result in a running Application being removed even 
 though it could continue to run successfully (just not making use of all 
 potential Workers and Executors.) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3445) Deprecate and later remove YARN alpha support

2014-09-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127318#comment-14127318
 ] 

Thomas Graves commented on SPARK-3445:
--

We are in progress of moving and the timeline of the proposal fits with the 
rest of our plans.

 Deprecate and later remove YARN alpha support
 -

 Key: SPARK-3445
 URL: https://issues.apache.org/jira/browse/SPARK-3445
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Patrick Wendell

 This will depend a bit on both user demand and the commitment level of 
 maintainers, but I'd like to propose the following timeline for yarn-alpha 
 support.
 Spark 1.2: Deprecate YARN-alpha
 Spark 1.3: Remove YARN-alpha (i.e. require YARN-stable)
 Since YARN-alpha is clearly identified as an alpha API, it seems reasonable 
 to drop support for it in a minor release. However, it does depend a bit 
 whether anyone uses this outside of Yahoo!, and that I'm not sure of. In the 
 past this API has been used and maintained by Yahoo, but they'll be migrating 
 soon to the stable API's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2425) Standalone Master is too aggressive in removing Applications

2014-09-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-2425:
--

 Standalone Master is too aggressive in removing Applications
 

 Key: SPARK-2425
 URL: https://issues.apache.org/jira/browse/SPARK-2425
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra
Priority: Critical
 Fix For: 1.2.0


 When standalone Executors trying to run a particular Application fail a 
 cummulative ApplicationState.MAX_NUM_RETRY times, Master will remove the 
 Application.  This will be true even if there actually are a number of 
 Executors that are successfully running the Application.  This makes 
 long-running standalone-mode Applications in particular unnecessarily 
 vulnerable to limited failures in the cluster -- e.g., a single bad node on 
 which Executors repeatedly fail for any reason can prevent an Application 
 from starting or can result in a running Application being removed even 
 though it could continue to run successfully (just not making use of all 
 potential Workers and Executors.) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2425) Standalone Master is too aggressive in removing Applications

2014-09-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-2425.

Resolution: Fixed

 Standalone Master is too aggressive in removing Applications
 

 Key: SPARK-2425
 URL: https://issues.apache.org/jira/browse/SPARK-2425
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra
Priority: Critical
 Fix For: 1.2.0


 When standalone Executors trying to run a particular Application fail a 
 cummulative ApplicationState.MAX_NUM_RETRY times, Master will remove the 
 Application.  This will be true even if there actually are a number of 
 Executors that are successfully running the Application.  This makes 
 long-running standalone-mode Applications in particular unnecessarily 
 vulnerable to limited failures in the cluster -- e.g., a single bad node on 
 which Executors repeatedly fail for any reason can prevent an Application 
 from starting or can result in a running Application being removed even 
 though it could continue to run successfully (just not making use of all 
 potential Workers and Executors.) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3464) Graceful decommission of executors

2014-09-09 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3464:
--
Description: In most cases, even when an application is utilizing only a 
small fraction of its available resources, executors will still have tasks 
running or blocks cached.  It would be useful to have a mechanism for waiting 
for running tasks on an executor to finish and migrating its cached blocks 
elsewhere before discarding it.

 Graceful decommission of executors
 --

 Key: SPARK-3464
 URL: https://issues.apache.org/jira/browse/SPARK-3464
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Sandy Ryza

 In most cases, even when an application is utilizing only a small fraction of 
 its available resources, executors will still have tasks running or blocks 
 cached.  It would be useful to have a mechanism for waiting for running tasks 
 on an executor to finish and migrating its cached blocks elsewhere before 
 discarding it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127446#comment-14127446
 ] 

Thomas Graves commented on SPARK-3174:
--

{quote}
Yeah so how about we create a sub-task that covers only graceful decommission. 
IMO that's a much simpler feature to implement. Thomas Graves is this an issue 
you've run into at Yahoo (people leaving clusters up that are no longer using 
any resources?).
{quote}

I haven't seen a lot of it at this point.  Generally when I do its someone 
using spark-shell or pyspark and left it up.  I haven't analyzed  many customer 
jobs that deeply either though to know that half the time they were wasting the 
resources. I can definitely sees its usefulness.

 Under YARN, add and remove executors based on load
 --

 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or
 Attachments: SPARK-3174design.pdf


 A common complaint with Spark in a multi-tenant environment is that 
 applications have a fixed allocation that doesn't grow and shrink with their 
 resource needs.  We're blocked on YARN-1197 for dynamically changing the 
 resources within executors, but we can still allocate and discard whole 
 executors.
 I think it would be useful to have some heuristics that
 * Request more executors when many pending tasks are building up
 * Request more executors when RDDs can't fit in memory
 * Discard executors when few tasks are running / pending and there's not much 
 in memory
 Bonus points: migrate blocks from executors we're about to discard to 
 executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1985) SPARK_HOME shouldn't be required when spark.executor.uri is provided

2014-09-09 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127455#comment-14127455
 ] 

Chip Senkbeil commented on SPARK-1985:
--

Does anyone know what the status of this is?

 SPARK_HOME shouldn't be required when spark.executor.uri is provided
 

 Key: SPARK-1985
 URL: https://issues.apache.org/jira/browse/SPARK-1985
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: MESOS
Reporter: Gerard Maas
  Labels: mesos

 When trying to run that simple example on a  Mesos installation,  I get an 
 error that SPARK_HOME is not set. A local spark installation should not be 
 required to run a job on Mesos. All that's needed is the executor package, 
 being the assembly.tar.gz on a reachable location (HDFS/S3/HTTP).
 I went looking into the code and indeed there's a check on SPARK_HOME [2] 
 regardless of the presence of the assembly but it's actually only used if the 
 assembly is not provided (which is a kind-of best-effort recovery strategy).
 Current flow:
 if (!SPARK_HOME) fail(No SPARK_HOME) 
 else if (assembly) { use assembly) }
 else { try use SPARK_HOME to build spark_executor } 
 Should be:
 sparkExecutor =  if (assembly) {assembly} 
  else if (SPARK_HOME) {try use SPARK_HOME to build 
 spark_executor}
  else { fail(No executor found. Please provide 
 spark.executor.uri (preferred) or spark.home)
 [1] 
 http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-with-Spark-Mesos-spark-shell-works-fine-td6165.html
 [2] 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3450) Enable specifiying the --jars CLI option multiple times

2014-09-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127533#comment-14127533
 ] 

Marcelo Vanzin commented on SPARK-3450:
---

My only concern is that adding this would probably break things for those 
relying on the current behavior for whatever reason. I don't expect many of 
those to exist, but you never know...

 Enable specifiying the --jars CLI option multiple times
 ---

 Key: SPARK-3450
 URL: https://issues.apache.org/jira/browse/SPARK-3450
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.2
Reporter: wolfgang hoschek

 spark-submit should support specifiying the --jars option multiple time, e.g. 
 --jars foo.jar,bar.jar --jars baz.jar,oops.jar should be equivalent to --jars 
 foo.jar,bar.jar,baz.jar,oops.jar
 This would allow using wrapper scripts that simplify usage for enterprise 
 customers along the following lines:
 {code}
 my-spark-submit.sh:
 jars=
 for i in /opt/myapp/*.jar; do
   if [ $i -gt 0]
   then
 jars=$jars,
   fi
   jars=$jars$i
 done
 spark-submit --jars $jars $@
 {code}
 Example usage:
 {code}
 my-spark-submit.sh --jars myUserDefinedFunction.jar 
 {code}
 The relevant enhancement code might go into SparkSubmitArguments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext

2014-09-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127564#comment-14127564
 ] 

Marcelo Vanzin commented on SPARK-3215:
---

For those following, I moved the prototype to this location:
https://github.com/vanzin/spark-client

This is so the Hive-on-Spark project can start playing with while we work on 
all the details.

 Add remote interface for SparkContext
 -

 Key: SPARK-3215
 URL: https://issues.apache.org/jira/browse/SPARK-3215
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Marcelo Vanzin
  Labels: hive
 Attachments: RemoteSparkContext.pdf


 A quick description of the issue: as part of running Hive jobs on top of 
 Spark, it's desirable to have a SparkContext that is running in the 
 background and listening for job requests for a particular user session.
 Running multiple contexts in the same JVM is not a very good solution. Not 
 only SparkContext currently has issues sharing the same JVM among multiple 
 instances, but that turns the JVM running the contexts into a huge bottleneck 
 in the system.
 So I'm proposing a solution where we have a SparkContext that is running in a 
 separate process, and listening for requests from the client application via 
 some RPC interface (most probably Akka).
 I'll attach a document shortly with the current proposal. Let's use this bug 
 to discuss the proposal and any other suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1

2014-09-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3404.
--
   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

Tests are now failing due to HiveQL test problems, but you can see they have 
passed SparkSubmitSuite:
https://amplab.cs.berkeley.edu/jenkins/view/Spark/

I think this one's resolved now.

 SparkSubmitSuite fails with spark-submit exits with code 1
 

 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
Reporter: Sean Owen
Priority: Critical
 Fix For: 1.1.1, 1.2.0


 Maven-based Jenkins builds have been failing for over a month. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 It's SparkSubmitSuite that fails. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
 {code}
 SparkSubmitSuite
 ...
 - launch simple application with spark-submit *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
 local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 - spark submit includes jars passed in through --jar *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
 local-cluster[2,1,512], --jars, 
 file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 {code}
 SBT builds don't fail, so it is likely to be due to some difference in how 
 the tests are run rather than a problem with test or core project.
 This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
 cause identified in that JIRA is, at least, not the only cause. (Although, it 
 wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
 config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)
 This JIRA tracks investigation into a different cause. Right now I have some 
 further information but not a PR yet.
 Part of the issue is that there is no clue in the log about why 
 {{spark-submit}} exited with status 1. See 
 https://github.com/apache/spark/pull/2108/files and 
 https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
 least print stdout to the log too.
 The SparkSubmit program exits with 1 when the main class it is supposed to 
 run is not found 
 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
  This is for example SimpleApplicationTest 
 (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
 The test actually submits an empty JAR not containing this class. It relies 
 on {{spark-submit}} finding the class within the compiled test-classes of the 
 

[jira] [Created] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-09 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3465:
-

 Summary: Task metrics are not aggregated correctly in local mode
 Key: SPARK-3465
 URL: https://issues.apache.org/jira/browse/SPARK-3465
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Davies Liu
Priority: Blocker


In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same 
object with that in TaskContext (because there is no serialization for 
MetricsUpdate in local mode), then all the upcoming changes in metrics will be 
lost, because updateAggregateMetrics() only counts the difference in these two. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-3465:
--
Description: 
In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same 
object with that in TaskContext (because there is no serialization for 
MetricsUpdate in local mode), then all the upcoming changes in metrics will be 
lost, because updateAggregateMetrics() only counts the difference in these two. 

This bug was introduced in https://issues.apache.org/jira/browse/SPARK-2099.

  was:
In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same 
object with that in TaskContext (because there is no serialization for 
MetricsUpdate in local mode), then all the upcoming changes in metrics will be 
lost, because updateAggregateMetrics() only counts the difference in these two. 

This bug was introduced in #2099.


 Task metrics are not aggregated correctly in local mode
 ---

 Key: SPARK-3465
 URL: https://issues.apache.org/jira/browse/SPARK-3465
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the 
 same object with that in TaskContext (because there is no serialization for 
 MetricsUpdate in local mode), then all the upcoming changes in metrics will 
 be lost, because updateAggregateMetrics() only counts the difference in these 
 two. 
 This bug was introduced in https://issues.apache.org/jira/browse/SPARK-2099.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-3465:
--
Description: 
In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same 
object with that in TaskContext (because there is no serialization for 
MetricsUpdate in local mode), then all the upcoming changes in metrics will be 
lost, because updateAggregateMetrics() only counts the difference in these two. 

This bug was introduced in #2099.

  was:In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the 
same object with that in TaskContext (because there is no serialization for 
MetricsUpdate in local mode), then all the upcoming changes in metrics will be 
lost, because updateAggregateMetrics() only counts the difference in these two. 


 Task metrics are not aggregated correctly in local mode
 ---

 Key: SPARK-3465
 URL: https://issues.apache.org/jira/browse/SPARK-3465
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the 
 same object with that in TaskContext (because there is no serialization for 
 MetricsUpdate in local mode), then all the upcoming changes in metrics will 
 be lost, because updateAggregateMetrics() only counts the difference in these 
 two. 
 This bug was introduced in #2099.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-3465:
--
Description: 
In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same 
object with that in TaskContext (because there is no serialization for 
MetricsUpdate in local mode), then all the upcoming changes in metrics will be 
lost, because updateAggregateMetrics() only counts the difference in these two. 

This bug was introduced in https://issues.apache.org/jira/browse/SPARK-2099, cc 
@sandy rayza

  was:
In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same 
object with that in TaskContext (because there is no serialization for 
MetricsUpdate in local mode), then all the upcoming changes in metrics will be 
lost, because updateAggregateMetrics() only counts the difference in these two. 

This bug was introduced in https://issues.apache.org/jira/browse/SPARK-2099.


 Task metrics are not aggregated correctly in local mode
 ---

 Key: SPARK-3465
 URL: https://issues.apache.org/jira/browse/SPARK-3465
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the 
 same object with that in TaskContext (because there is no serialization for 
 MetricsUpdate in local mode), then all the upcoming changes in metrics will 
 be lost, because updateAggregateMetrics() only counts the difference in these 
 two. 
 This bug was introduced in https://issues.apache.org/jira/browse/SPARK-2099, 
 cc @sandy rayza



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-3465:
--
Description: 
In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same 
object with that in TaskContext (because there is no serialization for 
MetricsUpdate in local mode), then all the upcoming changes in metrics will be 
lost, because updateAggregateMetrics() only counts the difference in these two. 

This bug was introduced in https://issues.apache.org/jira/browse/SPARK-2099, cc 
[~sandyr]]

  was:
In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same 
object with that in TaskContext (because there is no serialization for 
MetricsUpdate in local mode), then all the upcoming changes in metrics will be 
lost, because updateAggregateMetrics() only counts the difference in these two. 

This bug was introduced in https://issues.apache.org/jira/browse/SPARK-2099, cc 
@sandy rayza


 Task metrics are not aggregated correctly in local mode
 ---

 Key: SPARK-3465
 URL: https://issues.apache.org/jira/browse/SPARK-3465
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the 
 same object with that in TaskContext (because there is no serialization for 
 MetricsUpdate in local mode), then all the upcoming changes in metrics will 
 be lost, because updateAggregateMetrics() only counts the difference in these 
 two. 
 This bug was introduced in https://issues.apache.org/jira/browse/SPARK-2099, 
 cc [~sandyr]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures

2014-09-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3409:
-
Fix Version/s: 1.1.1

 Avoid pulling in Exchange operator itself in Exchange's closures
 

 Key: SPARK-3409
 URL: https://issues.apache.org/jira/browse/SPARK-3409
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.1.1, 1.2.0


 {code}
 val rdd = child.execute().mapPartitions { iter =
   if (sortBasedShuffleOn) {
 iter.map(r = (null, r.copy()))
   } else {
 val mutablePair = new MutablePair[Null, Row]()
 iter.map(r = mutablePair.update(null, r))
   }
 }
 {code}
 The above snippet from Exchange references sortBasedShuffleOn within a 
 closure, which requires pulling in the entire Exchange object in the closure. 
 This is a tiny teeny optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures

2014-09-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3409:
-
Affects Version/s: 1.1.0

 Avoid pulling in Exchange operator itself in Exchange's closures
 

 Key: SPARK-3409
 URL: https://issues.apache.org/jira/browse/SPARK-3409
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.1.1, 1.2.0


 {code}
 val rdd = child.execute().mapPartitions { iter =
   if (sortBasedShuffleOn) {
 iter.map(r = (null, r.copy()))
   } else {
 val mutablePair = new MutablePair[Null, Row]()
 iter.map(r = mutablePair.update(null, r))
   }
 }
 {code}
 The above snippet from Exchange references sortBasedShuffleOn within a 
 closure, which requires pulling in the entire Exchange object in the closure. 
 This is a tiny teeny optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3345) Do correct parameters for ShuffleFileGroup

2014-09-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3345:
-
Fix Version/s: 1.1.1

 Do correct parameters for ShuffleFileGroup
 --

 Key: SPARK-3345
 URL: https://issues.apache.org/jira/browse/SPARK-3345
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 In the method newFileGroup of class FileShuffleBlockManager, the parameters 
 for creating new ShuffleFileGroup object is in wrong order.
 Wrong: new ShuffleFileGroup(fileId, shuffleId, files)
 Corrent: new ShuffleFileGroup(shuffleId, fileId, files)
 Because in current codes, the parameters shuffleId and fileId are not used. 
 So it doesn't cause problem now. However it should be corrected for 
 readability and avoid future problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3061) Maven build fails in Windows OS

2014-09-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3061:
-
Fix Version/s: 1.1.1

 Maven build fails in Windows OS
 ---

 Key: SPARK-3061
 URL: https://issues.apache.org/jira/browse/SPARK-3061
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
 Environment: Windows
Reporter: Masayoshi TSUZUKI
Assignee: Andrew Or
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 Maven build fails in Windows OS with this error message.
 {noformat}
 [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec 
 (default) on project spark-core_2.10: Command execution failed. Cannot run 
 program unzip (in directory C:\path\to\gitofspark\python): CreateProcess 
 error=2, Žw’肳‚ꂽƒtƒ@ƒ - [Help 1]
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3061) Maven build fails in Windows OS

2014-09-09 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127677#comment-14127677
 ] 

Andrew Or commented on SPARK-3061:
--

Ok, backported. Thanks Josh.

 Maven build fails in Windows OS
 ---

 Key: SPARK-3061
 URL: https://issues.apache.org/jira/browse/SPARK-3061
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
 Environment: Windows
Reporter: Masayoshi TSUZUKI
Assignee: Josh Rosen
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 Maven build fails in Windows OS with this error message.
 {noformat}
 [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec 
 (default) on project spark-core_2.10: Command execution failed. Cannot run 
 program unzip (in directory C:\path\to\gitofspark\python): CreateProcess 
 error=2, Žw’肳‚ꂽƒtƒ@ƒ - [Help 1]
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3463) Show metrics about spilling in Python

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127918#comment-14127918
 ] 

Apache Spark commented on SPARK-3463:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2336

 Show metrics about spilling in Python
 -

 Key: SPARK-3463
 URL: https://issues.apache.org/jira/browse/SPARK-3463
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu

 It should also show the number of bytes spilled into disks while doing 
 aggregation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127920#comment-14127920
 ] 

Apache Spark commented on SPARK-3465:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2338

 Task metrics are not aggregated correctly in local mode
 ---

 Key: SPARK-3465
 URL: https://issues.apache.org/jira/browse/SPARK-3465
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the 
 same object with that in TaskContext (because there is no serialization for 
 MetricsUpdate in local mode), then all the upcoming changes in metrics will 
 be lost, because updateAggregateMetrics() only counts the difference in these 
 two. 
 This bug was introduced in https://issues.apache.org/jira/browse/SPARK-2099, 
 cc [~sandyr]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3446) FutureAction should expose the job ID

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127919#comment-14127919
 ] 

Apache Spark commented on SPARK-3446:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2337

 FutureAction should expose the job ID
 -

 Key: SPARK-3446
 URL: https://issues.apache.org/jira/browse/SPARK-3446
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin

 This is a follow up to SPARK-2636.
 The patch for that bug added a {{jobId}} method to {{SimpleFutureAction}}. 
 The problem is that {{SimpleFutureAction}} is not exposed through any 
 existing API; all the {{AsyncRDDActions}} methods return just 
 {{FutureAction}}. So clients have to restore to casting / isInstanceOf to be 
 able to use that.
 Exposing the {{jobId}} through {{FutureAction}} has extra complications, 
 though, because {{ComplexFutureAction}} also extends that class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3458) enable use of python's with statements for SparkContext management

2014-09-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3458.

   Resolution: Fixed
Fix Version/s: 1.2.0

 enable use of python's with statements for SparkContext management
 

 Key: SPARK-3458
 URL: https://issues.apache.org/jira/browse/SPARK-3458
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Matthew Farrellee
Assignee: Matthew Farrellee
  Labels: features, python, sparkcontext
 Fix For: 1.2.0


 best practice for managing SparkContexts involves exception handling, e.g.
 {code}
 try:
   sc = SparkContext()
   app(sc)
 finally:
   sc.stop()
 {code}
 python provides the with statement to simplify this code, e.g.
 {code}
 with SparkContext() as sc:
   app(sc)
 {code}
 the SparkContext should be usable in a with statement



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3212) Improve the clarity of caching semantics

2014-09-09 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127959#comment-14127959
 ] 

Michael Armbrust commented on SPARK-3212:
-

[~matei] also points out that we should make sure to uncache cached RDDs when 
the base table is dropped.

 Improve the clarity of caching semantics
 

 Key: SPARK-3212
 URL: https://issues.apache.org/jira/browse/SPARK-3212
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker

 Right now there are a bunch of different ways to cache tables in Spark SQL. 
 For example:
  - tweets.cache()
  - sql(SELECT * FROM tweets).cache()
  - table(tweets).cache()
  - tweets.cache().registerTempTable(tweets)
  - sql(CACHE TABLE tweets)
  - cacheTable(tweets)
 Each of the above commands has subtly different semantics, leading to a very 
 confusing user experience.  Ideally, we would stop doing caching based on 
 simple tables names and instead have a phase of optimization that does 
 intelligent matching of query plans with available cached data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3160) Simplify DecisionTree data structure for training

2014-09-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3160:
-
Description: 
Improvement: code clarity

Currently, we maintain a tree structure, a flat array of nodes, and a 
parentImpurities array.

Proposed fix: Maintain everything within a growing tree structure.

This would let us eliminate the flat array of nodes, thus saving storage when 
we do not grow a full tree.  It would also potentially make it easier to pass 
subtrees to compute nodes for local training.

Note:
* This JIRA used to have this item as well: We could have a “LearningNode 
extends Node” setup where the LearningNode holds metadata for learning (such as 
impurities).  The test-time model could be extracted from this training-time 
model, so that extra information (such as impurities) does not have to be kept 
after training.
* However, this is really a separate issue, so I removed it.

  was:
Improvement: code clarity

Currently, we maintain a tree structure, a flat array of nodes, and a 
parentImpurities array.

Proposed fix: Maintain everything within a growing tree structure.  For this, 
we could have a “LearningNode extends Node” setup where the LearningNode holds 
metadata for learning (such as impurities).  The test-time model could be 
extracted from this training-time model, so that extra information (such as 
impurities) does not have to be kept after training.

This would let us eliminate the flat array of nodes, thus saving storage when 
we do not grow a full tree.  It would also potentially make it easier to pass 
subtrees to compute nodes for local training.



 Simplify DecisionTree data structure for training
 -

 Key: SPARK-3160
 URL: https://issues.apache.org/jira/browse/SPARK-3160
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor

 Improvement: code clarity
 Currently, we maintain a tree structure, a flat array of nodes, and a 
 parentImpurities array.
 Proposed fix: Maintain everything within a growing tree structure.
 This would let us eliminate the flat array of nodes, thus saving storage when 
 we do not grow a full tree.  It would also potentially make it easier to pass 
 subtrees to compute nodes for local training.
 Note:
 * This JIRA used to have this item as well: We could have a “LearningNode 
 extends Node” setup where the LearningNode holds metadata for learning (such 
 as impurities).  The test-time model could be extracted from this 
 training-time model, so that extra information (such as impurities) does not 
 have to be kept after training.
 * However, this is really a separate issue, so I removed it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3160) Simplify DecisionTree data structure for training

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128048#comment-14128048
 ] 

Apache Spark commented on SPARK-3160:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/2341

 Simplify DecisionTree data structure for training
 -

 Key: SPARK-3160
 URL: https://issues.apache.org/jira/browse/SPARK-3160
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor

 Improvement: code clarity
 Currently, we maintain a tree structure, a flat array of nodes, and a 
 parentImpurities array.
 Proposed fix: Maintain everything within a growing tree structure.
 This would let us eliminate the flat array of nodes, thus saving storage when 
 we do not grow a full tree.  It would also potentially make it easier to pass 
 subtrees to compute nodes for local training.
 Note:
 * This JIRA used to have this item as well: We could have a “LearningNode 
 extends Node” setup where the LearningNode holds metadata for learning (such 
 as impurities).  The test-time model could be extracted from this 
 training-time model, so that extra information (such as impurities) does not 
 have to be kept after training.
 * However, this is really a separate issue, so I removed it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3468) WebUI Timeline-View feature

2014-09-09 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3468:
-

 Summary: WebUI Timeline-View feature
 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Kousuke Saruta


I sometimes trouble-shoot and analyse the cause of long time spending job.

At the time, I find the stages which spends long time or fails, then I find the 
tasks which spends long time or fails, next I analyse the proportion of each 
phase in a task.

Another case, I find executors which spends long time for running a task and 
analyse the details of a task.

In such situation, I think it's helpful to visualize timeline view of stages / 
tasks / executors and visualize details of proportion of activity for each task.

Now I'm developing prototypes like captures I attached.
I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature

2014-09-09 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3468:
--
Attachment: executors.png

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Kousuke Saruta
 Attachments: executors.png, stages.png, taskDetails.png, tasks.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature

2014-09-09 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3468:
--
Attachment: taskDetails.png

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Kousuke Saruta
 Attachments: executors.png, stages.png, taskDetails.png, tasks.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3468) WebUI Timeline-View feature

2014-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128109#comment-14128109
 ] 

Apache Spark commented on SPARK-3468:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2342

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Kousuke Saruta
 Attachments: executors.png, stages.png, taskDetails.png, tasks.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org