[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2015-01-07 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267469#comment-14267469
 ] 

Aniket Bhatnagar commented on SPARK-3452:
-

Here is the exception I am getting while triggering a job that contains 
SparkContext having master as yarn-client. A quick look at 1.2.0 source code 
suggests I should depend on spark-yarn module which I can't as it is not longer 
published. Do you want me to log a separate defect for this and submit 
appropriate pull request? 

2015-01-07 14:39:22,799 [pool-10-thread-13] [info] o.a.s.s.MemoryStore - MemoryS
tore started with capacity 731.7 MB
Exception in thread pool-10-thread-13 java.lang.ExceptionInInitializerError
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784)
at org.apache.spark.storage.BlockManager.init(BlockManager.scala:105)
at org.apache.spark.storage.BlockManager.init(BlockManager.scala:180)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159)
at org.apache.spark.SparkContext.init(SparkContext.scala:232)
at com.myimpl.Server:23)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
at scala.util.Try$.apply(Try.scala:191)
at scala.util.Success.map(Try.scala:236)
at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
at scala.util.Try$.apply(Try.scala:191)
at scala.util.Success.map(Try.scala:236)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Unable to load YARN support
at 
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199)
at 
org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194)
at 
org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
... 27 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:195)
... 29 more


 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical
 Fix For: 1.2.0


 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267483#comment-14267483
 ] 

Apache Spark commented on SPARK-5068:
-

User 'jeanlyn' has created a pull request for this issue:
https://github.com/apache/spark/pull/3907

 When the path not found in the hdfs,we can't get the result
 ---

 Key: SPARK-5068
 URL: https://issues.apache.org/jira/browse/SPARK-5068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn

 when the partion path was found in the metastore but not found in the hdfs,it 
 will casue some problems as follow:
 {noformat}
 hive show partitions partition_test;
 OK
 dt=1
 dt=2
 dt=3
 dt=4
 Time taken: 0.168 seconds, Fetched: 4 row(s)
 {noformat}
 {noformat}
 hive dfs -ls /user/jeanlyn/warehouse/partition_test;
 Found 3 items
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=1
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=3
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
 /user/jeanlyn/warehouse/partition_test/dt=4
 {noformat}
 when i run the sql 
 {noformat}
 select * from partition_test limit 10
 {noformat} in  *hive*,i got no problem,but when i run in *spark-sql* i get 
 the error as follow:
 {noformat}
 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: 
 Input path does not exist: 
 hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
 at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
 at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
 at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
 at org.apache.spark.sql.hive.testpartition.main(test.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {noformat}



--
This message was sent by Atlassian JIRA

[jira] [Created] (SPARK-5131) A typo in configuration doc

2015-01-07 Thread uncleGen (JIRA)
uncleGen created SPARK-5131:
---

 Summary: A typo in configuration doc
 Key: SPARK-5131
 URL: https://issues.apache.org/jira/browse/SPARK-5131
 Project: Spark
  Issue Type: Bug
Reporter: uncleGen
Priority: Minor
 Fix For: 1.2.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5120) Output the thread name in log4j.properties

2015-01-07 Thread WangTaoTheTonic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WangTaoTheTonic closed SPARK-5120.
--
Resolution: Won't Fix

 Output the thread name in log4j.properties
 --

 Key: SPARK-5120
 URL: https://issues.apache.org/jira/browse/SPARK-5120
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor

 In most case the thread name is very useful to analyse running job, it is 
 better to log it out in log4j properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5131) A typo in configuration doc

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267441#comment-14267441
 ] 

Apache Spark commented on SPARK-5131:
-

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3930

 A typo in configuration doc
 ---

 Key: SPARK-5131
 URL: https://issues.apache.org/jira/browse/SPARK-5131
 Project: Spark
  Issue Type: Bug
Reporter: uncleGen
Priority: Minor
 Fix For: 1.2.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5129) make SqlContext support select date +/- XX DAYS from table

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267442#comment-14267442
 ] 

Apache Spark commented on SPARK-5129:
-

User 'DoingDone9' has created a pull request for this issue:
https://github.com/apache/spark/pull/3931

 make SqlContext support select date +/- XX DAYS from table  
 --

 Key: SPARK-5129
 URL: https://issues.apache.org/jira/browse/SPARK-5129
 Project: Spark
  Issue Type: Improvement
Reporter: DoingDone9
Priority: Minor

 Example :
 create table test (date: Date)
 2014-01-01
 2014-01-02
 2014-01-03
 when  running select date + 10 DAYS from test, i want get
 2014-01-11 
 2014-01-12
 2014-01-13
 and running select date - 10 DAYS from test,  get
 2013-12-22
 2013-12-23
 2013-12-24



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types

2015-01-07 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267450#comment-14267450
 ] 

Kai Sasaki commented on SPARK-4284:
---

I'd like to work on this issue, if it does not fixed. Could you assign this to 
me? 

 BinaryClassificationMetrics precision-recall method names should correspond 
 to return types
 ---

 Key: SPARK-4284
 URL: https://issues.apache.org/jira/browse/SPARK-4284
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 BinaryClassificationMetrics has several methods which work with (recall, 
 precision) pairs, but the method names all use the wrong order (pr).  This 
 order should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5132) The name for get stage info atempt ID from Json was wrong

2015-01-07 Thread SuYan (JIRA)
SuYan created SPARK-5132:


 Summary: The name for get stage info atempt ID from Json was wrong
 Key: SPARK-5132
 URL: https://issues.apache.org/jira/browse/SPARK-5132
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: SuYan
Priority: Minor
 Fix For: 1.2.0


stageInfoToJson: Stage Attempt Id
stageInfoFromJson: Attempt Id



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5132) The name for get stage info atempt ID from Json was wrong

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267454#comment-14267454
 ] 

Apache Spark commented on SPARK-5132:
-

User 'suyanNone' has created a pull request for this issue:
https://github.com/apache/spark/pull/3932

 The name for get stage info atempt ID from Json was wrong
 -

 Key: SPARK-5132
 URL: https://issues.apache.org/jira/browse/SPARK-5132
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: SuYan
Priority: Minor
 Fix For: 1.2.0


 stageInfoToJson: Stage Attempt Id
 stageInfoFromJson: Attempt Id



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types

2015-01-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267457#comment-14267457
 ] 

Sean Owen commented on SPARK-4284:
--

[~lewuathe] I think you can just start working on it and submit a PR. For 
long-running efforts it may make sense to officially declare you're working on 
it, and try to get consensus that it's your issue, but this should be a quite 
quick/small change.

 BinaryClassificationMetrics precision-recall method names should correspond 
 to return types
 ---

 Key: SPARK-4284
 URL: https://issues.apache.org/jira/browse/SPARK-4284
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 BinaryClassificationMetrics has several methods which work with (recall, 
 precision) pairs, but the method names all use the wrong order (pr).  This 
 order should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5128) Add stable log1pExp impl

2015-01-07 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5128:


 Summary: Add stable log1pExp impl
 Key: SPARK-5128
 URL: https://issues.apache.org/jira/browse/SPARK-5128
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: DB Tsai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table

2015-01-07 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5129:
--
Description: 
Example :
create table test (date: Date)

2014-01-01
2014-01-02
2014-01-03

when  running select date + 10 DAYS from test, i want get

2014-01-11 
2014-01-12
2014-01-13




  was:
Example :
create table test (date: Date, name: String)

2014-01-01   a
2014-01-02   b
2014-01-03   c

when  running select date + 10 DAYS from test, i want get

2014-01-11 
2014-01-12
2014-01-13





 make SqlContext support select date + XX DAYS from table  
 

 Key: SPARK-5129
 URL: https://issues.apache.org/jira/browse/SPARK-5129
 Project: Spark
  Issue Type: Improvement
Reporter: DoingDone9
Priority: Minor

 Example :
 create table test (date: Date)
 2014-01-01
 2014-01-02
 2014-01-03
 when  running select date + 10 DAYS from test, i want get
 2014-01-11 
 2014-01-12
 2014-01-13



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table

2015-01-07 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5129:
--
Description: 
Example :
create table test (date: Date, name: String)

2014-01-01   a
2014-01-02   b
2014-01-03   c

when  running select date + 10 DAYS from test, i want get

2014-01-11 
2014-01-12
2014-01-13




  was:
Example :
create table test (date: Date, name: String)
datename
2014-01-01 a
2014-01-02 b
2014-01-03 c

when  running select date + 10 DAYS from test, i want get

2014-01-11 
2014-01-12
2014-01-13





 make SqlContext support select date + XX DAYS from table  
 

 Key: SPARK-5129
 URL: https://issues.apache.org/jira/browse/SPARK-5129
 Project: Spark
  Issue Type: Improvement
Reporter: DoingDone9
Priority: Minor

 Example :
 create table test (date: Date, name: String)
 2014-01-01   a
 2014-01-02   b
 2014-01-03   c
 when  running select date + 10 DAYS from test, i want get
 2014-01-11 
 2014-01-12
 2014-01-13



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5127) Fixed overflow when there are outliers in data in Logistic Regression

2015-01-07 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267390#comment-14267390
 ] 

DB Tsai commented on SPARK-5127:


Not an issue in binary logistic regression. Problem only occurs in MLOR.

 Fixed overflow when there are outliers in data in Logistic Regression
 -

 Key: SPARK-5127
 URL: https://issues.apache.org/jira/browse/SPARK-5127
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: DB Tsai

 gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label
 However, the first part of gradientMultiplier will be suffered from overflow 
 if there are samples far away from hyperplane, and this happens when there 
 are outliers in data. As a result, we use the equivalent formula but more 
 numerically stable.
 val gradientMultiplier =
if (margin  0.0) {
   val temp = math.exp(-margin)
   temp / (1.0 + temp) - label
 } else {
   1.0 / (1.0 + math.exp(margin)) - label
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5127) Fixed overflow when there are outliers in data in Logistic Regression

2015-01-07 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai closed SPARK-5127.
--
Resolution: Not a Problem

 Fixed overflow when there are outliers in data in Logistic Regression
 -

 Key: SPARK-5127
 URL: https://issues.apache.org/jira/browse/SPARK-5127
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: DB Tsai

 gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label
 However, the first part of gradientMultiplier will be suffered from overflow 
 if there are samples far away from hyperplane, and this happens when there 
 are outliers in data. As a result, we use the equivalent formula but more 
 numerically stable.
 val gradientMultiplier =
if (margin  0.0) {
   val temp = math.exp(-margin)
   temp / (1.0 + temp) - label
 } else {
   1.0 / (1.0 + math.exp(margin)) - label
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5097:
---
Priority: Critical  (was: Major)

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5127) Fixed overflow when there are outliers in data in Logistic Regression

2015-01-07 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-5127:
---
Description: 
gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label

However, the first part of gradientMultiplier will be suffered from overflow if 
there are samples far away from hyperplane, and this happens when there are 
outliers in data. As a result, we use the equivalent formula but more 
numerically stable.

val gradientMultiplier =
   if (margin  0.0) {
  val temp = math.exp(-margin)
  temp / (1.0 + temp) - label
} else {
  1.0 / (1.0 + math.exp(margin)) - label
}


  was:
gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label

However, the first part of gradientMultiplier will be suffered from overflow if 
there are samples far away from hyperplane, and this happens when there are 
outliers in data. As a result, we use the equivalent formula but more 
numerically stable.
```
val gradientMultiplier =
  if (margin  0.0) {
val temp = math.exp(-margin)
temp / (1.0 + temp) - label
  } else {
1.0 / (1.0 + math.exp(margin)) - label
  }
```


 Fixed overflow when there are outliers in data in Logistic Regression
 -

 Key: SPARK-5127
 URL: https://issues.apache.org/jira/browse/SPARK-5127
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: DB Tsai

 gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label
 However, the first part of gradientMultiplier will be suffered from overflow 
 if there are samples far away from hyperplane, and this happens when there 
 are outliers in data. As a result, we use the equivalent formula but more 
 numerically stable.
 val gradientMultiplier =
if (margin  0.0) {
   val temp = math.exp(-margin)
   temp / (1.0 + temp) - label
 } else {
   1.0 / (1.0 + math.exp(margin)) - label
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5127) Fixed overflow when there are outliers in data in Logistic Regression

2015-01-07 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-5127:
---
Description: 
gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label

However, the first part of gradientMultiplier will be suffered from overflow if 
there are samples far away from hyperplane, and this happens when there are 
outliers in data. As a result, we use the equivalent formula but more 
numerically stable.
```
val gradientMultiplier =
  if (margin  0.0) {
val temp = math.exp(-margin)
temp / (1.0 + temp) - label
  } else {
1.0 / (1.0 + math.exp(margin)) - label
  }
```

  was:
gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label

However, the first part of gradientMultiplier will be suffered from overflow if 
there are samples far away from hyperplane, and this happens when there are 
outliers in data. As a result, we use the equivalent formula but more 
numerically stable.

val gradientMultiplier =
  if (margin  0.0) {
val temp = math.exp(-margin)
temp / (1.0 + temp) - label
  } else {
1.0 / (1.0 + math.exp(margin)) - label
  }


 Fixed overflow when there are outliers in data in Logistic Regression
 -

 Key: SPARK-5127
 URL: https://issues.apache.org/jira/browse/SPARK-5127
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: DB Tsai

 gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label
 However, the first part of gradientMultiplier will be suffered from overflow 
 if there are samples far away from hyperplane, and this happens when there 
 are outliers in data. As a result, we use the equivalent formula but more 
 numerically stable.
 ```
 val gradientMultiplier =
   if (margin  0.0) {
 val temp = math.exp(-margin)
 temp / (1.0 + temp) - label
   } else {
 1.0 / (1.0 + math.exp(margin)) - label
   }
 ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4257) Spark master can only be accessed by hostname

2015-01-07 Thread Alister Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267387#comment-14267387
 ] 

Alister Lee commented on SPARK-4257:


Further, the spark URL is set correctly when SPARK_MASTER_IP is set, but not if 
the -h option is used from sbin/start-master.sh.

eg. 
$ sbin/start-master.sh -h `hostname --ip-address`
starting org.apache.spark.deploy.master.Master, logging to 
/tmp/log/spark-ec2-user-org.apache.spark.deploy.master.Master-1-ip-172-31-12-155.out
$ grep spark:// /tmp/log/spark*.out
15/01/07 08:04:12 INFO Master: Starting Spark master at 
spark://ip-172-31-12-155:7077
$ sbin/stop-master.sh
stopping org.apache.spark.deploy.master.Master
$ export SPARK_MASTER_IP=`hostname --ip-address`
$ sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to 
/tmp/log/spark-ec2-user-org.apache.spark.deploy.master.Master-1-ip-172-31-12-155.out
$ grep spark:// /tmp/log/spark*.out
15/01/07 08:05:39 INFO Master: Starting Spark master at 
spark://172.31.12.155:7077


 Spark master can only be accessed by hostname
 -

 Key: SPARK-4257
 URL: https://issues.apache.org/jira/browse/SPARK-4257
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Davies Liu
Priority: Critical

 After sbin/start-all.sh, the spark shell can not connect to standalone master 
 by spark://IP:7077, it works if replace IP by hostname.
 In the docs[1], it says use `spark://IP:PORT` to connect to master.
 [1] http://spark.apache.org/docs/latest/spark-standalone.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table

2015-01-07 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5129:
--
Priority: Minor  (was: Major)

 make SqlContext support select date + XX DAYS from table  
 

 Key: SPARK-5129
 URL: https://issues.apache.org/jira/browse/SPARK-5129
 Project: Spark
  Issue Type: Improvement
Reporter: DoingDone9
Priority: Minor

 Example :



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table

2015-01-07 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5129:
--
Description: 
Example :





 make SqlContext support select date + XX DAYS from table  
 

 Key: SPARK-5129
 URL: https://issues.apache.org/jira/browse/SPARK-5129
 Project: Spark
  Issue Type: Improvement
Reporter: DoingDone9

 Example :



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table

2015-01-07 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5129:
--
Description: 
Example :
create table test (date: Date, name: String)
datename
2014-01-01 a
2014-01-02 b
2014-01-03 c

when i run select date + 10 DAYS from test, i want get

2014-01-11 
2014-01-12
2014-01-13




  was:
Example :






 make SqlContext support select date + XX DAYS from table  
 

 Key: SPARK-5129
 URL: https://issues.apache.org/jira/browse/SPARK-5129
 Project: Spark
  Issue Type: Improvement
Reporter: DoingDone9
Priority: Minor

 Example :
 create table test (date: Date, name: String)
 datename
 2014-01-01 a
 2014-01-02 b
 2014-01-03 c
 when i run select date + 10 DAYS from test, i want get
 2014-01-11 
 2014-01-12
 2014-01-13



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table

2015-01-07 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5129:
--
Description: 
Example :
create table test (date: Date, name: String)
datename
2014-01-01 a
2014-01-02 b
2014-01-03 c

when  running select date + 10 DAYS from test, i want get

2014-01-11 
2014-01-12
2014-01-13




  was:
Example :
create table test (date: Date, name: String)
datename
2014-01-01 a
2014-01-02 b
2014-01-03 c

when i run select date + 10 DAYS from test, i want get

2014-01-11 
2014-01-12
2014-01-13





 make SqlContext support select date + XX DAYS from table  
 

 Key: SPARK-5129
 URL: https://issues.apache.org/jira/browse/SPARK-5129
 Project: Spark
  Issue Type: Improvement
Reporter: DoingDone9
Priority: Minor

 Example :
 create table test (date: Date, name: String)
 datename
 2014-01-01 a
 2014-01-02 b
 2014-01-03 c
 when  running select date + 10 DAYS from test, i want get
 2014-01-11 
 2014-01-12
 2014-01-13



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5128) Add stable log1pExp impl

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267391#comment-14267391
 ] 

Apache Spark commented on SPARK-5128:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3915

 Add stable log1pExp impl
 

 Key: SPARK-5128
 URL: https://issues.apache.org/jira/browse/SPARK-5128
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: DB Tsai





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5129) make SqlContext support select date +/- XX DAYS from table

2015-01-07 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5129:
--
Summary: make SqlContext support select date +/- XX DAYS from table
(was: make SqlContext support select date + XX DAYS from table  )

 make SqlContext support select date +/- XX DAYS from table  
 --

 Key: SPARK-5129
 URL: https://issues.apache.org/jira/browse/SPARK-5129
 Project: Spark
  Issue Type: Improvement
Reporter: DoingDone9
Priority: Minor

 Example :
 create table test (date: Date)
 2014-01-01
 2014-01-02
 2014-01-03
 when  running select date + 10 DAYS from test, i want get
 2014-01-11 
 2014-01-12
 2014-01-13



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5128) Add stable log1pExp impl

2015-01-07 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267392#comment-14267392
 ] 

DB Tsai commented on SPARK-5128:


https://github.com/apache/spark/pull/3915/commits

 Add stable log1pExp impl
 

 Key: SPARK-5128
 URL: https://issues.apache.org/jira/browse/SPARK-5128
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: DB Tsai





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5129) make SqlContext support select date +/- XX DAYS from table

2015-01-07 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5129:
--
Description: 
Example :
create table test (date: Date)

2014-01-01
2014-01-02
2014-01-03

when  running select date + 10 DAYS from test, i want get

2014-01-11 
2014-01-12
2014-01-13

and running select date - 10 DAYS from test,  get

2013-12-22
2013-12-23
2013-12-24



  was:
Example :
create table test (date: Date)

2014-01-01
2014-01-02
2014-01-03

when  running select date + 10 DAYS from test, i want get

2014-01-11 
2014-01-12
2014-01-13





 make SqlContext support select date +/- XX DAYS from table  
 --

 Key: SPARK-5129
 URL: https://issues.apache.org/jira/browse/SPARK-5129
 Project: Spark
  Issue Type: Improvement
Reporter: DoingDone9
Priority: Minor

 Example :
 create table test (date: Date)
 2014-01-01
 2014-01-02
 2014-01-03
 when  running select date + 10 DAYS from test, i want get
 2014-01-11 
 2014-01-12
 2014-01-13
 and running select date - 10 DAYS from test,  get
 2013-12-22
 2013-12-23
 2013-12-24



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5130) yarn-cluster mode should not be considered as client mode in spark-submit

2015-01-07 Thread WangTaoTheTonic (JIRA)
WangTaoTheTonic created SPARK-5130:
--

 Summary: yarn-cluster mode should not be considered as client mode 
in spark-submit
 Key: SPARK-5130
 URL: https://issues.apache.org/jira/browse/SPARK-5130
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: WangTaoTheTonic


spark-submit will choose SparkSubmitDriverBootstrapper or SparkSubmit to launch 
according to --deploy-mode.
When submitting application using yarn-cluster we do not need to specify 
--deploy-mode so spark-submit will launch SparkSubmitDriverBootstrapper, and it 
is not proper to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5130) yarn-cluster mode should not be considered as client mode in spark-submit

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267414#comment-14267414
 ] 

Apache Spark commented on SPARK-5130:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/3929

 yarn-cluster mode should not be considered as client mode in spark-submit
 -

 Key: SPARK-5130
 URL: https://issues.apache.org/jira/browse/SPARK-5130
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: WangTaoTheTonic

 spark-submit will choose SparkSubmitDriverBootstrapper or SparkSubmit to 
 launch according to --deploy-mode.
 When submitting application using yarn-cluster we do not need to specify 
 --deploy-mode so spark-submit will launch SparkSubmitDriverBootstrapper, and 
 it is not proper to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267419#comment-14267419
 ] 

Patrick Wendell commented on SPARK-1529:


Hey Sean,

From what I remember of this, the issue is that MapR clusters are not 
typically provisioned with much local disk space available, because the MapRFS 
supports accessing local volumes in its API, unlike the HDFS API. So in 
general the expectation is that large amounts of local data should be written 
through MapR's API to its local filesystem. They have an NFS mount you can use 
as a work around to provide POSIX API's, and I think most MapR users set this 
mount up and then have Spark write shuffle data there.

Option 2 which [~rkannan82] mentions is not actually feasible in Spark right 
now. We don't support writing shuffle data through the Hadoop API's right now 
and I think Cheng's patch was only a prototype of how we might do that...

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267424#comment-14267424
 ] 

Patrick Wendell commented on SPARK-1529:


BTW - I think if MapR wants to have a customized shuffle, the direction 
proposed in this patch is probably not the best way to do it. It would make 
more sense to implement a DFS-based shuffle using the new pluggable shuffle 
API. I.e. a shuffle that communicates through the filesystem rather than doing 
transfers through Spark.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267430#comment-14267430
 ] 

Sean Owen commented on SPARK-1529:
--

[~pwendell] Gotcha, that begins to make sense. I assume the cluster can be 
provisioned with as much local disk as desired, regardless of defaults. The 
alternative, to write temp files across the network and read them back in order 
to then broadcast them back over the network, seems a lot worse than just 
setting up the right amount of local disk. But if it works well enough and is 
easier in some situations, sounds like that's also an option. I suppose I'm 
asking / questioning why the project would want to encourage remote shuffle 
files by trying to not just use the HDFS APIs, but even maintain a specialized 
version of it, just to make a third workaround for a vendor config issue? 
Surely MapR should just set up clusters that are provisioned with Spark more 
how Spark needs them.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file

2015-01-07 Thread Kanwaljit Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kanwaljit Singh resolved SPARK-2641.

Resolution: Fixed

 Spark submit doesn't pick up executor instances from properties file
 

 Key: SPARK-2641
 URL: https://issues.apache.org/jira/browse/SPARK-2641
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kanwaljit Singh

 When running spark-submit in Yarn cluster mode, we provide properties file 
 using --properties-file option.
 spark.executor.instances=5
 spark.executor.memory=2120m
 spark.executor.cores=3
 The submitted job picks up the cores and memory, but not the correct 
 instances.
 I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments:
 // Use properties file as fallback for values which have a direct analog to
 // arguments in this script.
 master = 
 Option(master).getOrElse(defaultProperties.get(spark.master).orNull)
 executorMemory = Option(executorMemory)
   .getOrElse(defaultProperties.get(spark.executor.memory).orNull)
 executorCores = Option(executorCores)
   .getOrElse(defaultProperties.get(spark.executor.cores).orNull)
 totalExecutorCores = Option(totalExecutorCores)
   .getOrElse(defaultProperties.get(spark.cores.max).orNull)
 name = 
 Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull)
 jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull)
 Along with these defaults, we should also set default for instances:
 numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull)
 PS: spark.executor.instances is also not mentioned on 
 http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file

2015-01-07 Thread Kanwaljit Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kanwaljit Singh closed SPARK-2641.
--

 Spark submit doesn't pick up executor instances from properties file
 

 Key: SPARK-2641
 URL: https://issues.apache.org/jira/browse/SPARK-2641
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kanwaljit Singh

 When running spark-submit in Yarn cluster mode, we provide properties file 
 using --properties-file option.
 spark.executor.instances=5
 spark.executor.memory=2120m
 spark.executor.cores=3
 The submitted job picks up the cores and memory, but not the correct 
 instances.
 I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments:
 // Use properties file as fallback for values which have a direct analog to
 // arguments in this script.
 master = 
 Option(master).getOrElse(defaultProperties.get(spark.master).orNull)
 executorMemory = Option(executorMemory)
   .getOrElse(defaultProperties.get(spark.executor.memory).orNull)
 executorCores = Option(executorCores)
   .getOrElse(defaultProperties.get(spark.executor.cores).orNull)
 totalExecutorCores = Option(totalExecutorCores)
   .getOrElse(defaultProperties.get(spark.cores.max).orNull)
 name = 
 Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull)
 jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull)
 Along with these defaults, we should also set default for instances:
 numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull)
 PS: spark.executor.instances is also not mentioned on 
 http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267564#comment-14267564
 ] 

Apache Spark commented on SPARK-4284:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/3933

 BinaryClassificationMetrics precision-recall method names should correspond 
 to return types
 ---

 Key: SPARK-4284
 URL: https://issues.apache.org/jira/browse/SPARK-4284
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 BinaryClassificationMetrics has several methods which work with (recall, 
 precision) pairs, but the method names all use the wrong order (pr).  This 
 order should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267697#comment-14267697
 ] 

Apache Spark commented on SPARK-3619:
-

User 'jongyoul' has created a pull request for this issue:
https://github.com/apache/spark/pull/3934

 Upgrade to Mesos 0.21 to work around MESOS-1688
 ---

 Key: SPARK-3619
 URL: https://issues.apache.org/jira/browse/SPARK-3619
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Matei Zaharia
Assignee: Timothy Chen

 The Mesos 0.21 release has a fix for 
 https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4929) Yarn Client mode can not support the HA after the exitcode change

2015-01-07 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-4929.
--
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

 Yarn Client mode can not support the HA after the exitcode change
 -

 Key: SPARK-4929
 URL: https://issues.apache.org/jira/browse/SPARK-4929
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: SaintBacchus
 Fix For: 1.3.0, 1.2.1


 Nowadays, yarn-client will exit directly when the HA change happens no matter 
 how many times the am should retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-01-07 Thread Peter Prettenhofer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Prettenhofer updated SPARK-5133:
--
Description: 
Add feature importance to decision tree model and tree ensemble models.
If people are interested in this feature I could implement it given a mentor 
(API decisions, etc). Please find a description of the feature below:

Decision trees intrinsically perform feature selection by selecting appropriate 
split points. This information can be used to assess the relative importance of 
a feature. 
Relative feature importance gives valuable insight into a decision tree or tree 
ensemble and can even be used for feature selection.

More information on feature importance (via decrease in impurity) can be found 
in ESLII (10.13.1) or here [1].
R's randomForest package uses a different technique for assessing variable 
importance that is based on permutation tests.

All necessary information to create relative importance scores should be 
available in the tree representation (class Node; split, impurity gain, 
(weighted) nr of samples?).

[1] 
http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation

  was:
Add feature importance to decision tree model and tree ensemble models.
If people are interested in this feature I could implement it given a mentor 
(API decisions, etc). Please find a description of the feature below:

Decision trees intrinsically perform feature selection by selecting appropriate 
split points. This information can be used to assess the relative importance of 
a feature. 
Relative feature importance gives valuable insight into a decision tree or tree 
ensemble and can even be used for feature selection.

All necessary information to create relative importance scores should be 
available in the tree representation (class Node; split, impurity gain, 
(weighted) nr of samples?).


 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
Priority: Minor

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types

2015-01-07 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267535#comment-14267535
 ] 

Kai Sasaki commented on SPARK-4284:
---

[~srowen] It's very helpful advice.  Thank you!

 BinaryClassificationMetrics precision-recall method names should correspond 
 to return types
 ---

 Key: SPARK-4284
 URL: https://issues.apache.org/jira/browse/SPARK-4284
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 BinaryClassificationMetrics has several methods which work with (recall, 
 precision) pairs, but the method names all use the wrong order (pr).  This 
 order should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-01-07 Thread Peter Prettenhofer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Prettenhofer updated SPARK-5133:
--
Summary: Feature Importance for Decision Tree (Ensembles)  (was: Feature 
Importance for Tree (Ensembles))

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
Priority: Minor

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2165) spark on yarn: add support for setting maxAppAttempts in the ApplicationSubmissionContext

2015-01-07 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-2165.
--
  Resolution: Fixed
   Fix Version/s: 1.3.0
Target Version/s: 1.3.0

 spark on yarn: add support for setting maxAppAttempts in the 
 ApplicationSubmissionContext
 -

 Key: SPARK-2165
 URL: https://issues.apache.org/jira/browse/SPARK-2165
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
 Fix For: 1.3.0


 Hadoop 2.x adds support for allowing the application to specify the maximum 
 application attempts. We should add support for it by setting in the 
 ApplicationSubmissionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2165) spark on yarn: add support for setting maxAppAttempts in the ApplicationSubmissionContext

2015-01-07 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-2165:


Assignee: Thomas Graves

 spark on yarn: add support for setting maxAppAttempts in the 
 ApplicationSubmissionContext
 -

 Key: SPARK-2165
 URL: https://issues.apache.org/jira/browse/SPARK-2165
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 1.3.0


 Hadoop 2.x adds support for allowing the application to specify the maximum 
 application attempts. We should add support for it by setting in the 
 ApplicationSubmissionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries

2015-01-07 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268894#comment-14268894
 ] 

Cheng Lian commented on SPARK-4908:
---

It was considered as a quick fix because we hadn't figured out the root cause 
when the PR was submitted. But now it turned out to be a valid fix :)

 Spark SQL built for Hive 13 fails under concurrent metadata queries
 ---

 Key: SPARK-4908
 URL: https://issues.apache.org/jira/browse/SPARK-4908
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: David Ross
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: 
 https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6
 We are using Spark built for Hive 13, using this option:
 {{-Phive-0.13.1}}
 In single-threaded mode, normal operations look fine. However, under 
 concurrency, with at least 2 concurrent connections, metadata queries fail.
 For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} 
 statement when you pass a default schema in the JDBC URL, all fail.
 {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue.
 Here is some example code:
 {code}
 object main extends App {
   import java.sql._
   import scala.concurrent._
   import scala.concurrent.duration._
   import scala.concurrent.ExecutionContext.Implicits.global
   Class.forName(org.apache.hive.jdbc.HiveDriver)
   val host = localhost // update this
   val url = sjdbc:hive2://${host}:10511/some_db // update this
   val future = Future.traverse(1 to 3) { i =
 Future {
   println(Starting:  + i)
   try {
 val conn = DriverManager.getConnection(url)
   } catch {
 case e: Throwable = e.printStackTrace()
 println(Failed:  + i)
   }
   println(Finishing:  + i)
 }
   }
   Await.result(future, 2.minutes)
   println(done!)
 }
 {code}
 Here is the output:
 {code}
 Starting: 1
 Starting: 3
 Starting: 2
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Failed: 3
 Finishing: 3
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 

[jira] [Commented] (SPARK-5117) Hive Generic UDFs don't cast correctly

2015-01-07 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268861#comment-14268861
 ] 

Cheng Hao commented on SPARK-5117:
--

Definitely we can do that then.

 Hive Generic UDFs don't cast correctly
 --

 Key: SPARK-5117
 URL: https://issues.apache.org/jira/browse/SPARK-5117
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust
Assignee: Cheng Hao
Priority: Blocker

 Here's a test cast that is failing in master:
 {code}
   createQueryTest(generic udf casting,
 SELECT LPAD(test,5, 0) FROM src LIMIT 1)
 {code}
 This appears to be a regression from Spark 1.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4960) Interceptor pattern in receivers

2015-01-07 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268886#comment-14268886
 ] 

Saisai Shao edited comment on SPARK-4960 at 1/8/15 6:45 AM:


Hi all,

I just update the doc according to TD's comment, would you mind taking a look 
at this, thanks a lot.

Currently it's just a simple solution, since we don't need to take care of data 
type conversion, so the tricky corner is removed. This implementation is quite 
simple, with only one problem as previous mentioned, how to support 
store(ByteBuffer) API. Also this design should be align with SPARK-5042.

Here is the link:
https://docs.google.com/document/d/1-JfFkFlc5APstIcvCeqqv2t5np30ft5qaTIiNCGZfdI/edit?usp=sharing


was (Author: jerryshao):
Hi all,

I just update the doc according to TD's comment, would you mind taking a look 
at this, thanks a lot.

Here is the link:
https://docs.google.com/document/d/1-JfFkFlc5APstIcvCeqqv2t5np30ft5qaTIiNCGZfdI/edit?usp=sharing

 Interceptor pattern in receivers
 

 Key: SPARK-4960
 URL: https://issues.apache.org/jira/browse/SPARK-4960
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das

 Sometimes it is good to intercept a message received through a receiver and 
 modify / do something with the message before it is stored into Spark. This 
 is often referred to as the interceptor pattern. There should be general way 
 to specify an interceptor function that gets applied to all receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268830#comment-14268830
 ] 

Apache Spark commented on SPARK-4943:
-

User 'alexliu68' has created a pull request for this issue:
https://github.com/apache/spark/pull/3941

 Parsing error for query with table name having dot
 --

 Key: SPARK-4943
 URL: https://issues.apache.org/jira/browse/SPARK-4943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Alex Liu

 When integrating Spark 1.2.0 with Cassandra SQL, the following query is 
 broken. It was working for Spark 1.1.0 version. Basically we use the table 
 name having dot to include database name 
 {code}
 [info]   java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but 
 `.' found
 [info] 
 [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT 
 test2.a FROM sql_test.test2 AS test2
 [info] ^
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 [info]   at 
 scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
 [info]   at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647)
 [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
 [info]   at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683)
 [info]   at 
 org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1644)
 [info]   at 
 

[jira] [Commented] (SPARK-5042) Updated Receiver API to make it easier to write reliable receivers that ack source

2015-01-07 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268846#comment-14268846
 ] 

Saisai Shao commented on SPARK-5042:


Hey TD, what is your schedule on this?

 Updated Receiver API to make it easier to write reliable receivers that ack 
 source
 --

 Key: SPARK-5042
 URL: https://issues.apache.org/jira/browse/SPARK-5042
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 Receivers in Spark Streaming receive data from different sources and push 
 them into Spark’s block manager. However, the received records must be 
 chunked into blocks before being pushed into the BlockManager. Related to 
 this, the Receiver API provides two kinds of store() - 
 1. store(single record) - The receiver implementation submits one 
 record-at-a-time and the system takes care of dividing it into right sized 
 blocks, and limiting the ingestion rates. In future, it should also be able 
 to do automatic rate / flow control. However, there is no feedback to the 
 receiver on when blocks are formed thus no way to ensure reliability 
 guarantees. Overall, receivers using this are easy to implement.
 2. store(multiple records)- The  receiver submits multiple records and that 
 forms the blocks that are stored in the block manager. The receiver 
 implementation has full control over block generation, which allows the 
 receiver acknowledge source when blocks have been reliably received by 
 BlockManager and/or WriteAheadLog. However, the implementation of the 
 receivers will not get automatic block sizing and rate controlling; the 
 developer will have to take care of that. All this adds to the complexity of 
 the receiver implementation.
 So, to summarize, the (2) has the advantage of full control over block 
 generation, but the users have to deal with the complexity of generating 
 blocks of the right block size and rate control. 
 So we want to update this API such that it is becomes easier for developers 
 to achieve reliable receiving of records without sacrificing automatic block 
 sizing and rate control. 
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers

2015-01-07 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268886#comment-14268886
 ] 

Saisai Shao commented on SPARK-4960:


Hi all,

I just update the doc according to TD's comment, would you mind taking a look 
at this, thanks a lot.

Here is the link:
https://docs.google.com/document/d/1-JfFkFlc5APstIcvCeqqv2t5np30ft5qaTIiNCGZfdI/edit?usp=sharing

 Interceptor pattern in receivers
 

 Key: SPARK-4960
 URL: https://issues.apache.org/jira/browse/SPARK-4960
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das

 Sometimes it is good to intercept a message received through a receiver and 
 modify / do something with the message before it is stored into Spark. This 
 is often referred to as the interceptor pattern. There should be general way 
 to specify an interceptor function that gets applied to all receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5080) Expose more cluster resource information to user

2015-01-07 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268786#comment-14268786
 ] 

Xuefu Zhang commented on SPARK-5080:


cc: [~sandyr]

 Expose more cluster resource information to user
 

 Key: SPARK-5080
 URL: https://issues.apache.org/jira/browse/SPARK-5080
 Project: Spark
  Issue Type: Improvement
Reporter: Rui Li

 It'll be useful if user can get detailed cluster resource info, e.g. 
 granted/allocated executors, memory and CPU.
 Such information is available via WebUI but seems SparkContext doesn't have 
 these APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1825) Windows Spark fails to work with Linux YARN

2015-01-07 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268863#comment-14268863
 ] 

Masayoshi TSUZUKI commented on SPARK-1825:
--

It is necessary to use $$() to solve this problem but as discussed on PR #899 
if we use $$() build for hadoop2.4 will fail.
So PR #3943 uses reflection to avoid build failure for every version of hadoop.
Windows clilents works fine with Linux YARN cluetr only when we use hadoop 
2.4+. But it doesn't work under hadoop2.4 even after this patch.


 Windows Spark fails to work with Linux YARN
 ---

 Key: SPARK-1825
 URL: https://issues.apache.org/jira/browse/SPARK-1825
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Taeyun Kim
 Attachments: SPARK-1825.patch


 Windows Spark fails to work with Linux YARN.
 This is a cross-platform problem.
 This error occurs when 'yarn-client' mode is used.
 (yarn-cluster/yarn-standalone mode was not tested.)
 On YARN side, Hadoop 2.4.0 resolved the issue as follows:
 https://issues.apache.org/jira/browse/YARN-1824
 But Spark YARN module does not incorporate the new YARN API yet, so problem 
 persists for Spark.
 First, the following source files should be changed:
 - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
 - 
 /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala
 Change is as follows:
 - Replace .$() to .$$()
 - Replace File.pathSeparator for Environment.CLASSPATH.name to 
 ApplicationConstants.CLASS_PATH_SEPARATOR (import 
 org.apache.hadoop.yarn.api.ApplicationConstants is required for this)
 Unless the above are applied, launch_container.sh will contain invalid shell 
 script statements(since they will contain Windows-specific separators), and 
 job will fail.
 Also, the following symptom should also be fixed (I could not find the 
 relevant source code):
 - SPARK_HOME environment variable is copied straight to launch_container.sh. 
 It should be changed to the path format for the server OS, or, the better, a 
 separate environment variable or a configuration variable should be created.
 - '%HADOOP_MAPRED_HOME%' string still exists in launch_container.sh, after 
 the above change is applied. maybe I missed a few lines.
 I'm not sure whether this is all, since I'm new to both Spark and YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib

2015-01-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5116:
-
Assignee: Shuo Xiang

 Add extractor for SparseVector and DenseVector in MLlib 
 

 Key: SPARK-5116
 URL: https://issues.apache.org/jira/browse/SPARK-5116
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang
Priority: Minor
 Fix For: 1.3.0


 Add extractor for SparseVector and DenseVector in MLlib to save some code 
 while performing pattern matching on Vectors. For example, previously we need 
 to use:
 {code:title=A.scala|borderStyle=solid}
 vec match {
   case dv: DenseVector =
 val values = dv.values
 ...
   case sv: SparseVector =
 val indices = sv.indices
 val values = sv.values
 val size = sv.size
 ...
 }
 {code}
 with extractor it is:
 {code:title=B.scala|borderStyle=solid}
 vec match {
   case DenseVector(values) =
 ...
   case SparseVector(size, indices, values) =
 ...
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5141) CaseInsensitiveMap throws java.io.NotSerializableException

2015-01-07 Thread Gankun Luo (JIRA)
Gankun Luo created SPARK-5141:
-

 Summary: CaseInsensitiveMap throws 
java.io.NotSerializableException
 Key: SPARK-5141
 URL: https://issues.apache.org/jira/browse/SPARK-5141
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Gankun Luo
Priority: Minor


The following code throws a 
serialization.[https://github.com/luogankun/spark-jdbc|https://github.com/luogankun/spark-jdbc]

{code}
CREATE TEMPORARY TABLE jdbc_table
USING com.luogankun.spark.jdbc
OPTIONS (
sparksql_table_schema  '(TBL_ID int, TBL_NAME string, TBL_TYPE string)',
jdbc_table_name'TBLS',
jdbc_table_schema '(TBL_ID , TBL_NAME , TBL_TYPE)',
url'jdbc:mysql://hadoop000:3306/hive',
user'root',
password'root'
);

select TBL_ID,TBL_ID,TBL_TYPE from jdbc_table;
{code}

I get the following stack trace:

{code}
org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1448)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:616)
at 
org.apache.spark.sql.execution.Project.execute(basicOperators.scala:43)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:81)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:386)
at 
org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:365)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: 
org.apache.spark.sql.sources.CaseInsensitiveMap
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
..
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4922) Support dynamic allocation for coarse-grained Mesos

2015-01-07 Thread Jongyoul Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268826#comment-14268826
 ] 

Jongyoul Lee commented on SPARK-4922:
-

[~andrewor14] Hi, I have a basic question about your idea. I'm using 
fine-grained mesos for running my jobs. that mode already allocate resources 
dynamically when task scheduler wants. What you think the difference is between 
your idea and fine-grained mode? Unlike coarse-grained mode, fine-grained mode 
adjusts # of cores for a executor and enables to make two more executor on each 
slave. I think if we set # of cores for each mesos executor in a configuration 
on fine-grained mode - now, only one core fixed for each executor -, we can 
satisfy dynamic allocation idea. and I read SPARK-4751, and I'll handle this 
issue via using fine-grain mode. And how do you think you adjust resources? new 
API for increasing or decreasing cores or just use {{spark.cores.max}}?

 Support dynamic allocation for coarse-grained Mesos
 ---

 Key: SPARK-4922
 URL: https://issues.apache.org/jira/browse/SPARK-4922
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.2.0
Reporter: Andrew Or
Priority: Critical

 This brings SPARK-3174, which provided dynamic allocation of cluster 
 resources to Spark on YARN applications, to Mesos coarse-grained mode. 
 Note that the translation is not as trivial as adding a code path that 
 exposes the request and kill mechanisms as we did for YARN is SPARK-3822. 
 This is because Mesos coarse-grained mode schedules on the notion of setting 
 the number of cores allowed for an application (as in standalone mode) 
 instead of number of executors (as in YARN mode). For more detail, please see 
 SPARK-4751.
 If you intend to work on this, please provide a detailed design doc!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4922) Support dynamic allocation for coarse-grained Mesos

2015-01-07 Thread Jongyoul Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268826#comment-14268826
 ] 

Jongyoul Lee edited comment on SPARK-4922 at 1/8/15 5:35 AM:
-

[~andrewor14] Hi, I have a basic question about your idea. I'm using 
fine-grained mesos for running my jobs. that mode already allocate resources 
dynamically when task scheduler wants. What you think the difference is between 
your idea and fine-grained mode? Unlike coarse-grained mode, fine-grained mode 
adjusts # of cores for a executor and enables to make two more executor on each 
slave. I think if we set # of cores for each mesos executor in a configuration 
on fine-grained mode - now, only one core fixed for each executor -, we can 
satisfy dynamic allocation idea. and I read SPARK-4751, and I can handle this 
issue via using fine-grain mode. And how do you think you adjust resources? new 
API for increasing or decreasing cores or just use {{spark.cores.max}}?


was (Author: jongyoul):
[~andrewor14] Hi, I have a basic question about your idea. I'm using 
fine-grained mesos for running my jobs. that mode already allocate resources 
dynamically when task scheduler wants. What you think the difference is between 
your idea and fine-grained mode? Unlike coarse-grained mode, fine-grained mode 
adjusts # of cores for a executor and enables to make two more executor on each 
slave. I think if we set # of cores for each mesos executor in a configuration 
on fine-grained mode - now, only one core fixed for each executor -, we can 
satisfy dynamic allocation idea. and I read SPARK-4751, and I'll handle this 
issue via using fine-grain mode. And how do you think you adjust resources? new 
API for increasing or decreasing cores or just use {{spark.cores.max}}?

 Support dynamic allocation for coarse-grained Mesos
 ---

 Key: SPARK-4922
 URL: https://issues.apache.org/jira/browse/SPARK-4922
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.2.0
Reporter: Andrew Or
Priority: Critical

 This brings SPARK-3174, which provided dynamic allocation of cluster 
 resources to Spark on YARN applications, to Mesos coarse-grained mode. 
 Note that the translation is not as trivial as adding a code path that 
 exposes the request and kill mechanisms as we did for YARN is SPARK-3822. 
 This is because Mesos coarse-grained mode schedules on the notion of setting 
 the number of cores allowed for an application (as in standalone mode) 
 instead of number of executors (as in YARN mode). For more detail, please see 
 SPARK-4751.
 If you intend to work on this, please provide a detailed design doc!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1825) Windows Spark fails to work with Linux YARN

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268858#comment-14268858
 ] 

Apache Spark commented on SPARK-1825:
-

User 'tsudukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/3943

 Windows Spark fails to work with Linux YARN
 ---

 Key: SPARK-1825
 URL: https://issues.apache.org/jira/browse/SPARK-1825
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Taeyun Kim
 Attachments: SPARK-1825.patch


 Windows Spark fails to work with Linux YARN.
 This is a cross-platform problem.
 This error occurs when 'yarn-client' mode is used.
 (yarn-cluster/yarn-standalone mode was not tested.)
 On YARN side, Hadoop 2.4.0 resolved the issue as follows:
 https://issues.apache.org/jira/browse/YARN-1824
 But Spark YARN module does not incorporate the new YARN API yet, so problem 
 persists for Spark.
 First, the following source files should be changed:
 - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
 - 
 /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala
 Change is as follows:
 - Replace .$() to .$$()
 - Replace File.pathSeparator for Environment.CLASSPATH.name to 
 ApplicationConstants.CLASS_PATH_SEPARATOR (import 
 org.apache.hadoop.yarn.api.ApplicationConstants is required for this)
 Unless the above are applied, launch_container.sh will contain invalid shell 
 script statements(since they will contain Windows-specific separators), and 
 job will fail.
 Also, the following symptom should also be fixed (I could not find the 
 relevant source code):
 - SPARK_HOME environment variable is copied straight to launch_container.sh. 
 It should be changed to the path format for the server OS, or, the better, a 
 separate environment variable or a configuration variable should be created.
 - '%HADOOP_MAPRED_HOME%' string still exists in launch_container.sh, after 
 the above change is applied. maybe I missed a few lines.
 I'm not sure whether this is all, since I'm new to both Spark and YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries

2015-01-07 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268892#comment-14268892
 ] 

David Ross commented on SPARK-4908:
---

I've verified that this is fixed on trunk. Since his commit says just a quick 
fix, I will let [~marmbrus] decide whether or not to keep this JIRA open.

 Spark SQL built for Hive 13 fails under concurrent metadata queries
 ---

 Key: SPARK-4908
 URL: https://issues.apache.org/jira/browse/SPARK-4908
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: David Ross
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: 
 https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6
 We are using Spark built for Hive 13, using this option:
 {{-Phive-0.13.1}}
 In single-threaded mode, normal operations look fine. However, under 
 concurrency, with at least 2 concurrent connections, metadata queries fail.
 For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} 
 statement when you pass a default schema in the JDBC URL, all fail.
 {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue.
 Here is some example code:
 {code}
 object main extends App {
   import java.sql._
   import scala.concurrent._
   import scala.concurrent.duration._
   import scala.concurrent.ExecutionContext.Implicits.global
   Class.forName(org.apache.hive.jdbc.HiveDriver)
   val host = localhost // update this
   val url = sjdbc:hive2://${host}:10511/some_db // update this
   val future = Future.traverse(1 to 3) { i =
 Future {
   println(Starting:  + i)
   try {
 val conn = DriverManager.getConnection(url)
   } catch {
 case e: Throwable = e.printStackTrace()
 println(Failed:  + i)
   }
   println(Finishing:  + i)
 }
   }
   Await.result(future, 2.minutes)
   println(done!)
 }
 {code}
 Here is the output:
 {code}
 Starting: 1
 Starting: 3
 Starting: 2
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Failed: 3
 Finishing: 3
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 

[jira] [Resolved] (SPARK-5126) No error log for a typo master url

2015-01-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5126.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Shixiong Zhu

 No error log for a typo master url 
 ---

 Key: SPARK-5126
 URL: https://issues.apache.org/jira/browse/SPARK-5126
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
 Fix For: 1.3.0


 If a typo master url  is passed to Worker, it only print the following logs:
 {noformat}
 15/01/07 14:30:02 INFO worker.Worker: Connecting to master spark://master 
 url:7077...
 15/01/07 14:30:02 INFO 
 remote.RemoteActorRefProvider$RemoteDeadLetterActorRef: Message 
 [org.apache.spark.deploy.DeployMessages$RegisterWorker] from 
 Actor[akka://sparkWorker/user/Worker#-282880172] to 
 Actor[akka://sparkWorker/deadLetters] was not delivered. [3] dead letters 
 encountered. This logging can be turned off or adjusted with configuration 
 settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
 {noformat}
 It's not obvious to know the url is wrong. And 
 {{akka://sparkWorker/deadLetters}} is also confusing. The `deadLetters` Actor 
 is because `actorSelection` will return `deadLetters` for invalid path.
 {code}
   def actorSelection(path: String): ActorSelection = path match {
 case RelativeActorPath(elems) ⇒
   if (elems.isEmpty) ActorSelection(provider.deadLetters, )
   else if (elems.head.isEmpty) ActorSelection(provider.rootGuardian, 
 elems.tail)
   else ActorSelection(lookupRoot, elems)
 case ActorPathExtractor(address, elems) ⇒
   ActorSelection(provider.rootGuardianAt(address), elems)
 case _ ⇒
   ActorSelection(provider.deadLetters, )
   }
 {code}
 I think logging an error about invalid url is better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib

2015-01-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5116.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3919
[https://github.com/apache/spark/pull/3919]

 Add extractor for SparseVector and DenseVector in MLlib 
 

 Key: SPARK-5116
 URL: https://issues.apache.org/jira/browse/SPARK-5116
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Shuo Xiang
Priority: Minor
 Fix For: 1.3.0


 Add extractor for SparseVector and DenseVector in MLlib to save some code 
 while performing pattern matching on Vectors. For example, previously we need 
 to use:
 {code:title=A.scala|borderStyle=solid}
 vec match {
   case dv: DenseVector =
 val values = dv.values
 ...
   case sv: SparseVector =
 val indices = sv.indices
 val values = sv.values
 val size = sv.size
 ...
 }
 {code}
 with extractor it is:
 {code:title=B.scala|borderStyle=solid}
 vec match {
   case DenseVector(values) =
 ...
   case SparseVector(size, indices, values) =
 ...
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-01-07 Thread Gerard Maas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268440#comment-14268440
 ] 

Gerard Maas commented on SPARK-4940:


Hi Tim,

We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes 
much sense for Spark Streaming.

Here're few examples of resource allocation. They are taken from several runs 
of the same job with identical configuration:
Job config:
spark.cores.max = 18
spark.mesos.coarse = true
spark.executor.memory = 4g

The job logic will start 6 Kafka receivers.

#1
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 4 |  4GB | 3  | 2  |
| 2 | 6 |  4GB | 2  | 1  | 
| 3 | 7 | 4GB  | 3  | 2  |
| 4 | 1 | 4GB | 1 | 1 |

Total mem: 16 GB
Total CPUs: 18

Observations: 
Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process 
the received data, so all data received needs to be sent to other node for 
non-local processing  (not sure how replication helps or not in this case, the 
blocks of data are processed on other nodes). Also the nodes with 2 streaming 
receivers have higher load that the node with 1 receiver.

#2
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 7 |  4GB | 7  | 4  |
| 2 | 2 |  4GB | 2  | 2  | 

Total mem: 8 GB
Total CPUs: 9

Observations: 
This is the worst configuration of the day. Totally unbalanced (4 vs 2 
receivers) and for some reason, the job didn't get all the resources assigned 
in the configuration. The job processing time is also slower as there're less 
cores to handle the data and less overall memory.

#3
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 3 |  4GB | 3  | 2  |
| 2 | 8 |  4GB | 2  | 2  | 
| 3 | 7 | 4GB  | 3  | 2  |

Total mem: 12GB
Total CPU: 18

Observations: 
This is a fairly good configuration with a more evenly distributed receivers 
and CPUs although there's one  considerable smaller node in terms of CPU 
assignment.
 
We can observe that the current resource assignment policy results in less than 
ideal and in particular random assignments that have a strong impact on the job 
execution and performance. Given that CPU allocation is by executor (and not by 
job), makes total memory for the job variable as it can get 2 to 4 executors 
assigned. It's also weird and unexpected to observe less than max CPU 
allocations.
Here's a performance chart of the same job across two configurations, one with 
3 (left) nodes and one with 2 (right): 
!https://lh3.googleusercontent.com/Z1C71OKoQzGA13uNJ8Yvf_xz_glRUqU_IGGvLsfkPvUPK2lahrEatweiWl-PDDfysjXtbs1Sl_k=w1682-h689!
(chart line: processing time in ms, load is fairly constant)

 Support more evenly distributing cores for Mesos mode
 -

 Key: SPARK-4940
 URL: https://issues.apache.org/jira/browse/SPARK-4940
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Currently in Coarse grain mode the spark scheduler simply takes all the 
 resources it can on each node, but can cause uneven distribution based on 
 resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-01-07 Thread Gerard Maas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268440#comment-14268440
 ] 

Gerard Maas edited comment on SPARK-4940 at 1/7/15 10:54 PM:
-

Hi Tim,

We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes 
much sense for Spark Streaming.

Here're few examples of resource allocation. They are taken from several runs 
of the same job with identical configuration:
Job config:
spark.cores.max = 18
spark.mesos.coarse = true
spark.executor.memory = 4g

The job logic will start 6 Kafka receivers.

#1
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 4 |  4GB | 3  | 2  |
| 2 | 6 |  4GB | 2  | 1  | 
| 3 | 7 | 4GB  | 3  | 2  |
| 4 | 1 | 4GB | 1 | 1 |

Total mem: 16 GB
Total CPUs: 18

Observations: 
Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process 
the received data, so all data received needs to be sent to other node for 
non-local processing  (not sure how replication helps or not in this case, the 
blocks of data are processed on other nodes). Also the nodes with 2 streaming 
receivers have higher load that the node with 1 receiver.

#2
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 7 |  4GB | 7  | 4  |
| 2 | 2 |  4GB | 2  | 2  | 

Total mem: 8 GB
Total CPUs: 9

Observations: 
This is the worst configuration of the day. Totally unbalanced (4 vs 2 
receivers) and for some reason, the job didn't get all the resources assigned 
in the configuration. The job processing time is also slower as there're less 
cores to handle the data and less overall memory.

#3
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 3 |  4GB | 3  | 2  |
| 2 | 8 |  4GB | 2  | 2  | 
| 3 | 7 | 4GB  | 3  | 2  |

Total mem: 12GB
Total CPU: 18

Observations: 
This is a fairly good configuration with a more evenly distributed receivers 
and CPUs although there's one  considerable smaller node in terms of CPU 
assignment.
 
We can observe that the current resource assignment policy results in less than 
ideal and in particular random assignments that have a strong impact on the job 
execution and performance. Given that CPU allocation is by executor (and not by 
job), makes total memory for the job variable as it can get 2 to 4 executors 
assigned. It's also weird and unexpected to observe less than max CPU 
allocations.
Here's a performance chart of the same job jumping from one config to another 
(*):  3 nodes (left of the spike)  and  2 nodes (right):
 
!https://lh3.googleusercontent.com/Z1C71OKoQzGA13uNJ8Yvf_xz_glRUqU_IGGvLsfkPvUPK2lahrEatweiWl-PDDfysjXtbs1Sl_k=w1682-h689!
(chart line: processing time in ms, load is fairly constant, higher is worst. 
Note how the job performance is degraded)

(*) for some reason we didn't find yet, Mesos often kills the job. When 
Marathon relaunches it, it results in a different resource assignment.


was (Author: gmaas):
Hi Tim,

We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes 
much sense for Spark Streaming.

Here're few examples of resource allocation. They are taken from several runs 
of the same job with identical configuration:
Job config:
spark.cores.max = 18
spark.mesos.coarse = true
spark.executor.memory = 4g

The job logic will start 6 Kafka receivers.

#1
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 4 |  4GB | 3  | 2  |
| 2 | 6 |  4GB | 2  | 1  | 
| 3 | 7 | 4GB  | 3  | 2  |
| 4 | 1 | 4GB | 1 | 1 |

Total mem: 16 GB
Total CPUs: 18

Observations: 
Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process 
the received data, so all data received needs to be sent to other node for 
non-local processing  (not sure how replication helps or not in this case, the 
blocks of data are processed on other nodes). Also the nodes with 2 streaming 
receivers have higher load that the node with 1 receiver.

#2
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 7 |  4GB | 7  | 4  |
| 2 | 2 |  4GB | 2  | 2  | 

Total mem: 8 GB
Total CPUs: 9

Observations: 
This is the worst configuration of the day. Totally unbalanced (4 vs 2 
receivers) and for some reason, the job didn't get all the resources assigned 
in the configuration. The job processing time is also slower as there're less 
cores to handle the data and less overall memory.

#3
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 3 |  4GB | 3  | 2  |
| 2 | 8 |  4GB | 2  | 2  | 
| 3 | 7 | 4GB  | 3  | 2  |

Total mem: 12GB
Total CPU: 18

Observations: 
This is a fairly good configuration with a more evenly distributed receivers 
and CPUs although there's one  considerable smaller node in terms of CPU 
assignment.
 
We can observe that the current resource assignment policy results in less than 
ideal 

[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-01-07 Thread Gerard Maas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268440#comment-14268440
 ] 

Gerard Maas edited comment on SPARK-4940 at 1/7/15 10:53 PM:
-

Hi Tim,

We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes 
much sense for Spark Streaming.

Here're few examples of resource allocation. They are taken from several runs 
of the same job with identical configuration:
Job config:
spark.cores.max = 18
spark.mesos.coarse = true
spark.executor.memory = 4g

The job logic will start 6 Kafka receivers.

#1
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 4 |  4GB | 3  | 2  |
| 2 | 6 |  4GB | 2  | 1  | 
| 3 | 7 | 4GB  | 3  | 2  |
| 4 | 1 | 4GB | 1 | 1 |

Total mem: 16 GB
Total CPUs: 18

Observations: 
Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process 
the received data, so all data received needs to be sent to other node for 
non-local processing  (not sure how replication helps or not in this case, the 
blocks of data are processed on other nodes). Also the nodes with 2 streaming 
receivers have higher load that the node with 1 receiver.

#2
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 7 |  4GB | 7  | 4  |
| 2 | 2 |  4GB | 2  | 2  | 

Total mem: 8 GB
Total CPUs: 9

Observations: 
This is the worst configuration of the day. Totally unbalanced (4 vs 2 
receivers) and for some reason, the job didn't get all the resources assigned 
in the configuration. The job processing time is also slower as there're less 
cores to handle the data and less overall memory.

#3
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 3 |  4GB | 3  | 2  |
| 2 | 8 |  4GB | 2  | 2  | 
| 3 | 7 | 4GB  | 3  | 2  |

Total mem: 12GB
Total CPU: 18

Observations: 
This is a fairly good configuration with a more evenly distributed receivers 
and CPUs although there's one  considerable smaller node in terms of CPU 
assignment.
 
We can observe that the current resource assignment policy results in less than 
ideal and in particular random assignments that have a strong impact on the job 
execution and performance. Given that CPU allocation is by executor (and not by 
job), makes total memory for the job variable as it can get 2 to 4 executors 
assigned. It's also weird and unexpected to observe less than max CPU 
allocations.
Here's a performance chart of the same job jumping from one config to another 
(*), one with 3 (left) nodes and one with 2 (right): 
!https://lh3.googleusercontent.com/Z1C71OKoQzGA13uNJ8Yvf_xz_glRUqU_IGGvLsfkPvUPK2lahrEatweiWl-PDDfysjXtbs1Sl_k=w1682-h689!
(chart line: processing time in ms, load is fairly constant)

(*) for some reason we didn't find yet, Mesos often kills the job. When 
Marathon relaunches it, it results in a different resource assignment.


was (Author: gmaas):
Hi Tim,

We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes 
much sense for Spark Streaming.

Here're few examples of resource allocation. They are taken from several runs 
of the same job with identical configuration:
Job config:
spark.cores.max = 18
spark.mesos.coarse = true
spark.executor.memory = 4g

The job logic will start 6 Kafka receivers.

#1
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 4 |  4GB | 3  | 2  |
| 2 | 6 |  4GB | 2  | 1  | 
| 3 | 7 | 4GB  | 3  | 2  |
| 4 | 1 | 4GB | 1 | 1 |

Total mem: 16 GB
Total CPUs: 18

Observations: 
Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process 
the received data, so all data received needs to be sent to other node for 
non-local processing  (not sure how replication helps or not in this case, the 
blocks of data are processed on other nodes). Also the nodes with 2 streaming 
receivers have higher load that the node with 1 receiver.

#2
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 7 |  4GB | 7  | 4  |
| 2 | 2 |  4GB | 2  | 2  | 

Total mem: 8 GB
Total CPUs: 9

Observations: 
This is the worst configuration of the day. Totally unbalanced (4 vs 2 
receivers) and for some reason, the job didn't get all the resources assigned 
in the configuration. The job processing time is also slower as there're less 
cores to handle the data and less overall memory.

#3
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 3 |  4GB | 3  | 2  |
| 2 | 8 |  4GB | 2  | 2  | 
| 3 | 7 | 4GB  | 3  | 2  |

Total mem: 12GB
Total CPU: 18

Observations: 
This is a fairly good configuration with a more evenly distributed receivers 
and CPUs although there's one  considerable smaller node in terms of CPU 
assignment.
 
We can observe that the current resource assignment policy results in less than 
ideal and in particular random assignments that have a strong impact on 

[jira] [Commented] (SPARK-5136) Improve documentation around setting up Spark IntelliJ project

2015-01-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268458#comment-14268458
 ] 

Sean Owen commented on SPARK-5136:
--

[~pwendell] Before I suggest a change to the IntelliJ build notes in {{docs/}}, 
which are indeed a little out of date, I remember that you created 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-BuildingSparkinIntelliJIDEA
 for much the same purpose.

That's better but I think it's also a little out of date (e.g. the YARN 
structure has changed). Best to have this info in just one place.

Should the docs link to the wiki, and should I suggest a few changes to the 
wiki? 
Or should we try to put all of this info into docs and remove the wiki?

 Improve documentation around setting up Spark IntelliJ project
 --

 Key: SPARK-5136
 URL: https://issues.apache.org/jira/browse/SPARK-5136
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 [The documentation about setting up a Spark project in 
 Intellij|http://spark.apache.org/docs/latest/building-spark.html#using-with-intellij-idea]
  is somewhat short/cryptic and targets [an IntelliJ version released in 
 2012|https://www.jetbrains.com/company/history.jsp]. A refresh / upgrade is 
 probably warranted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2541) Standalone mode can't access secure HDFS anymore

2015-01-07 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268475#comment-14268475
 ] 

Nicholas Chammas commented on SPARK-2541:
-

By the way, should this issue be linked to [SPARK-3438]?

 Standalone mode can't access secure HDFS anymore
 

 Key: SPARK-2541
 URL: https://issues.apache.org/jira/browse/SPARK-2541
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1
Reporter: Thomas Graves
 Attachments: SPARK-2541-partial.patch


 In spark 0.9.x you could access secure HDFS from Standalone deploy, that 
 doesn't work in 1.X anymore. 
 It looks like the issues is in SparkHadoopUtil.runAsSparkUser.  Previously it 
 wouldn't do the doAs if the currentUser == user.  Not sure how it affects 
 when the daemons run as a super user but SPARK_USER is set to someone else.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2541) Standalone mode can't access secure HDFS anymore

2015-01-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268480#comment-14268480
 ] 

Thomas Graves commented on SPARK-2541:
--

Yeah kind of.  I guess 3438 is to officially add support for it.  It used to 
work, which is why I filed this jira, but perhaps it was never really 
officially supported. Atleast not in a documented way, so that one sounds like 
it should be more comprehensive.

 Standalone mode can't access secure HDFS anymore
 

 Key: SPARK-2541
 URL: https://issues.apache.org/jira/browse/SPARK-2541
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1
Reporter: Thomas Graves
 Attachments: SPARK-2541-partial.patch


 In spark 0.9.x you could access secure HDFS from Standalone deploy, that 
 doesn't work in 1.X anymore. 
 It looks like the issues is in SparkHadoopUtil.runAsSparkUser.  Previously it 
 wouldn't do the doAs if the currentUser == user.  Not sure how it affects 
 when the daemons run as a super user but SPARK_USER is set to someone else.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution

2015-01-07 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268529#comment-14268529
 ] 

Davies Liu commented on SPARK-3910:
---

[~joshrosen] I think we should backport SPARK-4348 and SPARK-4821 into 
branch-1.1, it also remove the hack in pyspark/__init__.py

 ./python/pyspark/mllib/classification.py doctests fails with module name 
 pollution
 --

 Key: SPARK-3910
 URL: https://issues.apache.org/jira/browse/SPARK-3910
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, 
 Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
 argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, 
 pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, 
 unittest2==0.5.1, wsgiref==0.1.2
Reporter: Tomohiko K.
  Labels: pyspark, testing

 In ./python/run-tests script, we run the doctests in 
 ./pyspark/mllib/classification.py.
 The output is as following:
 {noformat}
 $ ./python/run-tests
 ...
 Running test: pyspark/mllib/classification.py
 Traceback (most recent call last):
   File pyspark/mllib/classification.py, line 20, in module
 import numpy
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py,
  line 170, in module
 from . import add_newdocs
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py,
  line 13, in module
 from numpy.lib import add_newdoc
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py,
  line 8, in module
 from .type_check import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py,
  line 11, in module
 import numpy.core.numeric as _nx
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py,
  line 46, in module
 from numpy.testing import Tester
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py,
  line 13, in module
 from .utils import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py,
  line 15, in module
 from tempfile import mkdtemp
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py,
  line 34, in module
 from random import Random as _Random
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, 
 line 24, in module
 from pyspark.rdd import RDD
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 
 51, in module
 from pyspark.context import SparkContext
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 
 22, in module
 from tempfile import NamedTemporaryFile
 ImportError: cannot import name NamedTemporaryFile
 0.07 real 0.04 user 0.02 sys
 Had test failures; see logs.
 {noformat}
 The problem is a cyclic import of tempfile module.
 The cause of it is that pyspark.mllib.random module exists in the directory 
 where pyspark.mllib.classification module exists.
 classification module imports numpy module, and then numpy module imports 
 tempfile module from its inside.
 Now the first entry sys.path is the directory ./python/pyspark/mllib (where 
 the executed file classification.py exists), so tempfile module imports 
 pyspark.mllib.random module (not the standard library random module).
 Finally, import chains reach tempfile again, then a cyclic import is formed.
 Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile 
 → (cyclic import!!)
 Furthermore, stat module is in a standard library, and pyspark.mllib.stat 
 module exists. This also may be troublesome.
 commit: 0e8203f4fb721158fb27897680da476174d24c4b
 A fundamental solution is to avoid using module names used by standard 
 libraries (currently random and stat).
 A difficulty of this solution is to rename pyspark.mllib.random and 
 pyspark.mllib.stat, which may be already used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-01-07 Thread Zach Fry (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Fry updated SPARK-4879:

Attachment: speculation2.txt
speculation.txt

Hey Josh, 

I have been playing around with your repro above and I think I can consistently 
trigger the bad behavior by just tweaking the value of 
{{spark.speculation.multiplier}} and {{spark.speculation.quantile}}.

I set the {{multiplier}} to be 1 and the {{quantile}} to 0.01 so that only 1% 
of tasks have to finish before any task that takes longer than those 1% of 
tasks should speculate. 
As expected, I see a lot of tasks getting speculated. 
After running the repro about 5 times, I have seen 2 errors (stack traces at 
the bottom and the full run from the REPL is attached with this comment). 

One thing I do notice is that the part-0 associated with Stage 1 was always 
where I expected it to be in HDFS, and all lines were present (checked using a 
{{wc -l}})


{code}
scala 15/01/07 13:44:26 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 
0.0 (TID 119, redacted-host-02): java.io.IOException: The temporary 
job-output directory hdfs://redacted-host-01:8020/test6/_temporary doesn't 
exist!

org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)

org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240)

org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:89)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:980)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

{code}
15/01/07 15:17:39 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 
(TID 120, redacted-host-03): org.apache.hadoop.ipc.RemoteException: No lease 
on /test7/_temporary/_attempt_201501071517__m_00_120/part-0: File 
does not exist. Holder DFSClient_NONMAPREDUCE_-469253416_73 does not have any 
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2609)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2426)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2339)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:501)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:299)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44954)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1746)

org.apache.hadoop.ipc.Client.call(Client.java:1238)

org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
com.sun.proxy.$Proxy9.addBlock(Unknown Source)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)

org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)

org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
com.sun.proxy.$Proxy9.addBlock(Unknown Source)

org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:291)

org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1177)


[jira] [Created] (SPARK-5139) select table_alias.* with joins and selecting column names from inner queries not supported

2015-01-07 Thread Sunita Koppar (JIRA)
Sunita Koppar created SPARK-5139:


 Summary: select table_alias.* with joins  and selecting column 
names from inner queries not supported
 Key: SPARK-5139
 URL: https://issues.apache.org/jira/browse/SPARK-5139
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1
 Environment: Eclipse + SBT as well as linux cluster
Reporter: Sunita Koppar
Priority: Blocker


There are 2 issues here:
1. select table_alias.*  on a joined query is not supported
The exception thrown is as below:

at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260)
at croevss.WfPlsRej$.plsrej(WfPlsRej.scala:80)
at croevss.WfPlsRej$.main(WfPlsRej.scala:40)
at croevss.WfPlsRej.main(WfPlsRej.scala)

2. Multilevel nesting chokes up with messages like this:
Exception in thread main 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes:

Below is a sample query which runs on hive, but fails due to the above reasons 
with Spark SQL. 

SELECT sq.* ,r.*
FROM   (SELECT cs.*, 
   w.primary_key, 
   w.id  AS s_id1, 
   w.d_cd, 
   w.d_name, 
   w.rd, 
   w.completion_date AS completion_date1, 
   w.sales_type  AS sales_type1 
FROM   (SELECT stg.s_id, 
   stg.c_id, 
   stg.v, 
   stg.flg1, 
   stg.flg2, 
   comstg.d1, 
   comstg.d2, 
   comstg.d3, 
FROM   croe_rej_stage_pq stg 
   JOIN croe_rej_stage_comments_pq comstg 
 ON ( stg.s_id = comstg.s_id ) 
WHERE  comstg.valid_flg_txt = 'Y' 
   AND stg.valid_flg_txt = 'Y' 
ORDER  BY stg.s_id) cs 
   JOIN croe_rej_work_pq w 
 ON ( cs.s_id = w.s_id )) sq 
   JOIN CROE_rdr_pq r 
 ON ( sq.d_cd = r.d_number )


This is very cumbersome to deal with and we end up creating StructTypes for 
every level.
If there is a better way to deal with this, please let us know
regards
Sunita



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-01-07 Thread Gerard Maas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268440#comment-14268440
 ] 

Gerard Maas edited comment on SPARK-4940 at 1/7/15 10:52 PM:
-

Hi Tim,

We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes 
much sense for Spark Streaming.

Here're few examples of resource allocation. They are taken from several runs 
of the same job with identical configuration:
Job config:
spark.cores.max = 18
spark.mesos.coarse = true
spark.executor.memory = 4g

The job logic will start 6 Kafka receivers.

#1
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 4 |  4GB | 3  | 2  |
| 2 | 6 |  4GB | 2  | 1  | 
| 3 | 7 | 4GB  | 3  | 2  |
| 4 | 1 | 4GB | 1 | 1 |

Total mem: 16 GB
Total CPUs: 18

Observations: 
Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process 
the received data, so all data received needs to be sent to other node for 
non-local processing  (not sure how replication helps or not in this case, the 
blocks of data are processed on other nodes). Also the nodes with 2 streaming 
receivers have higher load that the node with 1 receiver.

#2
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 7 |  4GB | 7  | 4  |
| 2 | 2 |  4GB | 2  | 2  | 

Total mem: 8 GB
Total CPUs: 9

Observations: 
This is the worst configuration of the day. Totally unbalanced (4 vs 2 
receivers) and for some reason, the job didn't get all the resources assigned 
in the configuration. The job processing time is also slower as there're less 
cores to handle the data and less overall memory.

#3
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 3 |  4GB | 3  | 2  |
| 2 | 8 |  4GB | 2  | 2  | 
| 3 | 7 | 4GB  | 3  | 2  |

Total mem: 12GB
Total CPU: 18

Observations: 
This is a fairly good configuration with a more evenly distributed receivers 
and CPUs although there's one  considerable smaller node in terms of CPU 
assignment.
 
We can observe that the current resource assignment policy results in less than 
ideal and in particular random assignments that have a strong impact on the job 
execution and performance. Given that CPU allocation is by executor (and not by 
job), makes total memory for the job variable as it can get 2 to 4 executors 
assigned. It's also weird and unexpected to observe less than max CPU 
allocations.
Here's a performance chart of the same job jumping from one config to another 
(*), one with 3 (left) nodes and one with 2 (right): 
!https://lh3.googleusercontent.com/Z1C71OKoQzGA13uNJ8Yvf_xz_glRUqU_IGGvLsfkPvUPK2lahrEatweiWl-PDDfysjXtbs1Sl_k=w1682-h689!
(chart line: processing time in ms, load is fairly constant)

(*) for some reason we didn't find yet, Mesos often kills the job. When 
Marathon relaunches it, it results in a different resource assignment.


was (Author: gmaas):
Hi Tim,

We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes 
much sense for Spark Streaming.

Here're few examples of resource allocation. They are taken from several runs 
of the same job with identical configuration:
Job config:
spark.cores.max = 18
spark.mesos.coarse = true
spark.executor.memory = 4g

The job logic will start 6 Kafka receivers.

#1
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 4 |  4GB | 3  | 2  |
| 2 | 6 |  4GB | 2  | 1  | 
| 3 | 7 | 4GB  | 3  | 2  |
| 4 | 1 | 4GB | 1 | 1 |

Total mem: 16 GB
Total CPUs: 18

Observations: 
Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process 
the received data, so all data received needs to be sent to other node for 
non-local processing  (not sure how replication helps or not in this case, the 
blocks of data are processed on other nodes). Also the nodes with 2 streaming 
receivers have higher load that the node with 1 receiver.

#2
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 7 |  4GB | 7  | 4  |
| 2 | 2 |  4GB | 2  | 2  | 

Total mem: 8 GB
Total CPUs: 9

Observations: 
This is the worst configuration of the day. Totally unbalanced (4 vs 2 
receivers) and for some reason, the job didn't get all the resources assigned 
in the configuration. The job processing time is also slower as there're less 
cores to handle the data and less overall memory.

#3
--
|| Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers ||
| 1 | 3 |  4GB | 3  | 2  |
| 2 | 8 |  4GB | 2  | 2  | 
| 3 | 7 | 4GB  | 3  | 2  |

Total mem: 12GB
Total CPU: 18

Observations: 
This is a fairly good configuration with a more evenly distributed receivers 
and CPUs although there's one  considerable smaller node in terms of CPU 
assignment.
 
We can observe that the current resource assignment policy results in less than 
ideal and in particular random assignments that have a strong impact on 

[jira] [Updated] (SPARK-4951) A busy executor may be killed when dynamicAllocation is enabled

2015-01-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4951:
-
Assignee: Shixiong Zhu

 A busy executor may be killed when dynamicAllocation is enabled
 ---

 Key: SPARK-4951
 URL: https://issues.apache.org/jira/browse/SPARK-4951
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu

 If a task runs more than `spark.dynamicAllocation.executorIdleTimeout`, the 
 executor which runs this task will be killed.
 The following steps (yarn-client mode) can reproduce this bug:
 1. Start `spark-shell` using
 {code}
 ./bin/spark-shell --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.dynamicAllocation.maxExecutors=4 \
 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.dynamicAllocation.executorIdleTimeout=30 \
 --master yarn-client \
 --driver-memory 512m \
 --executor-memory 512m \
 --executor-cores 1
 {code}
 2. Wait more than 30 seconds until there is only one executor.
 3. Run the following code (a task needs at least 50 seconds to finish)
 {code}
 val r = sc.parallelize(1 to 1000, 20).map{t = Thread.sleep(1000); 
 t}.groupBy(_ % 2).collect()
 {code}
 4. Executors will be killed and allocated all the time, which makes the Job 
 fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution

2015-01-07 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268496#comment-14268496
 ] 

Davies Liu commented on SPARK-3910:
---

The 1.2 branch should not fail in a clean environment, where is the logging 
about the failure?

 ./python/pyspark/mllib/classification.py doctests fails with module name 
 pollution
 --

 Key: SPARK-3910
 URL: https://issues.apache.org/jira/browse/SPARK-3910
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, 
 Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
 argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, 
 pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, 
 unittest2==0.5.1, wsgiref==0.1.2
Reporter: Tomohiko K.
  Labels: pyspark, testing

 In ./python/run-tests script, we run the doctests in 
 ./pyspark/mllib/classification.py.
 The output is as following:
 {noformat}
 $ ./python/run-tests
 ...
 Running test: pyspark/mllib/classification.py
 Traceback (most recent call last):
   File pyspark/mllib/classification.py, line 20, in module
 import numpy
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py,
  line 170, in module
 from . import add_newdocs
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py,
  line 13, in module
 from numpy.lib import add_newdoc
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py,
  line 8, in module
 from .type_check import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py,
  line 11, in module
 import numpy.core.numeric as _nx
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py,
  line 46, in module
 from numpy.testing import Tester
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py,
  line 13, in module
 from .utils import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py,
  line 15, in module
 from tempfile import mkdtemp
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py,
  line 34, in module
 from random import Random as _Random
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, 
 line 24, in module
 from pyspark.rdd import RDD
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 
 51, in module
 from pyspark.context import SparkContext
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 
 22, in module
 from tempfile import NamedTemporaryFile
 ImportError: cannot import name NamedTemporaryFile
 0.07 real 0.04 user 0.02 sys
 Had test failures; see logs.
 {noformat}
 The problem is a cyclic import of tempfile module.
 The cause of it is that pyspark.mllib.random module exists in the directory 
 where pyspark.mllib.classification module exists.
 classification module imports numpy module, and then numpy module imports 
 tempfile module from its inside.
 Now the first entry sys.path is the directory ./python/pyspark/mllib (where 
 the executed file classification.py exists), so tempfile module imports 
 pyspark.mllib.random module (not the standard library random module).
 Finally, import chains reach tempfile again, then a cyclic import is formed.
 Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile 
 → (cyclic import!!)
 Furthermore, stat module is in a standard library, and pyspark.mllib.stat 
 module exists. This also may be troublesome.
 commit: 0e8203f4fb721158fb27897680da476174d24c4b
 A fundamental solution is to avoid using module names used by standard 
 libraries (currently random and stat).
 A difficulty of this solution is to rename pyspark.mllib.random and 
 pyspark.mllib.stat, which may be already used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution

2015-01-07 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268498#comment-14268498
 ] 

Josh Rosen commented on SPARK-3910:
---

Oh, I noticed this in 1.1 (while setting up SBT tests for the backport 
branches: SPARK-5053).  Here's a sample failure: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.1-SBT/1/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/console

 ./python/pyspark/mllib/classification.py doctests fails with module name 
 pollution
 --

 Key: SPARK-3910
 URL: https://issues.apache.org/jira/browse/SPARK-3910
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, 
 Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
 argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, 
 pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, 
 unittest2==0.5.1, wsgiref==0.1.2
Reporter: Tomohiko K.
  Labels: pyspark, testing

 In ./python/run-tests script, we run the doctests in 
 ./pyspark/mllib/classification.py.
 The output is as following:
 {noformat}
 $ ./python/run-tests
 ...
 Running test: pyspark/mllib/classification.py
 Traceback (most recent call last):
   File pyspark/mllib/classification.py, line 20, in module
 import numpy
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py,
  line 170, in module
 from . import add_newdocs
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py,
  line 13, in module
 from numpy.lib import add_newdoc
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py,
  line 8, in module
 from .type_check import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py,
  line 11, in module
 import numpy.core.numeric as _nx
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py,
  line 46, in module
 from numpy.testing import Tester
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py,
  line 13, in module
 from .utils import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py,
  line 15, in module
 from tempfile import mkdtemp
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py,
  line 34, in module
 from random import Random as _Random
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, 
 line 24, in module
 from pyspark.rdd import RDD
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 
 51, in module
 from pyspark.context import SparkContext
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 
 22, in module
 from tempfile import NamedTemporaryFile
 ImportError: cannot import name NamedTemporaryFile
 0.07 real 0.04 user 0.02 sys
 Had test failures; see logs.
 {noformat}
 The problem is a cyclic import of tempfile module.
 The cause of it is that pyspark.mllib.random module exists in the directory 
 where pyspark.mllib.classification module exists.
 classification module imports numpy module, and then numpy module imports 
 tempfile module from its inside.
 Now the first entry sys.path is the directory ./python/pyspark/mllib (where 
 the executed file classification.py exists), so tempfile module imports 
 pyspark.mllib.random module (not the standard library random module).
 Finally, import chains reach tempfile again, then a cyclic import is formed.
 Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile 
 → (cyclic import!!)
 Furthermore, stat module is in a standard library, and pyspark.mllib.stat 
 module exists. This also may be troublesome.
 commit: 0e8203f4fb721158fb27897680da476174d24c4b
 A fundamental solution is to avoid using module names used by standard 
 libraries (currently random and stat).
 A difficulty of this solution is to rename pyspark.mllib.random and 
 pyspark.mllib.stat, which may be already used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution

2015-01-07 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268559#comment-14268559
 ] 

Davies Liu commented on SPARK-3910:
---

It does not have random.py in branch-1.0, so 1.1 is the only branch we need to 
back port or patch.

 ./python/pyspark/mllib/classification.py doctests fails with module name 
 pollution
 --

 Key: SPARK-3910
 URL: https://issues.apache.org/jira/browse/SPARK-3910
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, 
 Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
 argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, 
 pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, 
 unittest2==0.5.1, wsgiref==0.1.2
Reporter: Tomohiko K.
  Labels: pyspark, testing

 In ./python/run-tests script, we run the doctests in 
 ./pyspark/mllib/classification.py.
 The output is as following:
 {noformat}
 $ ./python/run-tests
 ...
 Running test: pyspark/mllib/classification.py
 Traceback (most recent call last):
   File pyspark/mllib/classification.py, line 20, in module
 import numpy
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py,
  line 170, in module
 from . import add_newdocs
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py,
  line 13, in module
 from numpy.lib import add_newdoc
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py,
  line 8, in module
 from .type_check import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py,
  line 11, in module
 import numpy.core.numeric as _nx
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py,
  line 46, in module
 from numpy.testing import Tester
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py,
  line 13, in module
 from .utils import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py,
  line 15, in module
 from tempfile import mkdtemp
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py,
  line 34, in module
 from random import Random as _Random
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, 
 line 24, in module
 from pyspark.rdd import RDD
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 
 51, in module
 from pyspark.context import SparkContext
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 
 22, in module
 from tempfile import NamedTemporaryFile
 ImportError: cannot import name NamedTemporaryFile
 0.07 real 0.04 user 0.02 sys
 Had test failures; see logs.
 {noformat}
 The problem is a cyclic import of tempfile module.
 The cause of it is that pyspark.mllib.random module exists in the directory 
 where pyspark.mllib.classification module exists.
 classification module imports numpy module, and then numpy module imports 
 tempfile module from its inside.
 Now the first entry sys.path is the directory ./python/pyspark/mllib (where 
 the executed file classification.py exists), so tempfile module imports 
 pyspark.mllib.random module (not the standard library random module).
 Finally, import chains reach tempfile again, then a cyclic import is formed.
 Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile 
 → (cyclic import!!)
 Furthermore, stat module is in a standard library, and pyspark.mllib.stat 
 module exists. This also may be troublesome.
 commit: 0e8203f4fb721158fb27897680da476174d24c4b
 A fundamental solution is to avoid using module names used by standard 
 libraries (currently random and stat).
 A difficulty of this solution is to rename pyspark.mllib.random and 
 pyspark.mllib.stat, which may be already used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-07 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268640#comment-14268640
 ] 

Davies Liu commented on SPARK-3789:
---

Any updates?

 Python bindings for GraphX
 --

 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar
Assignee: Kushal Datta





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-07 Thread Kushal Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268649#comment-14268649
 ] 

Kushal Datta commented on SPARK-3789:
-

Hi Davies,

Here are the list of things which I have completed till now:
- Java API for VertexRDD, EdgeRDD and Graph
- Unit tests for JavaVertexRDD, JavaEdgeRDD and JavaGraph
- Python API for VertexRDD, EdgeRDD and Graph in Scala including
  -- PythonVertexRDD, PythonEdgeRDD and PythonGraph
  -- Also includes vertex, edge and graph transformations and actions

In progress are:
- Pregel API in Python which includes
  -- Adding the new Pregel API in Python
  -- serializing vertexProgram, sendMessage, mergeMsg and initialMsg

-Kushal

 Python bindings for GraphX
 --

 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar
Assignee: Kushal Datta





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5136) Improve documentation around setting up Spark IntelliJ project

2015-01-07 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268146#comment-14268146
 ] 

Ryan Williams commented on SPARK-5136:
--

I've not started on it so feel free to grab the lock. If I've not heard from 
you I'll take a crack at it in the next week or so.

 Improve documentation around setting up Spark IntelliJ project
 --

 Key: SPARK-5136
 URL: https://issues.apache.org/jira/browse/SPARK-5136
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 [The documentation about setting up a Spark project in 
 Intellij|http://spark.apache.org/docs/latest/building-spark.html#using-with-intellij-idea]
  is somewhat short/cryptic and targets [an IntelliJ version released in 
 2012|https://www.jetbrains.com/company/history.jsp]. A refresh / upgrade is 
 probably warranted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5136) Improve documentation around setting up Spark IntelliJ project

2015-01-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268106#comment-14268106
 ] 

Sean Owen commented on SPARK-5136:
--

[~rdub] Are you taking a crack at this or should I? I think the instructions 
could be elaborated a bit, particularly about picking profiles. It will be 
correct for any recent IntelliJ.

 Improve documentation around setting up Spark IntelliJ project
 --

 Key: SPARK-5136
 URL: https://issues.apache.org/jira/browse/SPARK-5136
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 [The documentation about setting up a Spark project in 
 Intellij|http://spark.apache.org/docs/latest/building-spark.html#using-with-intellij-idea]
  is somewhat short/cryptic and targets [an IntelliJ version released in 
 2012|https://www.jetbrains.com/company/history.jsp]. A refresh / upgrade is 
 probably warranted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5039) Spark 1.0 2.0.0-mr1-cdh4.1.2 Maven build fails due to javax.servlet.FilterRegistration's signer information errors

2015-01-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5039:
--
Assignee: Sean Owen

 Spark 1.0 2.0.0-mr1-cdh4.1.2 Maven build fails due to 
 javax.servlet.FilterRegistration's signer information errors 
 -

 Key: SPARK-5039
 URL: https://issues.apache.org/jira/browse/SPARK-5039
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Affects Versions: 1.0.2
Reporter: Josh Rosen
Assignee: Sean Owen
  Labels: starter
 Fix For: 1.0.3


 One of the four {{branch-1.0}} maven builds has been consistently failing due 
 to servlet class signing errors:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.0-Maven-pre-YARN/
 For example:
 {code}
 ContextCleanerSuite:
 Exception encountered when invoking run on a nested suite - class 
 javax.servlet.FilterRegistration's signer information does not match signer 
 information of other classes in the same package *** ABORTED ***
   java.lang.SecurityException: class javax.servlet.FilterRegistration's 
 signer information does not match signer information of other classes in the 
 same package
   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   ...
 {code}
 The fix for this issue is declaring proper exclusions for some 
 implementations of the servlet API.  I know how to do this, but I don't have 
 time to take care of it now, so I'm tossing up this JIRA so facilitate 
 work-stealing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5132) The name for get stage info atempt ID from Json was wrong

2015-01-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5132.
---
   Resolution: Fixed
Fix Version/s: (was: 1.2.0)
   1.2.1
   1.3.0
   1.1.2

Issue resolved by pull request 3932
[https://github.com/apache/spark/pull/3932]

 The name for get stage info atempt ID from Json was wrong
 -

 Key: SPARK-5132
 URL: https://issues.apache.org/jira/browse/SPARK-5132
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: SuYan
Priority: Minor
 Fix For: 1.1.2, 1.3.0, 1.2.1


 stageInfoToJson: Stage Attempt Id
 stageInfoFromJson: Attempt Id



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5108) Need to make jackson dependency version consistent with hadoop-2.6.0.

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268241#comment-14268241
 ] 

Apache Spark commented on SPARK-5108:
-

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/3937

 Need to make jackson dependency version consistent with hadoop-2.6.0.
 -

 Key: SPARK-5108
 URL: https://issues.apache.org/jira/browse/SPARK-5108
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Zhan Zhang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4389) Set akka.remote.netty.tcp.bind-hostname=0.0.0.0 so driver can be located behind NAT

2015-01-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4389:
-
Affects Version/s: 1.2.0

 Set akka.remote.netty.tcp.bind-hostname=0.0.0.0 so driver can be located 
 behind NAT
 -

 Key: SPARK-4389
 URL: https://issues.apache.org/jira/browse/SPARK-4389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Josh Rosen
Priority: Minor

 We should set {{akka.remote.netty.tcp.bind-hostname=0.0.0.0}} in our Akka 
 configuration so that Spark drivers can be located behind NATs / work with 
 weird DNS setups.
 This is blocked by upgrading our Akka version, since this configuration is 
 not present Akka 2.3.4.  There might be a different approach / workaround 
 that works on our current Akka version, though.
 EDIT: this is blocked by Akka 2.4, since this feature is only available in 
 the 2.4 snapshot release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener

2015-01-07 Thread Zach Fry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268113#comment-14268113
 ] 

Zach Fry commented on SPARK-4906:
-

[~pwendell],

Looks like you wanted to ping [~mkim]. 
He's away until the end of next week, so when he gets back he can take a look 
at this and get back to you. 
We also have some more datapoints to go from, so more to come. 

Zach 

 Spark master OOMs with exception stack trace stored in JobProgressListener
 --

 Key: SPARK-4906
 URL: https://issues.apache.org/jira/browse/SPARK-4906
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.1.1
Reporter: Mingyu Kim

 Spark master was OOMing with a lot of stack traces retained in 
 JobProgressListener. The object dependency goes like the following.
 JobProgressListener.stageIdToData = StageUIData.taskData = 
 TaskUIData.errorMessage
 Each error message is ~10kb since it has the entire stack trace. As we have a 
 lot of tasks, when all of the tasks across multiple stages go bad, these 
 error messages accounted for 0.5GB of heap at some point.
 Please correct me if I'm wrong, but it looks like all the task info for 
 running applications are kept in memory, which means it's almost always bound 
 to OOM for long-running applications. Would it make sense to fix this, for 
 example, by spilling some UI states to disk?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4406) SVD should check for k 1

2015-01-07 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268180#comment-14268180
 ] 

Manoj Kumar commented on SPARK-4406:


Hi Joseph, I believe this issue would be simple enough for me to start working 
on? Does it require you to assign it to me, or can I send a Pull Request right 
away?

 SVD should check for k  1
 --

 Key: SPARK-4406
 URL: https://issues.apache.org/jira/browse/SPARK-4406
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 When SVD is called with k  1, it still tries to compute the SVD, causing a 
 lower-level error.  It should fail early.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2298) Show stage attempt in UI

2015-01-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2298:
--
Fix Version/s: 1.1.0

 Show stage attempt in UI
 

 Key: SPARK-2298
 URL: https://issues.apache.org/jira/browse/SPARK-2298
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.1.0

 Attachments: Screen Shot 2014-06-25 at 4.54.46 PM.png


 We should add a column to the web ui to show stage attempt id. Then tasks 
 should be grouped by (stageId, stageAttempt) tuple.
 When a stage is resubmitted (e.g. due to fetch failures), we should get a 
 different entry in the web ui and tasks for the resubmission go there.
 See the attached screenshot for the confusing status quo. We currently show 
 the same stage entry twice, and then tasks appear in both. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5132) The name for get stage info atempt ID from Json was wrong

2015-01-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5132:
--
Component/s: Web UI

 The name for get stage info atempt ID from Json was wrong
 -

 Key: SPARK-5132
 URL: https://issues.apache.org/jira/browse/SPARK-5132
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.2.0
Reporter: SuYan
Assignee: SuYan
Priority: Minor
 Fix For: 1.3.0, 1.1.2, 1.2.1


 stageInfoToJson: Stage Attempt Id
 stageInfoFromJson: Attempt Id



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5132) The name for get stage info atempt ID from Json was wrong

2015-01-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5132:
--
Assignee: SuYan

 The name for get stage info atempt ID from Json was wrong
 -

 Key: SPARK-5132
 URL: https://issues.apache.org/jira/browse/SPARK-5132
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.2.0
Reporter: SuYan
Assignee: SuYan
Priority: Minor
 Fix For: 1.3.0, 1.1.2, 1.2.1


 stageInfoToJson: Stage Attempt Id
 stageInfoFromJson: Attempt Id



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4406) SVD should check for k 1

2015-01-07 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268180#comment-14268180
 ] 

Manoj Kumar edited comment on SPARK-4406 at 1/7/15 8:40 PM:


Hi Joseph, I believe this issue would be simple enough for me to start working 
on. Does it require you to assign it to me, or can I send a Pull Request right 
away?


was (Author: mechcoder):
Hi Joseph, I believe this issue would be simple enough for me to start working 
on? Does it require you to assign it to me, or can I send a Pull Request right 
away?

 SVD should check for k  1
 --

 Key: SPARK-4406
 URL: https://issues.apache.org/jira/browse/SPARK-4406
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 When SVD is called with k  1, it still tries to compute the SVD, causing a 
 lower-level error.  It should fail early.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2015-01-07 Thread Jongyoul Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267701#comment-14267701
 ] 

Jongyoul Lee commented on SPARK-3619:
-

[~tnachen] fixing the version from 0.18.1 to 0.21.0 is easy. I'm doing simple 
and complicated job tests on my real mesos clusters.

 Upgrade to Mesos 0.21 to work around MESOS-1688
 ---

 Key: SPARK-3619
 URL: https://issues.apache.org/jira/browse/SPARK-3619
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Matei Zaharia
Assignee: Timothy Chen

 The Mesos 0.21 release has a fix for 
 https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2458) Make failed application log visible on History Server

2015-01-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-2458.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Masayoshi TSUZUKI
Target Version/s: 1.3.0

 Make failed application log visible on History Server
 -

 Key: SPARK-2458
 URL: https://issues.apache.org/jira/browse/SPARK-2458
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Masayoshi TSUZUKI
Assignee: Masayoshi TSUZUKI
 Fix For: 1.3.0


 History server is very helpful for debugging application correctness  
 performance after the application finished. However, when the application 
 failed, the link is not listed on the hisotry server UI and history can't be 
 viewed.
 It would be very useful if we can check the history of failed application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5053) Test maintenance branches on Jenkins using SBT

2015-01-07 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268313#comment-14268313
 ] 

Josh Rosen commented on SPARK-5053:
---

It looks like nearly all of these new builds are failing for various reasons, 
so I could use some help fixing them.

One issue is that several of the PySpark tests are failing with

{code}

OK (skipped=1)
Traceback (most recent call last):
  File pyspark/mllib/_common.py, line 20, in module
import numpy
  File /usr/lib64/python2.6/site-packages/numpy/__init__.py, line 170, in 
module
from . import add_newdocs
  File /usr/lib64/python2.6/site-packages/numpy/add_newdocs.py, line 13, in 
module
from numpy.lib import add_newdoc
  File /usr/lib64/python2.6/site-packages/numpy/lib/__init__.py, line 8, in 
module
from .type_check import *
  File /usr/lib64/python2.6/site-packages/numpy/lib/type_check.py, line 11, 
in module
import numpy.core.numeric as _nx
  File /usr/lib64/python2.6/site-packages/numpy/core/__init__.py, line 46, in 
module
from numpy.testing import Tester
  File /usr/lib64/python2.6/site-packages/numpy/testing/__init__.py, line 13, 
in module
from .utils import *
  File /usr/lib64/python2.6/site-packages/numpy/testing/utils.py, line 15, in 
module
from tempfile import mkdtemp
  File /usr/lib64/python2.6/tempfile.py, line 34, in module
from random import Random as _Random
  File 
/home/jenkins/workspace/Spark-1.1-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/python/pyspark/mllib/random.py,
 line 23, in module
from pyspark.rdd import RDD
  File 
/home/jenkins/workspace/Spark-1.1-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/python/pyspark/__init__.py,
 line 63, in module
from pyspark.context import SparkContext
  File 
/home/jenkins/workspace/Spark-1.1-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/python/pyspark/context.py,
 line 22, in module
from tempfile import NamedTemporaryFile
ImportError: cannot import name NamedTemporaryFile
{code}

Some of the other failures might just be due to flaky tests exposed by higher 
Jenkins loads; let's see if they persist after rebuilds.

 Test maintenance branches on Jenkins using SBT
 --

 Key: SPARK-5053
 URL: https://issues.apache.org/jira/browse/SPARK-5053
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Josh Rosen
Priority: Blocker

 We need to create Jenkins jobs to test maintenance branches using SBT.  The 
 current Maven jobs for backport branches do not run the same checks that the 
 pull request builder / SBT builds do (e.g. MiMa checks, PySpark, RAT, etc.) 
 which means that cherry-picking backports can silently break things and we'll 
 only discover it once PRs that are explicitly opened against those branches 
 fail tests; this long delay between introducing test failures and detecting 
 them is a huge productivity issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5108) Need to make jackson dependency version consistent with hadoop-2.6.0.

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268336#comment-14268336
 ] 

Apache Spark commented on SPARK-5108:
-

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/3938

 Need to make jackson dependency version consistent with hadoop-2.6.0.
 -

 Key: SPARK-5108
 URL: https://issues.apache.org/jira/browse/SPARK-5108
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Zhan Zhang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4406) SVD should check for k 1

2015-01-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268358#comment-14268358
 ] 

Joseph K. Bradley commented on SPARK-4406:
--

It's good to get it assigned if it will take a while, but feel free to submit a 
PR if it's simple like this one.  If a PR will take time, then posting a 
comment that you're working on it is helpful.  Thanks in advance!

 SVD should check for k  1
 --

 Key: SPARK-4406
 URL: https://issues.apache.org/jira/browse/SPARK-4406
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 When SVD is called with k  1, it still tries to compute the SVD, causing a 
 lower-level error.  It should fail early.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5053) Test maintenance branches on Jenkins using SBT

2015-01-07 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268321#comment-14268321
 ] 

Josh Rosen commented on SPARK-5053:
---

Hmm, it looks like the Python issue is an occurrence of SPARK-3910.  This 
_used_ to work, so I'm not sure why it's failing now.

 Test maintenance branches on Jenkins using SBT
 --

 Key: SPARK-5053
 URL: https://issues.apache.org/jira/browse/SPARK-5053
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Josh Rosen
Priority: Blocker

 We need to create Jenkins jobs to test maintenance branches using SBT.  The 
 current Maven jobs for backport branches do not run the same checks that the 
 pull request builder / SBT builds do (e.g. MiMa checks, PySpark, RAT, etc.) 
 which means that cherry-picking backports can silently break things and we'll 
 only discover it once PRs that are explicitly opened against those branches 
 fail tests; this long delay between introducing test failures and detecting 
 them is a huge productivity issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution

2015-01-07 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268332#comment-14268332
 ] 

Josh Rosen commented on SPARK-3910:
---

It looks like this was fixed in SPARK-4348, but it turns out that we're now 
hitting this error when running PySpark tests in Jenkins jobs for maintenance 
branches (it turns out that Jenkins wasn't running these tests before for those 
branches, so it's not clear when this problem was introduced).  I'll see if I 
can figure out a fix for the backport branches.

 ./python/pyspark/mllib/classification.py doctests fails with module name 
 pollution
 --

 Key: SPARK-3910
 URL: https://issues.apache.org/jira/browse/SPARK-3910
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, 
 Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
 argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, 
 pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, 
 unittest2==0.5.1, wsgiref==0.1.2
Reporter: Tomohiko K.
  Labels: pyspark, testing

 In ./python/run-tests script, we run the doctests in 
 ./pyspark/mllib/classification.py.
 The output is as following:
 {noformat}
 $ ./python/run-tests
 ...
 Running test: pyspark/mllib/classification.py
 Traceback (most recent call last):
   File pyspark/mllib/classification.py, line 20, in module
 import numpy
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py,
  line 170, in module
 from . import add_newdocs
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py,
  line 13, in module
 from numpy.lib import add_newdoc
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py,
  line 8, in module
 from .type_check import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py,
  line 11, in module
 import numpy.core.numeric as _nx
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py,
  line 46, in module
 from numpy.testing import Tester
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py,
  line 13, in module
 from .utils import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py,
  line 15, in module
 from tempfile import mkdtemp
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py,
  line 34, in module
 from random import Random as _Random
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, 
 line 24, in module
 from pyspark.rdd import RDD
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 
 51, in module
 from pyspark.context import SparkContext
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 
 22, in module
 from tempfile import NamedTemporaryFile
 ImportError: cannot import name NamedTemporaryFile
 0.07 real 0.04 user 0.02 sys
 Had test failures; see logs.
 {noformat}
 The problem is a cyclic import of tempfile module.
 The cause of it is that pyspark.mllib.random module exists in the directory 
 where pyspark.mllib.classification module exists.
 classification module imports numpy module, and then numpy module imports 
 tempfile module from its inside.
 Now the first entry sys.path is the directory ./python/pyspark/mllib (where 
 the executed file classification.py exists), so tempfile module imports 
 pyspark.mllib.random module (not the standard library random module).
 Finally, import chains reach tempfile again, then a cyclic import is formed.
 Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile 
 → (cyclic import!!)
 Furthermore, stat module is in a standard library, and pyspark.mllib.stat 
 module exists. This also may be troublesome.
 commit: 0e8203f4fb721158fb27897680da476174d24c4b
 A fundamental solution is to avoid using module names used by standard 
 libraries (currently random and stat).
 A difficulty of this solution is to rename pyspark.mllib.random and 
 pyspark.mllib.stat, which may be already used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4406) SVD should check for k 1

2015-01-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268368#comment-14268368
 ] 

Joseph K. Bradley commented on SPARK-4406:
--

Also, to get JIRAs assigned to you, you will need to get an admin like 
[~mengxr] to add you to the developer group for this project.  (For this JIRA, 
the comment should be good enough.)

 SVD should check for k  1
 --

 Key: SPARK-4406
 URL: https://issues.apache.org/jira/browse/SPARK-4406
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 When SVD is called with k  1, it still tries to compute the SVD, causing a 
 lower-level error.  It should fail early.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5122) Remove Shark from spark-ec2

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268390#comment-14268390
 ] 

Apache Spark commented on SPARK-5122:
-

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/3939

 Remove Shark from spark-ec2
 ---

 Key: SPARK-5122
 URL: https://issues.apache.org/jira/browse/SPARK-5122
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 Since Shark has been replaced by Spark SQL, we don't need it in {{spark-ec2}} 
 anymore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4951) A busy executor may be killed when dynamicAllocation is enabled

2015-01-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4951:
-
Affects Version/s: 1.2.0

 A busy executor may be killed when dynamicAllocation is enabled
 ---

 Key: SPARK-4951
 URL: https://issues.apache.org/jira/browse/SPARK-4951
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Shixiong Zhu

 If a task runs more than `spark.dynamicAllocation.executorIdleTimeout`, the 
 executor which runs this task will be killed.
 The following steps (yarn-client mode) can reproduce this bug:
 1. Start `spark-shell` using
 {code}
 ./bin/spark-shell --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.dynamicAllocation.maxExecutors=4 \
 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.dynamicAllocation.executorIdleTimeout=30 \
 --master yarn-client \
 --driver-memory 512m \
 --executor-memory 512m \
 --executor-cores 1
 {code}
 2. Wait more than 30 seconds until there is only one executor.
 3. Run the following code (a task needs at least 50 seconds to finish)
 {code}
 val r = sc.parallelize(1 to 1000, 20).map{t = Thread.sleep(1000); 
 t}.groupBy(_ % 2).collect()
 {code}
 4. Executors will be killed and allocated all the time, which makes the Job 
 fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4983) Tag EC2 instances in the same call that launches them

2015-01-07 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-4983:

Labels: starter  (was: )

 Tag EC2 instances in the same call that launches them
 -

 Key: SPARK-4983
 URL: https://issues.apache.org/jira/browse/SPARK-4983
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Priority: Minor
  Labels: starter

 We launch EC2 instances in {{spark-ec2}} and then immediately tag them in a 
 separate boto call. Sometimes, EC2 doesn't get enough time to propagate 
 information about the just-launched instances, so when we go to tag them we 
 get a server that doesn't know about them yet.
 This yields the following type of error:
 {code}
 Launching instances...
 Launched 1 slaves in us-east-1b, regid = r-cf780321
 Launched master in us-east-1b, regid = r-da7e0534
 Traceback (most recent call last):
   File ./ec2/spark_ec2.py, line 1284, in module
 main()
   File ./ec2/spark_ec2.py, line 1276, in main
 real_main()
   File ./ec2/spark_ec2.py, line 1122, in real_main
 (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
   File ./ec2/spark_ec2.py, line 646, in launch_cluster
 value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id))
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in 
 add_tag
 self.add_tags({key: value}, dry_run)
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in 
 add_tags
 dry_run=dry_run
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, in 
 create_tags
 return self.get_status('CreateTags', params, verb='POST')
   File .../spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in 
 get_status
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
 instance ID 'i-585219a6' does not 
 exist/Message/Error/ErrorsRequestIDb9f1ad6e-59b9-47fd-a693-527be1f779eb/RequestID/Response
 {code}
 The solution is to tag the instances in the same call that launches them, or 
 less desirably, tag the instances after some short wait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution

2015-01-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268411#comment-14268411
 ] 

Apache Spark commented on SPARK-3910:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3940

 ./python/pyspark/mllib/classification.py doctests fails with module name 
 pollution
 --

 Key: SPARK-3910
 URL: https://issues.apache.org/jira/browse/SPARK-3910
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, 
 Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
 argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, 
 pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, 
 unittest2==0.5.1, wsgiref==0.1.2
Reporter: Tomohiko K.
  Labels: pyspark, testing

 In ./python/run-tests script, we run the doctests in 
 ./pyspark/mllib/classification.py.
 The output is as following:
 {noformat}
 $ ./python/run-tests
 ...
 Running test: pyspark/mllib/classification.py
 Traceback (most recent call last):
   File pyspark/mllib/classification.py, line 20, in module
 import numpy
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py,
  line 170, in module
 from . import add_newdocs
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py,
  line 13, in module
 from numpy.lib import add_newdoc
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py,
  line 8, in module
 from .type_check import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py,
  line 11, in module
 import numpy.core.numeric as _nx
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py,
  line 46, in module
 from numpy.testing import Tester
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py,
  line 13, in module
 from .utils import *
   File 
 /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py,
  line 15, in module
 from tempfile import mkdtemp
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py,
  line 34, in module
 from random import Random as _Random
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, 
 line 24, in module
 from pyspark.rdd import RDD
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 
 51, in module
 from pyspark.context import SparkContext
   File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 
 22, in module
 from tempfile import NamedTemporaryFile
 ImportError: cannot import name NamedTemporaryFile
 0.07 real 0.04 user 0.02 sys
 Had test failures; see logs.
 {noformat}
 The problem is a cyclic import of tempfile module.
 The cause of it is that pyspark.mllib.random module exists in the directory 
 where pyspark.mllib.classification module exists.
 classification module imports numpy module, and then numpy module imports 
 tempfile module from its inside.
 Now the first entry sys.path is the directory ./python/pyspark/mllib (where 
 the executed file classification.py exists), so tempfile module imports 
 pyspark.mllib.random module (not the standard library random module).
 Finally, import chains reach tempfile again, then a cyclic import is formed.
 Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile 
 → (cyclic import!!)
 Furthermore, stat module is in a standard library, and pyspark.mllib.stat 
 module exists. This also may be troublesome.
 commit: 0e8203f4fb721158fb27897680da476174d24c4b
 A fundamental solution is to avoid using module names used by standard 
 libraries (currently random and stat).
 A difficulty of this solution is to rename pyspark.mllib.random and 
 pyspark.mllib.stat, which may be already used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) Python bindings for GraphX

2015-01-07 Thread Ameet Talwalkar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268658#comment-14268658
 ] 

Ameet Talwalkar commented on SPARK-3789:


Agreed, thanks for the update!   Also, 1.3 release is a good target if I'm
going to use this in my MOOC...




 Python bindings for GraphX
 --

 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar
Assignee: Kushal Datta





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4777) Some block memory after unrollSafely not count into used memory(memoryStore.entrys or unrollMemory)

2015-01-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4777:
-
Priority: Major  (was: Minor)

 Some block memory after unrollSafely not count into used 
 memory(memoryStore.entrys or unrollMemory)
 ---

 Key: SPARK-4777
 URL: https://issues.apache.org/jira/browse/SPARK-4777
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: SuYan

 Some memory not count into memory used by memoryStore or unrollMemory.
 Thread A after unrollsafely memory, it will release 40MB unrollMemory(40MB 
 will used by other threads). then ThreadA wait get accountingLock to tryToPut 
 blockA(30MB). before Thread A get accountingLock, blockA memory size is not 
 counting into unrollMemory or memoryStore.currentMemory.
   
  IIUC, freeMemory should minus that block memory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >