[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267469#comment-14267469 ] Aniket Bhatnagar commented on SPARK-3452: - Here is the exception I am getting while triggering a job that contains SparkContext having master as yarn-client. A quick look at 1.2.0 source code suggests I should depend on spark-yarn module which I can't as it is not longer published. Do you want me to log a separate defect for this and submit appropriate pull request? 2015-01-07 14:39:22,799 [pool-10-thread-13] [info] o.a.s.s.MemoryStore - MemoryS tore started with capacity 731.7 MB Exception in thread pool-10-thread-13 java.lang.ExceptionInInitializerError at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:105) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:180) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159) at org.apache.spark.SparkContext.init(SparkContext.scala:232) at com.myimpl.Server:23) at scala.util.Success$$anonfun$map$1.apply(Try.scala:236) at scala.util.Try$.apply(Try.scala:191) at scala.util.Success.map(Try.scala:236) at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23) at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23) at scala.util.Success$$anonfun$map$1.apply(Try.scala:236) at scala.util.Try$.apply(Try.scala:191) at scala.util.Success.map(Try.scala:236) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Unable to load YARN support at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) ... 27 more Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:195) ... 29 more Maven build should skip publishing artifacts people shouldn't depend on --- Key: SPARK-3452 URL: https://issues.apache.org/jira/browse/SPARK-3452 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical Fix For: 1.2.0 I think it's easy to do this by just adding a skip configuration somewhere. We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5068) When the path not found in the hdfs,we can't get the result
[ https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267483#comment-14267483 ] Apache Spark commented on SPARK-5068: - User 'jeanlyn' has created a pull request for this issue: https://github.com/apache/spark/pull/3907 When the path not found in the hdfs,we can't get the result --- Key: SPARK-5068 URL: https://issues.apache.org/jira/browse/SPARK-5068 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: jeanlyn when the partion path was found in the metastore but not found in the hdfs,it will casue some problems as follow: {noformat} hive show partitions partition_test; OK dt=1 dt=2 dt=3 dt=4 Time taken: 0.168 seconds, Fetched: 4 row(s) {noformat} {noformat} hive dfs -ls /user/jeanlyn/warehouse/partition_test; Found 3 items drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=1 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=3 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 17:42 /user/jeanlyn/warehouse/partition_test/dt=4 {noformat} when i run the sql {noformat} select * from partition_test limit 10 {noformat} in *hive*,i got no problem,but when i run in *spark-sql* i get the error as follow: {noformat} Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2 at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) at org.apache.spark.rdd.RDD.collect(RDD.scala:780) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.hive.testpartition$.main(test.scala:23) at org.apache.spark.sql.hive.testpartition.main(test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} -- This message was sent by Atlassian JIRA
[jira] [Created] (SPARK-5131) A typo in configuration doc
uncleGen created SPARK-5131: --- Summary: A typo in configuration doc Key: SPARK-5131 URL: https://issues.apache.org/jira/browse/SPARK-5131 Project: Spark Issue Type: Bug Reporter: uncleGen Priority: Minor Fix For: 1.2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5120) Output the thread name in log4j.properties
[ https://issues.apache.org/jira/browse/SPARK-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WangTaoTheTonic closed SPARK-5120. -- Resolution: Won't Fix Output the thread name in log4j.properties -- Key: SPARK-5120 URL: https://issues.apache.org/jira/browse/SPARK-5120 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Priority: Minor In most case the thread name is very useful to analyse running job, it is better to log it out in log4j properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5131) A typo in configuration doc
[ https://issues.apache.org/jira/browse/SPARK-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267441#comment-14267441 ] Apache Spark commented on SPARK-5131: - User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/3930 A typo in configuration doc --- Key: SPARK-5131 URL: https://issues.apache.org/jira/browse/SPARK-5131 Project: Spark Issue Type: Bug Reporter: uncleGen Priority: Minor Fix For: 1.2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5129) make SqlContext support select date +/- XX DAYS from table
[ https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267442#comment-14267442 ] Apache Spark commented on SPARK-5129: - User 'DoingDone9' has created a pull request for this issue: https://github.com/apache/spark/pull/3931 make SqlContext support select date +/- XX DAYS from table -- Key: SPARK-5129 URL: https://issues.apache.org/jira/browse/SPARK-5129 Project: Spark Issue Type: Improvement Reporter: DoingDone9 Priority: Minor Example : create table test (date: Date) 2014-01-01 2014-01-02 2014-01-03 when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 and running select date - 10 DAYS from test, get 2013-12-22 2013-12-23 2013-12-24 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types
[ https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267450#comment-14267450 ] Kai Sasaki commented on SPARK-4284: --- I'd like to work on this issue, if it does not fixed. Could you assign this to me? BinaryClassificationMetrics precision-recall method names should correspond to return types --- Key: SPARK-4284 URL: https://issues.apache.org/jira/browse/SPARK-4284 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor BinaryClassificationMetrics has several methods which work with (recall, precision) pairs, but the method names all use the wrong order (pr). This order should be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5132) The name for get stage info atempt ID from Json was wrong
SuYan created SPARK-5132: Summary: The name for get stage info atempt ID from Json was wrong Key: SPARK-5132 URL: https://issues.apache.org/jira/browse/SPARK-5132 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: SuYan Priority: Minor Fix For: 1.2.0 stageInfoToJson: Stage Attempt Id stageInfoFromJson: Attempt Id -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5132) The name for get stage info atempt ID from Json was wrong
[ https://issues.apache.org/jira/browse/SPARK-5132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267454#comment-14267454 ] Apache Spark commented on SPARK-5132: - User 'suyanNone' has created a pull request for this issue: https://github.com/apache/spark/pull/3932 The name for get stage info atempt ID from Json was wrong - Key: SPARK-5132 URL: https://issues.apache.org/jira/browse/SPARK-5132 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: SuYan Priority: Minor Fix For: 1.2.0 stageInfoToJson: Stage Attempt Id stageInfoFromJson: Attempt Id -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types
[ https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267457#comment-14267457 ] Sean Owen commented on SPARK-4284: -- [~lewuathe] I think you can just start working on it and submit a PR. For long-running efforts it may make sense to officially declare you're working on it, and try to get consensus that it's your issue, but this should be a quite quick/small change. BinaryClassificationMetrics precision-recall method names should correspond to return types --- Key: SPARK-4284 URL: https://issues.apache.org/jira/browse/SPARK-4284 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor BinaryClassificationMetrics has several methods which work with (recall, precision) pairs, but the method names all use the wrong order (pr). This order should be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5128) Add stable log1pExp impl
Xiangrui Meng created SPARK-5128: Summary: Add stable log1pExp impl Key: SPARK-5128 URL: https://issues.apache.org/jira/browse/SPARK-5128 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: DB Tsai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table
[ https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5129: -- Description: Example : create table test (date: Date) 2014-01-01 2014-01-02 2014-01-03 when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 was: Example : create table test (date: Date, name: String) 2014-01-01 a 2014-01-02 b 2014-01-03 c when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 make SqlContext support select date + XX DAYS from table Key: SPARK-5129 URL: https://issues.apache.org/jira/browse/SPARK-5129 Project: Spark Issue Type: Improvement Reporter: DoingDone9 Priority: Minor Example : create table test (date: Date) 2014-01-01 2014-01-02 2014-01-03 when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table
[ https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5129: -- Description: Example : create table test (date: Date, name: String) 2014-01-01 a 2014-01-02 b 2014-01-03 c when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 was: Example : create table test (date: Date, name: String) datename 2014-01-01 a 2014-01-02 b 2014-01-03 c when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 make SqlContext support select date + XX DAYS from table Key: SPARK-5129 URL: https://issues.apache.org/jira/browse/SPARK-5129 Project: Spark Issue Type: Improvement Reporter: DoingDone9 Priority: Minor Example : create table test (date: Date, name: String) 2014-01-01 a 2014-01-02 b 2014-01-03 c when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5127) Fixed overflow when there are outliers in data in Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267390#comment-14267390 ] DB Tsai commented on SPARK-5127: Not an issue in binary logistic regression. Problem only occurs in MLOR. Fixed overflow when there are outliers in data in Logistic Regression - Key: SPARK-5127 URL: https://issues.apache.org/jira/browse/SPARK-5127 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label However, the first part of gradientMultiplier will be suffered from overflow if there are samples far away from hyperplane, and this happens when there are outliers in data. As a result, we use the equivalent formula but more numerically stable. val gradientMultiplier = if (margin 0.0) { val temp = math.exp(-margin) temp / (1.0 + temp) - label } else { 1.0 / (1.0 + math.exp(margin)) - label } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5127) Fixed overflow when there are outliers in data in Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai closed SPARK-5127. -- Resolution: Not a Problem Fixed overflow when there are outliers in data in Logistic Regression - Key: SPARK-5127 URL: https://issues.apache.org/jira/browse/SPARK-5127 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label However, the first part of gradientMultiplier will be suffered from overflow if there are samples far away from hyperplane, and this happens when there are outliers in data. As a result, we use the equivalent formula but more numerically stable. val gradientMultiplier = if (margin 0.0) { val temp = math.exp(-margin) temp / (1.0 + temp) - label } else { 1.0 / (1.0 + math.exp(margin)) - label } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5097) Adding data frame APIs to SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5097: --- Priority: Critical (was: Major) Adding data frame APIs to SchemaRDD --- Key: SPARK-5097 URL: https://issues.apache.org/jira/browse/SPARK-5097 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf SchemaRDD, through its DSL, already provides common data frame functionalities. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes for Scala and Python to make the SchemaRDD DSL API more usable and stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5127) Fixed overflow when there are outliers in data in Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-5127: --- Description: gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label However, the first part of gradientMultiplier will be suffered from overflow if there are samples far away from hyperplane, and this happens when there are outliers in data. As a result, we use the equivalent formula but more numerically stable. val gradientMultiplier = if (margin 0.0) { val temp = math.exp(-margin) temp / (1.0 + temp) - label } else { 1.0 / (1.0 + math.exp(margin)) - label } was: gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label However, the first part of gradientMultiplier will be suffered from overflow if there are samples far away from hyperplane, and this happens when there are outliers in data. As a result, we use the equivalent formula but more numerically stable. ``` val gradientMultiplier = if (margin 0.0) { val temp = math.exp(-margin) temp / (1.0 + temp) - label } else { 1.0 / (1.0 + math.exp(margin)) - label } ``` Fixed overflow when there are outliers in data in Logistic Regression - Key: SPARK-5127 URL: https://issues.apache.org/jira/browse/SPARK-5127 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label However, the first part of gradientMultiplier will be suffered from overflow if there are samples far away from hyperplane, and this happens when there are outliers in data. As a result, we use the equivalent formula but more numerically stable. val gradientMultiplier = if (margin 0.0) { val temp = math.exp(-margin) temp / (1.0 + temp) - label } else { 1.0 / (1.0 + math.exp(margin)) - label } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5127) Fixed overflow when there are outliers in data in Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-5127: --- Description: gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label However, the first part of gradientMultiplier will be suffered from overflow if there are samples far away from hyperplane, and this happens when there are outliers in data. As a result, we use the equivalent formula but more numerically stable. ``` val gradientMultiplier = if (margin 0.0) { val temp = math.exp(-margin) temp / (1.0 + temp) - label } else { 1.0 / (1.0 + math.exp(margin)) - label } ``` was: gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label However, the first part of gradientMultiplier will be suffered from overflow if there are samples far away from hyperplane, and this happens when there are outliers in data. As a result, we use the equivalent formula but more numerically stable. val gradientMultiplier = if (margin 0.0) { val temp = math.exp(-margin) temp / (1.0 + temp) - label } else { 1.0 / (1.0 + math.exp(margin)) - label } Fixed overflow when there are outliers in data in Logistic Regression - Key: SPARK-5127 URL: https://issues.apache.org/jira/browse/SPARK-5127 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label However, the first part of gradientMultiplier will be suffered from overflow if there are samples far away from hyperplane, and this happens when there are outliers in data. As a result, we use the equivalent formula but more numerically stable. ``` val gradientMultiplier = if (margin 0.0) { val temp = math.exp(-margin) temp / (1.0 + temp) - label } else { 1.0 / (1.0 + math.exp(margin)) - label } ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4257) Spark master can only be accessed by hostname
[ https://issues.apache.org/jira/browse/SPARK-4257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267387#comment-14267387 ] Alister Lee commented on SPARK-4257: Further, the spark URL is set correctly when SPARK_MASTER_IP is set, but not if the -h option is used from sbin/start-master.sh. eg. $ sbin/start-master.sh -h `hostname --ip-address` starting org.apache.spark.deploy.master.Master, logging to /tmp/log/spark-ec2-user-org.apache.spark.deploy.master.Master-1-ip-172-31-12-155.out $ grep spark:// /tmp/log/spark*.out 15/01/07 08:04:12 INFO Master: Starting Spark master at spark://ip-172-31-12-155:7077 $ sbin/stop-master.sh stopping org.apache.spark.deploy.master.Master $ export SPARK_MASTER_IP=`hostname --ip-address` $ sbin/start-master.sh starting org.apache.spark.deploy.master.Master, logging to /tmp/log/spark-ec2-user-org.apache.spark.deploy.master.Master-1-ip-172-31-12-155.out $ grep spark:// /tmp/log/spark*.out 15/01/07 08:05:39 INFO Master: Starting Spark master at spark://172.31.12.155:7077 Spark master can only be accessed by hostname - Key: SPARK-4257 URL: https://issues.apache.org/jira/browse/SPARK-4257 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Davies Liu Priority: Critical After sbin/start-all.sh, the spark shell can not connect to standalone master by spark://IP:7077, it works if replace IP by hostname. In the docs[1], it says use `spark://IP:PORT` to connect to master. [1] http://spark.apache.org/docs/latest/spark-standalone.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table
[ https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5129: -- Priority: Minor (was: Major) make SqlContext support select date + XX DAYS from table Key: SPARK-5129 URL: https://issues.apache.org/jira/browse/SPARK-5129 Project: Spark Issue Type: Improvement Reporter: DoingDone9 Priority: Minor Example : -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table
[ https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5129: -- Description: Example : make SqlContext support select date + XX DAYS from table Key: SPARK-5129 URL: https://issues.apache.org/jira/browse/SPARK-5129 Project: Spark Issue Type: Improvement Reporter: DoingDone9 Example : -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table
[ https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5129: -- Description: Example : create table test (date: Date, name: String) datename 2014-01-01 a 2014-01-02 b 2014-01-03 c when i run select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 was: Example : make SqlContext support select date + XX DAYS from table Key: SPARK-5129 URL: https://issues.apache.org/jira/browse/SPARK-5129 Project: Spark Issue Type: Improvement Reporter: DoingDone9 Priority: Minor Example : create table test (date: Date, name: String) datename 2014-01-01 a 2014-01-02 b 2014-01-03 c when i run select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5129) make SqlContext support select date + XX DAYS from table
[ https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5129: -- Description: Example : create table test (date: Date, name: String) datename 2014-01-01 a 2014-01-02 b 2014-01-03 c when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 was: Example : create table test (date: Date, name: String) datename 2014-01-01 a 2014-01-02 b 2014-01-03 c when i run select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 make SqlContext support select date + XX DAYS from table Key: SPARK-5129 URL: https://issues.apache.org/jira/browse/SPARK-5129 Project: Spark Issue Type: Improvement Reporter: DoingDone9 Priority: Minor Example : create table test (date: Date, name: String) datename 2014-01-01 a 2014-01-02 b 2014-01-03 c when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5128) Add stable log1pExp impl
[ https://issues.apache.org/jira/browse/SPARK-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267391#comment-14267391 ] Apache Spark commented on SPARK-5128: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/3915 Add stable log1pExp impl Key: SPARK-5128 URL: https://issues.apache.org/jira/browse/SPARK-5128 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: DB Tsai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5129) make SqlContext support select date +/- XX DAYS from table
[ https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5129: -- Summary: make SqlContext support select date +/- XX DAYS from table (was: make SqlContext support select date + XX DAYS from table ) make SqlContext support select date +/- XX DAYS from table -- Key: SPARK-5129 URL: https://issues.apache.org/jira/browse/SPARK-5129 Project: Spark Issue Type: Improvement Reporter: DoingDone9 Priority: Minor Example : create table test (date: Date) 2014-01-01 2014-01-02 2014-01-03 when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5128) Add stable log1pExp impl
[ https://issues.apache.org/jira/browse/SPARK-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267392#comment-14267392 ] DB Tsai commented on SPARK-5128: https://github.com/apache/spark/pull/3915/commits Add stable log1pExp impl Key: SPARK-5128 URL: https://issues.apache.org/jira/browse/SPARK-5128 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: DB Tsai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5129) make SqlContext support select date +/- XX DAYS from table
[ https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5129: -- Description: Example : create table test (date: Date) 2014-01-01 2014-01-02 2014-01-03 when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 and running select date - 10 DAYS from test, get 2013-12-22 2013-12-23 2013-12-24 was: Example : create table test (date: Date) 2014-01-01 2014-01-02 2014-01-03 when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 make SqlContext support select date +/- XX DAYS from table -- Key: SPARK-5129 URL: https://issues.apache.org/jira/browse/SPARK-5129 Project: Spark Issue Type: Improvement Reporter: DoingDone9 Priority: Minor Example : create table test (date: Date) 2014-01-01 2014-01-02 2014-01-03 when running select date + 10 DAYS from test, i want get 2014-01-11 2014-01-12 2014-01-13 and running select date - 10 DAYS from test, get 2013-12-22 2013-12-23 2013-12-24 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5130) yarn-cluster mode should not be considered as client mode in spark-submit
WangTaoTheTonic created SPARK-5130: -- Summary: yarn-cluster mode should not be considered as client mode in spark-submit Key: SPARK-5130 URL: https://issues.apache.org/jira/browse/SPARK-5130 Project: Spark Issue Type: Bug Components: Deploy Reporter: WangTaoTheTonic spark-submit will choose SparkSubmitDriverBootstrapper or SparkSubmit to launch according to --deploy-mode. When submitting application using yarn-cluster we do not need to specify --deploy-mode so spark-submit will launch SparkSubmitDriverBootstrapper, and it is not proper to do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5130) yarn-cluster mode should not be considered as client mode in spark-submit
[ https://issues.apache.org/jira/browse/SPARK-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267414#comment-14267414 ] Apache Spark commented on SPARK-5130: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/3929 yarn-cluster mode should not be considered as client mode in spark-submit - Key: SPARK-5130 URL: https://issues.apache.org/jira/browse/SPARK-5130 Project: Spark Issue Type: Bug Components: Deploy Reporter: WangTaoTheTonic spark-submit will choose SparkSubmitDriverBootstrapper or SparkSubmit to launch according to --deploy-mode. When submitting application using yarn-cluster we do not need to specify --deploy-mode so spark-submit will launch SparkSubmitDriverBootstrapper, and it is not proper to do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267419#comment-14267419 ] Patrick Wendell commented on SPARK-1529: Hey Sean, From what I remember of this, the issue is that MapR clusters are not typically provisioned with much local disk space available, because the MapRFS supports accessing local volumes in its API, unlike the HDFS API. So in general the expectation is that large amounts of local data should be written through MapR's API to its local filesystem. They have an NFS mount you can use as a work around to provide POSIX API's, and I think most MapR users set this mount up and then have Spark write shuffle data there. Option 2 which [~rkannan82] mentions is not actually feasible in Spark right now. We don't support writing shuffle data through the Hadoop API's right now and I think Cheng's patch was only a prototype of how we might do that... Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267424#comment-14267424 ] Patrick Wendell commented on SPARK-1529: BTW - I think if MapR wants to have a customized shuffle, the direction proposed in this patch is probably not the best way to do it. It would make more sense to implement a DFS-based shuffle using the new pluggable shuffle API. I.e. a shuffle that communicates through the filesystem rather than doing transfers through Spark. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267430#comment-14267430 ] Sean Owen commented on SPARK-1529: -- [~pwendell] Gotcha, that begins to make sense. I assume the cluster can be provisioned with as much local disk as desired, regardless of defaults. The alternative, to write temp files across the network and read them back in order to then broadcast them back over the network, seems a lot worse than just setting up the right amount of local disk. But if it works well enough and is easier in some situations, sounds like that's also an option. I suppose I'm asking / questioning why the project would want to encourage remote shuffle files by trying to not just use the HDFS APIs, but even maintain a specialized version of it, just to make a third workaround for a vendor config issue? Surely MapR should just set up clusters that are provisioned with Spark more how Spark needs them. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file
[ https://issues.apache.org/jira/browse/SPARK-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kanwaljit Singh resolved SPARK-2641. Resolution: Fixed Spark submit doesn't pick up executor instances from properties file Key: SPARK-2641 URL: https://issues.apache.org/jira/browse/SPARK-2641 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kanwaljit Singh When running spark-submit in Yarn cluster mode, we provide properties file using --properties-file option. spark.executor.instances=5 spark.executor.memory=2120m spark.executor.cores=3 The submitted job picks up the cores and memory, but not the correct instances. I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments: // Use properties file as fallback for values which have a direct analog to // arguments in this script. master = Option(master).getOrElse(defaultProperties.get(spark.master).orNull) executorMemory = Option(executorMemory) .getOrElse(defaultProperties.get(spark.executor.memory).orNull) executorCores = Option(executorCores) .getOrElse(defaultProperties.get(spark.executor.cores).orNull) totalExecutorCores = Option(totalExecutorCores) .getOrElse(defaultProperties.get(spark.cores.max).orNull) name = Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull) jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull) Along with these defaults, we should also set default for instances: numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull) PS: spark.executor.instances is also not mentioned on http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file
[ https://issues.apache.org/jira/browse/SPARK-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kanwaljit Singh closed SPARK-2641. -- Spark submit doesn't pick up executor instances from properties file Key: SPARK-2641 URL: https://issues.apache.org/jira/browse/SPARK-2641 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kanwaljit Singh When running spark-submit in Yarn cluster mode, we provide properties file using --properties-file option. spark.executor.instances=5 spark.executor.memory=2120m spark.executor.cores=3 The submitted job picks up the cores and memory, but not the correct instances. I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments: // Use properties file as fallback for values which have a direct analog to // arguments in this script. master = Option(master).getOrElse(defaultProperties.get(spark.master).orNull) executorMemory = Option(executorMemory) .getOrElse(defaultProperties.get(spark.executor.memory).orNull) executorCores = Option(executorCores) .getOrElse(defaultProperties.get(spark.executor.cores).orNull) totalExecutorCores = Option(totalExecutorCores) .getOrElse(defaultProperties.get(spark.cores.max).orNull) name = Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull) jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull) Along with these defaults, we should also set default for instances: numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull) PS: spark.executor.instances is also not mentioned on http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types
[ https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267564#comment-14267564 ] Apache Spark commented on SPARK-4284: - User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/3933 BinaryClassificationMetrics precision-recall method names should correspond to return types --- Key: SPARK-4284 URL: https://issues.apache.org/jira/browse/SPARK-4284 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor BinaryClassificationMetrics has several methods which work with (recall, precision) pairs, but the method names all use the wrong order (pr). This order should be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688
[ https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267697#comment-14267697 ] Apache Spark commented on SPARK-3619: - User 'jongyoul' has created a pull request for this issue: https://github.com/apache/spark/pull/3934 Upgrade to Mesos 0.21 to work around MESOS-1688 --- Key: SPARK-3619 URL: https://issues.apache.org/jira/browse/SPARK-3619 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Matei Zaharia Assignee: Timothy Chen The Mesos 0.21 release has a fix for https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4929) Yarn Client mode can not support the HA after the exitcode change
[ https://issues.apache.org/jira/browse/SPARK-4929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-4929. -- Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Yarn Client mode can not support the HA after the exitcode change - Key: SPARK-4929 URL: https://issues.apache.org/jira/browse/SPARK-4929 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: SaintBacchus Fix For: 1.3.0, 1.2.1 Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)
[ https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Prettenhofer updated SPARK-5133: -- Description: Add feature importance to decision tree model and tree ensemble models. If people are interested in this feature I could implement it given a mentor (API decisions, etc). Please find a description of the feature below: Decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to assess the relative importance of a feature. Relative feature importance gives valuable insight into a decision tree or tree ensemble and can even be used for feature selection. More information on feature importance (via decrease in impurity) can be found in ESLII (10.13.1) or here [1]. R's randomForest package uses a different technique for assessing variable importance that is based on permutation tests. All necessary information to create relative importance scores should be available in the tree representation (class Node; split, impurity gain, (weighted) nr of samples?). [1] http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation was: Add feature importance to decision tree model and tree ensemble models. If people are interested in this feature I could implement it given a mentor (API decisions, etc). Please find a description of the feature below: Decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to assess the relative importance of a feature. Relative feature importance gives valuable insight into a decision tree or tree ensemble and can even be used for feature selection. All necessary information to create relative importance scores should be available in the tree representation (class Node; split, impurity gain, (weighted) nr of samples?). Feature Importance for Decision Tree (Ensembles) Key: SPARK-5133 URL: https://issues.apache.org/jira/browse/SPARK-5133 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Peter Prettenhofer Priority: Minor Add feature importance to decision tree model and tree ensemble models. If people are interested in this feature I could implement it given a mentor (API decisions, etc). Please find a description of the feature below: Decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to assess the relative importance of a feature. Relative feature importance gives valuable insight into a decision tree or tree ensemble and can even be used for feature selection. More information on feature importance (via decrease in impurity) can be found in ESLII (10.13.1) or here [1]. R's randomForest package uses a different technique for assessing variable importance that is based on permutation tests. All necessary information to create relative importance scores should be available in the tree representation (class Node; split, impurity gain, (weighted) nr of samples?). [1] http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types
[ https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267535#comment-14267535 ] Kai Sasaki commented on SPARK-4284: --- [~srowen] It's very helpful advice. Thank you! BinaryClassificationMetrics precision-recall method names should correspond to return types --- Key: SPARK-4284 URL: https://issues.apache.org/jira/browse/SPARK-4284 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor BinaryClassificationMetrics has several methods which work with (recall, precision) pairs, but the method names all use the wrong order (pr). This order should be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)
[ https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Prettenhofer updated SPARK-5133: -- Summary: Feature Importance for Decision Tree (Ensembles) (was: Feature Importance for Tree (Ensembles)) Feature Importance for Decision Tree (Ensembles) Key: SPARK-5133 URL: https://issues.apache.org/jira/browse/SPARK-5133 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Peter Prettenhofer Priority: Minor Add feature importance to decision tree model and tree ensemble models. If people are interested in this feature I could implement it given a mentor (API decisions, etc). Please find a description of the feature below: Decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to assess the relative importance of a feature. Relative feature importance gives valuable insight into a decision tree or tree ensemble and can even be used for feature selection. All necessary information to create relative importance scores should be available in the tree representation (class Node; split, impurity gain, (weighted) nr of samples?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2165) spark on yarn: add support for setting maxAppAttempts in the ApplicationSubmissionContext
[ https://issues.apache.org/jira/browse/SPARK-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-2165. -- Resolution: Fixed Fix Version/s: 1.3.0 Target Version/s: 1.3.0 spark on yarn: add support for setting maxAppAttempts in the ApplicationSubmissionContext - Key: SPARK-2165 URL: https://issues.apache.org/jira/browse/SPARK-2165 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Fix For: 1.3.0 Hadoop 2.x adds support for allowing the application to specify the maximum application attempts. We should add support for it by setting in the ApplicationSubmissionContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2165) spark on yarn: add support for setting maxAppAttempts in the ApplicationSubmissionContext
[ https://issues.apache.org/jira/browse/SPARK-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-2165: Assignee: Thomas Graves spark on yarn: add support for setting maxAppAttempts in the ApplicationSubmissionContext - Key: SPARK-2165 URL: https://issues.apache.org/jira/browse/SPARK-2165 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Thomas Graves Fix For: 1.3.0 Hadoop 2.x adds support for allowing the application to specify the maximum application attempts. We should add support for it by setting in the ApplicationSubmissionContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries
[ https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268894#comment-14268894 ] Cheng Lian commented on SPARK-4908: --- It was considered as a quick fix because we hadn't figured out the root cause when the PR was submitted. But now it turned out to be a valid fix :) Spark SQL built for Hive 13 fails under concurrent metadata queries --- Key: SPARK-4908 URL: https://issues.apache.org/jira/browse/SPARK-4908 Project: Spark Issue Type: Bug Components: SQL Reporter: David Ross Assignee: Cheng Lian Priority: Blocker Fix For: 1.3.0, 1.2.1 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6 We are using Spark built for Hive 13, using this option: {{-Phive-0.13.1}} In single-threaded mode, normal operations look fine. However, under concurrency, with at least 2 concurrent connections, metadata queries fail. For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} statement when you pass a default schema in the JDBC URL, all fail. {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue. Here is some example code: {code} object main extends App { import java.sql._ import scala.concurrent._ import scala.concurrent.duration._ import scala.concurrent.ExecutionContext.Implicits.global Class.forName(org.apache.hive.jdbc.HiveDriver) val host = localhost // update this val url = sjdbc:hive2://${host}:10511/some_db // update this val future = Future.traverse(1 to 3) { i = Future { println(Starting: + i) try { val conn = DriverManager.getConnection(url) } catch { case e: Throwable = e.printStackTrace() println(Failed: + i) } println(Finishing: + i) } } Await.result(future, 2.minutes) println(done!) } {code} Here is the output: {code} Starting: 1 Starting: 3 Starting: 2 java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation cancelled at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231) at org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451) at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Failed: 3 Finishing: 3 java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation cancelled at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231) at org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451) at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at
[jira] [Commented] (SPARK-5117) Hive Generic UDFs don't cast correctly
[ https://issues.apache.org/jira/browse/SPARK-5117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268861#comment-14268861 ] Cheng Hao commented on SPARK-5117: -- Definitely we can do that then. Hive Generic UDFs don't cast correctly -- Key: SPARK-5117 URL: https://issues.apache.org/jira/browse/SPARK-5117 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust Assignee: Cheng Hao Priority: Blocker Here's a test cast that is failing in master: {code} createQueryTest(generic udf casting, SELECT LPAD(test,5, 0) FROM src LIMIT 1) {code} This appears to be a regression from Spark 1.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4960) Interceptor pattern in receivers
[ https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268886#comment-14268886 ] Saisai Shao edited comment on SPARK-4960 at 1/8/15 6:45 AM: Hi all, I just update the doc according to TD's comment, would you mind taking a look at this, thanks a lot. Currently it's just a simple solution, since we don't need to take care of data type conversion, so the tricky corner is removed. This implementation is quite simple, with only one problem as previous mentioned, how to support store(ByteBuffer) API. Also this design should be align with SPARK-5042. Here is the link: https://docs.google.com/document/d/1-JfFkFlc5APstIcvCeqqv2t5np30ft5qaTIiNCGZfdI/edit?usp=sharing was (Author: jerryshao): Hi all, I just update the doc according to TD's comment, would you mind taking a look at this, thanks a lot. Here is the link: https://docs.google.com/document/d/1-JfFkFlc5APstIcvCeqqv2t5np30ft5qaTIiNCGZfdI/edit?usp=sharing Interceptor pattern in receivers Key: SPARK-4960 URL: https://issues.apache.org/jira/browse/SPARK-4960 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Sometimes it is good to intercept a message received through a receiver and modify / do something with the message before it is stored into Spark. This is often referred to as the interceptor pattern. There should be general way to specify an interceptor function that gets applied to all receivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot
[ https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268830#comment-14268830 ] Apache Spark commented on SPARK-4943: - User 'alexliu68' has created a pull request for this issue: https://github.com/apache/spark/pull/3941 Parsing error for query with table name having dot -- Key: SPARK-4943 URL: https://issues.apache.org/jira/browse/SPARK-4943 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Alex Liu When integrating Spark 1.2.0 with Cassandra SQL, the following query is broken. It was working for Spark 1.1.0 version. Basically we use the table name having dot to include database name {code} [info] java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but `.' found [info] [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT test2.a FROM sql_test.test2 AS test2 [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) [info] at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) [info] at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683) [info] at org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1644) [info] at
[jira] [Commented] (SPARK-5042) Updated Receiver API to make it easier to write reliable receivers that ack source
[ https://issues.apache.org/jira/browse/SPARK-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268846#comment-14268846 ] Saisai Shao commented on SPARK-5042: Hey TD, what is your schedule on this? Updated Receiver API to make it easier to write reliable receivers that ack source -- Key: SPARK-5042 URL: https://issues.apache.org/jira/browse/SPARK-5042 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Receivers in Spark Streaming receive data from different sources and push them into Spark’s block manager. However, the received records must be chunked into blocks before being pushed into the BlockManager. Related to this, the Receiver API provides two kinds of store() - 1. store(single record) - The receiver implementation submits one record-at-a-time and the system takes care of dividing it into right sized blocks, and limiting the ingestion rates. In future, it should also be able to do automatic rate / flow control. However, there is no feedback to the receiver on when blocks are formed thus no way to ensure reliability guarantees. Overall, receivers using this are easy to implement. 2. store(multiple records)- The receiver submits multiple records and that forms the blocks that are stored in the block manager. The receiver implementation has full control over block generation, which allows the receiver acknowledge source when blocks have been reliably received by BlockManager and/or WriteAheadLog. However, the implementation of the receivers will not get automatic block sizing and rate controlling; the developer will have to take care of that. All this adds to the complexity of the receiver implementation. So, to summarize, the (2) has the advantage of full control over block generation, but the users have to deal with the complexity of generating blocks of the right block size and rate control. So we want to update this API such that it is becomes easier for developers to achieve reliable receiving of records without sacrificing automatic block sizing and rate control. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers
[ https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268886#comment-14268886 ] Saisai Shao commented on SPARK-4960: Hi all, I just update the doc according to TD's comment, would you mind taking a look at this, thanks a lot. Here is the link: https://docs.google.com/document/d/1-JfFkFlc5APstIcvCeqqv2t5np30ft5qaTIiNCGZfdI/edit?usp=sharing Interceptor pattern in receivers Key: SPARK-4960 URL: https://issues.apache.org/jira/browse/SPARK-4960 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Sometimes it is good to intercept a message received through a receiver and modify / do something with the message before it is stored into Spark. This is often referred to as the interceptor pattern. There should be general way to specify an interceptor function that gets applied to all receivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5080) Expose more cluster resource information to user
[ https://issues.apache.org/jira/browse/SPARK-5080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268786#comment-14268786 ] Xuefu Zhang commented on SPARK-5080: cc: [~sandyr] Expose more cluster resource information to user Key: SPARK-5080 URL: https://issues.apache.org/jira/browse/SPARK-5080 Project: Spark Issue Type: Improvement Reporter: Rui Li It'll be useful if user can get detailed cluster resource info, e.g. granted/allocated executors, memory and CPU. Such information is available via WebUI but seems SparkContext doesn't have these APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1825) Windows Spark fails to work with Linux YARN
[ https://issues.apache.org/jira/browse/SPARK-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268863#comment-14268863 ] Masayoshi TSUZUKI commented on SPARK-1825: -- It is necessary to use $$() to solve this problem but as discussed on PR #899 if we use $$() build for hadoop2.4 will fail. So PR #3943 uses reflection to avoid build failure for every version of hadoop. Windows clilents works fine with Linux YARN cluetr only when we use hadoop 2.4+. But it doesn't work under hadoop2.4 even after this patch. Windows Spark fails to work with Linux YARN --- Key: SPARK-1825 URL: https://issues.apache.org/jira/browse/SPARK-1825 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Taeyun Kim Attachments: SPARK-1825.patch Windows Spark fails to work with Linux YARN. This is a cross-platform problem. This error occurs when 'yarn-client' mode is used. (yarn-cluster/yarn-standalone mode was not tested.) On YARN side, Hadoop 2.4.0 resolved the issue as follows: https://issues.apache.org/jira/browse/YARN-1824 But Spark YARN module does not incorporate the new YARN API yet, so problem persists for Spark. First, the following source files should be changed: - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala Change is as follows: - Replace .$() to .$$() - Replace File.pathSeparator for Environment.CLASSPATH.name to ApplicationConstants.CLASS_PATH_SEPARATOR (import org.apache.hadoop.yarn.api.ApplicationConstants is required for this) Unless the above are applied, launch_container.sh will contain invalid shell script statements(since they will contain Windows-specific separators), and job will fail. Also, the following symptom should also be fixed (I could not find the relevant source code): - SPARK_HOME environment variable is copied straight to launch_container.sh. It should be changed to the path format for the server OS, or, the better, a separate environment variable or a configuration variable should be created. - '%HADOOP_MAPRED_HOME%' string still exists in launch_container.sh, after the above change is applied. maybe I missed a few lines. I'm not sure whether this is all, since I'm new to both Spark and YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib
[ https://issues.apache.org/jira/browse/SPARK-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5116: - Assignee: Shuo Xiang Add extractor for SparseVector and DenseVector in MLlib Key: SPARK-5116 URL: https://issues.apache.org/jira/browse/SPARK-5116 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Priority: Minor Fix For: 1.3.0 Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we need to use: {code:title=A.scala|borderStyle=solid} vec match { case dv: DenseVector = val values = dv.values ... case sv: SparseVector = val indices = sv.indices val values = sv.values val size = sv.size ... } {code} with extractor it is: {code:title=B.scala|borderStyle=solid} vec match { case DenseVector(values) = ... case SparseVector(size, indices, values) = ... } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5141) CaseInsensitiveMap throws java.io.NotSerializableException
Gankun Luo created SPARK-5141: - Summary: CaseInsensitiveMap throws java.io.NotSerializableException Key: SPARK-5141 URL: https://issues.apache.org/jira/browse/SPARK-5141 Project: Spark Issue Type: Bug Components: SQL Reporter: Gankun Luo Priority: Minor The following code throws a serialization.[https://github.com/luogankun/spark-jdbc|https://github.com/luogankun/spark-jdbc] {code} CREATE TEMPORARY TABLE jdbc_table USING com.luogankun.spark.jdbc OPTIONS ( sparksql_table_schema '(TBL_ID int, TBL_NAME string, TBL_TYPE string)', jdbc_table_name'TBLS', jdbc_table_schema '(TBL_ID , TBL_NAME , TBL_TYPE)', url'jdbc:mysql://hadoop000:3306/hive', user'root', password'root' ); select TBL_ID,TBL_ID,TBL_TYPE from jdbc_table; {code} I get the following stack trace: {code} org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1448) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:616) at org.apache.spark.sql.execution.Project.execute(basicOperators.scala:43) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:81) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:386) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:365) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.NotSerializableException: org.apache.spark.sql.sources.CaseInsensitiveMap at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) .. at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4922) Support dynamic allocation for coarse-grained Mesos
[ https://issues.apache.org/jira/browse/SPARK-4922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268826#comment-14268826 ] Jongyoul Lee commented on SPARK-4922: - [~andrewor14] Hi, I have a basic question about your idea. I'm using fine-grained mesos for running my jobs. that mode already allocate resources dynamically when task scheduler wants. What you think the difference is between your idea and fine-grained mode? Unlike coarse-grained mode, fine-grained mode adjusts # of cores for a executor and enables to make two more executor on each slave. I think if we set # of cores for each mesos executor in a configuration on fine-grained mode - now, only one core fixed for each executor -, we can satisfy dynamic allocation idea. and I read SPARK-4751, and I'll handle this issue via using fine-grain mode. And how do you think you adjust resources? new API for increasing or decreasing cores or just use {{spark.cores.max}}? Support dynamic allocation for coarse-grained Mesos --- Key: SPARK-4922 URL: https://issues.apache.org/jira/browse/SPARK-4922 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.2.0 Reporter: Andrew Or Priority: Critical This brings SPARK-3174, which provided dynamic allocation of cluster resources to Spark on YARN applications, to Mesos coarse-grained mode. Note that the translation is not as trivial as adding a code path that exposes the request and kill mechanisms as we did for YARN is SPARK-3822. This is because Mesos coarse-grained mode schedules on the notion of setting the number of cores allowed for an application (as in standalone mode) instead of number of executors (as in YARN mode). For more detail, please see SPARK-4751. If you intend to work on this, please provide a detailed design doc! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4922) Support dynamic allocation for coarse-grained Mesos
[ https://issues.apache.org/jira/browse/SPARK-4922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268826#comment-14268826 ] Jongyoul Lee edited comment on SPARK-4922 at 1/8/15 5:35 AM: - [~andrewor14] Hi, I have a basic question about your idea. I'm using fine-grained mesos for running my jobs. that mode already allocate resources dynamically when task scheduler wants. What you think the difference is between your idea and fine-grained mode? Unlike coarse-grained mode, fine-grained mode adjusts # of cores for a executor and enables to make two more executor on each slave. I think if we set # of cores for each mesos executor in a configuration on fine-grained mode - now, only one core fixed for each executor -, we can satisfy dynamic allocation idea. and I read SPARK-4751, and I can handle this issue via using fine-grain mode. And how do you think you adjust resources? new API for increasing or decreasing cores or just use {{spark.cores.max}}? was (Author: jongyoul): [~andrewor14] Hi, I have a basic question about your idea. I'm using fine-grained mesos for running my jobs. that mode already allocate resources dynamically when task scheduler wants. What you think the difference is between your idea and fine-grained mode? Unlike coarse-grained mode, fine-grained mode adjusts # of cores for a executor and enables to make two more executor on each slave. I think if we set # of cores for each mesos executor in a configuration on fine-grained mode - now, only one core fixed for each executor -, we can satisfy dynamic allocation idea. and I read SPARK-4751, and I'll handle this issue via using fine-grain mode. And how do you think you adjust resources? new API for increasing or decreasing cores or just use {{spark.cores.max}}? Support dynamic allocation for coarse-grained Mesos --- Key: SPARK-4922 URL: https://issues.apache.org/jira/browse/SPARK-4922 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.2.0 Reporter: Andrew Or Priority: Critical This brings SPARK-3174, which provided dynamic allocation of cluster resources to Spark on YARN applications, to Mesos coarse-grained mode. Note that the translation is not as trivial as adding a code path that exposes the request and kill mechanisms as we did for YARN is SPARK-3822. This is because Mesos coarse-grained mode schedules on the notion of setting the number of cores allowed for an application (as in standalone mode) instead of number of executors (as in YARN mode). For more detail, please see SPARK-4751. If you intend to work on this, please provide a detailed design doc! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1825) Windows Spark fails to work with Linux YARN
[ https://issues.apache.org/jira/browse/SPARK-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268858#comment-14268858 ] Apache Spark commented on SPARK-1825: - User 'tsudukim' has created a pull request for this issue: https://github.com/apache/spark/pull/3943 Windows Spark fails to work with Linux YARN --- Key: SPARK-1825 URL: https://issues.apache.org/jira/browse/SPARK-1825 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Taeyun Kim Attachments: SPARK-1825.patch Windows Spark fails to work with Linux YARN. This is a cross-platform problem. This error occurs when 'yarn-client' mode is used. (yarn-cluster/yarn-standalone mode was not tested.) On YARN side, Hadoop 2.4.0 resolved the issue as follows: https://issues.apache.org/jira/browse/YARN-1824 But Spark YARN module does not incorporate the new YARN API yet, so problem persists for Spark. First, the following source files should be changed: - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala Change is as follows: - Replace .$() to .$$() - Replace File.pathSeparator for Environment.CLASSPATH.name to ApplicationConstants.CLASS_PATH_SEPARATOR (import org.apache.hadoop.yarn.api.ApplicationConstants is required for this) Unless the above are applied, launch_container.sh will contain invalid shell script statements(since they will contain Windows-specific separators), and job will fail. Also, the following symptom should also be fixed (I could not find the relevant source code): - SPARK_HOME environment variable is copied straight to launch_container.sh. It should be changed to the path format for the server OS, or, the better, a separate environment variable or a configuration variable should be created. - '%HADOOP_MAPRED_HOME%' string still exists in launch_container.sh, after the above change is applied. maybe I missed a few lines. I'm not sure whether this is all, since I'm new to both Spark and YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries
[ https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268892#comment-14268892 ] David Ross commented on SPARK-4908: --- I've verified that this is fixed on trunk. Since his commit says just a quick fix, I will let [~marmbrus] decide whether or not to keep this JIRA open. Spark SQL built for Hive 13 fails under concurrent metadata queries --- Key: SPARK-4908 URL: https://issues.apache.org/jira/browse/SPARK-4908 Project: Spark Issue Type: Bug Components: SQL Reporter: David Ross Assignee: Cheng Lian Priority: Blocker Fix For: 1.3.0, 1.2.1 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6 We are using Spark built for Hive 13, using this option: {{-Phive-0.13.1}} In single-threaded mode, normal operations look fine. However, under concurrency, with at least 2 concurrent connections, metadata queries fail. For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} statement when you pass a default schema in the JDBC URL, all fail. {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue. Here is some example code: {code} object main extends App { import java.sql._ import scala.concurrent._ import scala.concurrent.duration._ import scala.concurrent.ExecutionContext.Implicits.global Class.forName(org.apache.hive.jdbc.HiveDriver) val host = localhost // update this val url = sjdbc:hive2://${host}:10511/some_db // update this val future = Future.traverse(1 to 3) { i = Future { println(Starting: + i) try { val conn = DriverManager.getConnection(url) } catch { case e: Throwable = e.printStackTrace() println(Failed: + i) } println(Finishing: + i) } } Await.result(future, 2.minutes) println(done!) } {code} Here is the output: {code} Starting: 1 Starting: 3 Starting: 2 java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation cancelled at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231) at org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451) at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Failed: 3 Finishing: 3 java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation cancelled at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231) at org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451) at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at
[jira] [Resolved] (SPARK-5126) No error log for a typo master url
[ https://issues.apache.org/jira/browse/SPARK-5126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5126. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Shixiong Zhu No error log for a typo master url --- Key: SPARK-5126 URL: https://issues.apache.org/jira/browse/SPARK-5126 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.3.0 If a typo master url is passed to Worker, it only print the following logs: {noformat} 15/01/07 14:30:02 INFO worker.Worker: Connecting to master spark://master url:7077... 15/01/07 14:30:02 INFO remote.RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from Actor[akka://sparkWorker/user/Worker#-282880172] to Actor[akka://sparkWorker/deadLetters] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. {noformat} It's not obvious to know the url is wrong. And {{akka://sparkWorker/deadLetters}} is also confusing. The `deadLetters` Actor is because `actorSelection` will return `deadLetters` for invalid path. {code} def actorSelection(path: String): ActorSelection = path match { case RelativeActorPath(elems) ⇒ if (elems.isEmpty) ActorSelection(provider.deadLetters, ) else if (elems.head.isEmpty) ActorSelection(provider.rootGuardian, elems.tail) else ActorSelection(lookupRoot, elems) case ActorPathExtractor(address, elems) ⇒ ActorSelection(provider.rootGuardianAt(address), elems) case _ ⇒ ActorSelection(provider.deadLetters, ) } {code} I think logging an error about invalid url is better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib
[ https://issues.apache.org/jira/browse/SPARK-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5116. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3919 [https://github.com/apache/spark/pull/3919] Add extractor for SparseVector and DenseVector in MLlib Key: SPARK-5116 URL: https://issues.apache.org/jira/browse/SPARK-5116 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Shuo Xiang Priority: Minor Fix For: 1.3.0 Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we need to use: {code:title=A.scala|borderStyle=solid} vec match { case dv: DenseVector = val values = dv.values ... case sv: SparseVector = val indices = sv.indices val values = sv.values val size = sv.size ... } {code} with extractor it is: {code:title=B.scala|borderStyle=solid} vec match { case DenseVector(values) = ... case SparseVector(size, indices, values) = ... } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268440#comment-14268440 ] Gerard Maas commented on SPARK-4940: Hi Tim, We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes much sense for Spark Streaming. Here're few examples of resource allocation. They are taken from several runs of the same job with identical configuration: Job config: spark.cores.max = 18 spark.mesos.coarse = true spark.executor.memory = 4g The job logic will start 6 Kafka receivers. #1 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 4 | 4GB | 3 | 2 | | 2 | 6 | 4GB | 2 | 1 | | 3 | 7 | 4GB | 3 | 2 | | 4 | 1 | 4GB | 1 | 1 | Total mem: 16 GB Total CPUs: 18 Observations: Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process the received data, so all data received needs to be sent to other node for non-local processing (not sure how replication helps or not in this case, the blocks of data are processed on other nodes). Also the nodes with 2 streaming receivers have higher load that the node with 1 receiver. #2 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 7 | 4GB | 7 | 4 | | 2 | 2 | 4GB | 2 | 2 | Total mem: 8 GB Total CPUs: 9 Observations: This is the worst configuration of the day. Totally unbalanced (4 vs 2 receivers) and for some reason, the job didn't get all the resources assigned in the configuration. The job processing time is also slower as there're less cores to handle the data and less overall memory. #3 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 3 | 4GB | 3 | 2 | | 2 | 8 | 4GB | 2 | 2 | | 3 | 7 | 4GB | 3 | 2 | Total mem: 12GB Total CPU: 18 Observations: This is a fairly good configuration with a more evenly distributed receivers and CPUs although there's one considerable smaller node in terms of CPU assignment. We can observe that the current resource assignment policy results in less than ideal and in particular random assignments that have a strong impact on the job execution and performance. Given that CPU allocation is by executor (and not by job), makes total memory for the job variable as it can get 2 to 4 executors assigned. It's also weird and unexpected to observe less than max CPU allocations. Here's a performance chart of the same job across two configurations, one with 3 (left) nodes and one with 2 (right): !https://lh3.googleusercontent.com/Z1C71OKoQzGA13uNJ8Yvf_xz_glRUqU_IGGvLsfkPvUPK2lahrEatweiWl-PDDfysjXtbs1Sl_k=w1682-h689! (chart line: processing time in ms, load is fairly constant) Support more evenly distributing cores for Mesos mode - Key: SPARK-4940 URL: https://issues.apache.org/jira/browse/SPARK-4940 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently in Coarse grain mode the spark scheduler simply takes all the resources it can on each node, but can cause uneven distribution based on resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268440#comment-14268440 ] Gerard Maas edited comment on SPARK-4940 at 1/7/15 10:54 PM: - Hi Tim, We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes much sense for Spark Streaming. Here're few examples of resource allocation. They are taken from several runs of the same job with identical configuration: Job config: spark.cores.max = 18 spark.mesos.coarse = true spark.executor.memory = 4g The job logic will start 6 Kafka receivers. #1 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 4 | 4GB | 3 | 2 | | 2 | 6 | 4GB | 2 | 1 | | 3 | 7 | 4GB | 3 | 2 | | 4 | 1 | 4GB | 1 | 1 | Total mem: 16 GB Total CPUs: 18 Observations: Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process the received data, so all data received needs to be sent to other node for non-local processing (not sure how replication helps or not in this case, the blocks of data are processed on other nodes). Also the nodes with 2 streaming receivers have higher load that the node with 1 receiver. #2 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 7 | 4GB | 7 | 4 | | 2 | 2 | 4GB | 2 | 2 | Total mem: 8 GB Total CPUs: 9 Observations: This is the worst configuration of the day. Totally unbalanced (4 vs 2 receivers) and for some reason, the job didn't get all the resources assigned in the configuration. The job processing time is also slower as there're less cores to handle the data and less overall memory. #3 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 3 | 4GB | 3 | 2 | | 2 | 8 | 4GB | 2 | 2 | | 3 | 7 | 4GB | 3 | 2 | Total mem: 12GB Total CPU: 18 Observations: This is a fairly good configuration with a more evenly distributed receivers and CPUs although there's one considerable smaller node in terms of CPU assignment. We can observe that the current resource assignment policy results in less than ideal and in particular random assignments that have a strong impact on the job execution and performance. Given that CPU allocation is by executor (and not by job), makes total memory for the job variable as it can get 2 to 4 executors assigned. It's also weird and unexpected to observe less than max CPU allocations. Here's a performance chart of the same job jumping from one config to another (*): 3 nodes (left of the spike) and 2 nodes (right): !https://lh3.googleusercontent.com/Z1C71OKoQzGA13uNJ8Yvf_xz_glRUqU_IGGvLsfkPvUPK2lahrEatweiWl-PDDfysjXtbs1Sl_k=w1682-h689! (chart line: processing time in ms, load is fairly constant, higher is worst. Note how the job performance is degraded) (*) for some reason we didn't find yet, Mesos often kills the job. When Marathon relaunches it, it results in a different resource assignment. was (Author: gmaas): Hi Tim, We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes much sense for Spark Streaming. Here're few examples of resource allocation. They are taken from several runs of the same job with identical configuration: Job config: spark.cores.max = 18 spark.mesos.coarse = true spark.executor.memory = 4g The job logic will start 6 Kafka receivers. #1 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 4 | 4GB | 3 | 2 | | 2 | 6 | 4GB | 2 | 1 | | 3 | 7 | 4GB | 3 | 2 | | 4 | 1 | 4GB | 1 | 1 | Total mem: 16 GB Total CPUs: 18 Observations: Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process the received data, so all data received needs to be sent to other node for non-local processing (not sure how replication helps or not in this case, the blocks of data are processed on other nodes). Also the nodes with 2 streaming receivers have higher load that the node with 1 receiver. #2 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 7 | 4GB | 7 | 4 | | 2 | 2 | 4GB | 2 | 2 | Total mem: 8 GB Total CPUs: 9 Observations: This is the worst configuration of the day. Totally unbalanced (4 vs 2 receivers) and for some reason, the job didn't get all the resources assigned in the configuration. The job processing time is also slower as there're less cores to handle the data and less overall memory. #3 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 3 | 4GB | 3 | 2 | | 2 | 8 | 4GB | 2 | 2 | | 3 | 7 | 4GB | 3 | 2 | Total mem: 12GB Total CPU: 18 Observations: This is a fairly good configuration with a more evenly distributed receivers and CPUs although there's one considerable smaller node in terms of CPU assignment. We can observe that the current resource assignment policy results in less than ideal
[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268440#comment-14268440 ] Gerard Maas edited comment on SPARK-4940 at 1/7/15 10:53 PM: - Hi Tim, We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes much sense for Spark Streaming. Here're few examples of resource allocation. They are taken from several runs of the same job with identical configuration: Job config: spark.cores.max = 18 spark.mesos.coarse = true spark.executor.memory = 4g The job logic will start 6 Kafka receivers. #1 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 4 | 4GB | 3 | 2 | | 2 | 6 | 4GB | 2 | 1 | | 3 | 7 | 4GB | 3 | 2 | | 4 | 1 | 4GB | 1 | 1 | Total mem: 16 GB Total CPUs: 18 Observations: Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process the received data, so all data received needs to be sent to other node for non-local processing (not sure how replication helps or not in this case, the blocks of data are processed on other nodes). Also the nodes with 2 streaming receivers have higher load that the node with 1 receiver. #2 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 7 | 4GB | 7 | 4 | | 2 | 2 | 4GB | 2 | 2 | Total mem: 8 GB Total CPUs: 9 Observations: This is the worst configuration of the day. Totally unbalanced (4 vs 2 receivers) and for some reason, the job didn't get all the resources assigned in the configuration. The job processing time is also slower as there're less cores to handle the data and less overall memory. #3 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 3 | 4GB | 3 | 2 | | 2 | 8 | 4GB | 2 | 2 | | 3 | 7 | 4GB | 3 | 2 | Total mem: 12GB Total CPU: 18 Observations: This is a fairly good configuration with a more evenly distributed receivers and CPUs although there's one considerable smaller node in terms of CPU assignment. We can observe that the current resource assignment policy results in less than ideal and in particular random assignments that have a strong impact on the job execution and performance. Given that CPU allocation is by executor (and not by job), makes total memory for the job variable as it can get 2 to 4 executors assigned. It's also weird and unexpected to observe less than max CPU allocations. Here's a performance chart of the same job jumping from one config to another (*), one with 3 (left) nodes and one with 2 (right): !https://lh3.googleusercontent.com/Z1C71OKoQzGA13uNJ8Yvf_xz_glRUqU_IGGvLsfkPvUPK2lahrEatweiWl-PDDfysjXtbs1Sl_k=w1682-h689! (chart line: processing time in ms, load is fairly constant) (*) for some reason we didn't find yet, Mesos often kills the job. When Marathon relaunches it, it results in a different resource assignment. was (Author: gmaas): Hi Tim, We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes much sense for Spark Streaming. Here're few examples of resource allocation. They are taken from several runs of the same job with identical configuration: Job config: spark.cores.max = 18 spark.mesos.coarse = true spark.executor.memory = 4g The job logic will start 6 Kafka receivers. #1 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 4 | 4GB | 3 | 2 | | 2 | 6 | 4GB | 2 | 1 | | 3 | 7 | 4GB | 3 | 2 | | 4 | 1 | 4GB | 1 | 1 | Total mem: 16 GB Total CPUs: 18 Observations: Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process the received data, so all data received needs to be sent to other node for non-local processing (not sure how replication helps or not in this case, the blocks of data are processed on other nodes). Also the nodes with 2 streaming receivers have higher load that the node with 1 receiver. #2 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 7 | 4GB | 7 | 4 | | 2 | 2 | 4GB | 2 | 2 | Total mem: 8 GB Total CPUs: 9 Observations: This is the worst configuration of the day. Totally unbalanced (4 vs 2 receivers) and for some reason, the job didn't get all the resources assigned in the configuration. The job processing time is also slower as there're less cores to handle the data and less overall memory. #3 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 3 | 4GB | 3 | 2 | | 2 | 8 | 4GB | 2 | 2 | | 3 | 7 | 4GB | 3 | 2 | Total mem: 12GB Total CPU: 18 Observations: This is a fairly good configuration with a more evenly distributed receivers and CPUs although there's one considerable smaller node in terms of CPU assignment. We can observe that the current resource assignment policy results in less than ideal and in particular random assignments that have a strong impact on
[jira] [Commented] (SPARK-5136) Improve documentation around setting up Spark IntelliJ project
[ https://issues.apache.org/jira/browse/SPARK-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268458#comment-14268458 ] Sean Owen commented on SPARK-5136: -- [~pwendell] Before I suggest a change to the IntelliJ build notes in {{docs/}}, which are indeed a little out of date, I remember that you created https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-BuildingSparkinIntelliJIDEA for much the same purpose. That's better but I think it's also a little out of date (e.g. the YARN structure has changed). Best to have this info in just one place. Should the docs link to the wiki, and should I suggest a few changes to the wiki? Or should we try to put all of this info into docs and remove the wiki? Improve documentation around setting up Spark IntelliJ project -- Key: SPARK-5136 URL: https://issues.apache.org/jira/browse/SPARK-5136 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor [The documentation about setting up a Spark project in Intellij|http://spark.apache.org/docs/latest/building-spark.html#using-with-intellij-idea] is somewhat short/cryptic and targets [an IntelliJ version released in 2012|https://www.jetbrains.com/company/history.jsp]. A refresh / upgrade is probably warranted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2541) Standalone mode can't access secure HDFS anymore
[ https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268475#comment-14268475 ] Nicholas Chammas commented on SPARK-2541: - By the way, should this issue be linked to [SPARK-3438]? Standalone mode can't access secure HDFS anymore Key: SPARK-2541 URL: https://issues.apache.org/jira/browse/SPARK-2541 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0, 1.0.1 Reporter: Thomas Graves Attachments: SPARK-2541-partial.patch In spark 0.9.x you could access secure HDFS from Standalone deploy, that doesn't work in 1.X anymore. It looks like the issues is in SparkHadoopUtil.runAsSparkUser. Previously it wouldn't do the doAs if the currentUser == user. Not sure how it affects when the daemons run as a super user but SPARK_USER is set to someone else. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2541) Standalone mode can't access secure HDFS anymore
[ https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268480#comment-14268480 ] Thomas Graves commented on SPARK-2541: -- Yeah kind of. I guess 3438 is to officially add support for it. It used to work, which is why I filed this jira, but perhaps it was never really officially supported. Atleast not in a documented way, so that one sounds like it should be more comprehensive. Standalone mode can't access secure HDFS anymore Key: SPARK-2541 URL: https://issues.apache.org/jira/browse/SPARK-2541 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0, 1.0.1 Reporter: Thomas Graves Attachments: SPARK-2541-partial.patch In spark 0.9.x you could access secure HDFS from Standalone deploy, that doesn't work in 1.X anymore. It looks like the issues is in SparkHadoopUtil.runAsSparkUser. Previously it wouldn't do the doAs if the currentUser == user. Not sure how it affects when the daemons run as a super user but SPARK_USER is set to someone else. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution
[ https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268529#comment-14268529 ] Davies Liu commented on SPARK-3910: --- [~joshrosen] I think we should backport SPARK-4348 and SPARK-4821 into branch-1.1, it also remove the hack in pyspark/__init__.py ./python/pyspark/mllib/classification.py doctests fails with module name pollution -- Key: SPARK-3910 URL: https://issues.apache.org/jira/browse/SPARK-3910 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, unittest2==0.5.1, wsgiref==0.1.2 Reporter: Tomohiko K. Labels: pyspark, testing In ./python/run-tests script, we run the doctests in ./pyspark/mllib/classification.py. The output is as following: {noformat} $ ./python/run-tests ... Running test: pyspark/mllib/classification.py Traceback (most recent call last): File pyspark/mllib/classification.py, line 20, in module import numpy File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py, line 170, in module from . import add_newdocs File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py, line 13, in module from numpy.lib import add_newdoc File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py, line 8, in module from .type_check import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py, line 11, in module import numpy.core.numeric as _nx File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py, line 46, in module from numpy.testing import Tester File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py, line 13, in module from .utils import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py, line 15, in module from tempfile import mkdtemp File /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py, line 34, in module from random import Random as _Random File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, line 24, in module from pyspark.rdd import RDD File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 51, in module from pyspark.context import SparkContext File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 22, in module from tempfile import NamedTemporaryFile ImportError: cannot import name NamedTemporaryFile 0.07 real 0.04 user 0.02 sys Had test failures; see logs. {noformat} The problem is a cyclic import of tempfile module. The cause of it is that pyspark.mllib.random module exists in the directory where pyspark.mllib.classification module exists. classification module imports numpy module, and then numpy module imports tempfile module from its inside. Now the first entry sys.path is the directory ./python/pyspark/mllib (where the executed file classification.py exists), so tempfile module imports pyspark.mllib.random module (not the standard library random module). Finally, import chains reach tempfile again, then a cyclic import is formed. Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile → (cyclic import!!) Furthermore, stat module is in a standard library, and pyspark.mllib.stat module exists. This also may be troublesome. commit: 0e8203f4fb721158fb27897680da476174d24c4b A fundamental solution is to avoid using module names used by standard libraries (currently random and stat). A difficulty of this solution is to rename pyspark.mllib.random and pyspark.mllib.stat, which may be already used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Fry updated SPARK-4879: Attachment: speculation2.txt speculation.txt Hey Josh, I have been playing around with your repro above and I think I can consistently trigger the bad behavior by just tweaking the value of {{spark.speculation.multiplier}} and {{spark.speculation.quantile}}. I set the {{multiplier}} to be 1 and the {{quantile}} to 0.01 so that only 1% of tasks have to finish before any task that takes longer than those 1% of tasks should speculate. As expected, I see a lot of tasks getting speculated. After running the repro about 5 times, I have seen 2 errors (stack traces at the bottom and the full run from the REPL is attached with this comment). One thing I do notice is that the part-0 associated with Stage 1 was always where I expected it to be in HDFS, and all lines were present (checked using a {{wc -l}}) {code} scala 15/01/07 13:44:26 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID 119, redacted-host-02): java.io.IOException: The temporary job-output directory hdfs://redacted-host-01:8020/test6/_temporary doesn't exist! org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250) org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240) org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116) org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:89) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:980) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} {code} 15/01/07 15:17:39 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID 120, redacted-host-03): org.apache.hadoop.ipc.RemoteException: No lease on /test7/_temporary/_attempt_201501071517__m_00_120/part-0: File does not exist. Holder DFSClient_NONMAPREDUCE_-469253416_73 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2609) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2339) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:501) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:299) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44954) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1746) org.apache.hadoop.ipc.Client.call(Client.java:1238) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) com.sun.proxy.$Proxy9.addBlock(Unknown Source) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) com.sun.proxy.$Proxy9.addBlock(Unknown Source) org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:291) org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1177)
[jira] [Created] (SPARK-5139) select table_alias.* with joins and selecting column names from inner queries not supported
Sunita Koppar created SPARK-5139: Summary: select table_alias.* with joins and selecting column names from inner queries not supported Key: SPARK-5139 URL: https://issues.apache.org/jira/browse/SPARK-5139 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1 Environment: Eclipse + SBT as well as linux cluster Reporter: Sunita Koppar Priority: Blocker There are 2 issues here: 1. select table_alias.* on a joined query is not supported The exception thrown is as below: at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260) at croevss.WfPlsRej$.plsrej(WfPlsRej.scala:80) at croevss.WfPlsRej$.main(WfPlsRej.scala:40) at croevss.WfPlsRej.main(WfPlsRej.scala) 2. Multilevel nesting chokes up with messages like this: Exception in thread main org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: Below is a sample query which runs on hive, but fails due to the above reasons with Spark SQL. SELECT sq.* ,r.* FROM (SELECT cs.*, w.primary_key, w.id AS s_id1, w.d_cd, w.d_name, w.rd, w.completion_date AS completion_date1, w.sales_type AS sales_type1 FROM (SELECT stg.s_id, stg.c_id, stg.v, stg.flg1, stg.flg2, comstg.d1, comstg.d2, comstg.d3, FROM croe_rej_stage_pq stg JOIN croe_rej_stage_comments_pq comstg ON ( stg.s_id = comstg.s_id ) WHERE comstg.valid_flg_txt = 'Y' AND stg.valid_flg_txt = 'Y' ORDER BY stg.s_id) cs JOIN croe_rej_work_pq w ON ( cs.s_id = w.s_id )) sq JOIN CROE_rdr_pq r ON ( sq.d_cd = r.d_number ) This is very cumbersome to deal with and we end up creating StructTypes for every level. If there is a better way to deal with this, please let us know regards Sunita -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268440#comment-14268440 ] Gerard Maas edited comment on SPARK-4940 at 1/7/15 10:52 PM: - Hi Tim, We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes much sense for Spark Streaming. Here're few examples of resource allocation. They are taken from several runs of the same job with identical configuration: Job config: spark.cores.max = 18 spark.mesos.coarse = true spark.executor.memory = 4g The job logic will start 6 Kafka receivers. #1 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 4 | 4GB | 3 | 2 | | 2 | 6 | 4GB | 2 | 1 | | 3 | 7 | 4GB | 3 | 2 | | 4 | 1 | 4GB | 1 | 1 | Total mem: 16 GB Total CPUs: 18 Observations: Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process the received data, so all data received needs to be sent to other node for non-local processing (not sure how replication helps or not in this case, the blocks of data are processed on other nodes). Also the nodes with 2 streaming receivers have higher load that the node with 1 receiver. #2 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 7 | 4GB | 7 | 4 | | 2 | 2 | 4GB | 2 | 2 | Total mem: 8 GB Total CPUs: 9 Observations: This is the worst configuration of the day. Totally unbalanced (4 vs 2 receivers) and for some reason, the job didn't get all the resources assigned in the configuration. The job processing time is also slower as there're less cores to handle the data and less overall memory. #3 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 3 | 4GB | 3 | 2 | | 2 | 8 | 4GB | 2 | 2 | | 3 | 7 | 4GB | 3 | 2 | Total mem: 12GB Total CPU: 18 Observations: This is a fairly good configuration with a more evenly distributed receivers and CPUs although there's one considerable smaller node in terms of CPU assignment. We can observe that the current resource assignment policy results in less than ideal and in particular random assignments that have a strong impact on the job execution and performance. Given that CPU allocation is by executor (and not by job), makes total memory for the job variable as it can get 2 to 4 executors assigned. It's also weird and unexpected to observe less than max CPU allocations. Here's a performance chart of the same job jumping from one config to another (*), one with 3 (left) nodes and one with 2 (right): !https://lh3.googleusercontent.com/Z1C71OKoQzGA13uNJ8Yvf_xz_glRUqU_IGGvLsfkPvUPK2lahrEatweiWl-PDDfysjXtbs1Sl_k=w1682-h689! (chart line: processing time in ms, load is fairly constant) (*) for some reason we didn't find yet, Mesos often kills the job. When Marathon relaunches it, it results in a different resource assignment. was (Author: gmaas): Hi Tim, We are indeed using Coarse Grain mode. I'm not sure fine-grained mode makes much sense for Spark Streaming. Here're few examples of resource allocation. They are taken from several runs of the same job with identical configuration: Job config: spark.cores.max = 18 spark.mesos.coarse = true spark.executor.memory = 4g The job logic will start 6 Kafka receivers. #1 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 4 | 4GB | 3 | 2 | | 2 | 6 | 4GB | 2 | 1 | | 3 | 7 | 4GB | 3 | 2 | | 4 | 1 | 4GB | 1 | 1 | Total mem: 16 GB Total CPUs: 18 Observations: Node#4 with only 1 CPU and 1 Kafka receiver does not have capacity to process the received data, so all data received needs to be sent to other node for non-local processing (not sure how replication helps or not in this case, the blocks of data are processed on other nodes). Also the nodes with 2 streaming receivers have higher load that the node with 1 receiver. #2 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 7 | 4GB | 7 | 4 | | 2 | 2 | 4GB | 2 | 2 | Total mem: 8 GB Total CPUs: 9 Observations: This is the worst configuration of the day. Totally unbalanced (4 vs 2 receivers) and for some reason, the job didn't get all the resources assigned in the configuration. The job processing time is also slower as there're less cores to handle the data and less overall memory. #3 -- || Node || Mesos CPU || Mesos Mem || Spark tasks || Streaming receivers || | 1 | 3 | 4GB | 3 | 2 | | 2 | 8 | 4GB | 2 | 2 | | 3 | 7 | 4GB | 3 | 2 | Total mem: 12GB Total CPU: 18 Observations: This is a fairly good configuration with a more evenly distributed receivers and CPUs although there's one considerable smaller node in terms of CPU assignment. We can observe that the current resource assignment policy results in less than ideal and in particular random assignments that have a strong impact on
[jira] [Updated] (SPARK-4951) A busy executor may be killed when dynamicAllocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4951: - Assignee: Shixiong Zhu A busy executor may be killed when dynamicAllocation is enabled --- Key: SPARK-4951 URL: https://issues.apache.org/jira/browse/SPARK-4951 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu If a task runs more than `spark.dynamicAllocation.executorIdleTimeout`, the executor which runs this task will be killed. The following steps (yarn-client mode) can reproduce this bug: 1. Start `spark-shell` using {code} ./bin/spark-shell --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.minExecutors=1 \ --conf spark.dynamicAllocation.maxExecutors=4 \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.executorIdleTimeout=30 \ --master yarn-client \ --driver-memory 512m \ --executor-memory 512m \ --executor-cores 1 {code} 2. Wait more than 30 seconds until there is only one executor. 3. Run the following code (a task needs at least 50 seconds to finish) {code} val r = sc.parallelize(1 to 1000, 20).map{t = Thread.sleep(1000); t}.groupBy(_ % 2).collect() {code} 4. Executors will be killed and allocated all the time, which makes the Job fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution
[ https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268496#comment-14268496 ] Davies Liu commented on SPARK-3910: --- The 1.2 branch should not fail in a clean environment, where is the logging about the failure? ./python/pyspark/mllib/classification.py doctests fails with module name pollution -- Key: SPARK-3910 URL: https://issues.apache.org/jira/browse/SPARK-3910 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, unittest2==0.5.1, wsgiref==0.1.2 Reporter: Tomohiko K. Labels: pyspark, testing In ./python/run-tests script, we run the doctests in ./pyspark/mllib/classification.py. The output is as following: {noformat} $ ./python/run-tests ... Running test: pyspark/mllib/classification.py Traceback (most recent call last): File pyspark/mllib/classification.py, line 20, in module import numpy File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py, line 170, in module from . import add_newdocs File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py, line 13, in module from numpy.lib import add_newdoc File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py, line 8, in module from .type_check import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py, line 11, in module import numpy.core.numeric as _nx File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py, line 46, in module from numpy.testing import Tester File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py, line 13, in module from .utils import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py, line 15, in module from tempfile import mkdtemp File /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py, line 34, in module from random import Random as _Random File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, line 24, in module from pyspark.rdd import RDD File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 51, in module from pyspark.context import SparkContext File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 22, in module from tempfile import NamedTemporaryFile ImportError: cannot import name NamedTemporaryFile 0.07 real 0.04 user 0.02 sys Had test failures; see logs. {noformat} The problem is a cyclic import of tempfile module. The cause of it is that pyspark.mllib.random module exists in the directory where pyspark.mllib.classification module exists. classification module imports numpy module, and then numpy module imports tempfile module from its inside. Now the first entry sys.path is the directory ./python/pyspark/mllib (where the executed file classification.py exists), so tempfile module imports pyspark.mllib.random module (not the standard library random module). Finally, import chains reach tempfile again, then a cyclic import is formed. Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile → (cyclic import!!) Furthermore, stat module is in a standard library, and pyspark.mllib.stat module exists. This also may be troublesome. commit: 0e8203f4fb721158fb27897680da476174d24c4b A fundamental solution is to avoid using module names used by standard libraries (currently random and stat). A difficulty of this solution is to rename pyspark.mllib.random and pyspark.mllib.stat, which may be already used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution
[ https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268498#comment-14268498 ] Josh Rosen commented on SPARK-3910: --- Oh, I noticed this in 1.1 (while setting up SBT tests for the backport branches: SPARK-5053). Here's a sample failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.1-SBT/1/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/console ./python/pyspark/mllib/classification.py doctests fails with module name pollution -- Key: SPARK-3910 URL: https://issues.apache.org/jira/browse/SPARK-3910 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, unittest2==0.5.1, wsgiref==0.1.2 Reporter: Tomohiko K. Labels: pyspark, testing In ./python/run-tests script, we run the doctests in ./pyspark/mllib/classification.py. The output is as following: {noformat} $ ./python/run-tests ... Running test: pyspark/mllib/classification.py Traceback (most recent call last): File pyspark/mllib/classification.py, line 20, in module import numpy File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py, line 170, in module from . import add_newdocs File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py, line 13, in module from numpy.lib import add_newdoc File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py, line 8, in module from .type_check import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py, line 11, in module import numpy.core.numeric as _nx File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py, line 46, in module from numpy.testing import Tester File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py, line 13, in module from .utils import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py, line 15, in module from tempfile import mkdtemp File /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py, line 34, in module from random import Random as _Random File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, line 24, in module from pyspark.rdd import RDD File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 51, in module from pyspark.context import SparkContext File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 22, in module from tempfile import NamedTemporaryFile ImportError: cannot import name NamedTemporaryFile 0.07 real 0.04 user 0.02 sys Had test failures; see logs. {noformat} The problem is a cyclic import of tempfile module. The cause of it is that pyspark.mllib.random module exists in the directory where pyspark.mllib.classification module exists. classification module imports numpy module, and then numpy module imports tempfile module from its inside. Now the first entry sys.path is the directory ./python/pyspark/mllib (where the executed file classification.py exists), so tempfile module imports pyspark.mllib.random module (not the standard library random module). Finally, import chains reach tempfile again, then a cyclic import is formed. Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile → (cyclic import!!) Furthermore, stat module is in a standard library, and pyspark.mllib.stat module exists. This also may be troublesome. commit: 0e8203f4fb721158fb27897680da476174d24c4b A fundamental solution is to avoid using module names used by standard libraries (currently random and stat). A difficulty of this solution is to rename pyspark.mllib.random and pyspark.mllib.stat, which may be already used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution
[ https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268559#comment-14268559 ] Davies Liu commented on SPARK-3910: --- It does not have random.py in branch-1.0, so 1.1 is the only branch we need to back port or patch. ./python/pyspark/mllib/classification.py doctests fails with module name pollution -- Key: SPARK-3910 URL: https://issues.apache.org/jira/browse/SPARK-3910 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, unittest2==0.5.1, wsgiref==0.1.2 Reporter: Tomohiko K. Labels: pyspark, testing In ./python/run-tests script, we run the doctests in ./pyspark/mllib/classification.py. The output is as following: {noformat} $ ./python/run-tests ... Running test: pyspark/mllib/classification.py Traceback (most recent call last): File pyspark/mllib/classification.py, line 20, in module import numpy File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py, line 170, in module from . import add_newdocs File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py, line 13, in module from numpy.lib import add_newdoc File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py, line 8, in module from .type_check import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py, line 11, in module import numpy.core.numeric as _nx File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py, line 46, in module from numpy.testing import Tester File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py, line 13, in module from .utils import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py, line 15, in module from tempfile import mkdtemp File /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py, line 34, in module from random import Random as _Random File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, line 24, in module from pyspark.rdd import RDD File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 51, in module from pyspark.context import SparkContext File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 22, in module from tempfile import NamedTemporaryFile ImportError: cannot import name NamedTemporaryFile 0.07 real 0.04 user 0.02 sys Had test failures; see logs. {noformat} The problem is a cyclic import of tempfile module. The cause of it is that pyspark.mllib.random module exists in the directory where pyspark.mllib.classification module exists. classification module imports numpy module, and then numpy module imports tempfile module from its inside. Now the first entry sys.path is the directory ./python/pyspark/mllib (where the executed file classification.py exists), so tempfile module imports pyspark.mllib.random module (not the standard library random module). Finally, import chains reach tempfile again, then a cyclic import is formed. Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile → (cyclic import!!) Furthermore, stat module is in a standard library, and pyspark.mllib.stat module exists. This also may be troublesome. commit: 0e8203f4fb721158fb27897680da476174d24c4b A fundamental solution is to avoid using module names used by standard libraries (currently random and stat). A difficulty of this solution is to rename pyspark.mllib.random and pyspark.mllib.stat, which may be already used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268640#comment-14268640 ] Davies Liu commented on SPARK-3789: --- Any updates? Python bindings for GraphX -- Key: SPARK-3789 URL: https://issues.apache.org/jira/browse/SPARK-3789 Project: Spark Issue Type: New Feature Components: GraphX, PySpark Reporter: Ameet Talwalkar Assignee: Kushal Datta -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268649#comment-14268649 ] Kushal Datta commented on SPARK-3789: - Hi Davies, Here are the list of things which I have completed till now: - Java API for VertexRDD, EdgeRDD and Graph - Unit tests for JavaVertexRDD, JavaEdgeRDD and JavaGraph - Python API for VertexRDD, EdgeRDD and Graph in Scala including -- PythonVertexRDD, PythonEdgeRDD and PythonGraph -- Also includes vertex, edge and graph transformations and actions In progress are: - Pregel API in Python which includes -- Adding the new Pregel API in Python -- serializing vertexProgram, sendMessage, mergeMsg and initialMsg -Kushal Python bindings for GraphX -- Key: SPARK-3789 URL: https://issues.apache.org/jira/browse/SPARK-3789 Project: Spark Issue Type: New Feature Components: GraphX, PySpark Reporter: Ameet Talwalkar Assignee: Kushal Datta -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5136) Improve documentation around setting up Spark IntelliJ project
[ https://issues.apache.org/jira/browse/SPARK-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268146#comment-14268146 ] Ryan Williams commented on SPARK-5136: -- I've not started on it so feel free to grab the lock. If I've not heard from you I'll take a crack at it in the next week or so. Improve documentation around setting up Spark IntelliJ project -- Key: SPARK-5136 URL: https://issues.apache.org/jira/browse/SPARK-5136 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor [The documentation about setting up a Spark project in Intellij|http://spark.apache.org/docs/latest/building-spark.html#using-with-intellij-idea] is somewhat short/cryptic and targets [an IntelliJ version released in 2012|https://www.jetbrains.com/company/history.jsp]. A refresh / upgrade is probably warranted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5136) Improve documentation around setting up Spark IntelliJ project
[ https://issues.apache.org/jira/browse/SPARK-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268106#comment-14268106 ] Sean Owen commented on SPARK-5136: -- [~rdub] Are you taking a crack at this or should I? I think the instructions could be elaborated a bit, particularly about picking profiles. It will be correct for any recent IntelliJ. Improve documentation around setting up Spark IntelliJ project -- Key: SPARK-5136 URL: https://issues.apache.org/jira/browse/SPARK-5136 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor [The documentation about setting up a Spark project in Intellij|http://spark.apache.org/docs/latest/building-spark.html#using-with-intellij-idea] is somewhat short/cryptic and targets [an IntelliJ version released in 2012|https://www.jetbrains.com/company/history.jsp]. A refresh / upgrade is probably warranted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5039) Spark 1.0 2.0.0-mr1-cdh4.1.2 Maven build fails due to javax.servlet.FilterRegistration's signer information errors
[ https://issues.apache.org/jira/browse/SPARK-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5039: -- Assignee: Sean Owen Spark 1.0 2.0.0-mr1-cdh4.1.2 Maven build fails due to javax.servlet.FilterRegistration's signer information errors - Key: SPARK-5039 URL: https://issues.apache.org/jira/browse/SPARK-5039 Project: Spark Issue Type: Bug Components: Build, Project Infra Affects Versions: 1.0.2 Reporter: Josh Rosen Assignee: Sean Owen Labels: starter Fix For: 1.0.3 One of the four {{branch-1.0}} maven builds has been consistently failing due to servlet class signing errors: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.0-Maven-pre-YARN/ For example: {code} ContextCleanerSuite: Exception encountered when invoking run on a nested suite - class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package *** ABORTED *** java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666) at java.lang.ClassLoader.defineClass(ClassLoader.java:794) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) ... {code} The fix for this issue is declaring proper exclusions for some implementations of the servlet API. I know how to do this, but I don't have time to take care of it now, so I'm tossing up this JIRA so facilitate work-stealing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5132) The name for get stage info atempt ID from Json was wrong
[ https://issues.apache.org/jira/browse/SPARK-5132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5132. --- Resolution: Fixed Fix Version/s: (was: 1.2.0) 1.2.1 1.3.0 1.1.2 Issue resolved by pull request 3932 [https://github.com/apache/spark/pull/3932] The name for get stage info atempt ID from Json was wrong - Key: SPARK-5132 URL: https://issues.apache.org/jira/browse/SPARK-5132 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: SuYan Priority: Minor Fix For: 1.1.2, 1.3.0, 1.2.1 stageInfoToJson: Stage Attempt Id stageInfoFromJson: Attempt Id -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5108) Need to make jackson dependency version consistent with hadoop-2.6.0.
[ https://issues.apache.org/jira/browse/SPARK-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268241#comment-14268241 ] Apache Spark commented on SPARK-5108: - User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/3937 Need to make jackson dependency version consistent with hadoop-2.6.0. - Key: SPARK-5108 URL: https://issues.apache.org/jira/browse/SPARK-5108 Project: Spark Issue Type: Bug Components: Build Reporter: Zhan Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4389) Set akka.remote.netty.tcp.bind-hostname=0.0.0.0 so driver can be located behind NAT
[ https://issues.apache.org/jira/browse/SPARK-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4389: - Affects Version/s: 1.2.0 Set akka.remote.netty.tcp.bind-hostname=0.0.0.0 so driver can be located behind NAT - Key: SPARK-4389 URL: https://issues.apache.org/jira/browse/SPARK-4389 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Josh Rosen Priority: Minor We should set {{akka.remote.netty.tcp.bind-hostname=0.0.0.0}} in our Akka configuration so that Spark drivers can be located behind NATs / work with weird DNS setups. This is blocked by upgrading our Akka version, since this configuration is not present Akka 2.3.4. There might be a different approach / workaround that works on our current Akka version, though. EDIT: this is blocked by Akka 2.4, since this feature is only available in the 2.4 snapshot release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener
[ https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268113#comment-14268113 ] Zach Fry commented on SPARK-4906: - [~pwendell], Looks like you wanted to ping [~mkim]. He's away until the end of next week, so when he gets back he can take a look at this and get back to you. We also have some more datapoints to go from, so more to come. Zach Spark master OOMs with exception stack trace stored in JobProgressListener -- Key: SPARK-4906 URL: https://issues.apache.org/jira/browse/SPARK-4906 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.1.1 Reporter: Mingyu Kim Spark master was OOMing with a lot of stack traces retained in JobProgressListener. The object dependency goes like the following. JobProgressListener.stageIdToData = StageUIData.taskData = TaskUIData.errorMessage Each error message is ~10kb since it has the entire stack trace. As we have a lot of tasks, when all of the tasks across multiple stages go bad, these error messages accounted for 0.5GB of heap at some point. Please correct me if I'm wrong, but it looks like all the task info for running applications are kept in memory, which means it's almost always bound to OOM for long-running applications. Would it make sense to fix this, for example, by spilling some UI states to disk? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4406) SVD should check for k 1
[ https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268180#comment-14268180 ] Manoj Kumar commented on SPARK-4406: Hi Joseph, I believe this issue would be simple enough for me to start working on? Does it require you to assign it to me, or can I send a Pull Request right away? SVD should check for k 1 -- Key: SPARK-4406 URL: https://issues.apache.org/jira/browse/SPARK-4406 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor When SVD is called with k 1, it still tries to compute the SVD, causing a lower-level error. It should fail early. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2298) Show stage attempt in UI
[ https://issues.apache.org/jira/browse/SPARK-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2298: -- Fix Version/s: 1.1.0 Show stage attempt in UI Key: SPARK-2298 URL: https://issues.apache.org/jira/browse/SPARK-2298 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.1.0 Attachments: Screen Shot 2014-06-25 at 4.54.46 PM.png We should add a column to the web ui to show stage attempt id. Then tasks should be grouped by (stageId, stageAttempt) tuple. When a stage is resubmitted (e.g. due to fetch failures), we should get a different entry in the web ui and tasks for the resubmission go there. See the attached screenshot for the confusing status quo. We currently show the same stage entry twice, and then tasks appear in both. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5132) The name for get stage info atempt ID from Json was wrong
[ https://issues.apache.org/jira/browse/SPARK-5132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5132: -- Component/s: Web UI The name for get stage info atempt ID from Json was wrong - Key: SPARK-5132 URL: https://issues.apache.org/jira/browse/SPARK-5132 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.2.0 Reporter: SuYan Assignee: SuYan Priority: Minor Fix For: 1.3.0, 1.1.2, 1.2.1 stageInfoToJson: Stage Attempt Id stageInfoFromJson: Attempt Id -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5132) The name for get stage info atempt ID from Json was wrong
[ https://issues.apache.org/jira/browse/SPARK-5132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5132: -- Assignee: SuYan The name for get stage info atempt ID from Json was wrong - Key: SPARK-5132 URL: https://issues.apache.org/jira/browse/SPARK-5132 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.2.0 Reporter: SuYan Assignee: SuYan Priority: Minor Fix For: 1.3.0, 1.1.2, 1.2.1 stageInfoToJson: Stage Attempt Id stageInfoFromJson: Attempt Id -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4406) SVD should check for k 1
[ https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268180#comment-14268180 ] Manoj Kumar edited comment on SPARK-4406 at 1/7/15 8:40 PM: Hi Joseph, I believe this issue would be simple enough for me to start working on. Does it require you to assign it to me, or can I send a Pull Request right away? was (Author: mechcoder): Hi Joseph, I believe this issue would be simple enough for me to start working on? Does it require you to assign it to me, or can I send a Pull Request right away? SVD should check for k 1 -- Key: SPARK-4406 URL: https://issues.apache.org/jira/browse/SPARK-4406 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor When SVD is called with k 1, it still tries to compute the SVD, causing a lower-level error. It should fail early. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688
[ https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267701#comment-14267701 ] Jongyoul Lee commented on SPARK-3619: - [~tnachen] fixing the version from 0.18.1 to 0.21.0 is easy. I'm doing simple and complicated job tests on my real mesos clusters. Upgrade to Mesos 0.21 to work around MESOS-1688 --- Key: SPARK-3619 URL: https://issues.apache.org/jira/browse/SPARK-3619 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Matei Zaharia Assignee: Timothy Chen The Mesos 0.21 release has a fix for https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2458) Make failed application log visible on History Server
[ https://issues.apache.org/jira/browse/SPARK-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-2458. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Masayoshi TSUZUKI Target Version/s: 1.3.0 Make failed application log visible on History Server - Key: SPARK-2458 URL: https://issues.apache.org/jira/browse/SPARK-2458 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0 Reporter: Masayoshi TSUZUKI Assignee: Masayoshi TSUZUKI Fix For: 1.3.0 History server is very helpful for debugging application correctness performance after the application finished. However, when the application failed, the link is not listed on the hisotry server UI and history can't be viewed. It would be very useful if we can check the history of failed application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5053) Test maintenance branches on Jenkins using SBT
[ https://issues.apache.org/jira/browse/SPARK-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268313#comment-14268313 ] Josh Rosen commented on SPARK-5053: --- It looks like nearly all of these new builds are failing for various reasons, so I could use some help fixing them. One issue is that several of the PySpark tests are failing with {code} OK (skipped=1) Traceback (most recent call last): File pyspark/mllib/_common.py, line 20, in module import numpy File /usr/lib64/python2.6/site-packages/numpy/__init__.py, line 170, in module from . import add_newdocs File /usr/lib64/python2.6/site-packages/numpy/add_newdocs.py, line 13, in module from numpy.lib import add_newdoc File /usr/lib64/python2.6/site-packages/numpy/lib/__init__.py, line 8, in module from .type_check import * File /usr/lib64/python2.6/site-packages/numpy/lib/type_check.py, line 11, in module import numpy.core.numeric as _nx File /usr/lib64/python2.6/site-packages/numpy/core/__init__.py, line 46, in module from numpy.testing import Tester File /usr/lib64/python2.6/site-packages/numpy/testing/__init__.py, line 13, in module from .utils import * File /usr/lib64/python2.6/site-packages/numpy/testing/utils.py, line 15, in module from tempfile import mkdtemp File /usr/lib64/python2.6/tempfile.py, line 34, in module from random import Random as _Random File /home/jenkins/workspace/Spark-1.1-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/python/pyspark/mllib/random.py, line 23, in module from pyspark.rdd import RDD File /home/jenkins/workspace/Spark-1.1-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/python/pyspark/__init__.py, line 63, in module from pyspark.context import SparkContext File /home/jenkins/workspace/Spark-1.1-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/python/pyspark/context.py, line 22, in module from tempfile import NamedTemporaryFile ImportError: cannot import name NamedTemporaryFile {code} Some of the other failures might just be due to flaky tests exposed by higher Jenkins loads; let's see if they persist after rebuilds. Test maintenance branches on Jenkins using SBT -- Key: SPARK-5053 URL: https://issues.apache.org/jira/browse/SPARK-5053 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Josh Rosen Priority: Blocker We need to create Jenkins jobs to test maintenance branches using SBT. The current Maven jobs for backport branches do not run the same checks that the pull request builder / SBT builds do (e.g. MiMa checks, PySpark, RAT, etc.) which means that cherry-picking backports can silently break things and we'll only discover it once PRs that are explicitly opened against those branches fail tests; this long delay between introducing test failures and detecting them is a huge productivity issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5108) Need to make jackson dependency version consistent with hadoop-2.6.0.
[ https://issues.apache.org/jira/browse/SPARK-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268336#comment-14268336 ] Apache Spark commented on SPARK-5108: - User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/3938 Need to make jackson dependency version consistent with hadoop-2.6.0. - Key: SPARK-5108 URL: https://issues.apache.org/jira/browse/SPARK-5108 Project: Spark Issue Type: Bug Components: Build Reporter: Zhan Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4406) SVD should check for k 1
[ https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268358#comment-14268358 ] Joseph K. Bradley commented on SPARK-4406: -- It's good to get it assigned if it will take a while, but feel free to submit a PR if it's simple like this one. If a PR will take time, then posting a comment that you're working on it is helpful. Thanks in advance! SVD should check for k 1 -- Key: SPARK-4406 URL: https://issues.apache.org/jira/browse/SPARK-4406 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor When SVD is called with k 1, it still tries to compute the SVD, causing a lower-level error. It should fail early. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5053) Test maintenance branches on Jenkins using SBT
[ https://issues.apache.org/jira/browse/SPARK-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268321#comment-14268321 ] Josh Rosen commented on SPARK-5053: --- Hmm, it looks like the Python issue is an occurrence of SPARK-3910. This _used_ to work, so I'm not sure why it's failing now. Test maintenance branches on Jenkins using SBT -- Key: SPARK-5053 URL: https://issues.apache.org/jira/browse/SPARK-5053 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Josh Rosen Priority: Blocker We need to create Jenkins jobs to test maintenance branches using SBT. The current Maven jobs for backport branches do not run the same checks that the pull request builder / SBT builds do (e.g. MiMa checks, PySpark, RAT, etc.) which means that cherry-picking backports can silently break things and we'll only discover it once PRs that are explicitly opened against those branches fail tests; this long delay between introducing test failures and detecting them is a huge productivity issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution
[ https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268332#comment-14268332 ] Josh Rosen commented on SPARK-3910: --- It looks like this was fixed in SPARK-4348, but it turns out that we're now hitting this error when running PySpark tests in Jenkins jobs for maintenance branches (it turns out that Jenkins wasn't running these tests before for those branches, so it's not clear when this problem was introduced). I'll see if I can figure out a fix for the backport branches. ./python/pyspark/mllib/classification.py doctests fails with module name pollution -- Key: SPARK-3910 URL: https://issues.apache.org/jira/browse/SPARK-3910 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, unittest2==0.5.1, wsgiref==0.1.2 Reporter: Tomohiko K. Labels: pyspark, testing In ./python/run-tests script, we run the doctests in ./pyspark/mllib/classification.py. The output is as following: {noformat} $ ./python/run-tests ... Running test: pyspark/mllib/classification.py Traceback (most recent call last): File pyspark/mllib/classification.py, line 20, in module import numpy File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py, line 170, in module from . import add_newdocs File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py, line 13, in module from numpy.lib import add_newdoc File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py, line 8, in module from .type_check import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py, line 11, in module import numpy.core.numeric as _nx File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py, line 46, in module from numpy.testing import Tester File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py, line 13, in module from .utils import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py, line 15, in module from tempfile import mkdtemp File /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py, line 34, in module from random import Random as _Random File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, line 24, in module from pyspark.rdd import RDD File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 51, in module from pyspark.context import SparkContext File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 22, in module from tempfile import NamedTemporaryFile ImportError: cannot import name NamedTemporaryFile 0.07 real 0.04 user 0.02 sys Had test failures; see logs. {noformat} The problem is a cyclic import of tempfile module. The cause of it is that pyspark.mllib.random module exists in the directory where pyspark.mllib.classification module exists. classification module imports numpy module, and then numpy module imports tempfile module from its inside. Now the first entry sys.path is the directory ./python/pyspark/mllib (where the executed file classification.py exists), so tempfile module imports pyspark.mllib.random module (not the standard library random module). Finally, import chains reach tempfile again, then a cyclic import is formed. Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile → (cyclic import!!) Furthermore, stat module is in a standard library, and pyspark.mllib.stat module exists. This also may be troublesome. commit: 0e8203f4fb721158fb27897680da476174d24c4b A fundamental solution is to avoid using module names used by standard libraries (currently random and stat). A difficulty of this solution is to rename pyspark.mllib.random and pyspark.mllib.stat, which may be already used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4406) SVD should check for k 1
[ https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268368#comment-14268368 ] Joseph K. Bradley commented on SPARK-4406: -- Also, to get JIRAs assigned to you, you will need to get an admin like [~mengxr] to add you to the developer group for this project. (For this JIRA, the comment should be good enough.) SVD should check for k 1 -- Key: SPARK-4406 URL: https://issues.apache.org/jira/browse/SPARK-4406 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor When SVD is called with k 1, it still tries to compute the SVD, causing a lower-level error. It should fail early. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5122) Remove Shark from spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268390#comment-14268390 ] Apache Spark commented on SPARK-5122: - User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/3939 Remove Shark from spark-ec2 --- Key: SPARK-5122 URL: https://issues.apache.org/jira/browse/SPARK-5122 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor Since Shark has been replaced by Spark SQL, we don't need it in {{spark-ec2}} anymore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4951) A busy executor may be killed when dynamicAllocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4951: - Affects Version/s: 1.2.0 A busy executor may be killed when dynamicAllocation is enabled --- Key: SPARK-4951 URL: https://issues.apache.org/jira/browse/SPARK-4951 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Shixiong Zhu If a task runs more than `spark.dynamicAllocation.executorIdleTimeout`, the executor which runs this task will be killed. The following steps (yarn-client mode) can reproduce this bug: 1. Start `spark-shell` using {code} ./bin/spark-shell --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.minExecutors=1 \ --conf spark.dynamicAllocation.maxExecutors=4 \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.executorIdleTimeout=30 \ --master yarn-client \ --driver-memory 512m \ --executor-memory 512m \ --executor-cores 1 {code} 2. Wait more than 30 seconds until there is only one executor. 3. Run the following code (a task needs at least 50 seconds to finish) {code} val r = sc.parallelize(1 to 1000, 20).map{t = Thread.sleep(1000); t}.groupBy(_ % 2).collect() {code} 4. Executors will be killed and allocated all the time, which makes the Job fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4983) Tag EC2 instances in the same call that launches them
[ https://issues.apache.org/jira/browse/SPARK-4983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-4983: Labels: starter (was: ) Tag EC2 instances in the same call that launches them - Key: SPARK-4983 URL: https://issues.apache.org/jira/browse/SPARK-4983 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Reporter: Nicholas Chammas Priority: Minor Labels: starter We launch EC2 instances in {{spark-ec2}} and then immediately tag them in a separate boto call. Sometimes, EC2 doesn't get enough time to propagate information about the just-launched instances, so when we go to tag them we get a server that doesn't know about them yet. This yields the following type of error: {code} Launching instances... Launched 1 slaves in us-east-1b, regid = r-cf780321 Launched master in us-east-1b, regid = r-da7e0534 Traceback (most recent call last): File ./ec2/spark_ec2.py, line 1284, in module main() File ./ec2/spark_ec2.py, line 1276, in main real_main() File ./ec2/spark_ec2.py, line 1122, in real_main (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name) File ./ec2/spark_ec2.py, line 646, in launch_cluster value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id)) File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in add_tag self.add_tags({key: value}, dry_run) File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in add_tags dry_run=dry_run File .../spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, in create_tags return self.get_status('CreateTags', params, verb='POST') File .../spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in get_status raise self.ResponseError(response.status, response.reason, body) boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe instance ID 'i-585219a6' does not exist/Message/Error/ErrorsRequestIDb9f1ad6e-59b9-47fd-a693-527be1f779eb/RequestID/Response {code} The solution is to tag the instances in the same call that launches them, or less desirably, tag the instances after some short wait. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution
[ https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268411#comment-14268411 ] Apache Spark commented on SPARK-3910: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/3940 ./python/pyspark/mllib/classification.py doctests fails with module name pollution -- Key: SPARK-3910 URL: https://issues.apache.org/jira/browse/SPARK-3910 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, unittest2==0.5.1, wsgiref==0.1.2 Reporter: Tomohiko K. Labels: pyspark, testing In ./python/run-tests script, we run the doctests in ./pyspark/mllib/classification.py. The output is as following: {noformat} $ ./python/run-tests ... Running test: pyspark/mllib/classification.py Traceback (most recent call last): File pyspark/mllib/classification.py, line 20, in module import numpy File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py, line 170, in module from . import add_newdocs File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py, line 13, in module from numpy.lib import add_newdoc File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py, line 8, in module from .type_check import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py, line 11, in module import numpy.core.numeric as _nx File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py, line 46, in module from numpy.testing import Tester File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py, line 13, in module from .utils import * File /Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py, line 15, in module from tempfile import mkdtemp File /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py, line 34, in module from random import Random as _Random File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py, line 24, in module from pyspark.rdd import RDD File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py, line 51, in module from pyspark.context import SparkContext File /Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py, line 22, in module from tempfile import NamedTemporaryFile ImportError: cannot import name NamedTemporaryFile 0.07 real 0.04 user 0.02 sys Had test failures; see logs. {noformat} The problem is a cyclic import of tempfile module. The cause of it is that pyspark.mllib.random module exists in the directory where pyspark.mllib.classification module exists. classification module imports numpy module, and then numpy module imports tempfile module from its inside. Now the first entry sys.path is the directory ./python/pyspark/mllib (where the executed file classification.py exists), so tempfile module imports pyspark.mllib.random module (not the standard library random module). Finally, import chains reach tempfile again, then a cyclic import is formed. Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile → (cyclic import!!) Furthermore, stat module is in a standard library, and pyspark.mllib.stat module exists. This also may be troublesome. commit: 0e8203f4fb721158fb27897680da476174d24c4b A fundamental solution is to avoid using module names used by standard libraries (currently random and stat). A difficulty of this solution is to rename pyspark.mllib.random and pyspark.mllib.stat, which may be already used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3789) Python bindings for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268658#comment-14268658 ] Ameet Talwalkar commented on SPARK-3789: Agreed, thanks for the update! Also, 1.3 release is a good target if I'm going to use this in my MOOC... Python bindings for GraphX -- Key: SPARK-3789 URL: https://issues.apache.org/jira/browse/SPARK-3789 Project: Spark Issue Type: New Feature Components: GraphX, PySpark Reporter: Ameet Talwalkar Assignee: Kushal Datta -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4777) Some block memory after unrollSafely not count into used memory(memoryStore.entrys or unrollMemory)
[ https://issues.apache.org/jira/browse/SPARK-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4777: - Priority: Major (was: Minor) Some block memory after unrollSafely not count into used memory(memoryStore.entrys or unrollMemory) --- Key: SPARK-4777 URL: https://issues.apache.org/jira/browse/SPARK-4777 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: SuYan Some memory not count into memory used by memoryStore or unrollMemory. Thread A after unrollsafely memory, it will release 40MB unrollMemory(40MB will used by other threads). then ThreadA wait get accountingLock to tryToPut blockA(30MB). before Thread A get accountingLock, blockA memory size is not counting into unrollMemory or memoryStore.currentMemory. IIUC, freeMemory should minus that block memory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org