[jira] [Updated] (SPARK-4359) Empty classifier in avro-mapred is misinterpreted in SBT
[ https://issues.apache.org/jira/browse/SPARK-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4359: - Description: In the parent pom, avro.mapred.classifier is set to hadoop2 for Yarn but not otherwise set. As a result, when an application that uses spark-hive_2.10 as a module is built with SBT, it will try to resolve a jar that is literally called the following: {code} [warn] Maven Repository: tried [warn] http://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.7.6/avro-mapred-1.7.6-${avro.mapred.classifier}.jar [warn] :: [warn] :: FAILED DOWNLOADS:: [warn] :: ^ see resolution messages for details ^ :: [warn] :: [warn] :: org.apache.avro#avro-mapred;1.7.6!avro-mapred.jar [warn] :: sbt.ResolveException: download failed: org.apache.avro#avro-mapred;1.7.6!avro-mapred.jar {code} This is because avro.mapred.classifier is not a variable according to SBT. was: In the parent pom, avro.mapred.classifier is set to hadoop2 for Yarn but not otherwise set. As a result, when an application that uses spark-hive_2.10 as a module is built with SBT, it will try to resolve a jar that is literally called {code} avro-mapred-1.7.6-${avro.mapred.classifier}.jar {code} This is because avro.mapred.classifier is not a variable according to SBT. Empty classifier in avro-mapred is misinterpreted in SBT -- Key: SPARK-4359 URL: https://issues.apache.org/jira/browse/SPARK-4359 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0, 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical In the parent pom, avro.mapred.classifier is set to hadoop2 for Yarn but not otherwise set. As a result, when an application that uses spark-hive_2.10 as a module is built with SBT, it will try to resolve a jar that is literally called the following: {code} [warn] Maven Repository: tried [warn] http://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.7.6/avro-mapred-1.7.6-${avro.mapred.classifier}.jar [warn]:: [warn]:: FAILED DOWNLOADS:: [warn]:: ^ see resolution messages for details ^ :: [warn]:: [warn]:: org.apache.avro#avro-mapred;1.7.6!avro-mapred.jar [warn]:: sbt.ResolveException: download failed: org.apache.avro#avro-mapred;1.7.6!avro-mapred.jar {code} This is because avro.mapred.classifier is not a variable according to SBT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4341) Spark need to set num-executors automatically
[ https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207793#comment-14207793 ] Sean Owen commented on SPARK-4341: -- The problem is that the number of executors is then not appropriate for anything but the first action that is computed. Spark need to set num-executors automatically - Key: SPARK-4341 URL: https://issues.apache.org/jira/browse/SPARK-4341 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Hong Shen The mapreduce job can set maptask automaticlly, but in spark, we have to set num-executors, executor memory and cores. It's difficult for users to set these args, especially for the users want to use spark sql. So when user havn't set num-executors, spark should set num-executors automatically accroding to the input partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4359) Empty classifier in avro-mapred is misinterpreted in SBT
[ https://issues.apache.org/jira/browse/SPARK-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207794#comment-14207794 ] Andrew Or commented on SPARK-4359: -- Ok, I reverted commit https://github.com/apache/spark/commit/78887f94a0ae9cdcfb851910ab9c7d51a1ef2acb for branch-1.1 for now. Empty classifier in avro-mapred is misinterpreted in SBT -- Key: SPARK-4359 URL: https://issues.apache.org/jira/browse/SPARK-4359 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0, 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical In the parent pom, avro.mapred.classifier is set to hadoop2 for Yarn but not otherwise set. As a result, when an application that uses spark-hive_2.10 as a module is built with SBT, it will try to resolve a jar that is literally called the following: {code} [warn] Maven Repository: tried [warn] http://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.7.6/avro-mapred-1.7.6-${avro.mapred.classifier}.jar [warn]:: [warn]:: FAILED DOWNLOADS:: [warn]:: ^ see resolution messages for details ^ :: [warn]:: [warn]:: org.apache.avro#avro-mapred;1.7.6!avro-mapred.jar [warn]:: sbt.ResolveException: download failed: org.apache.avro#avro-mapred;1.7.6!avro-mapred.jar {code} This is because avro.mapred.classifier is not a variable according to SBT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207801#comment-14207801 ] Apache Spark commented on SPARK-2426: - User 'debasish83' has created a pull request for this issue: https://github.com/apache/spark/pull/3221 Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4353) Delete the val that never used in Catalog
[ https://issues.apache.org/jira/browse/SPARK-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4353. -- Resolution: Not a Problem Delete the val that never used in Catalog - Key: SPARK-4353 URL: https://issues.apache.org/jira/browse/SPARK-4353 Project: Spark Issue Type: Improvement Components: SQL Reporter: DoingDone9 Priority: Minor dbName in Catalog never used, like that { val (dbName, tblName) = processDatabaseAndTableName(databaseName, tableName); tables -= tblName } I think it should be deleted,it should be val tblName = processDatabaseAndTableName(databaseName, tableName)._2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4341) Spark need to set num-executors automatically
[ https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207816#comment-14207816 ] Hong Shen commented on SPARK-4341: -- After the first action computed, we can set set nimPartition for the following HadoopRDD. So the following HadoopRDD's partitions won't less than num-executors, and it will prevent wasting of resources. On the other hand if the following HadoopRDD's partitions is much bigger than num-executors, we can reset numExecuors to ApplicaitonMaster and allocate new executors. Spark need to set num-executors automatically - Key: SPARK-4341 URL: https://issues.apache.org/jira/browse/SPARK-4341 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Hong Shen The mapreduce job can set maptask automaticlly, but in spark, we have to set num-executors, executor memory and cores. It's difficult for users to set these args, especially for the users want to use spark sql. So when user havn't set num-executors, spark should set num-executors automatically accroding to the input partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4341) Spark need to set num-executors automatically
[ https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207816#comment-14207816 ] Hong Shen edited comment on SPARK-4341 at 11/12/14 8:40 AM: After the first action computed, we can set nimPartition for the following HadoopRDD. So the following HadoopRDD's partitions won't less than num-executors, and it will prevent wasting of resources. On the other hand if the following HadoopRDD's partitions is much bigger than num-executors, we can reset numExecuors to ApplicaitonMaster and allocate new executors. was (Author: shenhong): After the first action computed, we can set set nimPartition for the following HadoopRDD. So the following HadoopRDD's partitions won't less than num-executors, and it will prevent wasting of resources. On the other hand if the following HadoopRDD's partitions is much bigger than num-executors, we can reset numExecuors to ApplicaitonMaster and allocate new executors. Spark need to set num-executors automatically - Key: SPARK-4341 URL: https://issues.apache.org/jira/browse/SPARK-4341 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Hong Shen The mapreduce job can set maptask automaticlly, but in spark, we have to set num-executors, executor memory and cores. It's difficult for users to set these args, especially for the users want to use spark sql. So when user havn't set num-executors, spark should set num-executors automatically accroding to the input partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3206) Error in PageRank values
[ https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207822#comment-14207822 ] Ankur Dave commented on SPARK-3206: --- I just tested this with the standalone version of PageRank that was introduced in SPARK-3427, and it seems to be fixed, so I'm closing this. {code} scala val e = sc.parallelize(List( (1, 2), (1, 3), (3, 2), (3, 4), (5, 3), (6, 7), (7, 8), (8, 9), (9, 7))) scala val g = Graph.fromEdgeTuples(e.map(kv = (kv._1.toLong, kv._2.toLong)), 0) scala g.pageRank(0.0001).vertices.collect.foreach(println) (8,1.2808550959634413) (1,0.15) (9,1.2387268204156412) (2,0.358781244) (3,0.341249994) (4,0.295031247) (5,0.15) (6,0.15) (7,1.330417786200011) scala g.staticPageRank(100).vertices.collect.foreach(println) (8,1.2803346052504254) (1,0.15) (9,1.2381240056453071) (2,0.358781244) (3,0.341249994) (4,0.295031247) (5,0.15) (6,0.15) (7,1.3299054047985106) {code} Error in PageRank values Key: SPARK-3206 URL: https://issues.apache.org/jira/browse/SPARK-3206 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2 Environment: UNIX with Hadoop Reporter: Peter Fontana I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: || Node1 || Node2 || |1 | 2 | |1 | 3| |3 | 2| |3 | 4| |5 | 3| |6 | 7| |7 | 8| |8 | 9| |9 | 7| Node Table (note the extra node): || NodeID || NodeName || |a | 1| |b | 2| |c | 3| |d | 4| |e | 5| |f | 6| |g | 7| |h | 8| |i | 9| |j.longaddress.com | 10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}} I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running {{val ranksI = PageRank.run(graph,100).vertices}} I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3206) Error in PageRank values
[ https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-3206. --- Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Ankur Dave Error in PageRank values Key: SPARK-3206 URL: https://issues.apache.org/jira/browse/SPARK-3206 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2 Environment: UNIX with Hadoop Reporter: Peter Fontana Assignee: Ankur Dave Fix For: 1.2.0 I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: || Node1 || Node2 || |1 | 2 | |1 | 3| |3 | 2| |3 | 4| |5 | 3| |6 | 7| |7 | 8| |8 | 9| |9 | 7| Node Table (note the extra node): || NodeID || NodeName || |a | 1| |b | 2| |c | 3| |d | 4| |e | 5| |f | 6| |g | 7| |h | 8| |i | 9| |j.longaddress.com | 10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}} I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running {{val ranksI = PageRank.run(graph,100).vertices}} I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4360) task only execute on one node when spark on yarn
seekerak created SPARK-4360: --- Summary: task only execute on one node when spark on yarn Key: SPARK-4360 URL: https://issues.apache.org/jira/browse/SPARK-4360 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Reporter: seekerak hadoop version: hadoop 2.0.3-alpha spark version: 1.0.2 when i run spark jobs on yarn, i found all the task only run on one node, my cluster has 4 nodes, executors has 3, but only one has task, the others hasn't, my command like this : /opt/hadoopcluster/spark-1.0.2-bin-hadoop2/bin/spark-submit --class org.sr.scala.Spark_LineCount_G0 --executor-memory 2G --num-executors 12 --master yarn-cluster /home/Spark_G0.jar /data /output/ou_1 is there any one knows why? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207826#comment-14207826 ] zzc commented on SPARK-2468: Hi, Aaron Davidson, I am sure that I ran my last test with the patch #3155 applied. configuration : spark.shuffle.consolidateFilestrue spark.storage.memoryFraction 0.2 spark.shuffle.memoryFraction 0.2 spark.shuffle.file.buffer.kb 100 spark.reducer.maxMbInFlight 48 spark.shuffle.blockTransferServicenetty spark.shuffle.io.mode nio spark.shuffle.io.connectionTimeout120 spark.shuffle.manager SORT spark.shuffle.io.preferDirectBufs true spark.shuffle.io.maxRetries 3 spark.shuffle.io.retryWaitMs5000 spark.shuffle.io.maxUsableCores 3 command: --num-executors 17 --executor-memory 12g --executor-cores 3 If spark.shuffle.io.preferDirectBufs=false, it's OK. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207839#comment-14207839 ] zzc commented on SPARK-2468: Hi, Aaron Davidson, can you describe your test, including the environment, configuration, data volume? Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4251) Add Restricted Boltzmann machine(RBM) algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207855#comment-14207855 ] Apache Spark commented on SPARK-4251: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/3222 Add Restricted Boltzmann machine(RBM) algorithm to MLlib Key: SPARK-4251 URL: https://issues.apache.org/jira/browse/SPARK-4251 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4361) SparkContext HadoopRDD is not clear about how to use a Hadoop Configuration
Shixiong Zhu created SPARK-4361: --- Summary: SparkContext HadoopRDD is not clear about how to use a Hadoop Configuration Key: SPARK-4361 URL: https://issues.apache.org/jira/browse/SPARK-4361 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Minor When I answered this question: http://apache-spark-user-list.1001560.n3.nabble.com/How-did-the-RDD-union-work-td18686.html, I found SparkContext did not explain how to use a Hadoop Configuration. More docs to clarify that Configuration will be put into a Broadcast is better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly
[ https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4355. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3220 [https://github.com/apache/spark/pull/3220] OnlineSummarizer doesn't merge mean correctly - Key: SPARK-4355 URL: https://issues.apache.org/jira/browse/SPARK-4355 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.2.0 It happens when the mean on one side is zero. I will send an PR with some code clean-up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly
[ https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-4355: -- Reopen this issue because we haven't fixed branch-1.1 and branch-1.0 yet. OnlineSummarizer doesn't merge mean correctly - Key: SPARK-4355 URL: https://issues.apache.org/jira/browse/SPARK-4355 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.2.0 It happens when the mean on one side is zero. I will send an PR with some code clean-up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4361) SparkContext HadoopRDD is not clear about how to use a Hadoop Configuration
[ https://issues.apache.org/jira/browse/SPARK-4361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207882#comment-14207882 ] Apache Spark commented on SPARK-4361: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3225 SparkContext HadoopRDD is not clear about how to use a Hadoop Configuration --- Key: SPARK-4361 URL: https://issues.apache.org/jira/browse/SPARK-4361 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Minor Labels: doc, easyfix When I answered this question: http://apache-spark-user-list.1001560.n3.nabble.com/How-did-the-RDD-union-work-td18686.html, I found SparkContext did not explain how to use a Hadoop Configuration. More docs to clarify that Configuration will be put into a Broadcast is better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4341) Spark need to set num-executors automatically
[ https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207885#comment-14207885 ] Sean Owen commented on SPARK-4341: -- So I think some of this is already done by Spark. For example, the number of partitions is determined in the same way that Hadoop does, and carries through a pipeline of transformations. Some of this is not necessarily the right thing to do. For example I could be running several transformations at once, and trying to match each's parallelism to the number of executors may be inefficient, not only because it may mean making partitions that are excessively small or large, but because it may require a shuffle, which is expensive. Finally I think the issue of resource usage is better dealt with by increasing/decreasing the number of executors dynamically in response to demand or load, and there is already work in progress on those. So maybe that covers what you are thinking of already. Spark need to set num-executors automatically - Key: SPARK-4341 URL: https://issues.apache.org/jira/browse/SPARK-4341 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Hong Shen The mapreduce job can set maptask automaticlly, but in spark, we have to set num-executors, executor memory and cores. It's difficult for users to set these args, especially for the users want to use spark sql. So when user havn't set num-executors, spark should set num-executors automatically accroding to the input partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4360) task only execute on one node when spark on yarn
[ https://issues.apache.org/jira/browse/SPARK-4360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207894#comment-14207894 ] Sean Owen commented on SPARK-4360: -- I don't think there's enough info here; this maybe should have been a question on the list first. Is there more than 1 partition in the input? did more than 1 executor actually allocate? are you definitely observing tasks running and not some single-threaded process on the driver? task only execute on one node when spark on yarn Key: SPARK-4360 URL: https://issues.apache.org/jira/browse/SPARK-4360 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Reporter: seekerak hadoop version: hadoop 2.0.3-alpha spark version: 1.0.2 when i run spark jobs on yarn, i found all the task only run on one node, my cluster has 4 nodes, executors has 3, but only one has task, the others hasn't, my command like this : /opt/hadoopcluster/spark-1.0.2-bin-hadoop2/bin/spark-submit --class org.sr.scala.Spark_LineCount_G0 --executor-memory 2G --num-executors 12 --master yarn-cluster /home/Spark_G0.jar /data /output/ou_1 is there any one knows why? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4341) Spark need to set num-executors automatically
[ https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207899#comment-14207899 ] Hong Shen commented on SPARK-4341: -- My main point is when running spark (especially spark SQL), not all user want to set parallelism to match executors, we can provide a easy way for them to use spark. Spark need to set num-executors automatically - Key: SPARK-4341 URL: https://issues.apache.org/jira/browse/SPARK-4341 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Hong Shen The mapreduce job can set maptask automaticlly, but in spark, we have to set num-executors, executor memory and cores. It's difficult for users to set these args, especially for the users want to use spark sql. So when user havn't set num-executors, spark should set num-executors automatically accroding to the input partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib
[ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207925#comment-14207925 ] Ashutosh Trivedi commented on SPARK-4038: - The questions raised are valid and we want community to discuss it. This algorithm deals with categorical data, In my knowledge it uses the simplest approach by calculating frequency of each attribute in the data set. Some of the people in community are already doing the review and I am working on it. I did not find any other algorithm which work on categorical data to find outliers. If you are aware of any other algorithm which is well known please share with us. Outlier Detection Algorithm for MLlib - Key: SPARK-4038 URL: https://issues.apache.org/jira/browse/SPARK-4038 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ashutosh Trivedi Priority: Minor The aim of this JIRA is to discuss about which parallel outlier detection algorithms can be included in MLlib. The one which I am familiar with is Attribute Value Frequency (AVF). It scales linearly with the number of data points and attributes, and relies on a single data scan. It is not distance based and well suited for categorical data. In original paper a parallel version is also given, which is not complected to implement. I am working on the implementation and soon submit the initial code for review. Here is the Link for the paper http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382 As pointed out by Xiangrui in discussion http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html There are other algorithms also. Lets discuss about which will be more general and easily paralleled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4038) Outlier Detection Algorithm for MLlib
[ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207925#comment-14207925 ] Ashutosh Trivedi edited comment on SPARK-4038 at 11/12/14 10:53 AM: The questions raised are valid and we want community to discuss it. This algorithm deals with categorical data, It uses the simplest approach by calculating frequency of each attribute in the data set. Some of the people in community are already doing the review and I am working on it. I did not find any other algorithm which work on categorical data to find outliers. If you are aware of any other algorithm which is well known please share with us. was (Author: rusty): The questions raised are valid and we want community to discuss it. This algorithm deals with categorical data, In my knowledge it uses the simplest approach by calculating frequency of each attribute in the data set. Some of the people in community are already doing the review and I am working on it. I did not find any other algorithm which work on categorical data to find outliers. If you are aware of any other algorithm which is well known please share with us. Outlier Detection Algorithm for MLlib - Key: SPARK-4038 URL: https://issues.apache.org/jira/browse/SPARK-4038 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ashutosh Trivedi Priority: Minor The aim of this JIRA is to discuss about which parallel outlier detection algorithms can be included in MLlib. The one which I am familiar with is Attribute Value Frequency (AVF). It scales linearly with the number of data points and attributes, and relies on a single data scan. It is not distance based and well suited for categorical data. In original paper a parallel version is also given, which is not complected to implement. I am working on the implementation and soon submit the initial code for review. Here is the Link for the paper http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382 As pointed out by Xiangrui in discussion http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html There are other algorithms also. Lets discuss about which will be more general and easily paralleled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4362) Make prediction probability available in Naive Baye's Model
Jatinpreet Singh created SPARK-4362: --- Summary: Make prediction probability available in Naive Baye's Model Key: SPARK-4362 URL: https://issues.apache.org/jira/browse/SPARK-4362 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jatinpreet Singh Priority: Minor Fix For: 1.2.0 There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4363) The Broadcast example is out of date
Shixiong Zhu created SPARK-4363: --- Summary: The Broadcast example is out of date Key: SPARK-4363 URL: https://issues.apache.org/jira/browse/SPARK-4363 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Shixiong Zhu Priority: Trivial The Broadcast example is out of date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4363) The Broadcast example is out of date
[ https://issues.apache.org/jira/browse/SPARK-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207942#comment-14207942 ] Apache Spark commented on SPARK-4363: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3226 The Broadcast example is out of date Key: SPARK-4363 URL: https://issues.apache.org/jira/browse/SPARK-4363 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Shixiong Zhu Priority: Trivial The Broadcast example is out of date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4364) Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong
Shixiong Zhu created SPARK-4364: --- Summary: Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong Key: SPARK-4364 URL: https://issues.apache.org/jira/browse/SPARK-4364 Project: Spark Issue Type: Test Components: Streaming Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Trivial Because of the type erase, the unit tests will pass. However, the wrong variable types will confuse people. The locations of these variables can be found in my PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4364) Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong
[ https://issues.apache.org/jira/browse/SPARK-4364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207981#comment-14207981 ] Sean Owen commented on SPARK-4364: -- This is already covered in SPARK-4297 Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong Key: SPARK-4364 URL: https://issues.apache.org/jira/browse/SPARK-4364 Project: Spark Issue Type: Test Components: Streaming Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Trivial Labels: unit-test Because of the type erase, the unit tests will pass. However, the wrong variable types will confuse people. The locations of these variables can be found in my PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4362) Make prediction probability available in NaiveBayesModel
[ https://issues.apache.org/jira/browse/SPARK-4362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4362: - Summary: Make prediction probability available in NaiveBayesModel (was: Make prediction probability available in Naive Baye's Model) Make prediction probability available in NaiveBayesModel Key: SPARK-4362 URL: https://issues.apache.org/jira/browse/SPARK-4362 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jatinpreet Singh Priority: Minor Labels: naive-bayes Fix For: 1.2.0 There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4364) Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong
[ https://issues.apache.org/jira/browse/SPARK-4364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207995#comment-14207995 ] Apache Spark commented on SPARK-4364: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3227 Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong Key: SPARK-4364 URL: https://issues.apache.org/jira/browse/SPARK-4364 Project: Spark Issue Type: Test Components: Streaming Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Trivial Labels: unit-test Because of the type erase, the unit tests will pass. However, the wrong variable types will confuse people. The locations of these variables can be found in my PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2867) saveAsHadoopFile() in PairRDDFunction.scala should allow use other OutputCommiter class
[ https://issues.apache.org/jira/browse/SPARK-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207997#comment-14207997 ] Romi Kuntsman commented on SPARK-2867: -- In the latest code, it seems to be resolved // Use configured output committer if already set if (conf.getOutputCommitter == null) { hadoopConf.setOutputCommitter(classOf[FileOutputCommitter]) } https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L934 saveAsHadoopFile() in PairRDDFunction.scala should allow use other OutputCommiter class --- Key: SPARK-2867 URL: https://issues.apache.org/jira/browse/SPARK-2867 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Joseph Su Priority: Minor The saveAsHadoopFile() in PairRDDFunction.scala hard-coded the OutputCommitter class as FileOutputCommitter because of the following code in the source: hadoopConf.setOutputCommitter(classOf[FileOutputCommitter]) However, OutputCommitter is a changeable option in regular Hadoop MapReduce program. Users can specify mapred.output.committer.class to change the committer class used by other Hadoop programs. The saveAsHadoopFile() function should remove this hard-coded assignment and provide a way to specify the OutputCommitte used here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208028#comment-14208028 ] Cristian Opris commented on SPARK-3633: --- FWIW I get this as well, with a very straightforward job and setup. Spark 1.1.0, executors configured to 2GB, storage.fraction=0.2, shuffle.spill=true 50GB dataset on ext4, spread over 7000 files, hence the coalescing below The jobs is only doing: input.coalesce(72, false).groupBy(key).count The groupBy is successful then I get the dreaded fetch error on count stage (oddly enough), but it seems to me that's when it does the actual shuffling for groupBy ? Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Priority: Critical Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4365) Remove unnecessary filter call on records returned from parquet library
Yash Datta created SPARK-4365: - Summary: Remove unnecessary filter call on records returned from parquet library Key: SPARK-4365 URL: https://issues.apache.org/jira/browse/SPARK-4365 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Yash Datta Priority: Minor Fix For: 1.2.0 Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those : from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java public boolean nextKeyValue() throws IOException, InterruptedException { boolean recordFound = false; while (!recordFound) { // no more records left if (current = total) { return false; } try { checkRead(); currentValue = recordReader.read(); current ++; if (recordReader.shouldSkipCurrentRecord()) { // this record is being filtered via the filter2 package if (DEBUG) LOG.debug(skipping record); continue; } if (currentValue == null) { // only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar; if (DEBUG) LOG.debug(filtered record reader reached end of block); continue; } recordFound = true; if (DEBUG) LOG.debug(read value: + currentValue); } catch (RuntimeException e) { throw new ParquetDecodingException(format(Can not read value at %d in block %d in file %s, current, currentBlock, file), e); } } return true; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4365) Remove unnecessary filter call on records returned from parquet library
[ https://issues.apache.org/jira/browse/SPARK-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208068#comment-14208068 ] Apache Spark commented on SPARK-4365: - User 'saucam' has created a pull request for this issue: https://github.com/apache/spark/pull/3229 Remove unnecessary filter call on records returned from parquet library --- Key: SPARK-4365 URL: https://issues.apache.org/jira/browse/SPARK-4365 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Yash Datta Priority: Minor Fix For: 1.2.0 Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those : from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java public boolean nextKeyValue() throws IOException, InterruptedException { boolean recordFound = false; while (!recordFound) { // no more records left if (current = total) { return false; } try { checkRead(); currentValue = recordReader.read(); current ++; if (recordReader.shouldSkipCurrentRecord()) { // this record is being filtered via the filter2 package if (DEBUG) LOG.debug(skipping record); continue; } if (currentValue == null) { // only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar; if (DEBUG) LOG.debug(filtered record reader reached end of block); continue; } recordFound = true; if (DEBUG) LOG.debug(read value: + currentValue); } catch (RuntimeException e) { throw new ParquetDecodingException(format(Can not read value at %d in block %d in file %s, current, currentBlock, file), e); } } return true; } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object
[ https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208084#comment-14208084 ] Corey J. Nolet commented on SPARK-4320: --- Since this is a simple change, I wanted to work on this myself to get more familiar with the code base. Could someone w/ the proper privileges give me access to be able to assign this ticket to myself? JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object Key: SPARK-4320 URL: https://issues.apache.org/jira/browse/SPARK-4320 Project: Spark Issue Type: Improvement Components: Input/Output, Spark Core Reporter: Corey J. Nolet Fix For: 1.1.1, 1.2.0 I am outputting data to Accumulo using a custom OutputFormat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm storing, but rather a generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API. Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of forcing the user to always set up the Job object explicitly, I'm in the camp of having the following method signature: saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once. Perhaps an overloaded method signature could be: saveAsNewHadoopDataset(job : Job) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4366) Aggregation Optimization
Cheng Hao created SPARK-4366: Summary: Aggregation Optimization Key: SPARK-4366 URL: https://issues.apache.org/jira/browse/SPARK-4366 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao This improvement actually includes couple of sub tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4367) Process the distinct value before shuffling for aggregation
Cheng Hao created SPARK-4367: Summary: Process the distinct value before shuffling for aggregation Key: SPARK-4367 URL: https://issues.apache.org/jira/browse/SPARK-4367 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Most of aggregate function(e.g average) with distinct value will requires all of the records in the same group to be shuffled into a single node, however, as part of the optimization, those records can be partially aggregated before shuffling, that probably reduces the overhead of shuffling significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4233) Simplify the Aggregation Function implementation
[ https://issues.apache.org/jira/browse/SPARK-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4233: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-4366 Simplify the Aggregation Function implementation Key: SPARK-4233 URL: https://issues.apache.org/jira/browse/SPARK-4233 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, the UDAF implementation is quite complicated, and we have to provide distinct non-distinct version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4367) Process the distinct value before shuffling for aggregation
[ https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4367: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-4366 Process the distinct value before shuffling for aggregation - Key: SPARK-4367 URL: https://issues.apache.org/jira/browse/SPARK-4367 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Most of aggregate function(e.g average) with distinct value will requires all of the records in the same group to be shuffled into a single node, however, as part of the optimization, those records can be partially aggregated before shuffling, that probably reduces the overhead of shuffling significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208028#comment-14208028 ] Cristian Opris edited comment on SPARK-3633 at 11/12/14 3:20 PM: - FWIW I get this as well, with a very straightforward job and setup. Spark 1.1.0, executors configured to 2GB, storage.fraction=0.2, shuffle.spill=true 50GB dataset on ext4, spread over 7000 files, hence the coalescing below The jobs is only doing: input.coalesce(72, false).groupBy(key).count The groupBy is successful then I get the dreaded fetch error on count stage (oddly enough), but it seems to me that's when it does the actual shuffling for groupBy ? EDIT: This might be due to Full GC on the executors during the shuffle block transfer phase. What's interesting is that it doesn't go OOM and the same amount is collected every time. (Old gen is 1.5 GB) 2014-11-12T07:17:06.899-0800: 477.697: [Full GC [PSYoungGen: 248320K-0K(466432K)] [ParOldGen: 1355469K-1301675K(1398272K)] 1603789K-1301675K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6565240 secs] [Times: user=3.35 sys=0.00, real=0.66 secs] 2014-11-12T07:17:07.751-0800: 478.549: [Full GC [PSYoungGen: 248320K-0K(466432K)] [ParOldGen: 1301681K-1268312K(1398272K)] 1550001K-1268312K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.5821160 secs] [Times: user=3.16 sys=0.00, real=0.58 secs] 2014-11-12T07:17:08.495-0800: 479.294: [Full GC [PSYoungGen: 248320K-0K(466432K)] [ParOldGen: 1268314K-1300497K(1398272K)] 1516634K-1300497K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6400670 secs] [Times: user=4.07 sys=0.01, real=0.64 secs] was (Author: onetoinfin...@yahoo.com): FWIW I get this as well, with a very straightforward job and setup. Spark 1.1.0, executors configured to 2GB, storage.fraction=0.2, shuffle.spill=true 50GB dataset on ext4, spread over 7000 files, hence the coalescing below The jobs is only doing: input.coalesce(72, false).groupBy(key).count The groupBy is successful then I get the dreaded fetch error on count stage (oddly enough), but it seems to me that's when it does the actual shuffling for groupBy ? Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Priority: Critical Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Updated] (SPARK-3056) Sort-based Aggregation
[ https://issues.apache.org/jira/browse/SPARK-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-3056: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-4366 Sort-based Aggregation -- Key: SPARK-3056 URL: https://issues.apache.org/jira/browse/SPARK-3056 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, SparkSQL only support the hash-based aggregation, which may cause OOM if too many identical keys in the input tuples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208028#comment-14208028 ] Cristian Opris edited comment on SPARK-3633 at 11/12/14 3:36 PM: - FWIW I get this as well, with a very straightforward job and setup. Spark 1.1.0, executors configured to 2GB, storage.fraction=0.2, shuffle.spill=true 50GB dataset on ext4, spread over 7000 files, hence the coalescing below The jobs is only doing: input.coalesce(72, false).groupBy(key).count The groupBy is successful then I get the dreaded fetch error on count stage (oddly enough), but it seems to me that's when it does the actual shuffling for groupBy ? EDIT: This might be due to Full GC on the executors during the shuffle block transfer phase. What's interesting is that it doesn't go OOM and the same amount is collected every time. (Old gen is 1.5 GB) 2014-11-12T07:17:06.899-0800: 477.697: [Full GC [PSYoungGen: 248320K-0K(466432K)] [ParOldGen: 1355469K-1301675K(1398272K)] 1603789K-1301675K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6565240 secs] [Times: user=3.35 sys=0.00, real=0.66 secs] 2014-11-12T07:17:07.751-0800: 478.549: [Full GC [PSYoungGen: 248320K-0K(466432K)] [ParOldGen: 1301681K-1268312K(1398272K)] 1550001K-1268312K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.5821160 secs] [Times: user=3.16 sys=0.00, real=0.58 secs] 2014-11-12T07:17:08.495-0800: 479.294: [Full GC [PSYoungGen: 248320K-0K(466432K)] [ParOldGen: 1268314K-1300497K(1398272K)] 1516634K-1300497K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6400670 secs] [Times: user=4.07 sys=0.01, real=0.64 secs] EDIT2: Changing to G1 collector actually causes it to go OOM. This must be related somehow to the number of shuffle files and hence perhaps open buffers as lowering the number of reducers from 72 to 10 runs without issues (note I'm using consolidated shuffle files). 14/11/12 07:30:53 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Connection manager future execution context-2,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$2.apply(BlockFetcherIterator.scala:124) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$2.apply(BlockFetcherIterator.scala:121) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) was (Author: onetoinfin...@yahoo.com): FWIW I get this as well, with a very straightforward job and setup. Spark 1.1.0, executors configured to 2GB, storage.fraction=0.2, shuffle.spill=true 50GB dataset on ext4, spread over 7000 files, hence the coalescing below The jobs is only doing: input.coalesce(72, false).groupBy(key).count The groupBy is successful then I get the dreaded fetch error on count stage (oddly enough), but it seems to me that's when it does the actual shuffling for groupBy ? EDIT: This might be due to Full GC on the executors during the shuffle block transfer phase. What's interesting is that it doesn't go OOM and the same amount is collected every time. (Old gen is 1.5 GB) 2014-11-12T07:17:06.899-0800: 477.697: [Full GC [PSYoungGen: 248320K-0K(466432K)] [ParOldGen: 1355469K-1301675K(1398272K)] 1603789K-1301675K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6565240 secs] [Times: user=3.35 sys=0.00, real=0.66 secs] 2014-11-12T07:17:07.751-0800: 478.549: [Full GC [PSYoungGen: 248320K-0K(466432K)] [ParOldGen: 1301681K-1268312K(1398272K)] 1550001K-1268312K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.5821160 secs] [Times: user=3.16 sys=0.00, real=0.58 secs] 2014-11-12T07:17:08.495-0800: 479.294: [Full GC [PSYoungGen: 248320K-0K(466432K)] [ParOldGen: 1268314K-1300497K(1398272K)] 1516634K-1300497K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6400670 secs] [Times: user=4.07 sys=0.01, real=0.64 secs] Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug
[jira] [Commented] (SPARK-1014) MultilogisticRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208226#comment-14208226 ] Sean Owen commented on SPARK-1014: -- I'm curious if this is still active -- where was the PR? was this just one-vs-all LR ? MultilogisticRegressionWithSGD -- Key: SPARK-1014 URL: https://issues.apache.org/jira/browse/SPARK-1014 Project: Spark Issue Type: New Feature Affects Versions: 0.9.0 Reporter: Kun Yang Multilogistic Regression With SGD based on mllib packages Use labeledpoint, gradientDescent to train the model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1245) Can't read EMR HBase cluster from properly built Cloudera Spark Cluster.
[ https://issues.apache.org/jira/browse/SPARK-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1245. -- Resolution: Not a Problem I'm guessing this is now either obsolete, or, a case of matching HBase / Hadoop versions exactly. Spark should be provided, and not marking as such may mean the Spark Hadoop / cluster Hadoop / HBase Hadoop deps are colliding. Can't read EMR HBase cluster from properly built Cloudera Spark Cluster. Key: SPARK-1245 URL: https://issues.apache.org/jira/browse/SPARK-1245 Project: Spark Issue Type: Bug Reporter: Sam Abeyratne Can't read EMR HBase cluster from properly built Cloudera Spark Cluster. If I scp hadoop-yarn-client-2.2.0.jar from our EMR hbase cluster lib dir and manually add it as a lib to my jar it does NOT give me a noSuchMethod error, but does give me a weird EOF exception (see below). Usually I use SBT to build Jars, but the EMR distros are very strange I can't find a proper repository for them. I'm thinking only thing we can do is get our sysadm to rebuild the hbase cluster to use a proper cloudera hbase / hadoop. SBT Dependencies include: org.apache.spark % spark-core_2.10 % 0.9.0-incubating, org.apache.hbase % hbase % 0.94.7, 14/03/11 19:08:06 WARN scheduler.TaskSetManager: Lost TID 95 (task 0.0:3) 14/03/11 19:08:06 WARN scheduler.TaskSetManager: Loss was due to java.io.EOFException java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2744) at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1015) at org.apache.hadoop.io.WritableUtils.readCompressedByteArray(WritableUtils.java:39) at org.apache.hadoop.io.WritableUtils.readCompressedString(WritableUtils.java:87) at org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:185) at org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2433) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75) at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208258#comment-14208258 ] Anson Abraham commented on SPARK-1867: -- I'm running 1.1 (standalone) w/o yarn on CDH 5.2. I'm just doing a quick test: val source = sc.textFile(/tmp/testfile.txt) source.saveAsTextFile(/tmp/test_spark_output) and I'm hitting that issue, java.lang.IllegalStateException: unread block data. The versions on all the nodes are identical. I can't figure out what the exact issue is. Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1, org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0, // standard, probably unrelated com.github.seratch %% awscala % [0.2,), org.scalacheck %% scalacheck % 1.10.1 % test, org.specs2 %% specs2 % 1.14 % test, org.scala-lang % scala-reflect % 2.10.3, org.scalaz %% scalaz-core % 7.0.5, net.minidev % json-smart % 1.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SPARK-4368) Ceph integration?
Serge Smertin created SPARK-4368: Summary: Ceph integration? Key: SPARK-4368 URL: https://issues.apache.org/jira/browse/SPARK-4368 Project: Spark Issue Type: Bug Components: Input/Output Reporter: Serge Smertin There is a use-case of storing big number of relatively small BLOB objects (2-20Mb), which has to have some ugly workarounds in HDFS environments. There is a need to process those BLOBs close to data themselves, so that's why MapReduce paradigm is good, as it guarantees data locality. Ceph seems to be one of the systems that maintains both of the properties (small files and data locality) - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I know already that Spark supports GlusterFS - http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E So i wonder, could there be an integration with this storage solution and what could be the effort of doing that? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208318#comment-14208318 ] Cristian Opris commented on SPARK-3633: --- This looks like a memory leak in ConnectionManager where responses (BufferMessage) are retained by the TimerTask waiting for ACK even after the Future completes with Success, please see the Possibly related to https://github.com/apache/spark/commit/76fa0eaf515fd6771cdd69422b1259485debcae5 +--+--+--+-+ |Class | Objects| Shallow Size | Retained Size | +--+--+--+-+ | java.util.TaskQueue |10 % | 240 % | 885,048,168 100 % | | java.util.TimerTask[] |10 % | 2,0640 % | 885,048,144 99 % | | org.apache.spark.network.ConnectionManager$$anon$5 | 2865 % | 13,7280 % | ~ 885,046,080 99 % | | org.apache.spark.network.BufferMessage | 572 10 % | 36,6080 % | ~ 885,018,624 99 % | | scala.concurrent.impl.Promise$DefaultPromise| 2865 % | 4,5760 % | ~ 884,968,288 99 % | | scala.util.Success | 2865 % | 4,5760 % | ~ 884,963,712 99 % | | scala.collection.mutable.ArrayBuffer| 572 10 % | 13,7280 % | ~ 884,915,768 99 % | | java.lang.Object[] | 572 10 % | 45,7600 % | ~ 884,902,040 99 % | | java.nio.HeapByteBuffer | 2865 % | 13,7280 % | ~ 884,856,280 99 % | | byte[] | 2865 % | 884,842,552 99 % | ~ 884,842,552 99 % | | java.net.InetSocketAddress | 572 10 % | 9,1520 % | ~ 66,2480 % | | java.net.InetSocketAddress$InetSocketAddressHolder | 572 10 % | 13,7280 % | ~ 57,0960 % | | java.net.Inet4Address | 2865 % | 6,8640 % | ~ 43,3680 % | | java.net.InetAddress$InetAddressHolder | 2865 % | 6,8640 % | ~ 36,5040 % | | java.lang.String| 2855 % | 6,8400 % | ~ 29,6400 % | | char[] | 2855 % | 22,8000 % | ~ 22,8000 % | | java.lang.Object| 2865 % | 4,5760 % |~ 4,5760 % | +--+--+--+-+ Generated by YourKit Java Profiler 2014 build 14110 12-Nov-2014 17:44:32 Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Priority: Critical Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
[jira] [Comment Edited] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208318#comment-14208318 ] Cristian Opris edited comment on SPARK-3633 at 11/12/14 5:48 PM: - This looks like a memory leak in ConnectionManager where responses (BufferMessage) are retained by the TimerTask waiting for ACK even after the Future completes with Success, please see the reference chain from a heap dump below Possibly related to https://github.com/apache/spark/commit/76fa0eaf515fd6771cdd69422b1259485debcae5 +--+--+--+-+ |Class | Objects| Shallow Size | Retained Size | +--+--+--+-+ | java.util.TaskQueue |10 % | 240 % | 885,048,168 100 % | | java.util.TimerTask[] |10 % | 2,0640 % | 885,048,144 99 % | | org.apache.spark.network.ConnectionManager$$anon$5 | 2865 % | 13,7280 % | ~ 885,046,080 99 % | | org.apache.spark.network.BufferMessage | 572 10 % | 36,6080 % | ~ 885,018,624 99 % | | scala.concurrent.impl.Promise$DefaultPromise| 2865 % | 4,5760 % | ~ 884,968,288 99 % | | scala.util.Success | 2865 % | 4,5760 % | ~ 884,963,712 99 % | | scala.collection.mutable.ArrayBuffer| 572 10 % | 13,7280 % | ~ 884,915,768 99 % | | java.lang.Object[] | 572 10 % | 45,7600 % | ~ 884,902,040 99 % | | java.nio.HeapByteBuffer | 2865 % | 13,7280 % | ~ 884,856,280 99 % | | byte[] | 2865 % | 884,842,552 99 % | ~ 884,842,552 99 % | | java.net.InetSocketAddress | 572 10 % | 9,1520 % | ~ 66,2480 % | | java.net.InetSocketAddress$InetSocketAddressHolder | 572 10 % | 13,7280 % | ~ 57,0960 % | | java.net.Inet4Address | 2865 % | 6,8640 % | ~ 43,3680 % | | java.net.InetAddress$InetAddressHolder | 2865 % | 6,8640 % | ~ 36,5040 % | | java.lang.String| 2855 % | 6,8400 % | ~ 29,6400 % | | char[] | 2855 % | 22,8000 % | ~ 22,8000 % | | java.lang.Object| 2865 % | 4,5760 % |~ 4,5760 % | +--+--+--+-+ Generated by YourKit Java Profiler 2014 build 14110 12-Nov-2014 17:44:32 was (Author: onetoinfin...@yahoo.com): This looks like a memory leak in ConnectionManager where responses (BufferMessage) are retained by the TimerTask waiting for ACK even after the Future completes with Success, please see the Possibly related to https://github.com/apache/spark/commit/76fa0eaf515fd6771cdd69422b1259485debcae5 +--+--+--+-+ |Class | Objects| Shallow Size | Retained Size | +--+--+--+-+ | java.util.TaskQueue |10 % | 240 % | 885,048,168 100 % | | java.util.TimerTask[] |10 % | 2,0640 % | 885,048,144 99 % | | org.apache.spark.network.ConnectionManager$$anon$5 | 2865 % | 13,7280 % | ~ 885,046,080 99 % | | org.apache.spark.network.BufferMessage | 572 10 % | 36,6080 % | ~ 885,018,624 99 % | | scala.concurrent.impl.Promise$DefaultPromise| 2865 % | 4,5760 % | ~ 884,968,288 99 % | | scala.util.Success | 2865 % | 4,5760 % | ~ 884,963,712 99 % | | scala.collection.mutable.ArrayBuffer| 572 10 % | 13,7280 % | ~ 884,915,768 99 % | | java.lang.Object[] | 572 10 % | 45,7600 % | ~ 884,902,040 99 % | | java.nio.HeapByteBuffer
[jira] [Created] (SPARK-4369) TreeModel.predict does not work with RDD
Davies Liu created SPARK-4369: - Summary: TreeModel.predict does not work with RDD Key: SPARK-4369 URL: https://issues.apache.org/jira/browse/SPARK-4369 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Davies Liu Priority: Blocker {code} Stack Trace - Traceback (most recent call last): File /home/rprabhu/Coding/github/SDNDDoS/classification/DecisionTree.py, line 49, in module predictions = model.predict(parsedData.map(lambda x: x.features)) File /home/rprabhu/Software/spark/python/pyspark/mllib/tree.py, line 42, in predict return self.call(predict, x.map(_convert_to_vector)) File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line 140, in call return callJavaFunc(self._sc, getattr(self._java_model, name), *a) File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line 117, in callJavaFunc return _java2py(sc, func(*args)) File /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 304, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o39.predict. Trace: py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1014) MultilogisticRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208352#comment-14208352 ] Kun Yang commented on SPARK-1014: - I am not sure if you can find the pr on the repository. Please find it on my github: https://github.com/kunyang1987/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/MultilogisticRegression.scala MultilogisticRegressionWithSGD -- Key: SPARK-1014 URL: https://issues.apache.org/jira/browse/SPARK-1014 Project: Spark Issue Type: New Feature Affects Versions: 0.9.0 Reporter: Kun Yang Multilogistic Regression With SGD based on mllib packages Use labeledpoint, gradientDescent to train the model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4370) Limit cores used by Netty transfer service based on executor size
Aaron Davidson created SPARK-4370: - Summary: Limit cores used by Netty transfer service based on executor size Key: SPARK-4370 URL: https://issues.apache.org/jira/browse/SPARK-4370 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical Right now, the NettyBlockTransferService uses the total number of cores on the system as the number of threads and buffer arenas to create. The latter is more troubling -- this can lead to significant allocation of extra heap and direct memory in situations where executors are relatively small compared to the whole machine. For instance, on a machine with 32 cores, we will allocate (32 cores * 16MB per arena = 512MB) * 2 for client and server = 1GB direct and heap memory. This can be a huge overhead if you're only using, say, 8 of those cores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208378#comment-14208378 ] Cristian Opris commented on SPARK-3633: --- At first sight (haven't tested this) the problem is in the code below. The TimerTask is cancelled on Success but this doesn't actually remove it from the Timer TaskQueue since the TimerThread doesn't actually remove cancelled tasks until they're actually scheduled to run, which in this case is by default 60 secs ack timeout. A quick fix would be to call Timer.purge() after task cancel below, or better yet change to a better Timer like the HashedWheel one from Netty {code:title=|borderStyle=solid} val status = new MessageStatus(message, connectionManagerId, s = { timeoutTask.cancel() s.ackMessage match { case None = // Indicates a failure where we either never sent or never got ACK'd promise.failure(new IOException(sendMessageReliably failed without being ACK'd)) case Some(ackMessage) = if (ackMessage.hasError) { promise.failure( new IOException(sendMessageReliably failed with ACK that signalled a remote error)) } else { promise.success(ackMessage) } } }) {code} Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Priority: Critical Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4369) TreeModel.predict does not work with RDD
[ https://issues.apache.org/jira/browse/SPARK-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208377#comment-14208377 ] Apache Spark commented on SPARK-4369: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3230 TreeModel.predict does not work with RDD Key: SPARK-4369 URL: https://issues.apache.org/jira/browse/SPARK-4369 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Davies Liu Priority: Blocker {code} Stack Trace - Traceback (most recent call last): File /home/rprabhu/Coding/github/SDNDDoS/classification/DecisionTree.py, line 49, in module predictions = model.predict(parsedData.map(lambda x: x.features)) File /home/rprabhu/Software/spark/python/pyspark/mllib/tree.py, line 42, in predict return self.call(predict, x.map(_convert_to_vector)) File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line 140, in call return callJavaFunc(self._sc, getattr(self._java_model, name), *a) File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line 117, in callJavaFunc return _java2py(sc, func(*args)) File /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 304, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o39.predict. Trace: py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4370) Limit cores used by Netty transfer service based on executor size
[ https://issues.apache.org/jira/browse/SPARK-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208391#comment-14208391 ] Apache Spark commented on SPARK-4370: - User 'aarondav' has created a pull request for this issue: https://github.com/apache/spark/pull/3155 Limit cores used by Netty transfer service based on executor size - Key: SPARK-4370 URL: https://issues.apache.org/jira/browse/SPARK-4370 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical Right now, the NettyBlockTransferService uses the total number of cores on the system as the number of threads and buffer arenas to create. The latter is more troubling -- this can lead to significant allocation of extra heap and direct memory in situations where executors are relatively small compared to the whole machine. For instance, on a machine with 32 cores, we will allocate (32 cores * 16MB per arena = 512MB) * 2 for client and server = 1GB direct and heap memory. This can be a huge overhead if you're only using, say, 8 of those cores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3530. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3099 [https://github.com/apache/spark/pull/3099] Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.2.0 This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3315) Support hyperparameter tuning
[ https://issues.apache.org/jira/browse/SPARK-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-3315. Resolution: Fixed Fix Version/s: 1.2.0 CrossValidator and ParamGridBuilder were included in the PR for SPARK-3530. I'm closing this now and I will create separate JIRAs for other tuning features. Support hyperparameter tuning - Key: SPARK-3315 URL: https://issues.apache.org/jira/browse/SPARK-3315 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.2.0 Tuning a pipeline and select the best set of parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208472#comment-14208472 ] Manish Amde commented on SPARK-3717: [~bbnsumanth] Look forward to your details of your approach. This is an important ticket and want to make sure that we all agree on the architecture before pursuing the implementation work. Also, as [~josephkb] suggested it might be a good idea to get your feet wet with a couple of small patches to get used to the Spark contribution workflow. DecisionTree, RandomForest: Partition by feature Key: SPARK-3717 URL: https://issues.apache.org/jira/browse/SPARK-3717 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Each worker stores: ** a subset of columns (i.e., a subset of features). If a worker stores feature j, then the worker stores the feature value for all instances (i.e., the whole column). ** all labels * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations, reduce to total of: ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) ** Estimate: 2^D * M * B * C * 8 h2. Comparing Partitioning Methods Partitioning features cost partitioning instances cost when: * D * (M * 8 + N) 2^D * M * B * C * 8 * D * N 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the right hand side) * N [ 2^D * M * B * C * 8 ] / D Example: many instances: * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5) * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208472#comment-14208472 ] Manish Amde edited comment on SPARK-3717 at 11/12/14 7:02 PM: -- [~bbnsumanth] Look forward to the details of your approach. This is an important ticket and want to make sure that we all agree on the architecture before pursuing the implementation work. Also, as [~josephkb] suggested it might be a good idea to get your feet wet with a couple of small patches to get used to the Spark contribution workflow. was (Author: manishamde): [~bbnsumanth] Look forward to your details of your approach. This is an important ticket and want to make sure that we all agree on the architecture before pursuing the implementation work. Also, as [~josephkb] suggested it might be a good idea to get your feet wet with a couple of small patches to get used to the Spark contribution workflow. DecisionTree, RandomForest: Partition by feature Key: SPARK-3717 URL: https://issues.apache.org/jira/browse/SPARK-3717 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Each worker stores: ** a subset of columns (i.e., a subset of features). If a worker stores feature j, then the worker stores the feature value for all instances (i.e., the whole column). ** all labels * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations, reduce to total of: ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) ** Estimate: 2^D * M * B * C * 8 h2. Comparing Partitioning Methods Partitioning features cost partitioning instances cost when: * D * (M * 8 + N) 2^D * M * B * C * 8 * D * N 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the right hand side) * N [ 2^D * M * B * C * 8 ] / D Example: many instances: * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5) * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4371) Spark crashes with JBoss Logging 3.6.1
[ https://issues.apache.org/jira/browse/SPARK-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208525#comment-14208525 ] Sean Owen commented on SPARK-4371: -- SLF4J is pretty backwards compatible. The right thing to do in general is update your dependency to 1.7.x in your app. Does that not work? Spark crashes with JBoss Logging 3.6.1 -- Key: SPARK-4371 URL: https://issues.apache.org/jira/browse/SPARK-4371 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Florent Pellerin When using JBoss-logging which itself depends on slf4j 1.6.1, Since SLF4JBridgeHandler.removeHandlersForRootLogger() was added in slf4j 1.6.5, Since spark/Logging.scala is doing at line 147: bridgeClass.getMethod(removeHandlersForRootLogger).invoke(null) Spark is crashing: java.lang.ExceptionInInitializerError: null at java.lang.Class.getMethod(Class.java:1670) at org.apache.spark.Logging$.init(Logging.scala:147) at org.apache.spark.Logging$.clinit(Logging.scala) at org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:104) at org.apache.spark.Logging$class.log(Logging.scala:51) at org.apache.spark.SecurityManager.log(SecurityManager.scala:143) at org.apache.spark.Logging$class.logInfo(Logging.scala:59) at org.apache.spark.SecurityManager.logInfo(SecurityManager.scala:143) at org.apache.spark.SecurityManager.setViewAcls(SecurityManager.scala:208) at org.apache.spark.SecurityManager.init(SecurityManager.scala:167) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:151) at org.apache.spark.SparkContext.init(SparkContext.scala:203) I suggest Spark should at least silently swallow the exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-3039: -- Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API -- Key: SPARK-3039 URL: https://issues.apache.org/jira/browse/SPARK-3039 Project: Spark Issue Type: Bug Components: Build, Input/Output, Spark Core Affects Versions: 0.9.1, 1.0.0, 1.1.0 Environment: hadoop2, hadoop-2.4.0, HDP-2.1 Reporter: Bertrand Bossy Assignee: Bertrand Bossy Fix For: 1.2.0 The spark assembly contains the artifact org.apache.avro:avro-mapred as a dependency of org.spark-project.hive:hive-serde. The avro-mapred package provides a hadoop FileInputFormat to read and write avro files. There are two versions of this package, distinguished by a classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. avro-mapred for the old Hadoop API uses no classifier. E.g. when reading avro files using {code} sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro) {code} The following error occurs: {code} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} This error usually is a hint that there was a mix up of the old and the new Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to appear before the version that is bundled with Spark, reading avro files works fine. Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208595#comment-14208595 ] Kousuke Saruta commented on SPARK-4267: --- Hi [~ozawa], On my YARN-2.5.1(JDK 1.7.0_60) cluster, Spark Shell works well. I built with following command. {code} sbt/sbt -Dhadoop.version=2.5.1 -Pyarn assembly {code} And launched Spark Shell with following command. {code} bin/spark-shell --master yarn --deploy-mode client --executor-cores 1 --driver-memory 512M --executor-memory 512M --num-executors 1 {code} And then, I ran job with following script. {code} sc.textFile(hdfs:///user/kou/LICENSE.txt).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/kou/LICENSE.txt.count) {code} So I think the problem is not caused by the version of Hadoop. One possible case is that SparkContext#stop is called between instantiating SparkContext and running job accidentally. Did you see any ERROR log on the shell? Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later -- Key: SPARK-4267 URL: https://issues.apache.org/jira/browse/SPARK-4267 Project: Spark Issue Type: Bug Reporter: Tsuyoshi OZAWA Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN fails to launch jobs with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
[jira] [Comment Edited] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208595#comment-14208595 ] Kousuke Saruta edited comment on SPARK-4267 at 11/12/14 8:07 PM: - Hi [~ozawa], On my YARN 2.5.1(JDK 1.7.0_60) cluster, Spark Shell works well. I built with following command. {code} sbt/sbt -Dhadoop.version=2.5.1 -Pyarn assembly {code} And launched Spark Shell with following command. {code} bin/spark-shell --master yarn --deploy-mode client --executor-cores 1 --driver-memory 512M --executor-memory 512M --num-executors 1 {code} And then, I ran job with following script. {code} sc.textFile(hdfs:///user/kou/LICENSE.txt).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/kou/LICENSE.txt.count) {code} So I think the problem is not caused by the version of Hadoop. One possible case is that SparkContext#stop is called between instantiating SparkContext and running job accidentally. Did you see any ERROR log on the shell? was (Author: sarutak): Hi [~ozawa], On my YARN-2.5.1(JDK 1.7.0_60) cluster, Spark Shell works well. I built with following command. {code} sbt/sbt -Dhadoop.version=2.5.1 -Pyarn assembly {code} And launched Spark Shell with following command. {code} bin/spark-shell --master yarn --deploy-mode client --executor-cores 1 --driver-memory 512M --executor-memory 512M --num-executors 1 {code} And then, I ran job with following script. {code} sc.textFile(hdfs:///user/kou/LICENSE.txt).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/kou/LICENSE.txt.count) {code} So I think the problem is not caused by the version of Hadoop. One possible case is that SparkContext#stop is called between instantiating SparkContext and running job accidentally. Did you see any ERROR log on the shell? Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later -- Key: SPARK-4267 URL: https://issues.apache.org/jira/browse/SPARK-4267 Project: Spark Issue Type: Bug Reporter: Tsuyoshi OZAWA Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN fails to launch jobs with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console)
[jira] [Resolved] (SPARK-3660) Initial RDD for updateStateByKey transformation
[ https://issues.apache.org/jira/browse/SPARK-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-3660. -- Resolution: Fixed Fix Version/s: 1.3.0 Initial RDD for updateStateByKey transformation --- Key: SPARK-3660 URL: https://issues.apache.org/jira/browse/SPARK-3660 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Soumitra Kumar Priority: Minor Fix For: 1.3.0 Original Estimate: 24h Remaining Estimate: 24h How to initialize state tranformation updateStateByKey? I have word counts from previous spark-submit run, and want to load that in next spark-submit job to start counting over that. One proposal is to add following argument to updateStateByKey methods. initial : Option [RDD [(K, S)]] = None This will maintain the backward compatibility as well. I have a working code as well. This thread started on spark-user list at: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-updateStateByKey-operation-td14772.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3660) Initial RDD for updateStateByKey transformation
[ https://issues.apache.org/jira/browse/SPARK-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-3660: - Priority: Major (was: Minor) Initial RDD for updateStateByKey transformation --- Key: SPARK-3660 URL: https://issues.apache.org/jira/browse/SPARK-3660 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Soumitra Kumar Fix For: 1.3.0 Original Estimate: 24h Remaining Estimate: 24h How to initialize state tranformation updateStateByKey? I have word counts from previous spark-submit run, and want to load that in next spark-submit job to start counting over that. One proposal is to add following argument to updateStateByKey methods. initial : Option [RDD [(K, S)]] = None This will maintain the backward compatibility as well. I have a working code as well. This thread started on spark-user list at: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-updateStateByKey-operation-td14772.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4372) Make LR and SVM's default parameters consistent in Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-4372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208597#comment-14208597 ] Apache Spark commented on SPARK-4372: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3232 Make LR and SVM's default parameters consistent in Scala and Python Key: SPARK-4372 URL: https://issues.apache.org/jira/browse/SPARK-4372 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Xiangrui Meng The current default regParam is 1.0 and regType is claimed to be none in Python (but actually it is l2), while regParam = 0.0 and regType is L2 in Scala. We should make the default values consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3666) Extract interfaces for EdgeRDD and VertexRDD
[ https://issues.apache.org/jira/browse/SPARK-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3666. Resolution: Fixed Fix Version/s: 1.2.0 Extract interfaces for EdgeRDD and VertexRDD Key: SPARK-3666 URL: https://issues.apache.org/jira/browse/SPARK-3666 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave Priority: Blocker Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208766#comment-14208766 ] Josh Rosen commented on SPARK-3630: --- Hi [~rdub], Thanks for the detailed logs. Do you have access to the executor logs from the executors where fetch failures occurred? I'd like to see whether those logs contain more information about why those fetches failed. Identify cause of Kryo+Snappy PARSING_ERROR --- Key: SPARK-3630 URL: https://issues.apache.org/jira/browse/SPARK-3630 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Andrew Ash Assignee: Josh Rosen A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted (see SPARK-3400). Separately, [~aash] observed the same exception stacktrace in an application-specific Kryo registrator: {noformat} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) ... {noformat} This ticket is to identify the cause of the exception in the GraphX commit so the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2996) Standalone and Yarn have different settings for adding the user classpath first
[ https://issues.apache.org/jira/browse/SPARK-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208767#comment-14208767 ] Apache Spark commented on SPARK-2996: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/3233 Standalone and Yarn have different settings for adding the user classpath first --- Key: SPARK-2996 URL: https://issues.apache.org/jira/browse/SPARK-2996 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor Standalone uses spark.files.userClassPathFirst while Yarn uses spark.yarn.user.classpath.first. Adding support for the former in Yarn should be pretty trivial. Don't know if Mesos has anything similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4369) TreeModel.predict does not work with RDD
[ https://issues.apache.org/jira/browse/SPARK-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4369. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3230 [https://github.com/apache/spark/pull/3230] TreeModel.predict does not work with RDD Key: SPARK-4369 URL: https://issues.apache.org/jira/browse/SPARK-4369 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Davies Liu Priority: Blocker Fix For: 1.2.0 {code} Stack Trace - Traceback (most recent call last): File /home/rprabhu/Coding/github/SDNDDoS/classification/DecisionTree.py, line 49, in module predictions = model.predict(parsedData.map(lambda x: x.features)) File /home/rprabhu/Software/spark/python/pyspark/mllib/tree.py, line 42, in predict return self.call(predict, x.map(_convert_to_vector)) File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line 140, in call return callJavaFunc(self._sc, getattr(self._java_model, name), *a) File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line 117, in callJavaFunc return _java2py(sc, func(*args)) File /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 304, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o39.predict. Trace: py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4369) TreeModel.predict does not work with RDD
[ https://issues.apache.org/jira/browse/SPARK-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4369: - Assignee: Davies Liu TreeModel.predict does not work with RDD Key: SPARK-4369 URL: https://issues.apache.org/jira/browse/SPARK-4369 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Fix For: 1.2.0 {code} Stack Trace - Traceback (most recent call last): File /home/rprabhu/Coding/github/SDNDDoS/classification/DecisionTree.py, line 49, in module predictions = model.predict(parsedData.map(lambda x: x.features)) File /home/rprabhu/Software/spark/python/pyspark/mllib/tree.py, line 42, in predict return self.call(predict, x.map(_convert_to_vector)) File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line 140, in call return callJavaFunc(self._sc, getattr(self._java_model, name), *a) File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line 117, in callJavaFunc return _java2py(sc, func(*args)) File /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 304, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o39.predict. Trace: py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3667) Deprecate Graph#unpersistVertices and document how to correctly unpersist graphs
[ https://issues.apache.org/jira/browse/SPARK-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-3667. --- Resolution: Won't Fix Target Version/s: (was: 1.2.0) Deprecate Graph#unpersistVertices and document how to correctly unpersist graphs Key: SPARK-3667 URL: https://issues.apache.org/jira/browse/SPARK-3667 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208857#comment-14208857 ] Ryan Williams commented on SPARK-3630: -- I ran a few more instances of this job, toggling {{spark.shuffle.manager}} between {{hash}} and {{sort}}, and wasn't able to continue reproducing the Snappy errors. Some jobs did go into a millions-of-FetchFailures death spiral, and some passed. Not sure how to help debug these transient failures. Identify cause of Kryo+Snappy PARSING_ERROR --- Key: SPARK-3630 URL: https://issues.apache.org/jira/browse/SPARK-3630 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Andrew Ash Assignee: Josh Rosen A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted (see SPARK-3400). Separately, [~aash] observed the same exception stacktrace in an application-specific Kryo registrator: {noformat} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) ... {noformat} This ticket is to identify the cause of the exception in the GraphX commit so the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208863#comment-14208863 ] Ryan Williams commented on SPARK-3630: -- [~joshrosen] I do have access to the logs, though I don't remember exactly which job was which. Let me try to put them somewhere you can see them. Identify cause of Kryo+Snappy PARSING_ERROR --- Key: SPARK-3630 URL: https://issues.apache.org/jira/browse/SPARK-3630 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Andrew Ash Assignee: Josh Rosen A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted (see SPARK-3400). Separately, [~aash] observed the same exception stacktrace in an application-specific Kryo registrator: {noformat} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) ... {noformat} This ticket is to identify the cause of the exception in the GraphX commit so the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208876#comment-14208876 ] Ryan Williams commented on SPARK-3630: -- [~joshrosen] can you see [this dropbox folder|https://www.dropbox.com/sh/pn0bik3tvy73wfi/AAByFlQVJ3QUOqiKYKXt31RGa?dl=0]? The {{\*.logs}} and {{\*.stacks}} files there are the raw yarn logs and a histogram of stack traces, respectively, for four of my jobs that have Snappy exceptions in the logs (0005, 0006, 0007, and 0008). Let me know if that helps or I can provide other info, thanks. Identify cause of Kryo+Snappy PARSING_ERROR --- Key: SPARK-3630 URL: https://issues.apache.org/jira/browse/SPARK-3630 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Andrew Ash Assignee: Josh Rosen A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted (see SPARK-3400). Separately, [~aash] observed the same exception stacktrace in an application-specific Kryo registrator: {noformat} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) ... {noformat} This ticket is to identify the cause of the exception in the GraphX commit so the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4373) MLlib unit tests failed maven test
Xiangrui Meng created SPARK-4373: Summary: MLlib unit tests failed maven test Key: SPARK-4373 URL: https://issues.apache.org/jira/browse/SPARK-4373 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We should make sure there is at most one SparkContext running at any time inside the same JVM. Maven initializes all test classes first and then runs tests. So we cannot initialize sc as a member. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3665) Java API for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-3665: -- Description: The Java API will wrap the Scala API in a similar manner as JavaRDD. Components will include: # JavaGraph #- removes optional param from persist, subgraph, mapReduceTriplets, Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices #- merges multiple parameters lists #- incorporates GraphOps # JavaVertexRDD # JavaEdgeRDD was: The Java API will wrap the Scala API in a similar manner as JavaRDD. Components will include: # JavaGraph #- removes optional param from persist, subgraph, mapReduceTriplets, Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices #- merges multiple parameters lists #- incorporates GraphOps # JavaVertexRDD # JavaEdgeRDD # JavaGraphLoader #- removes optional params, or uses builder pattern Java API for GraphX --- Key: SPARK-3665 URL: https://issues.apache.org/jira/browse/SPARK-3665 Project: Spark Issue Type: Improvement Components: GraphX, Java API Reporter: Ankur Dave Assignee: Ankur Dave The Java API will wrap the Scala API in a similar manner as JavaRDD. Components will include: # JavaGraph #- removes optional param from persist, subgraph, mapReduceTriplets, Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices #- merges multiple parameters lists #- incorporates GraphOps # JavaVertexRDD # JavaEdgeRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3665) Java API for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208916#comment-14208916 ] Apache Spark commented on SPARK-3665: - User 'ankurdave' has created a pull request for this issue: https://github.com/apache/spark/pull/3234 Java API for GraphX --- Key: SPARK-3665 URL: https://issues.apache.org/jira/browse/SPARK-3665 Project: Spark Issue Type: Improvement Components: GraphX, Java API Reporter: Ankur Dave Assignee: Ankur Dave The Java API will wrap the Scala API in a similar manner as JavaRDD. Components will include: # JavaGraph #- removes optional param from persist, subgraph, mapReduceTriplets, Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices #- merges multiple parameters lists #- incorporates GraphOps # JavaVertexRDD # JavaEdgeRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4374) LibraryClientSuite has been flaky
Reynold Xin created SPARK-4374: -- Summary: LibraryClientSuite has been flaky Key: SPARK-4374 URL: https://issues.apache.org/jira/browse/SPARK-4374 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Timothy Hunter Priority: Critical https://github.com/databricks/universe/pull/1780#issuecomment-62809791 LibraryClientSuite: PROD-2230 sanity checks for old data (historical: 55.00% [n=20], recent: 55.00% [n=20]) LibraryClientSuite: A simple case for python (historical: 55.00% [n=20], recent: 55.00% [n=20]) I disabled the two test cases in LibraryClientSuite. Tim - can you look into that? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2672) Support compression in wholeFile()
[ https://issues.apache.org/jira/browse/SPARK-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2672. --- Resolution: Fixed Fix Version/s: 1.3.0 1.2.0 Issue resolved by pull request 3005 [https://github.com/apache/spark/pull/3005] Support compression in wholeFile() -- Key: SPARK-2672 URL: https://issues.apache.org/jira/browse/SPARK-2672 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.2.0, 1.3.0 Original Estimate: 72h Remaining Estimate: 72h The wholeFile() can not read compressed files, it should be, just like textFile(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-4374) LibraryClientSuite has been flaky
[ https://issues.apache.org/jira/browse/SPARK-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin deleted SPARK-4374: --- LibraryClientSuite has been flaky - Key: SPARK-4374 URL: https://issues.apache.org/jira/browse/SPARK-4374 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Timothy Hunter Priority: Critical https://github.com/databricks/universe/pull/1780#issuecomment-62809791 LibraryClientSuite: PROD-2230 sanity checks for old data (historical: 55.00% [n=20], recent: 55.00% [n=20]) LibraryClientSuite: A simple case for python (historical: 55.00% [n=20], recent: 55.00% [n=20]) I disabled the two test cases in LibraryClientSuite. Tim - can you look into that? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4326) unidoc is broken on master
[ https://issues.apache.org/jira/browse/SPARK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208998#comment-14208998 ] Marcelo Vanzin commented on SPARK-4326: --- So, this is really weird. Unidoc is run by the sbt build, where none of the shading shenanigans from the maven build should apply. The root pom.xml adds guava as a dependency for everybody with compile scope when the sbt profile is enabled. That being said, if you look at the output of {{show allDependencies}} from within an sbt shell, it will show some components with a guava 11.0.2 provided dependency. So the profile isn't taking? Another fun fact is that the dependencies for the core project, where the errors above come from, are correct in the output of {{show allDependencies}}; it shows guava 14.0.1 compile as it should. I was able to workaround this by adding guava explicitly in SparkBuild.scala, in the {{sharedSettings}} variable: {code} libraryDependencies += com.google.guava % guava % 14.0.1 {code} That got rid of the above errors, but it didn't fix the overall build. Anyone more familiar with sbt/unidoc knows what's going on here? Here are the errors with that hack applied: {noformat} [error] /work/apache/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55: not found: type Type [error] protected Type type() { return Type.UPLOAD_BLOCK; } [error] ^ [error] /work/apache/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: not found: type Type [error] protected Type type() { return Type.REGISTER_EXECUTOR; } [error] ^ [error] /work/apache/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: not found: type Type [error] protected Type type() { return Type.OPEN_BLOCKS; } [error] ^ [error] /work/apache/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39: not found: type Type [error] protected Type type() { return Type.STREAM_HANDLE; } [error] ^ {noformat} unidoc is broken on master -- Key: SPARK-4326 URL: https://issues.apache.org/jira/browse/SPARK-4326 Project: Spark Issue Type: Bug Components: Build, Documentation Affects Versions: 1.3.0 Reporter: Xiangrui Meng On master, `jekyll build` throws the following error: {code} [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala:205: value hashInt is not a member of com.google.common.hash.HashFunction [error] private def rehash(h: Int): Int = Hashing.murmur3_32().hashInt(h).asInt() [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala:426: value limit is not a member of object com.google.common.io.ByteStreams [error] val bufferedStream = new BufferedInputStream(ByteStreams.limit(fileStream, end - start)) [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala:558: value limit is not a member of object com.google.common.io.ByteStreams [error] val bufferedStream = new BufferedInputStream(ByteStreams.limit(fileStream, end - start)) [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:261: value hashInt is not a member of com.google.common.hash.HashFunction [error] private def hashcode(h: Int): Int = Hashing.murmur3_32().hashInt(h).asInt() [error]^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/Utils.scala:37: type mismatch; [error] found : java.util.Iterator[T] [error] required: Iterable[?] [error] collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), num)).iterator [error] ^ [error] /Users/meng/src/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:421: value putAll is not a member of com.google.common.cache.Cache[org.apache.hadoop.fs.FileStatus,parquet.hadoop.Footer] [error] footerCache.putAll(newFooters) [error] ^ [warn] /Users/meng/src/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala:34: @deprecated now takes two arguments; see the scaladoc. [warn] @deprecated(No code should depend on FakeParquetHiveSerDe as it is only intended as a + [warn] ^
[jira] [Created] (SPARK-4375) assembly built with Maven is missing most of repl classes
Sandy Ryza created SPARK-4375: - Summary: assembly built with Maven is missing most of repl classes Key: SPARK-4375 URL: https://issues.apache.org/jira/browse/SPARK-4375 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4375) assembly built with Maven is missing most of repl classes
[ https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-4375: -- Description: In particular, the ones in the split scala-2.10/scala-2.11 directories aren't being added assembly built with Maven is missing most of repl classes - Key: SPARK-4375 URL: https://issues.apache.org/jira/browse/SPARK-4375 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Blocker In particular, the ones in the split scala-2.10/scala-2.11 directories aren't being added -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4326) unidoc is broken on master
[ https://issues.apache.org/jira/browse/SPARK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209058#comment-14209058 ] Xiangrui Meng commented on SPARK-4326: -- [~vanzin] Thanks for looking into this issue! This is the commit that caused the problem: SPARK-3796: https://github.com/apache/spark/commit/f55218aeb1e9d638df6229b36a59a15ce5363482 It adds Guava 11.0.1 in the pom, which is perhaps not the correct way to specify Guava version. [~adav] Could you explain which Guava version you need per Hadoop profile? unidoc is broken on master -- Key: SPARK-4326 URL: https://issues.apache.org/jira/browse/SPARK-4326 Project: Spark Issue Type: Bug Components: Build, Documentation Affects Versions: 1.3.0 Reporter: Xiangrui Meng On master, `jekyll build` throws the following error: {code} [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala:205: value hashInt is not a member of com.google.common.hash.HashFunction [error] private def rehash(h: Int): Int = Hashing.murmur3_32().hashInt(h).asInt() [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala:426: value limit is not a member of object com.google.common.io.ByteStreams [error] val bufferedStream = new BufferedInputStream(ByteStreams.limit(fileStream, end - start)) [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala:558: value limit is not a member of object com.google.common.io.ByteStreams [error] val bufferedStream = new BufferedInputStream(ByteStreams.limit(fileStream, end - start)) [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:261: value hashInt is not a member of com.google.common.hash.HashFunction [error] private def hashcode(h: Int): Int = Hashing.murmur3_32().hashInt(h).asInt() [error]^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/Utils.scala:37: type mismatch; [error] found : java.util.Iterator[T] [error] required: Iterable[?] [error] collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), num)).iterator [error] ^ [error] /Users/meng/src/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:421: value putAll is not a member of com.google.common.cache.Cache[org.apache.hadoop.fs.FileStatus,parquet.hadoop.Footer] [error] footerCache.putAll(newFooters) [error] ^ [warn] /Users/meng/src/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala:34: @deprecated now takes two arguments; see the scaladoc. [warn] @deprecated(No code should depend on FakeParquetHiveSerDe as it is only intended as a + [warn] ^ [info] No documentation generated with unsucessful compiler run [warn] two warnings found [error] 6 errors found [error] (spark/scalaunidoc:doc) Scaladoc generation failed [error] Total time: 48 s, completed Nov 10, 2014 1:31:01 PM {code} It doesn't happen on branch-1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4326) unidoc is broken on master
[ https://issues.apache.org/jira/browse/SPARK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4326: - Priority: Critical (was: Major) unidoc is broken on master -- Key: SPARK-4326 URL: https://issues.apache.org/jira/browse/SPARK-4326 Project: Spark Issue Type: Bug Components: Build, Documentation Affects Versions: 1.3.0 Reporter: Xiangrui Meng Priority: Critical On master, `jekyll build` throws the following error: {code} [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala:205: value hashInt is not a member of com.google.common.hash.HashFunction [error] private def rehash(h: Int): Int = Hashing.murmur3_32().hashInt(h).asInt() [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala:426: value limit is not a member of object com.google.common.io.ByteStreams [error] val bufferedStream = new BufferedInputStream(ByteStreams.limit(fileStream, end - start)) [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala:558: value limit is not a member of object com.google.common.io.ByteStreams [error] val bufferedStream = new BufferedInputStream(ByteStreams.limit(fileStream, end - start)) [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:261: value hashInt is not a member of com.google.common.hash.HashFunction [error] private def hashcode(h: Int): Int = Hashing.murmur3_32().hashInt(h).asInt() [error]^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/Utils.scala:37: type mismatch; [error] found : java.util.Iterator[T] [error] required: Iterable[?] [error] collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), num)).iterator [error] ^ [error] /Users/meng/src/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:421: value putAll is not a member of com.google.common.cache.Cache[org.apache.hadoop.fs.FileStatus,parquet.hadoop.Footer] [error] footerCache.putAll(newFooters) [error] ^ [warn] /Users/meng/src/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala:34: @deprecated now takes two arguments; see the scaladoc. [warn] @deprecated(No code should depend on FakeParquetHiveSerDe as it is only intended as a + [warn] ^ [info] No documentation generated with unsucessful compiler run [warn] two warnings found [error] 6 errors found [error] (spark/scalaunidoc:doc) Scaladoc generation failed [error] Total time: 48 s, completed Nov 10, 2014 1:31:01 PM {code} It doesn't happen on branch-1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4179) Streaming Linear Regression example has type mismatch
[ https://issues.apache.org/jira/browse/SPARK-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-4179. Resolution: Not a Problem Assignee: Xiangrui Meng (was: Anant Daksh Asthana) I'm closing this JIRA because this is already fixed in SPARK-3108 and the user guide is up-to-date. But please feel free to re-open it if I missed something. Streaming Linear Regression example has type mismatch - Key: SPARK-4179 URL: https://issues.apache.org/jira/browse/SPARK-4179 Project: Spark Issue Type: Bug Components: Examples, MLlib Affects Versions: 1.1.0 Reporter: Anant Daksh Asthana Assignee: Xiangrui Meng The example for Streaming Linear Regression on line 65 calls predictOn with a DStream of type[Double, Vector] when the expected type is vector. This throws a type error. examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala#65 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4376) Put external modules behind build profiles
Patrick Wendell created SPARK-4376: -- Summary: Put external modules behind build profiles Key: SPARK-4376 URL: https://issues.apache.org/jira/browse/SPARK-4376 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Assignee: Sandy Ryza Priority: Blocker Several people have asked me whether, to speed up the build, we can put the external projects behind build flags similar to the kinesis-asl module. Since these aren't in the assembly there isn't a great reason to build them by default. We can just modify our release script to build them and when we run tests. This doesn't technically block Spark 1.2 but it is going to be looped into a separate fix that does block Spark 1.2 so I'm upgrading it to blocker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3325) Add a parameter to the method print in class DStream.
[ https://issues.apache.org/jira/browse/SPARK-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209153#comment-14209153 ] Apache Spark commented on SPARK-3325: - User 'watermen' has created a pull request for this issue: https://github.com/apache/spark/pull/3237 Add a parameter to the method print in class DStream. - Key: SPARK-3325 URL: https://issues.apache.org/jira/browse/SPARK-3325 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: Yadong Qi def print(num: Int = 10) User can control the number of elements which to print. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4364) Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong
[ https://issues.apache.org/jira/browse/SPARK-4364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu closed SPARK-4364. --- Resolution: Duplicate Sorry. Didn't notice SPARK-4297 Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong Key: SPARK-4364 URL: https://issues.apache.org/jira/browse/SPARK-4364 Project: Spark Issue Type: Test Components: Streaming Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Trivial Labels: unit-test Because of the type erase, the unit tests will pass. However, the wrong variable types will confuse people. The locations of these variables can be found in my PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4373) MLlib unit tests failed maven test
[ https://issues.apache.org/jira/browse/SPARK-4373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4373. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3235 [https://github.com/apache/spark/pull/3235] MLlib unit tests failed maven test -- Key: SPARK-4373 URL: https://issues.apache.org/jira/browse/SPARK-4373 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.2.0 We should make sure there is at most one SparkContext running at any time inside the same JVM. Maven initializes all test classes first and then runs tests. So we cannot initialize sc as a member. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4370) Limit cores used by Netty transfer service based on executor size
[ https://issues.apache.org/jira/browse/SPARK-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4370. Resolution: Fixed Fix Version/s: 1.2.0 Limit cores used by Netty transfer service based on executor size - Key: SPARK-4370 URL: https://issues.apache.org/jira/browse/SPARK-4370 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical Fix For: 1.2.0 Right now, the NettyBlockTransferService uses the total number of cores on the system as the number of threads and buffer arenas to create. The latter is more troubling -- this can lead to significant allocation of extra heap and direct memory in situations where executors are relatively small compared to the whole machine. For instance, on a machine with 32 cores, we will allocate (32 cores * 16MB per arena = 512MB) * 2 for client and server = 1GB direct and heap memory. This can be a huge overhead if you're only using, say, 8 of those cores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4375) Assembly built with Maven is missing most of repl classes
[ https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-4375: -- Summary: Assembly built with Maven is missing most of repl classes (was: assembly built with Maven is missing most of repl classes) Assembly built with Maven is missing most of repl classes - Key: SPARK-4375 URL: https://issues.apache.org/jira/browse/SPARK-4375 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Blocker In particular, the ones in the split scala-2.10/scala-2.11 directories aren't being added -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4326) unidoc is broken on master
[ https://issues.apache.org/jira/browse/SPARK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209170#comment-14209170 ] Marcelo Vanzin commented on SPARK-4326: --- Hmm, but core/pom.xml defines an explicit dependency on guava 14, so it should override the 11.0.2 dependency from the shuffle module (which is correct, btw). And maven's / sbt's dependency resolution seems to indicate that's happening, although unidoc doesn't. That's the weird part. Maybe some bug in the unidoc plugin? unidoc is broken on master -- Key: SPARK-4326 URL: https://issues.apache.org/jira/browse/SPARK-4326 Project: Spark Issue Type: Bug Components: Build, Documentation Affects Versions: 1.3.0 Reporter: Xiangrui Meng Priority: Critical On master, `jekyll build` throws the following error: {code} [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala:205: value hashInt is not a member of com.google.common.hash.HashFunction [error] private def rehash(h: Int): Int = Hashing.murmur3_32().hashInt(h).asInt() [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala:426: value limit is not a member of object com.google.common.io.ByteStreams [error] val bufferedStream = new BufferedInputStream(ByteStreams.limit(fileStream, end - start)) [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala:558: value limit is not a member of object com.google.common.io.ByteStreams [error] val bufferedStream = new BufferedInputStream(ByteStreams.limit(fileStream, end - start)) [error] ^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:261: value hashInt is not a member of com.google.common.hash.HashFunction [error] private def hashcode(h: Int): Int = Hashing.murmur3_32().hashInt(h).asInt() [error]^ [error] /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/Utils.scala:37: type mismatch; [error] found : java.util.Iterator[T] [error] required: Iterable[?] [error] collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), num)).iterator [error] ^ [error] /Users/meng/src/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:421: value putAll is not a member of com.google.common.cache.Cache[org.apache.hadoop.fs.FileStatus,parquet.hadoop.Footer] [error] footerCache.putAll(newFooters) [error] ^ [warn] /Users/meng/src/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala:34: @deprecated now takes two arguments; see the scaladoc. [warn] @deprecated(No code should depend on FakeParquetHiveSerDe as it is only intended as a + [warn] ^ [info] No documentation generated with unsucessful compiler run [warn] two warnings found [error] 6 errors found [error] (spark/scalaunidoc:doc) Scaladoc generation failed [error] Total time: 48 s, completed Nov 10, 2014 1:31:01 PM {code} It doesn't happen on branch-1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209183#comment-14209183 ] Matthew Daniel commented on SPARK-4267: --- Apologies, I don't know if we want log verbiage inline or as an attachment. I experienced this NPE on an EMR cluster, AMI 3.3.0 which is Amazon Hadoop 2.4.0 against a {{make-distribution.sh}} version with {{-Pyarn}} and {{-Phadoop-2.2}} with {{-Dhadoop.version=2.2.0}}. I built it against 2.2 because some of our jobs run on 2.2, and I thought 2.4 would be backwards compatible. I will try building as you said, using {{sbt assembly}}, but I wanted to reply to your comment that yes, I do see an {{ERROR}} line but it isn't helpful to me, so I hope it's meaningful to others. {noformat} 14/11/13 02:58:23 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1415847498993 yarnAppState: ACCEPTED 14/11/13 02:58:23 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=10.166.39.198,PROXY_URI_BASE=http://10.166.39.198:9046/proxy/application_1415840940647_0001, /proxy/application_1415840940647_0001 14/11/13 02:58:23 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter 14/11/13 02:58:24 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: 0 appStartTime: 1415847498993 yarnAppState: RUNNING 14/11/13 02:58:29 ERROR cluster.YarnClientSchedulerBackend: Yarn application already ended: FINISHED 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/metrics/json,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/json,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment/json,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd/json,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/json,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool/json,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/json,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/json,null} 14/11/13 02:58:29 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages,null} 14/11/13 02:58:29 INFO ui.SparkUI: Stopped Spark web UI at http://ip-10-166-39-198.ec2.internal:4040 14/11/13 02:58:29 INFO scheduler.DAGScheduler: Stopping DAGScheduler 14/11/13 02:58:29 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors 14/11/13 02:58:29 INFO cluster.YarnClientSchedulerBackend: Asking each executor to shut down 14/11/13 02:58:29 INFO cluster.YarnClientSchedulerBackend: Stopped 14/11/13 02:58:30 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/11/13 02:58:30 INFO network.ConnectionManager: Selector thread was interrupted! 14/11/13 02:58:30 INFO network.ConnectionManager: ConnectionManager stopped 14/11/13 02:58:30 INFO storage.MemoryStore: MemoryStore cleared 14/11/13 02:58:30 INFO storage.BlockManager: BlockManager stopped 14/11/13 02:58:30 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 14/11/13 02:58:30 INFO spark.SparkContext: Successfully stopped SparkContext 14/11/13 02:58:30 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/11/13 02:58:30 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/11/13 02:58:30 INFO Remoting: Remoting shut down 14/11/13 02:58:30 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 14/11/13 02:58:47 INFO
[jira] [Commented] (SPARK-750) LocalSparkContext should be included in Spark JAR
[ https://issues.apache.org/jira/browse/SPARK-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209167#comment-14209167 ] Nathan M commented on SPARK-750: +1 This shouldnt be hard, in maven its a plugin to add in the spark/core/pom.xml file like describe here http://maven.apache.org/plugins/maven-jar-plugin/examples/create-test-jar.html LocalSparkContext should be included in Spark JAR - Key: SPARK-750 URL: https://issues.apache.org/jira/browse/SPARK-750 Project: Spark Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Josh Rosen Priority: Minor To aid third-party developers in writing unit tests with Spark, LocalSparkContext should be included in the Spark JAR. Right now, it appears to be excluded because it is located in one of the Spark test directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4294) The same function should have the same realization.
[ https://issues.apache.org/jira/browse/SPARK-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-4294: Assignee: Tathagata Das The same function should have the same realization. --- Key: SPARK-4294 URL: https://issues.apache.org/jira/browse/SPARK-4294 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Yadong Qi Assignee: Tathagata Das Priority: Minor Fix For: 1.2.0 In class TransformedDStream: require(parents.length 0, List of DStreams to transform is empty) require(parents.map(_.ssc).distinct.size == 1, Some of the DStreams have different contexts) require(parents.map(_.slideDuration).distinct.size == 1, Some of the DStreams have different slide durations) In class UnionDStream: if (parents.length == 0) { throw new IllegalArgumentException(Empty array of parents) } if (parents.map(_.ssc).distinct.size 1) { throw new IllegalArgumentException(Array of parents have different StreamingContexts) } if (parents.map(_.slideDuration).distinct.size 1) { throw new IllegalArgumentException(Array of parents have different slide times) } The function is the same, but the realization is not. I think they shoule be the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209239#comment-14209239 ] Kousuke Saruta commented on SPARK-4267: --- Hi [~bugzi...@mdaniel.scdi.com]. The NPE is caused by SparkContext stopped because Application finished accidentally. I don't know why your application finished before running job for now. Can you see some ERROR message on the logs of ApplicationMaster or ResourceManager? Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later -- Key: SPARK-4267 URL: https://issues.apache.org/jira/browse/SPARK-4267 Project: Spark Issue Type: Bug Reporter: Tsuyoshi OZAWA Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN fails to launch jobs with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209255#comment-14209255 ] Derrick Burns commented on SPARK-2620: -- I also hit the bug when running Spark 1.1.0 in local mode. case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4375) Assembly built with Maven is missing most of repl classes
[ https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209305#comment-14209305 ] Patrick Wendell edited comment on SPARK-4375 at 11/13/14 5:37 AM: -- Hey Sandy, What about the following solution: 1. For the repl case, we make the change you are suggesting and simply drop the need for -Pscala-2.10 to be there explicilty. 2. We no longer include the examples module or the external project modules by default in the build. There is a profile for each of the external projects and a profile for the examples. 2. When building the examples, you need to specify, somewhat pedantically, all of the necessary external sub projects and also -Pscala-2.10 or -Pscala-2.11. We can just give people the exact commands to run for 2.10 and 2.11 examples in the maven docs. The main benefits I see is that there is no regression for someone doing a package for Scala 2.10 which is the common case. If someone wants to build the examples, they need to go and do a bit of extra work to look up the new command, but it's mostly straightforward. Of course, all of our packages will still have the examples pre-built. was (Author: pwendell): Hey Sandy, What about the following solution: 1. For the repl case, we make the change you are suggesting and simply drop the need for -Pscala-2.10 to be there explicilty. 2. We no longer include the examples module or the external project modules by default in the build. 2. When building the examples, you need to specify, somewhat pedantically, all of the necessary external sub projects and also -Pscala-2.10 or -Pscala-2.11. We can just give people the exact commands to run for 2.10 and 2.11 examples in the maven docs. The main benefits I see is that there is no regression for someone doing a package for Scala 2.10 which is the common case. If someone wants to build the examples, they need to go and do a bit of extra work to look up the new command, but it's mostly straightforward. Of course, all of our packages will still have the examples pre-built. Assembly built with Maven is missing most of repl classes - Key: SPARK-4375 URL: https://issues.apache.org/jira/browse/SPARK-4375 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Blocker In particular, the ones in the split scala-2.10/scala-2.11 directories aren't being added -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4375) Assembly built with Maven is missing most of repl classes
[ https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209305#comment-14209305 ] Patrick Wendell commented on SPARK-4375: Hey Sandy, What about the following solution: 1. For the repl case, we make the change you are suggesting and simply drop the need for -Pscala-2.10 to be there explicilty. 2. We no longer include the examples module or the external project modules by default in the build. 2. When building the examples, you need to specify, somewhat pedantically, all of the necessary external sub projects and also -Pscala-2.10 or -Pscala-2.11. We can just give people the exact commands to run for 2.10 and 2.11 examples in the maven docs. The main benefits I see is that there is no regression for someone doing a package for Scala 2.10 which is the common case. If someone wants to build the examples, they need to go and do a bit of extra work to look up the new command, but it's mostly straightforward. Of course, all of our packages will still have the examples pre-built. Assembly built with Maven is missing most of repl classes - Key: SPARK-4375 URL: https://issues.apache.org/jira/browse/SPARK-4375 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Blocker In particular, the ones in the split scala-2.10/scala-2.11 directories aren't being added -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4375) Assembly built with Maven is missing most of repl classes
[ https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209314#comment-14209314 ] Patrick Wendell commented on SPARK-4375: One thing we could add onto that to make the user have to write less code is something like this: {code} mvn package -Pexamples -Pscala-2.10 -Dexamples-2.10 {code} The internally we have profiles that are activated by the examples-2.10 property and add the relevant modules that are required by the examples build in 2.10. Assembly built with Maven is missing most of repl classes - Key: SPARK-4375 URL: https://issues.apache.org/jira/browse/SPARK-4375 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Blocker In particular, the ones in the split scala-2.10/scala-2.11 directories aren't being added -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org