[jira] [Updated] (SPARK-3526) Docs section on data locality
[ https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3526: -- Summary: Docs section on data locality (was: Section on data locality) Docs section on data locality - Key: SPARK-3526 URL: https://issues.apache.org/jira/browse/SPARK-3526 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.0.2 Reporter: Andrew Ash Several threads on the mailing list have been about data locality and how to interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc. Let's get some more details in the docs on this concept so we can point future questions there. A couple people appreciated the below description of locality so it could be a good starting point: {quote} The locality is how close the data is to the code that's processing it. PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same node, or in another executor on the same node, so is a little slower because the data has to travel across an IPC connection. RACK_LOCAL is even slower -- data is on a different server so needs to be sent over the network. Spark switches to lower locality levels when there's no unprocessed data on a node that has idle CPUs. In that situation you have two options: wait until the busy CPUs free up so you can start another task that uses data on that server, or start a new task on a farther away server that needs to bring data from that remote place. What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3526) Docs section on data locality
[ https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133662#comment-14133662 ] Andrew Ash commented on SPARK-3526: --- Note: reports from users that reading from {{file://}} may be logged as {{PROCESS_LOCAL}} ? Docs section on data locality - Key: SPARK-3526 URL: https://issues.apache.org/jira/browse/SPARK-3526 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.0.2 Reporter: Andrew Ash Several threads on the mailing list have been about data locality and how to interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc. Let's get some more details in the docs on this concept so we can point future questions there. A couple people appreciated the below description of locality so it could be a good starting point: {quote} The locality is how close the data is to the code that's processing it. PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same node, or in another executor on the same node, so is a little slower because the data has to travel across an IPC connection. RACK_LOCAL is even slower -- data is on a different server so needs to be sent over the network. Spark switches to lower locality levels when there's no unprocessed data on a node that has idle CPUs. In that situation you have two options: wait until the busy CPUs free up so you can start another task that uses data on that server, or start a new task on a farther away server that needs to bring data from that remote place. What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3527) Strip the physical plan message margin
Cheng Hao created SPARK-3527: Summary: Strip the physical plan message margin Key: SPARK-3527 URL: https://issues.apache.org/jira/browse/SPARK-3527 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3527) Strip the physical plan message margin
[ https://issues.apache.org/jira/browse/SPARK-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133669#comment-14133669 ] Apache Spark commented on SPARK-3527: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/2392 Strip the physical plan message margin -- Key: SPARK-3527 URL: https://issues.apache.org/jira/browse/SPARK-3527 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
Andrew Ash created SPARK-3528: - Summary: Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL Key: SPARK-3528 URL: https://issues.apache.org/jira/browse/SPARK-3528 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Ash Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task {noformat} 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:20862+20863 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:0+20862 {noformat} There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: {noformat} override def getPreferredLocations(split: Partition): Seq[String] = { // TODO: Filtering out localhost in case of file:// URLs val hadoopSplit = split.asInstanceOf[HadoopPartition] hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost) } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3526) Docs section on data locality
[ https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133662#comment-14133662 ] Andrew Ash edited comment on SPARK-3526 at 9/15/14 8:14 AM: Note: reports from users that reading from {{file://}} may be logged as {{PROCESS_LOCAL}} ? Edit: repro'd and filed as SPARK-3528 was (Author: aash): Note: reports from users that reading from {{file://}} may be logged as {{PROCESS_LOCAL}} ? Docs section on data locality - Key: SPARK-3526 URL: https://issues.apache.org/jira/browse/SPARK-3526 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.0.2 Reporter: Andrew Ash Several threads on the mailing list have been about data locality and how to interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc. Let's get some more details in the docs on this concept so we can point future questions there. A couple people appreciated the below description of locality so it could be a good starting point: {quote} The locality is how close the data is to the code that's processing it. PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same node, or in another executor on the same node, so is a little slower because the data has to travel across an IPC connection. RACK_LOCAL is even slower -- data is on a different server so needs to be sent over the network. Spark switches to lower locality levels when there's no unprocessed data on a node that has idle CPUs. In that situation you have two options: wait until the busy CPUs free up so you can start another task that uses data on that server, or start a new task on a farther away server that needs to bring data from that remote place. What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
[ https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3528: -- Description: Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task {noformat} spark sc.textFile(pom.xml).count ... 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:20862+20863 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:0+20862 {noformat} There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: {noformat} override def getPreferredLocations(split: Partition): Seq[String] = { // TODO: Filtering out localhost in case of file:// URLs val hadoopSplit = split.asInstanceOf[HadoopPartition] hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost) } {noformat} was: Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task {noformat} 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:20862+20863 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:0+20862 {noformat} There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: {noformat} override def getPreferredLocations(split: Partition): Seq[String] = { // TODO: Filtering out localhost in case of file:// URLs val hadoopSplit = split.asInstanceOf[HadoopPartition] hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost) } {noformat} Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL Key: SPARK-3528 URL: https://issues.apache.org/jira/browse/SPARK-3528 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Ash Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task {noformat} spark sc.textFile(pom.xml).count ... 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:20862+20863 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:0+20862 {noformat} There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: {noformat} override def getPreferredLocations(split: Partition): Seq[String] = { // TODO: Filtering out localhost in case of file:// URLs val hadoopSplit = split.asInstanceOf[HadoopPartition] hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost) } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3529) Delete the temporal files after test exit
Cheng Hao created SPARK-3529: Summary: Delete the temporal files after test exit Key: SPARK-3529 URL: https://issues.apache.org/jira/browse/SPARK-3529 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3529) Delete the temporal files after test exit
[ https://issues.apache.org/jira/browse/SPARK-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133683#comment-14133683 ] Apache Spark commented on SPARK-3529: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/2393 Delete the temporal files after test exit - Key: SPARK-3529 URL: https://issues.apache.org/jira/browse/SPARK-3529 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3530) Pipeline and Parameters
Xiangrui Meng created SPARK-3530: Summary: Pipeline and Parameters Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133694#comment-14133694 ] Saisai Shao commented on SPARK-2926: Hey [~rxin], here is the branch rebased on your code (https://github.com/jerryshao/apache-spark/tree/sort-shuffle-read-new-netty), mind taking a look at it? Thanks a lot. Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle -- Key: SPARK-2926 URL: https://issues.apache.org/jira/browse/SPARK-2926 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report(contd).pdf, Spark Shuffle Test Report.pdf Currently Spark has already integrated sort-based shuffle write, which greatly improve the IO performance and reduce the memory consumption when reducer number is very large. But for the reducer side, it still adopts the implementation of hash-based shuffle reader, which neglects the ordering attributes of map output data in some situations. Here we propose a MR style sort-merge like shuffle reader for sort-based shuffle to better improve the performance of sort-based shuffle. Working in progress code and performance test report will be posted later when some unit test bugs are fixed. Any comments would be greatly appreciated. Thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3521) Missing modules in 1.1.0 source distribution - cant be build with maven
[ https://issues.apache.org/jira/browse/SPARK-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar closed SPARK-3521. -- Resolution: Not a Problem Fix Version/s: 1.1.1 Compile problem is fixed on github branch-1.1 Missing modules in 1.1.0 source distribution - cant be build with maven --- Key: SPARK-3521 URL: https://issues.apache.org/jira/browse/SPARK-3521 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Radim Kolar Priority: Minor Fix For: 1.1.1 modules {{bagel}}, {{mllib}}, {{flume-sink}} and {{flume}} are missing from source code distro, spark cant be build with maven. It cant be build by {{sbt/sbt}} either due to other bug (_java.lang.IllegalStateException: impossible to get artifacts when data has not been loaded. IvyNode = org.slf4j#slf4j-api;1.6.1_) (hsn@sanatana:pts/6):work/spark-1.1.0% mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -DskipTests clean package [INFO] Scanning for projects... [ERROR] The build could not read 1 project - [Help 1] [ERROR] [ERROR] The project org.apache.spark:spark-parent:1.1.0 (/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml) has 4 errors [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/bagel of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/mllib of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/external/flume of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/external/flume-sink/pom.xml of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3531) select null from table would throw a MatchError
Adrian Wang created SPARK-3531: -- Summary: select null from table would throw a MatchError Key: SPARK-3531 URL: https://issues.apache.org/jira/browse/SPARK-3531 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang select null from src limit 1 will lead to a scala.MatchError -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3531) select null from table would throw a MatchError
[ https://issues.apache.org/jira/browse/SPARK-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133728#comment-14133728 ] Apache Spark commented on SPARK-3531: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/2396 select null from table would throw a MatchError --- Key: SPARK-3531 URL: https://issues.apache.org/jira/browse/SPARK-3531 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang select null from src limit 1 will lead to a scala.MatchError -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133760#comment-14133760 ] Sean Owen commented on SPARK-3530: -- A few high-level questions: Is this a rewrite of MLlib? I see the old code will be deprecated. I assume the algorithms will come along, but in a fairly different form. I think that's actually a good thing. But is this targeted at a 2.x release, or sooner? How does this relate to MLI and MLbase? I had thought they would in theory handle things like grid-search, but haven't seen activity or mention of these in a while. Is this at all a merge of the two or is MLlib going to take over these concerns? I don't think you will need or want to use this code, but the oryx project already has an implementation of grid search on Spark. At least another take on the API for such a thing to consider. https://github.com/OryxProject/oryx/tree/master/oryx-ml/src/main/java/com/cloudera/oryx/ml/param Big +1 for parameter tuning. That belongs as a first-class citizen. I'm also intrigued by doing better than trying every possible combination of parameters separately, and maybe sharing partial results to speed up several models' training. Is this realistic for any parameters besides things like # iterations? which isn't really a hyperparam. I don't know, for example, ways to build N models with N different overfitting params and share some work. I would love to know that's possible. Good to design for it anyway. I see mention of a Dataset abstraction, which I'm assuming contains some type information, like distinguishing categorical and numeric features. I think that's very good! I've always found the 'pipeline' part hard to build. It's tempting to construct a framework for feature extraction. To some degree you can by providing transformations, 1-hot encoding, etc. But I think that a framework for understanding arbitrary databases and fields and so on quickly becomes too endlessly large a scope. Spark Core to me is already the right abstraction for upstream ETL of data before entering an ML framework. I mention it just because it's in the first picture, but I don't see discussion of actually doing user/product attribute selection later. So maybe it's not meant to be part of the proposal. I'd certainly like to keep up more with your work here. This is a big step forward in making MLlib more relevant to production deployments rather than just pure algorithms implementations. Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...
[ https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133776#comment-14133776 ] Apache Spark commented on SPARK-2594: - User 'ravipesala' has created a pull request for this issue: https://github.com/apache/spark/pull/2397 Add CACHE TABLE name AS SELECT ... Key: SPARK-2594 URL: https://issues.apache.org/jira/browse/SPARK-2594 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3532) Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.
Prashant Sharma created SPARK-3532: -- Summary: Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs. Key: SPARK-3532 URL: https://issues.apache.org/jira/browse/SPARK-3532 Project: Spark Issue Type: Bug Reporter: Prashant Sharma Priority: Minor While trying out spark on freebsd, this seemed like first blocker. Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3532) Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.
[ https://issues.apache.org/jira/browse/SPARK-3532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-3532: --- Description: While trying out spark on freebsd, this seemed like first blocker. Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress false Even better workaround is set: spark.io.compression.codec lzf was: While trying out spark on freebsd, this seemed like first blocker. Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress false Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs. - Key: SPARK-3532 URL: https://issues.apache.org/jira/browse/SPARK-3532 Project: Spark Issue Type: Bug Reporter: Prashant Sharma Priority: Minor While trying out spark on freebsd, this seemed like first blocker. Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress false Even better workaround is set: spark.io.compression.codeclzf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3532) Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.
[ https://issues.apache.org/jira/browse/SPARK-3532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133821#comment-14133821 ] Radim Kolar commented on SPARK-3532: you need to grab snappy native library from _snappy-java_ freebsd port, its not included in maven central JAR. {quote} (hsn@sanatana:pts/8):~% pkg info -l snappyjava snappyjava-1.0.4.1_1: /usr/local/lib/libsnappyjava.so /usr/local/share/java/classes/snappy-java.jar {quote} Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs. - Key: SPARK-3532 URL: https://issues.apache.org/jira/browse/SPARK-3532 Project: Spark Issue Type: Bug Reporter: Prashant Sharma Priority: Minor While trying out spark on freebsd, this seemed like first blocker. Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress false Even better workaround is set: spark.io.compression.codeclzf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal
[ https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3410. -- Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Kousuke Saruta The priority of shutdownhook for ApplicationMaster should not be integer literal Key: SPARK-3410 URL: https://issues.apache.org/jira/browse/SPARK-3410 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Minor Fix For: 1.2.0 In ApplicationMaster, the priority of shutdown hook is set to 30, which expects higher than the priority of o.a.h.FileSystem. In FileSystem, the priority of shutdown hook is expressed as public constant named SHUTDOWN_HOOK_PRIORITY so I think it's better to use this constant for the priority of ApplicationMaster's shutdown hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3396) Change LogistricRegressionWithSGD's default regType to L2
[ https://issues.apache.org/jira/browse/SPARK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133917#comment-14133917 ] Apache Spark commented on SPARK-3396: - User 'BigCrunsh' has created a pull request for this issue: https://github.com/apache/spark/pull/2398 Change LogistricRegressionWithSGD's default regType to L2 - Key: SPARK-3396 URL: https://issues.apache.org/jira/browse/SPARK-3396 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.1.0 Reporter: Xiangrui Meng Assignee: Christoph Sawade The default updater is SimpleUpdater, which doesn't add any regularization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
[ https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133927#comment-14133927 ] Nicholas Chammas commented on SPARK-3528: - [~aash] - How about for data read from S3? I see that being marked as {{PROCESS_LOCAL}} as well. {code} sc.textFile('s3n://...').count() 14/09/15 10:12:20 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1242 bytes) 14/09/15 10:12:20 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1242 bytes) {code} Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL Key: SPARK-3528 URL: https://issues.apache.org/jira/browse/SPARK-3528 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Ash Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task {noformat} spark sc.textFile(pom.xml).count ... 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1191 bytes) 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:20862+20863 14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:0+20862 {noformat} There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: {noformat} override def getPreferredLocations(split: Partition): Seq[String] = { // TODO: Filtering out localhost in case of file:// URLs val hadoopSplit = split.asInstanceOf[HadoopPartition] hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost) } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3526) Docs section on data locality
[ https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133935#comment-14133935 ] Nicholas Chammas commented on SPARK-3526: - FYI: Looks like the valid localities are [enumerated here|https://github.com/apache/spark/blob/cc14644460872efb344e8d895859d70213a40840/core/src/main/scala/org/apache/spark/scheduler/TaskLocality.scala#L25]. Docs section on data locality - Key: SPARK-3526 URL: https://issues.apache.org/jira/browse/SPARK-3526 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.0.2 Reporter: Andrew Ash Assignee: Andrew Ash Several threads on the mailing list have been about data locality and how to interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc. Let's get some more details in the docs on this concept so we can point future questions there. A couple people appreciated the below description of locality so it could be a good starting point: {quote} The locality is how close the data is to the code that's processing it. PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same node, or in another executor on the same node, so is a little slower because the data has to travel across an IPC connection. RACK_LOCAL is even slower -- data is on a different server so needs to be sent over the network. Spark switches to lower locality levels when there's no unprocessed data on a node that has idle CPUs. In that situation you have two options: wait until the busy CPUs free up so you can start another task that uses data on that server, or start a new task on a farther away server that needs to bring data from that remote place. What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
[ https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3470. -- Resolution: Fixed Fix Version/s: 1.2.0 Have JavaSparkContext implement Closeable/AutoCloseable --- Key: SPARK-3470 URL: https://issues.apache.org/jira/browse/SPARK-3470 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor Fix For: 1.2.0 After discussion in SPARK-2972, it seems like a good idea to allow Java developers to use Java 7 automatic resource management with JavaSparkContext, like so: {code:java} try (JavaSparkContext ctx = new JavaSparkContext(...)) { return br.readLine(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1895) Run tests on windows
[ https://issues.apache.org/jira/browse/SPARK-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133964#comment-14133964 ] Sean Owen commented on SPARK-1895: -- Can anyone still reproduce this? I know test temp file cleanup was improved in 1.0.x, and am not sure I have heard of this since. Run tests on windows Key: SPARK-1895 URL: https://issues.apache.org/jira/browse/SPARK-1895 Project: Spark Issue Type: Bug Components: PySpark, Windows Affects Versions: 0.9.1 Environment: spark-0.9.1-bin-hadoop1 Reporter: stribog Priority: Trivial bin\pyspark python\pyspark\rdd.py Sometimes tests complete without error _. Last tests fail log: 14/05/21 18:31:40 INFO Executor: Running task ID 321 14/05/21 18:31:40 INFO Executor: Running task ID 324 14/05/21 18:31:40 INFO Executor: Running task ID 322 14/05/21 18:31:40 INFO Executor: Running task ID 323 14/05/21 18:31:40 INFO PythonRDD: Times: total = 241, boot = 240, init = 1, finish = 0 14/05/21 18:31:40 INFO Executor: Serialized size of result for 324 is 607 14/05/21 18:31:40 INFO Executor: Sending result for 324 directly to driver 14/05/21 18:31:40 INFO Executor: Finished task ID 324 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 324 in 248 ms on localhost (progress: 1/4) 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 3) 14/05/21 18:31:40 INFO PythonRDD: Times: total = 518, boot = 516, init = 2, finish = 0 14/05/21 18:31:40 INFO Executor: Serialized size of result for 323 is 607 14/05/21 18:31:40 INFO Executor: Sending result for 323 directly to driver 14/05/21 18:31:40 INFO Executor: Finished task ID 323 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 323 in 528 ms on localhost (progress: 2/4) 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 2) 14/05/21 18:31:41 INFO PythonRDD: Times: total = 776, boot = 774, init = 2, finish = 0 14/05/21 18:31:41 INFO Executor: Serialized size of result for 322 is 607 14/05/21 18:31:41 INFO Executor: Sending result for 322 directly to driver 14/05/21 18:31:41 INFO Executor: Finished task ID 322 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 322 in 785 ms on localhost (progress: 3/4) 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 1) 14/05/21 18:31:41 INFO PythonRDD: Times: total = 1043, boot = 1042, init = 1, finish = 0 14/05/21 18:31:41 INFO Executor: Serialized size of result for 321 is 607 14/05/21 18:31:41 INFO Executor: Sending result for 321 directly to driver 14/05/21 18:31:41 INFO Executor: Finished task ID 321 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 321 in 1049 ms on localhost (progress: 4/4) 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 0) 14/05/21 18:31:41 INFO TaskSchedulerImpl: Removed TaskSet 80.0, whose tasks have all completed, from pool 14/05/21 18:31:41 INFO DAGScheduler: Stage 80 (top at doctest __main__.RDD.top[0]:1) finished in 1,051 s 14/05/21 18:31:41 INFO SparkContext: Job finished: top at doctest __main__.RDD.top[0]:1, took 1.053832912 s 14/05/21 18:31:41 INFO SparkContext: Starting job: top at doctest __main__.RDD.top[1]:1 14/05/21 18:31:41 INFO DAGScheduler: Got job 63 (top at doctest __main__.RDD.top[1]:1) with 4 output partitions (allowLocal=false) 14/05/21 18:31:41 INFO DAGScheduler: Final stage: Stage 81 (top at doctest __main__.RDD.top[1]:1) 14/05/21 18:31:41 INFO DAGScheduler: Parents of final stage: List() 14/05/21 18:31:41 INFO DAGScheduler: Missing parents: List() 14/05/21 18:31:41 INFO DAGScheduler: Submitting Stage 81 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1), which has no missing parents 14/05/21 18:31:41 INFO DAGScheduler: Submitting 4 missing tasks from Stage 81 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1) 14/05/21 18:31:41 INFO TaskSchedulerImpl: Adding task set 81.0 with 4 tasks 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:0 as TID 325 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:0 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:1 as TID 326 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:1 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:2 as TID 327 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:2 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:3 as TID 328 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:3 as 2609 bytes in 1 ms 14/05/21 18:31:41 INFO Executor: Running task ID 326 14/05/21 18:31:41 INFO Executor:
[jira] [Resolved] (SPARK-1258) RDD.countByValue optimization
[ https://issues.apache.org/jira/browse/SPARK-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1258. -- Resolution: Won't Fix I'm taking the liberty of closing this, since this refers to an optimization using fastutil classes, which were removed from Spark. An equivalent optimization is employed now, using Spark's OpenHashMap. RDD.countByValue optimization - Key: SPARK-1258 URL: https://issues.apache.org/jira/browse/SPARK-1258 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Jaroslav Kamenik Priority: Trivial Class Object2LongOpenHashMap has method add(key, incr) (addTo in new version) for incrementation value assigned to the key. It should be faster than currently used map.put(v, map.getLong(v) + 1L) . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3506) 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest
[ https://issues.apache.org/jira/browse/SPARK-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133976#comment-14133976 ] Sean Owen commented on SPARK-3506: -- Yeah, I imagine that can be touched up right now. For the future, I imagine the issue was just that the site was built from the branch before the release plugin upped the version and created the artifacts? So the site might be better built from the final released source artifact. I imagine it's a release-process doc change but don't know whether that lives. 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest -- Key: SPARK-3506 URL: https://issues.apache.org/jira/browse/SPARK-3506 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Jacek Laskowski Priority: Trivial In https://spark.apache.org/docs/latest/ there are references to 1.1.0-SNAPSHOT: * This documentation is for Spark version 1.1.0-SNAPSHOT. * For the Scala API, Spark 1.1.0-SNAPSHOT uses Scala 2.10. It should be version 1.1.0 since that's the latest released version and the header tells so, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134096#comment-14134096 ] Sean Owen commented on SPARK-2620: -- FWIW, here is a mailing list comment that suggests 1.1 works with these case classes, although this is not a case where the REPL is being used: http://apache-spark-user-list.1001560.n3.nabble.com/Compiler-issues-for-multiple-map-on-RDD-td14248.html case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134101#comment-14134101 ] Daniel Siegmann commented on SPARK-2620: I have tested the case in spark-shell on Spark 1.1.0. It is still broken. case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134073#comment-14134073 ] Gregory Phillips edited comment on SPARK-2984 at 9/15/14 4:43 PM: -- I'm running into this as well. But to respond to this theory: {quote} I think this may be related to spark.speculation. I think the error condition might manifest in this circumstance: 1) task T starts on a executor E1 2) it takes a long time, so task T' is started on another executor E2 3) T finishes in E1 so moves its data from _temporary to the final destination and deletes the _temporary directory during cleanup 4) T' finishes in E2 and attempts to move its data from _temporary, but those files no longer exist! exception {quote} Speculation is not necessary for this to occur. I am consistently running into this while testing some code against local without speculation where I am trying to download, manipulate and merge 2 sets of data from S3 and serialize the resulting RDD using saveAsTextFile back to S3: {code} Job aborted due to stage failure: Task 3.0:754 failed 1 times, most recent failure: Exception failure in TID 762 on host localhost: java.io.FileNotFoundException: s3n://bucket/_temporary/_attempt_201409151537__m_000754_762/part-00754.deflate: No such file or directory. org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:786) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:769) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: {code} I'm happy to provide more information or help investigate further to figure this one out. Edit: I forgot to mention that the file in question actually did exist on S3 when I checked after receiving this exception. was (Author: gphil): I'm running into this as well. But to respond to this theory: {quote} I think this may be related to spark.speculation. I think the error condition might manifest in this circumstance: 1) task T starts on a executor E1 2) it takes a long time, so task T' is started on another executor E2 3) T finishes in E1 so moves its data from _temporary to the final destination and deletes the _temporary directory during cleanup 4) T' finishes in E2 and attempts to move its data from _temporary, but those files no longer exist! exception {quote} Speculation is not necessary for this to occur. I am consistently running into this while testing some code against local without speculation where I am trying to download, manipulate and merge 2 sets of data from S3 and serialize the resulting RDD using saveAsTextFile back to S3: {code} Job aborted due to stage failure: Task 3.0:754 failed 1 times, most recent failure: Exception failure in TID 762 on host localhost: java.io.FileNotFoundException: s3n://bucket/_temporary/_attempt_201409151537__m_000754_762/part-00754.deflate: No such file or directory. org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:786) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:769) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: {code} I'm happy to provide more information or help investigate further to figure this one out. FileNotFoundException on _temporary directory
[jira] [Commented] (SPARK-2932) Move MasterFailureTest out of main source directory
[ https://issues.apache.org/jira/browse/SPARK-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134116#comment-14134116 ] Apache Spark commented on SPARK-2932: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/2399 Move MasterFailureTest out of main source directory - Key: SPARK-2932 URL: https://issues.apache.org/jira/browse/SPARK-2932 Project: Spark Issue Type: Task Components: Streaming Reporter: Marcelo Vanzin Priority: Trivial Currently, MasterFailureTest.scala lives in streaming/src/main, which means it ends up in the published streaming jar. It's only used by other test code, and although it also provides a main() entry point, that's also only usable for testing, so the code should probably be moved to the test directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1895) Run tests on windows
[ https://issues.apache.org/jira/browse/SPARK-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-1895: -- Description: bin\pyspark python\pyspark\rdd.py Sometimes tests complete without error _. Last tests fail log: {noformat} 14/05/21 18:31:40 INFO Executor: Running task ID 321 14/05/21 18:31:40 INFO Executor: Running task ID 324 14/05/21 18:31:40 INFO Executor: Running task ID 322 14/05/21 18:31:40 INFO Executor: Running task ID 323 14/05/21 18:31:40 INFO PythonRDD: Times: total = 241, boot = 240, init = 1, finish = 0 14/05/21 18:31:40 INFO Executor: Serialized size of result for 324 is 607 14/05/21 18:31:40 INFO Executor: Sending result for 324 directly to driver 14/05/21 18:31:40 INFO Executor: Finished task ID 324 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 324 in 248 ms on localhost (progress: 1/4) 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 3) 14/05/21 18:31:40 INFO PythonRDD: Times: total = 518, boot = 516, init = 2, finish = 0 14/05/21 18:31:40 INFO Executor: Serialized size of result for 323 is 607 14/05/21 18:31:40 INFO Executor: Sending result for 323 directly to driver 14/05/21 18:31:40 INFO Executor: Finished task ID 323 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 323 in 528 ms on localhost (progress: 2/4) 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 2) 14/05/21 18:31:41 INFO PythonRDD: Times: total = 776, boot = 774, init = 2, finish = 0 14/05/21 18:31:41 INFO Executor: Serialized size of result for 322 is 607 14/05/21 18:31:41 INFO Executor: Sending result for 322 directly to driver 14/05/21 18:31:41 INFO Executor: Finished task ID 322 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 322 in 785 ms on localhost (progress: 3/4) 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 1) 14/05/21 18:31:41 INFO PythonRDD: Times: total = 1043, boot = 1042, init = 1, finish = 0 14/05/21 18:31:41 INFO Executor: Serialized size of result for 321 is 607 14/05/21 18:31:41 INFO Executor: Sending result for 321 directly to driver 14/05/21 18:31:41 INFO Executor: Finished task ID 321 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 321 in 1049 ms on localhost (progress: 4/4) 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 0) 14/05/21 18:31:41 INFO TaskSchedulerImpl: Removed TaskSet 80.0, whose tasks have all completed, from pool 14/05/21 18:31:41 INFO DAGScheduler: Stage 80 (top at doctest __main__.RDD.top[0]:1) finished in 1,051 s 14/05/21 18:31:41 INFO SparkContext: Job finished: top at doctest __main__.RDD.top[0]:1, took 1.053832912 s 14/05/21 18:31:41 INFO SparkContext: Starting job: top at doctest __main__.RDD.top[1]:1 14/05/21 18:31:41 INFO DAGScheduler: Got job 63 (top at doctest __main__.RDD.top[1]:1) with 4 output partitions (allowLocal=false) 14/05/21 18:31:41 INFO DAGScheduler: Final stage: Stage 81 (top at doctest __main__.RDD.top[1]:1) 14/05/21 18:31:41 INFO DAGScheduler: Parents of final stage: List() 14/05/21 18:31:41 INFO DAGScheduler: Missing parents: List() 14/05/21 18:31:41 INFO DAGScheduler: Submitting Stage 81 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1), which has no missing parents 14/05/21 18:31:41 INFO DAGScheduler: Submitting 4 missing tasks from Stage 81 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1) 14/05/21 18:31:41 INFO TaskSchedulerImpl: Adding task set 81.0 with 4 tasks 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:0 as TID 325 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:0 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:1 as TID 326 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:1 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:2 as TID 327 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:2 as 2594 bytes in 0 ms 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:3 as TID 328 on executor localhost: localhost (PROCESS_LOCAL) 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:3 as 2609 bytes in 1 ms 14/05/21 18:31:41 INFO Executor: Running task ID 326 14/05/21 18:31:41 INFO Executor: Running task ID 328 14/05/21 18:31:41 INFO Executor: Running task ID 327 14/05/21 18:31:41 INFO Executor: Running task ID 325 14/05/21 18:31:41 INFO CacheManager: Partition rdd_212_3 not found, computing it 14/05/21 18:31:41 INFO MemoryStore: ensureFreeSpace(152) called with curMem=1120, maxMem=311387750 14/05/21 18:31:41 INFO MemoryStore: Block rdd_212_3 stored as values to memory (estimated size 152.0 B, free 297.0 MB) 14/05/21 18:31:41 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_212_3 in memory on stribog-pc:37187 (size: 152.0 B, free: 297.0 MB) 14/05/21 18:31:41 INFO
[jira] [Updated] (SPARK-1764) EOF reached before Python server acknowledged
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-1764: -- Description: I'm getting EOF reached before Python server acknowledged while using PySpark on Mesos. The error manifests itself in multiple ways. One is: {noformat} 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error EOF reached before Python server acknowledged; shutting down SparkContext {noformat} And the other has a full stacktrace: {noformat} 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server acknowledged org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:277) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {noformat} This error causes the SparkContext to shutdown. I have not been able to reliably reproduce this bug, it seems to happen randomly, but if you run enough tasks on a SparkContext it'll hapen eventually was: I'm getting EOF reached before Python server acknowledged while using PySpark on Mesos. The error manifests itself in multiple ways. One is: 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error EOF reached before Python server acknowledged; shutting down SparkContext And the other has a full stacktrace: 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server acknowledged org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:277) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at
[jira] [Updated] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down
[ https://issues.apache.org/jira/browse/SPARK-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-2586: -- Description: When you running Spark with Tachyon, when the connection to Tachyon master is down (due to problem in network or the Master node is down) there is no clear log or error message to indicate it. Here is sample stack running SparkTachyonPi example with Tachyon connecting: {noformat} 14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to a loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5) 14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra 14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hsaputra) 14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started 14/07/15 16:43:11 INFO Remoting: Starting remoting 14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203] 14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203] 14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker 14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster 14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c 14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id = ConnectionManagerId(office-5-148.pa.gopivotal.com,53204) 14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB 14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager 14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM 14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server 14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at http://10.64.5.148:53205 14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server 14/07/15 16:43:12 INFO SparkUI: Started SparkUI at http://office-5-148.pa.gopivotal.com:4040 2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from SCDynamicStore 14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/15 16:43:12 INFO SparkContext: Added JAR examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar with timestamp 1405467792813 14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master spark://henry-pivotal.local:7077... 14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at SparkTachyonPi.scala:43 14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false) 14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkTachyonPi.scala:43) 14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List() 14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List() 14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkTachyonPi.scala:39), which has no missing parents 14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[1] at map at SparkTachyonPi.scala:39) 14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140715164313- 14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: app-20140715164313-/0 on worker-20140715164009-office-5-148.pa.gopivotal.com-52519 (office-5-148.pa.gopivotal.com:52519) with 8 cores 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140715164313-/0 on hostPort office-5-148.pa.gopivotal.com:52519 with 8 cores, 512.0 MB RAM 14/07/15 16:43:13 INFO AppClient$ClientActor: Executor updated: app-20140715164313-/0 is now RUNNING 14/07/15 16:43:15 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkexecu...@office-5-148.pa.gopivotal.com:53213/user/Executor#-423405256] with ID 0 14/07/15 16:43:15 INFO TaskSetManager: Re-computing pending task lists. 14/07/15 16:43:15 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 0: office-5-148.pa.gopivotal.com (PROCESS_LOCAL) 14/07/15 16:43:15 INFO
[jira] [Updated] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down
[ https://issues.apache.org/jira/browse/SPARK-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-2586: -- Description: When you running Spark with Tachyon, when the connection to Tachyon master is down (due to problem in network or the Master node is down) there is no clear log or error message to indicate it. Here is sample stack running SparkTachyonPi example with Tachyon connecting: {noformat} 14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to a loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5) 14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra 14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hsaputra) 14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started 14/07/15 16:43:11 INFO Remoting: Starting remoting 14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203] 14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203] 14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker 14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster 14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c 14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id = ConnectionManagerId(office-5-148.pa.gopivotal.com,53204) 14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB 14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager 14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM 14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server 14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at http://10.64.5.148:53205 14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server 14/07/15 16:43:12 INFO SparkUI: Started SparkUI at http://office-5-148.pa.gopivotal.com:4040 2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from SCDynamicStore 14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/15 16:43:12 INFO SparkContext: Added JAR examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar with timestamp 1405467792813 14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master spark://henry-pivotal.local:7077... 14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at SparkTachyonPi.scala:43 14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false) 14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkTachyonPi.scala:43) 14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List() 14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List() 14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkTachyonPi.scala:39), which has no missing parents 14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[1] at map at SparkTachyonPi.scala:39) 14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140715164313- 14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: app-20140715164313-/0 on worker-20140715164009-office-5-148.pa.gopivotal.com-52519 (office-5-148.pa.gopivotal.com:52519) with 8 cores 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140715164313-/0 on hostPort office-5-148.pa.gopivotal.com:52519 with 8 cores, 512.0 MB RAM 14/07/15 16:43:13 INFO AppClient$ClientActor: Executor updated: app-20140715164313-/0 is now RUNNING 14/07/15 16:43:15 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkexecu...@office-5-148.pa.gopivotal.com:53213/user/Executor#-423405256] with ID 0 14/07/15 16:43:15 INFO TaskSetManager: Re-computing pending task lists. 14/07/15 16:43:15 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 0: office-5-148.pa.gopivotal.com (PROCESS_LOCAL) 14/07/15 16:43:15 INFO
[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134160#comment-14134160 ] Andrew Ash commented on SPARK-1239: --- For large statuses, would we expect that to exceed {{spark.akka.frameSize}} and cause the below exception? {noformat} 2014-09-14T01:34:21.305 ERROR [spark-akka.actor.default-dispatcher-4] org.apache.spark.MapOutputTrackerMasterActor - Map output statuses were 13920119 bytes which exceeds spark.akka.frameSize (10485760 bytes). {noformat} Don't fetch all map output statuses at each reducer during shuffles --- Key: SPARK-1239 URL: https://issues.apache.org/jira/browse/SPARK-1239 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Patrick Wendell Assignee: Kostas Sakellis Instead we should modify the way we fetch map output statuses to take both a mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134198#comment-14134198 ] Patrick Wendell commented on SPARK-1239: Yes, the current state of the art is to just increase the frame size. Don't fetch all map output statuses at each reducer during shuffles --- Key: SPARK-1239 URL: https://issues.apache.org/jira/browse/SPARK-1239 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Patrick Wendell Assignee: Kostas Sakellis Instead we should modify the way we fetch map output statuses to take both a mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3425) OpenJDK - when run with jvm 1.8, should not set MaxPermSize
[ https://issues.apache.org/jira/browse/SPARK-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3425. Resolution: Fixed Fixed by: https://github.com/apache/spark/pull/2301 OpenJDK - when run with jvm 1.8, should not set MaxPermSize --- Key: SPARK-3425 URL: https://issues.apache.org/jira/browse/SPARK-3425 Project: Spark Issue Type: Improvement Reporter: Matthew Farrellee Assignee: Matthew Farrellee Priority: Minor Fix For: 1.2.0 In JVM 1.8.0, MaxPermSize is no longer supported. In spark stderr output, there would be a line of Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2377) Create a Python API for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134210#comment-14134210 ] Jyotiska NK edited comment on SPARK-2377 at 9/15/14 6:00 PM: - I have been watching the work going on PR #11 for a while. Is there any way to contribute to this? was (Author: jyotiska): I have been watching the work going on PR #11. Is there any way to contribute to this? Create a Python API for Spark Streaming --- Key: SPARK-2377 URL: https://issues.apache.org/jira/browse/SPARK-2377 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Nicholas Chammas Assignee: Kenichi Takagiwa [Spark Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html] currently offers APIs in Scala and Java. It would be great feature add to have a Python API as well. This is probably a large task that will span many issues if undertaken. This ticket should provide some place to track overall progress towards an initial Python API for Spark Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2377) Create a Python API for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134210#comment-14134210 ] Jyotiska NK commented on SPARK-2377: I have been watching the work going on PR #11. Is there any way to contribute to this? Create a Python API for Spark Streaming --- Key: SPARK-2377 URL: https://issues.apache.org/jira/browse/SPARK-2377 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Nicholas Chammas Assignee: Kenichi Takagiwa [Spark Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html] currently offers APIs in Scala and Java. It would be great feature add to have a Python API as well. This is probably a large task that will span many issues if undertaken. This ticket should provide some place to track overall progress towards an initial Python API for Spark Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2377) Create a Python API for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134225#comment-14134225 ] Matthew Farrellee commented on SPARK-2377: -- it's a little tricky. you need to clone tdas' or giwa's repository, make changes on master (it's far from current spark master) and submit pull requests to giwa or tdas. imho, it'd be much simpler if the PR was tagged [WIP] and directed toward the apache/spark repo! (pls!) Create a Python API for Spark Streaming --- Key: SPARK-2377 URL: https://issues.apache.org/jira/browse/SPARK-2377 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Nicholas Chammas Assignee: Kenichi Takagiwa [Spark Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html] currently offers APIs in Scala and Java. It would be great feature add to have a Python API as well. This is probably a large task that will span many issues if undertaken. This ticket should provide some place to track overall progress towards an initial Python API for Spark Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134246#comment-14134246 ] Derrick Burns commented on SPARK-2308: -- I have implemented MiniBatch KMeans in Spark. We do not need a special iterator type or random access to get the advantages of MiniBatch because the gain comes primarily from decreasing the number of distance calculations, and not from decreasing the number of points that are touched. MiniBatch is good if the number of points is dramatically larger than the number of clusters. In that case, any sampling of points will impact a large number of clusters, leading to faster convergence. MiniBatch is less useful when the number of desired clusters is large. In this case, MiniBatch is less useful. A better approach is to track which clusters are dirty and which points are assigned to which clusters. Using this information, one can eliminate more and more distance calculations per round. This leads to shorter and shorter rounds, and consequently faster convergence. Add KMeans MiniBatch clustering algorithm to MLlib -- Key: SPARK-2308 URL: https://issues.apache.org/jira/browse/SPARK-2308 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: RJ Nowling Priority: Minor Attachments: many_small_centers.pdf, uneven_centers.pdf Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy). The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib. I suggest adding KMeans Mini-batch as an alternative. I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134264#comment-14134264 ] RJ Nowling commented on SPARK-2308: --- It is true that we will save on the distance calculations for high dimensional data sets. There is also work under way to improve sampling in Spark, so this will also benefit further from that. Are you planning on creating a PR for your implementation? It would be valuable for the community. I closed mine due to the sampling issues. But I'd be happy to review and test yours. Add KMeans MiniBatch clustering algorithm to MLlib -- Key: SPARK-2308 URL: https://issues.apache.org/jira/browse/SPARK-2308 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: RJ Nowling Priority: Minor Attachments: many_small_centers.pdf, uneven_centers.pdf Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy). The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib. I suggest adding KMeans Mini-batch as an alternative. I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3534) Avoid running MLlib and Streaming tests when testing SQL PRs
Michael Armbrust created SPARK-3534: --- Summary: Avoid running MLlib and Streaming tests when testing SQL PRs Key: SPARK-3534 URL: https://issues.apache.org/jira/browse/SPARK-3534 Project: Spark Issue Type: Bug Components: Project Infra, SQL Reporter: Michael Armbrust Priority: Blocker We are bumping up against the 120 minute time limit for tests pretty regularly now. Since we have decreased the number of shuffle partitions and up-ed the parallelism I don't think there is much low hanging fruit to speed up the SQL tests. (The tests that are listed as taking 2-3 minutes are actually 100s of tests that I think are valuable). Instead I propose we avoid running tests that we don't need to. This will have the added benefit of eliminating failures in SQL due to flaky streaming tests. Note that this won't fix the full builds that are run for every commit. There I think we just just up the test timeout. cc: [~joshrosen] [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3534) Avoid running MLlib and Streaming tests when testing SQL PRs
[ https://issues.apache.org/jira/browse/SPARK-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134290#comment-14134290 ] Josh Rosen commented on SPARK-3534: --- Looks like this has been proposed before: SPARK-1455 Avoid running MLlib and Streaming tests when testing SQL PRs Key: SPARK-3534 URL: https://issues.apache.org/jira/browse/SPARK-3534 Project: Spark Issue Type: Bug Components: Project Infra, SQL Reporter: Michael Armbrust Priority: Blocker We are bumping up against the 120 minute time limit for tests pretty regularly now. Since we have decreased the number of shuffle partitions and up-ed the parallelism I don't think there is much low hanging fruit to speed up the SQL tests. (The tests that are listed as taking 2-3 minutes are actually 100s of tests that I think are valuable). Instead I propose we avoid running tests that we don't need to. This will have the added benefit of eliminating failures in SQL due to flaky streaming tests. Note that this won't fix the full builds that are run for every commit. There I think we just just up the test timeout. cc: [~joshrosen] [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134291#comment-14134291 ] Derrick Burns commented on SPARK-2308: -- I have submitted several issues regarding the Spark K-Means clustering implementation. I would like to contribute my new K-Means implementation to Spark. However, my implementation is a re-write, and not simply a trivial pull-request. Why did I re-write it? The initial reason for rewriting the code is that the 1.0.2 (and 1.1.0) implementation does not support the distance function that I wanted. I use the KL-Divergence measure, whereas yours uses the squared Euclidean distance. Unfortunately, the Euclidean metric is to deeply ingrained in the 1.0.2 implementation, that a major re-write was required. My implementation has a pluggable distance function. I wrote an implementation of the squared Euclidean distance function (actually I wrote two versions), and an implementation of KL-divergence. With this abstraction, the core K-Means algorithm knows nothing about the distance function. The second reason is that I noticed that the 1.0.2. implementation is rather slow because it recomputes all distances to all points on all rounds. Further, on my experiments on millions of points and thousands of cluster with 100s of dimensions, the K-Means algorithm is CPU bound with very very little memory being used. I observed that we can eliminate the vast majority of distance calculations by maintaining for each point its closest cluster, the distance to that cluster, and the round that either of those values was changed and for each cluster which round that cluster was last moved. This is a modest amount of addition space, but results in a dramatic improvement in running time. I have tested my new implementation extensively on EC2 using up to 16 c3.8xlarge worker machines. The application is in financial services, so I need the results to be right. :) I call my implementation Tracking K-Means. If you would entertain including my implementation in a future Spark release, please let me know. Add KMeans MiniBatch clustering algorithm to MLlib -- Key: SPARK-2308 URL: https://issues.apache.org/jira/browse/SPARK-2308 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: RJ Nowling Priority: Minor Attachments: many_small_centers.pdf, uneven_centers.pdf Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy). The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib. I suggest adding KMeans Mini-batch as an alternative. I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-1517: -- Fix Version/s: (was: 1.1.0) 1.2.0 We should revisit this for the 1.2.0 release cycle, since this would have solved the issue that we ran into with bumping the SNAPSHOT version before the 1.1 artifacts were published on Maven. Publish nightly snapshots of documentation, maven artifacts, and binary builds -- Key: SPARK-1517 URL: https://issues.apache.org/jira/browse/SPARK-1517 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Patrick Wendell Fix For: 1.2.0 Should be pretty easy to do with Jenkins. The only thing I can think of that would be tricky is to set up credentials so that jenkins can publish this stuff somewhere on apache infra. Ideally we don't want to have to put a private key on every jenkins box (since they are otherwise pretty stateless). One idea is to encrypt these credentials with a passphrase and post them somewhere publicly visible. Then the jenkins build can download the credentials provided we set a passphrase in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3104) Jenkins failing to test some PRs when asked to
[ https://issues.apache.org/jira/browse/SPARK-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3104. --- Resolution: Cannot Reproduce Resolving this as cannot reproduce for now, since Jenkins seems to have become responsive again. Feel free to re-open if it begins experiencing issues. We have an alternative to the pull request builder plugin ready to deploy if it breaks again. Jenkins failing to test some PRs when asked to -- Key: SPARK-3104 URL: https://issues.apache.org/jira/browse/SPARK-3104 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Erik Erlandson I've seen a few PRs where Jenkins does not appear to be testing when requested: https://github.com/apache/spark/pull/1964 https://github.com/apache/spark/pull/1254 https://github.com/apache/spark/pull/1839 Maybe the Jenkins logs have a record of what's going wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134301#comment-14134301 ] RJ Nowling commented on SPARK-2308: --- I'm not a committer but [~mengxr] is. That said, I'm very happy to help in any way I can. The issue of different distance metrics has come up on the mailing list -- a must requested feature. If you provide it as a PR, maybe others who are more familiar with the work to add additional distance metrics can comment and we, as a community, can move forward to get it included. Add KMeans MiniBatch clustering algorithm to MLlib -- Key: SPARK-2308 URL: https://issues.apache.org/jira/browse/SPARK-2308 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: RJ Nowling Priority: Minor Attachments: many_small_centers.pdf, uneven_centers.pdf Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy). The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib. I suggest adding KMeans Mini-batch as an alternative. I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2232) Fix Jenkins tests in Maven
[ https://issues.apache.org/jira/browse/SPARK-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2232. --- Resolution: Fixed This has been fixed; the Maven builds have now been green for a few days. Fix Jenkins tests in Maven -- Key: SPARK-2232 URL: https://issues.apache.org/jira/browse/SPARK-2232 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Patrick Wendell Priority: Blocker It appears Maven tests are failing under the newer Hadoop configurations. We need to go through and make sure all the Spark master build configurations are passing. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs
[ https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-3533: Affects Version/s: 1.1.0 Add saveAsTextFileByKey() method to RDDs Key: SPARK-3533 URL: https://issues.apache.org/jira/browse/SPARK-3533 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Nicholas Chammas Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys. For example, say I have an RDD like this: {code} a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda x: x[0]) a.collect() [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] a.keys().distinct().collect() ['B', 'F', 'N'] {code} Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition. So the output would look something like: {code} /path/prefix/B [/part-1, /part-2, etc] /path/prefix/F [/part-1, /part-2, etc] /path/prefix/N [/part-1, /part-2, etc] {code} Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark. Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns commented on SPARK-3219: -- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:14 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:16 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:16 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:15 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values as is done with the Fast Euclidean distance which pre-computes magnitudes. (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values as is done with the Fast Euclidean distance which pre-computes magnitudes. (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values as is done with the Fast Euclidean distance which pre-computes magnitudes. (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values as is done with the Fast Euclidean distance which pre-computes magnitudes. (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:24 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points, which is too expensive to re-compute in the distance calculation!) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (the type used to represent distance, such as Float or Double), T (the data type used for a point by the K-Means client), P (the data type used for a point by the distance function), C (the data type used for a cluster center by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points, which is too expensive to re-compute in the distance calculation!) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points, which is too expensive to re-compute in the distance calculation!) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (the type used to represent distance, such as Float or Double), T (the data type used for a point by the K-Means client), P (the data type used for a point by the distance function), C (the data type used for a cluster center by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points, which is too expensive to re-compute in the distance calculation!) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (the type used to represent distance, such as Float or Double), T (the data type used for a point by the K-Means client), P (the data type used for a point by the distance function), C (the data type used for a cluster center by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points, which is too expensive to re-compute in the distance calculation!) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120325#comment-14120325 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM: --- Great! You can find my work here: https://github.com/derrickburns/generalized-kmeans-clustering.git. I should warn you that I rewrote much of the original Spark clusterer because the original is too tightly coupled to using the Euclidean norm and does not allow one to identify efficiently which points belong to which clusters. I have tested this version extensively. You will notice a package call com.rincaro.clusterer.metrics. Please take a look at the two files EuOps.scala and FastEuclideansOps.scala. They both implement the Euclidean norm. However, one is much faster than the other by using the same algebraic transformations that the Spark version uses. This demonstrates that it is possible to be efficient while not being tightly coupled. One could easily re-implement FastEuclideanOps using Breeze or Blas without affecting the core Kmeans implementation. Not included in this project is another distance function that that I have implemented: the Kullback-Leibler distance function, a.k.a. relative entropy. In my implementation, I also perform algebraic transformations to expedite the computation, resulting in a distance computation that is even faster than the fast euclidean norm. Let me know if this is useful to you. was (Author: derrickburns): Great! You can find my work here: https://github.com/derrickburns/generalized-kmeans-clustering.git. I should warn you that I rewrote much of the original Spark clusterer because the original is too tightly coupled to using the Euclidean norm and does not allow one to identify efficiently which points belong to which clusters. I have tested this version extensively. You will notice a package call com.rincaro.clusterer.metrics. Please take a look at the two files EuOps.scala and FastEuclideansOps.scala. They both implement the Euclidean norm. However, one is much faster than the other by using the same algebraic transformations that the Spark version uses. This demonstrates that it is possible to be efficient while not being tightly coupled. One could easily re-implement FastEuclideanOps using Breeze or Blas without effecting the core Kmeans implementation. Not included in this project is another distance function that that I have implemented: the Kullback-Leibler distance function, a.k.a. relative entropy. In my implementation, I also perform algebraic transformations to expedite the computation, resulting in a distance computation that is even faster than the fast euclidean norm. Let me know if this is useful to you. K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3308) Ability to read JSON Arrays as tables
[ https://issues.apache.org/jira/browse/SPARK-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134385#comment-14134385 ] Apache Spark commented on SPARK-3308: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/2400 Ability to read JSON Arrays as tables - Key: SPARK-3308 URL: https://issues.apache.org/jira/browse/SPARK-3308 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Critical Right now we can only read json where each object is on its own line. It would be nice to be able to read top level json arrays where each element maps to a row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3534) Avoid running MLlib and Streaming tests when testing SQL PRs
[ https://issues.apache.org/jira/browse/SPARK-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134399#comment-14134399 ] Michael Armbrust commented on SPARK-3534: - Ah, yeah. Good catch Josh. I'll settle for a simpler version where we only fix SQL (since that is where the pain is being felt now). Bonus points for implementing the full solution :) Avoid running MLlib and Streaming tests when testing SQL PRs Key: SPARK-3534 URL: https://issues.apache.org/jira/browse/SPARK-3534 Project: Spark Issue Type: Bug Components: Project Infra, SQL Reporter: Michael Armbrust Priority: Blocker We are bumping up against the 120 minute time limit for tests pretty regularly now. Since we have decreased the number of shuffle partitions and up-ed the parallelism I don't think there is much low hanging fruit to speed up the SQL tests. (The tests that are listed as taking 2-3 minutes are actually 100s of tests that I think are valuable). Instead I propose we avoid running tests that we don't need to. This will have the added benefit of eliminating failures in SQL due to flaky streaming tests. Note that this won't fix the full builds that are run for every commit. There I think we just just up the test timeout. cc: [~joshrosen] [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134445#comment-14134445 ] Pedro Rodriguez commented on SPARK-1405: Hi All. Just wanted to quickly introduce myself quickly. I am undergrad at UC Berkeley working in the Amplab and in particular with LDA (continuation of a grad class final project from last spring). Generally speaking my focus will be to use one LDA implementation as a baseline (probably Joey's since it is fully distributed in all parts, particularly the token-topic matrix), write unit tests + test cases, and benchmark it at scale. parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Xusen Yin Labels: features Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
Brenden Matthews created SPARK-3535: --- Summary: Spark on Mesos not correctly setting heap overhead Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
[ https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134499#comment-14134499 ] Timothy St. Clair commented on SPARK-3535: -- Are you seeing this under fine grained, course grained, or both. Spark on Mesos not correctly setting heap overhead -- Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
[ https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134524#comment-14134524 ] Brenden Matthews commented on SPARK-3535: - I'm seeing this in fine grained mode. I confirmed that Spark sets the heap size to equal to the task memory. Spark on Mesos not correctly setting heap overhead -- Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3536) SELECT on empty parquet table throws exception
Michael Armbrust created SPARK-3536: --- Summary: SELECT on empty parquet table throws exception Key: SPARK-3536 URL: https://issues.apache.org/jira/browse/SPARK-3536 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Reported by [~matei]. Reproduce as follows: {code} scala case class Data(i: Int) defined class Data scala createParquetFile[Data](testParquet) scala parquetFile(testParquet).count() 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due to exception - job: 0 java.lang.NullPointerException at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
[ https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134533#comment-14134533 ] Brenden Matthews commented on SPARK-3535: - I wrote a patch: https://github.com/apache/spark/pull/2401 I'm not very familiar with Spark internals or conventions, so feel free to rip this apart. Spark on Mesos not correctly setting heap overhead -- Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3537) Statistics for cached RDDs
Michael Armbrust created SPARK-3537: --- Summary: Statistics for cached RDDs Key: SPARK-3537 URL: https://issues.apache.org/jira/browse/SPARK-3537 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Right now we only have limited statistics for hive tables. We could easily collect this data when caching an RDD as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3538) Provide way for workers to log messages to driver's out/err
Matthew Farrellee created SPARK-3538: Summary: Provide way for workers to log messages to driver's out/err Key: SPARK-3538 URL: https://issues.apache.org/jira/browse/SPARK-3538 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core, Spark Shell Reporter: Matthew Farrellee Priority: Minor As part of SPARK-927 we encountered a use case for the code running on a worker to be able to emit messages back to the driver. The communication channel is for trace/debug messages to an application's (shell or app) user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
[ https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134592#comment-14134592 ] Andrew Ash commented on SPARK-3535: --- Why does the task need extra memory if the heap size equals the available task memory? Filesystem cache? Spark on Mesos not correctly setting heap overhead -- Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
[ https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134599#comment-14134599 ] Brenden Matthews commented on SPARK-3535: - The JVM heap size does not include the stack size, GC overhead, or anything malloc'd by other libraries linked into the JVM. Spark on Mesos not correctly setting heap overhead -- Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3518) Remove useless statement in JsonProtocol
[ https://issues.apache.org/jira/browse/SPARK-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3518. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: Kousuke Saruta Fixed by: https://github.com/apache/spark/pull/2380 Remove useless statement in JsonProtocol Key: SPARK-3518 URL: https://issues.apache.org/jira/browse/SPARK-3518 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Minor Fix For: 1.1.1, 1.2.0 In org.apache.spark.util.JsonProtocol#taskInfoToJson, a variable named accumUpdateMap is created as follows. {code} val accumUpdateMap = taskInfo.accumulables {code} But accumUpdateMap is never used and there is 2nd invocation of taskInfo.accumlables as follows. {code} (Accumulables - JArray(taskInfo.accumulables.map(accumulableInfoToJson).toList)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2505) Weighted Regularizer
[ https://issues.apache.org/jira/browse/SPARK-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2505: --- Fix Version/s: (was: 1.1.0) 1.2.0 Weighted Regularizer Key: SPARK-2505 URL: https://issues.apache.org/jira/browse/SPARK-2505 Project: Spark Issue Type: New Feature Components: MLlib Reporter: DB Tsai Fix For: 1.2.0 The current implementation of regularization in linear model is using `Updater`, and this design has couple issues as the following. 1) It will penalize all the weights including intercept. In machine learning training process, typically, people don't penalize the intercept. 2) The `Updater` has the logic of adaptive step size for gradient decent, and we would like to clean it up by separating the logic of regularization out from updater to regularizer so in LBFGS optimizer, we don't need the trick for getting the loss and gradient of objective function. In this work, a weighted regularizer will be implemented, and users can exclude the intercept or any weight from regularization by setting that term with zero weighted penalty. Since the regularizer will return a tuple of loss and gradient, the adaptive step size logic, and soft thresholding for L1 in Updater will be moved to SGD optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2314) RDD actions are only overridden in Scala, not java or python
[ https://issues.apache.org/jira/browse/SPARK-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2314: --- Fix Version/s: (was: 1.1.0) 1.2.0 RDD actions are only overridden in Scala, not java or python Key: SPARK-2314 URL: https://issues.apache.org/jira/browse/SPARK-2314 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0, 1.0.1 Reporter: Michael Armbrust Assignee: Aaron Staple Labels: starter Fix For: 1.2.0, 1.0.3 For example, collect and take(). We should keep these two in sync, or move this code to schemaRDD like if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3403: --- Fix Version/s: (was: 1.1.0) 1.2.0 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.2.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2703) Make Tachyon related unit tests execute without deploying a Tachyon system locally.
[ https://issues.apache.org/jira/browse/SPARK-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2703: --- Fix Version/s: (was: 1.1.0) 1.2.0 Make Tachyon related unit tests execute without deploying a Tachyon system locally. --- Key: SPARK-2703 URL: https://issues.apache.org/jira/browse/SPARK-2703 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Haoyuan Li Assignee: Rong Gu Fix For: 1.2.0 Use the LocalTachyonCluster class in tachyon-test.jar in 0.5.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3038) delete history server logs when there are too many logs
[ https://issues.apache.org/jira/browse/SPARK-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3038: --- Fix Version/s: (was: 1.1.0) 1.2.0 delete history server logs when there are too many logs Key: SPARK-3038 URL: https://issues.apache.org/jira/browse/SPARK-3038 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.1 Reporter: wangfei Fix For: 1.2.0 enhance history server to delete logs automatically 1 use spark.history.deletelogs.enable to enable this function 2 when app logs num is greater than spark.history.maxsavedapplication, delete the older logs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1832) Executor UI improvement suggestions
[ https://issues.apache.org/jira/browse/SPARK-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1832: --- Fix Version/s: (was: 1.1.0) 1.2.0 Executor UI improvement suggestions --- Key: SPARK-1832 URL: https://issues.apache.org/jira/browse/SPARK-1832 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0 Reporter: Thomas Graves Fix For: 1.2.0 I received some suggestions from a user for the /executors UI page to make it more helpful. This gets more important when you have a really large number of executors. Fill some of the cells with color in order to make it easier to absorb the info, e.g. RED if Failed Tasks greater than 0 (maybe the more failed, the more intense the red) GREEN if Active Tasks greater than 0 (maybe more intense the larger the number) Possibly color code COMPLETE TASKS using various shades of blue (e.g., based on the log(# completed) - if dark blue then write the value in white (same for the RED and GREEN above Maybe mark the MASTER task somehow Report the TOTALS in each column (do this at the TOP so no need to scroll to the bottom, or print both at top and bottom). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2167) spark-submit should return exit code based on failure/success
[ https://issues.apache.org/jira/browse/SPARK-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2167: --- Fix Version/s: (was: 1.1.0) 1.2.0 spark-submit should return exit code based on failure/success - Key: SPARK-2167 URL: https://issues.apache.org/jira/browse/SPARK-2167 Project: Spark Issue Type: New Feature Components: Deploy Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Guoqiang Li Fix For: 1.2.0 spark-submit script and Java class should exit with 0 for success and non-zero with failure so that other command line tools and workflow managers (like oozie) can properly tell if the spark app succeeded or failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2754) Document standalone-cluster mode now that it's working
[ https://issues.apache.org/jira/browse/SPARK-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2754: --- Fix Version/s: (was: 1.1.0) 1.2.0 Document standalone-cluster mode now that it's working -- Key: SPARK-2754 URL: https://issues.apache.org/jira/browse/SPARK-2754 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.0.1 Reporter: Andrew Or Fix For: 1.2.0 This was previously broken before SPARK-2260, so we (attempted to) remove all documentation related to this mode. We should add it back now that we have fixed it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2947) DAGScheduler resubmit the stage into an infinite loop
[ https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2947: --- Fix Version/s: (was: 1.1.0) 1.2.0 DAGScheduler resubmit the stage into an infinite loop - Key: SPARK-2947 URL: https://issues.apache.org/jira/browse/SPARK-2947 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.2 Reporter: Guoqiang Li Priority: Blocker Fix For: 1.2.0, 1.0.3 Stage to resubmit more than 5 times. This seems to be caused by {{FetchFailed.bmAddress}} is null . I don't know how to reproduce it. master log: {noformat} 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as TID 52334 on executor 82: sanshan (PROCESS_LOCAL) 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 3060 bytes in 0 ms 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as TID 52335 on executor 78: tuan231 (PROCESS_LOCAL) 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 3060 bytes in 0 ms 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 1.189:141) 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission -- 5 times --- 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 1.189, whose tasks have all completed, from pool 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 ms on jilin (progress: 280/280) 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 269) 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 2.1, whose tasks have all completed, from pool 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at DealCF.scala:207) finished in 129.544 s {noformat} worker: log {noformat} /1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, computing it 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, computing it 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18017 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18151 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, computing it 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, computing it 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18285 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18419 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
[jira] [Updated] (SPARK-2638) Improve concurrency of fetching Map outputs
[ https://issues.apache.org/jira/browse/SPARK-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2638: --- Fix Version/s: (was: 1.1.0) 1.2.0 Improve concurrency of fetching Map outputs --- Key: SPARK-2638 URL: https://issues.apache.org/jira/browse/SPARK-2638 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Environment: All Reporter: Stephen Boesch Assignee: Josh Rosen Priority: Minor Labels: MapOutput, concurrency Fix For: 1.2.0 Original Estimate: 0h Remaining Estimate: 0h This issue was noticed while perusing the MapOutputTracker source code. Notice that the synchronization is on the containing fetching collection - which makes ALL fetches wait if any fetch were occurring. The fix is to synchronize instead on the shuffleId (interned as a string to ensure JVM wide visibility). def getServerStatuses(shuffleId: Int, reduceId: Int): Array[(BlockManagerId, Long)] = { val statuses = mapStatuses.get(shuffleId).orNull if (statuses == null) { logInfo(Don't have map outputs for shuffle + shuffleId + , fetching them) var fetchedStatuses: Array[MapStatus] = null fetching.synchronized { // This is existing code // shuffleId.toString.intern.synchronized { // New Code if (fetching.contains(shuffleId)) { // Someone else is fetching it; wait for them to be done while (fetching.contains(shuffleId)) { try { fetching.wait() } catch { case e: InterruptedException = } } This is only a small code change, but the testcases to prove (a) proper functionality and (b) proper performance improvement are not so trivial. For (b) it is not worthwhile to add a testcase to the codebase. Instead I have added a git project that demonstrates the concurrency/performance improvement using the fine-grained approach . The github project is at https://github.com/javadba/scalatesting.git . Simply run sbt test. Note: it is unclear how/where to include this ancillary testing/verification information that will not be included in the git PR: i am open for any suggestions - even as far as simply removing references to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1911) Warn users if their assembly jars are not built with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1911: --- Fix Version/s: (was: 1.1.0) 1.2.0 Warn users if their assembly jars are not built with Java 6 --- Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.2.0 The root cause of the problem is detailed in: https://issues.apache.org/jira/browse/SPARK-1520. In short, an assembly jar built with Java 7+ is not always accessible by Python or other versions of Java (especially Java 6). If the assembly jar is not built on the cluster itself, this problem may manifest itself in strange exceptions that are not trivial to debug. This is an issue especially for PySpark on YARN, which relies on the python files included within the assembly jar. Currently we warn users only in make-distribution.sh, but most users build the jars directly. At the very least we need to emphasize this in the docs (currently missing entirely). The next step is to add a warning prompt in the mvn scripts whenever Java 7+ is detected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2793) Correctly lock directory creation in DiskBlockManager.getFile
[ https://issues.apache.org/jira/browse/SPARK-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2793: --- Fix Version/s: (was: 1.1.0) 1.2.0 Correctly lock directory creation in DiskBlockManager.getFile - Key: SPARK-2793 URL: https://issues.apache.org/jira/browse/SPARK-2793 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Matei Zaharia Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2069) MIMA false positives (umbrella)
[ https://issues.apache.org/jira/browse/SPARK-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2069: --- Fix Version/s: (was: 1.1.0) 1.2.0 MIMA false positives (umbrella) --- Key: SPARK-2069 URL: https://issues.apache.org/jira/browse/SPARK-2069 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical Fix For: 1.2.0 Since we started using MIMA more actively in core we've been running into situations were we get false positives. We should address these ASAP as they require having manual excludes in our build files which is pretty tedious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1830) Deploy failover, Make Persistence engine and LeaderAgent Pluggable.
[ https://issues.apache.org/jira/browse/SPARK-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1830: --- Fix Version/s: (was: 1.1.0) 1.2.0 Deploy failover, Make Persistence engine and LeaderAgent Pluggable. --- Key: SPARK-1830 URL: https://issues.apache.org/jira/browse/SPARK-1830 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Prashant Sharma Fix For: 1.0.1, 1.2.0 With current code base it is difficult to plugin an external user specified Persistence Engine or Election Agent. It would be good to expose this as a pluggable API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2795) Improve DiskBlockObjectWriter API
[ https://issues.apache.org/jira/browse/SPARK-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2795: --- Fix Version/s: (was: 1.1.0) 1.2.0 Improve DiskBlockObjectWriter API - Key: SPARK-2795 URL: https://issues.apache.org/jira/browse/SPARK-2795 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Matei Zaharia Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1706) Allow multiple executors per worker in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1706: --- Fix Version/s: (was: 1.1.0) 1.2.0 Allow multiple executors per worker in Standalone mode -- Key: SPARK-1706 URL: https://issues.apache.org/jira/browse/SPARK-1706 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Patrick Wendell Assignee: Nan Zhu Fix For: 1.2.0 Right now if people want to launch multiple executors on each machine they need to start multiple standalone workers. This is not too difficult, but it means you have extra JVM's sitting around. We should just allow users to set a number of cores they want per-executor in standalone mode and then allow packing multiple executors on each node. This would make standalone mode more consistent with YARN in the way you request resources. It's not too big of a change as far as I can see. You'd need to: 1. Introduce a configuration for how many cores you want per executor. 2. Change the scheduling logic in Master.scala to take this into account. 3. Change CoarseGrainedSchedulerBackend to not assume a 1-1 correspondence between hosts and executors. And maybe modify a few other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1989) Exit executors faster if they get into a cycle of heavy GC
[ https://issues.apache.org/jira/browse/SPARK-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1989: --- Fix Version/s: (was: 1.1.0) 1.2.0 Exit executors faster if they get into a cycle of heavy GC -- Key: SPARK-1989 URL: https://issues.apache.org/jira/browse/SPARK-1989 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Matei Zaharia Fix For: 1.2.0 I've seen situations where an application is allocating too much memory across its tasks + cache to proceed, but Java gets into a cycle where it repeatedly runs full GCs, frees up a bit of the heap, and continues instead of giving up. This then leads to timeouts and confusing error messages. It would be better to crash with OOM sooner. The JVM has options to support this: http://java.dzone.com/articles/tracking-excessive-garbage. The right solution would probably be: - Add some config options used by spark-submit to set XX:GCTimeLimit and XX:GCHeapFreeLimit, with more conservative values than the defaults (e.g. 90% time limit, 5% free limit) - Make sure we pass these into the Java options for executors in each deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1860: --- Fix Version/s: (was: 1.1.0) 1.2.0 Standalone Worker cleanup should not clean up running executors --- Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Critical Fix For: 1.2.0 The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any executors that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Executor's log/data folders should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1924) Make local:/ scheme work in more deploy modes
[ https://issues.apache.org/jira/browse/SPARK-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1924: --- Fix Version/s: (was: 1.1.0) 1.2.0 Make local:/ scheme work in more deploy modes - Key: SPARK-1924 URL: https://issues.apache.org/jira/browse/SPARK-1924 Project: Spark Issue Type: Sub-task Components: Deploy Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Minor Fix For: 1.2.0 A resource marked local:/ is assumed to be available on the local file system of every node. In such case, no data copying over network should happen. I tested different deploy modes in v1.0, right now we only support local:/ scheme for the app jar and secondary jars in the following modes: 1) local (jars are copied to the working directory) 2) standalone client 3) yarn client It doesn’t work for --files and python apps (--py-files and app.py). For the next release, we could support more deploy modes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1379) Calling .cache() on a SchemaRDD should do something more efficient than caching the individual row objects.
[ https://issues.apache.org/jira/browse/SPARK-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1379: --- Fix Version/s: (was: 1.1.0) 1.2.0 Calling .cache() on a SchemaRDD should do something more efficient than caching the individual row objects. --- Key: SPARK-1379 URL: https://issues.apache.org/jira/browse/SPARK-1379 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Fix For: 1.2.0 Since rows aren't black boxes we could use InMemoryColumnarTableScan. This would significantly reduce GC pressure on the workers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1684) Merge script should standardize SPARK-XXX prefix
[ https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1684: --- Fix Version/s: (was: 1.1.0) 1.2.0 Merge script should standardize SPARK-XXX prefix Key: SPARK-1684 URL: https://issues.apache.org/jira/browse/SPARK-1684 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Minor Fix For: 1.2.0 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK-XXX: Issue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1853: --- Fix Version/s: (was: 1.1.0) 1.2.0 Show Streaming application code context (file, line number) in Spark Stages UI -- Key: SPARK-1853 URL: https://issues.apache.org/jira/browse/SPARK-1853 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Mubarak Seyed Fix For: 1.2.0 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png Right now, the code context (file, and line number) shown for streaming jobs in stages UI is meaningless as it refers to internal DStream:random line rather than user application file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1201) Do not materialize partitions whenever possible in BlockManager
[ https://issues.apache.org/jira/browse/SPARK-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1201: --- Fix Version/s: (was: 1.1.0) 1.2.0 Do not materialize partitions whenever possible in BlockManager --- Key: SPARK-1201 URL: https://issues.apache.org/jira/browse/SPARK-1201 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Reporter: Patrick Wendell Assignee: Andrew Or Fix For: 1.2.0 This is a slightly more complex version of SPARK-942 where we try to avoid unrolling iterators in other situations where it is possible. SPARK-942 focused on the case where the DISK_ONLY storage level was used. There are other cases though, such as if data is stored serialized and in memory and but there is not enough memory left to store the RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2159) Spark shell exit() does not stop SparkContext
[ https://issues.apache.org/jira/browse/SPARK-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2159: --- Fix Version/s: (was: 1.1.0) 1.2.0 Spark shell exit() does not stop SparkContext - Key: SPARK-2159 URL: https://issues.apache.org/jira/browse/SPARK-2159 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Priority: Minor Fix For: 1.2.0 If you type exit() in spark shell, it is equivalent to a Ctrl+C and does not stop the SparkContext. This is used very commonly to exit a shell, and it would be good if it is equivalent to Ctrl+D instead, which does stop the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org