[jira] [Commented] (SPARK-5649) Throw exception when can not apply datatype cast
[ https://issues.apache.org/jira/browse/SPARK-5649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319724#comment-14319724 ] Michael Armbrust commented on SPARK-5649: - https://github.com/apache/spark/pull/4558 Throw exception when can not apply datatype cast Key: SPARK-5649 URL: https://issues.apache.org/jira/browse/SPARK-5649 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Fix For: 1.3.0 Throw exception when can not apply datatypes cast to info user the cast issue in the sqls. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5649) Throw exception when can not apply datatype cast
[ https://issues.apache.org/jira/browse/SPARK-5649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5649. - Resolution: Fixed Fix Version/s: 1.3.0 Assignee: wangfei Throw exception when can not apply datatype cast Key: SPARK-5649 URL: https://issues.apache.org/jira/browse/SPARK-5649 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Assignee: wangfei Fix For: 1.3.0 Throw exception when can not apply datatypes cast to info user the cast issue in the sqls. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
Littlestar created SPARK-5795: - Summary: api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java Key: SPARK-5795 URL: https://issues.apache.org/jira/browse/SPARK-5795 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; the following code can't compile on java. JavaPairDStreamInteger, Integer rs = rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, TextOutputFormat.class, jobConf); but similar code in JavaPairRDD works ok. JavaPairRDDString, String counts =... counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, TextOutputFormat.class, jobConf); mybe the def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } = def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
[ https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319769#comment-14319769 ] Littlestar commented on SPARK-5795: --- org.apache.spark.api.java.JavaPairRDDK, V {noformat} /** Output the RDD to any Hadoop-supported file system. */ def saveAsHadoopFile[F : OutputFormat[_, _]]( path: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: JobConf) { rdd.saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass, conf) } /** Output the RDD to any Hadoop-supported file system. */ def saveAsHadoopFile[F : OutputFormat[_, _]]( path: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F]) { rdd.saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass) } /** Output the RDD to any Hadoop-supported file system, compressing with the supplied codec. */ def saveAsHadoopFile[F : OutputFormat[_, _]]( path: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], codec: Class[_ : CompressionCodec]) { rdd.saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass, codec) } /** Output the RDD to any Hadoop-supported file system. */ def saveAsNewAPIHadoopFile[F : NewOutputFormat[_, _]]( path: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: Configuration) { rdd.saveAsNewAPIHadoopFile(path, keyClass, valueClass, outputFormatClass, conf) } /** * Output the RDD to any Hadoop-supported storage system, using * a Configuration object for that storage system. */ def saveAsNewAPIHadoopDataset(conf: Configuration) { rdd.saveAsNewAPIHadoopDataset(conf) } /** Output the RDD to any Hadoop-supported file system. */ def saveAsNewAPIHadoopFile[F : NewOutputFormat[_, _]]( path: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F]) { rdd.saveAsNewAPIHadoopFile(path, keyClass, valueClass, outputFormatClass) } {noformat} org.apache.spark.streaming.api.java.JavaPairDStreamK, V {noformat} /** * Save each RDD in `this` DStream as a Hadoop file. The file name at each batch interval is * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix. */ def saveAsHadoopFiles[F : OutputFormat[K, V]](prefix: String, suffix: String) { dstream.saveAsHadoopFiles(prefix, suffix) } /** * Save each RDD in `this` DStream as a Hadoop file. The file name at each batch interval is * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix. */ def saveAsHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : OutputFormat[_, _]]) { dstream.saveAsHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass) } /** * Save each RDD in `this` DStream as a Hadoop file. The file name at each batch interval is * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix. */ def saveAsHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : OutputFormat[_, _]], conf: JobConf) { dstream.saveAsHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } /** * Save each RDD in `this` DStream as a Hadoop file. The file name at each batch interval is * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix. */ def saveAsNewAPIHadoopFiles[F : NewOutputFormat[K, V]](prefix: String, suffix: String) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix) } /** * Save each RDD in `this` DStream as a Hadoop file. The file name at each batch interval is * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix. */ def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]]) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass) } /** * Save each RDD in `this` DStream as a Hadoop file. The file name at each batch interval is * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix. */ def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } {noformat} api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319777#comment-14319777 ] Sam Halliday commented on SPARK-3785: - Hi all, just joining the thread :-) I'm the author of netlib-java. I recommend watching my ScalaX talk http://fommil.github.io/scalax14/#/ for anybody who hasn't seen it yet. I talk about beyond-CPU acceleration in the last few slides (just after the Breeze examples). In my decade of industrial experience with these things, the GPU is a *lot* faster than the CPU for large matrix operations, but slower for smaller ones (1000 elements or less). Typically, operations that are highly parallelisable, such as matrix multiplication, have a constant time cost rather than linear in number of elements. However, the big problem with GPUs is memory management. If you have a problem that you're happy to solve entirely on the GPU, you're going to get great performance at the cost of less portability... a major consideration for a JVM based application. The trick is minimising how much data you need to transmit between the traditional CPU memory space and the GPU memory space. And further optimisations can be obtained by using the GPU profilers that come with the card. It is for this reason that GPU-backed implementations of BLAS/LAPACK can only match, but not surpass, the performance of Intel MKL. There exist BLAS-LIKE and LAPACK-LIKE implementations for GPUs (e.g. cuBLAS, clBLAS) but they can only be used when you hold pointers to the GPU memory regions and are not good for use from Java/Scala (unless you are using macros/code generators to really generate native code). I have links with FPGA companies and I'd love to see a full BLAS implementation using that custom hardware... but it's such a mammoth task the FPGA implementors (not me) would need to be funded to do it. I am very hopeful about the cutting edge commodity tech coming from Intel/AMD (e.g. APUs) which allow CPU and GPU to share the memory region. I would love to buy one of these machines and write a minimal BLAS implementation to do some benchmarks and see if we can get GPU performance without the memory transfer overhead. My project https://github.com/fommil/multiblas (which was abandoned until the tech caught up) would be a perfect place to do this and would involve only runtime changes for Spark users to benefit. But, to be honest, I'd probably need funding to turn my attention to this because I've got a few other personal priorities at the moment. I've heard the raspberry pi has such a shared region. It might be interesting to use it as a cheapo dev environment. Support off-loading computations to a GPU - Key: SPARK-3785 URL: https://issues.apache.org/jira/browse/SPARK-3785 Project: Spark Issue Type: Brainstorming Components: MLlib Reporter: Thomas Darimont Priority: Minor Are there any plans to adding support for off-loading computations to the GPU, e.g. via an open-cl binding? http://www.jocl.org/ https://code.google.com/p/javacl/ http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
[ https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5795: - Priority: Minor (was: Major) When you say doesn't compile, you should show the compilation error. Although I think I know what it is. There's a workaround but I agree we can look at fixing it. If it breaks binary compatibility, it would have to wait until later. api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java - Key: SPARK-5795 URL: https://issues.apache.org/jira/browse/SPARK-5795 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; the following code can't compile on java. JavaPairDStreamInteger, Integer rs = rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, TextOutputFormat.class, jobConf); but similar code in JavaPairRDD works ok. JavaPairRDDString, String counts =... counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, TextOutputFormat.class, jobConf); mybe the def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } = def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5728) MQTTStreamSuite leaves behind ActiveMQ database files
[ https://issues.apache.org/jira/browse/SPARK-5728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5728: - Fix Version/s: 1.2.2 MQTTStreamSuite leaves behind ActiveMQ database files - Key: SPARK-5728 URL: https://issues.apache.org/jira/browse/SPARK-5728 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.2.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Trivial Fix For: 1.3.0, 1.2.2 I've seen this several times and finally wanted to fix it: {{MQTTStreamSuite}} uses a local ActiveMQ broker, that creates a working dir for its database in the {{external/mqtt}} directory called {{activemq}}. This doesn't get cleaned up, at least often it does not for me. It's trivial to set it to use a temp directory which the test harness does clean up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4832) some other processes might take the daemon pid
[ https://issues.apache.org/jira/browse/SPARK-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4832. -- Resolution: Fixed Fix Version/s: 1.2.2 1.3.0 Issue resolved by pull request 3683 [https://github.com/apache/spark/pull/3683] some other processes might take the daemon pid -- Key: SPARK-4832 URL: https://issues.apache.org/jira/browse/SPARK-4832 Project: Spark Issue Type: Bug Components: Deploy Reporter: Tao Wang Priority: Minor Fix For: 1.3.0, 1.2.2 Some other processes might use the pid saved in pid file. In that case we should ignore it and launch daemons. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5726) Hadamard Vector Product Transformer
[ https://issues.apache.org/jira/browse/SPARK-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319839#comment-14319839 ] Sean Owen commented on SPARK-5726: -- Go ahead and change it; my guess is that Xiangrui is OK with that too but he can comment too. Hadamard Vector Product Transformer --- Key: SPARK-5726 URL: https://issues.apache.org/jira/browse/SPARK-5726 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Octavian Geagla Assignee: Octavian Geagla I originally posted my idea here: http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html A draft of this feature is implemented, documented, and tested already. Code is on a branch on my fork here: https://github.com/ogeagla/spark/compare/spark-mllib-weighting I'm curious if there is any interest in this feature, in which case I'd appreciate some feedback. One thing that might be useful is an example/test case using the transformer within the ML pipeline, since there are not any examples which use Vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4832) some other processes might take the daemon pid
[ https://issues.apache.org/jira/browse/SPARK-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4832: - Assignee: Tao Wang some other processes might take the daemon pid -- Key: SPARK-4832 URL: https://issues.apache.org/jira/browse/SPARK-4832 Project: Spark Issue Type: Bug Components: Deploy Reporter: Tao Wang Assignee: Tao Wang Priority: Minor Fix For: 1.3.0, 1.2.2 Some other processes might use the pid saved in pid file. In that case we should ignore it and launch daemons. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4631) Add real unit test for MQTT
[ https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4631: - Target Version/s: (was: 1.3.0) Fix Version/s: 1.2.2 Add real unit test for MQTT Key: SPARK-4631 URL: https://issues.apache.org/jira/browse/SPARK-4631 Project: Spark Issue Type: Test Components: Streaming Reporter: Tathagata Das Priority: Critical Fix For: 1.3.0, 1.2.2 A real unit test that actually transfers data to ensure that the MQTTUtil is functional -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319948#comment-14319948 ] Dr. Christian Betz commented on SPARK-5081: --- From SPARK-5715 I see a *factor four performance loss* in my Spark jobs when migrating from Spark 1.1.0 to Spark 1.2.0 or 1.2.1. Also, I see an *increase in the size of shuffle writes* (which is also reported by Kevin Jung on the mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tt20894.html Together with this I experience a *huge number of disk spills*. I'm experiencing these with my job under the following circumstances: * Spark 1.2.0 with Sort-based Shuffle * Spark 1.2.0 with Hash-based Shuffle * Spark 1.2.1 with Sort-based Shuffle All three combinations show the same behavior, which contrasts from Spark 1.1.0. In Spark 1.1.0, my job runs for about an hour, in Spark 1.2.x it runs for almost four hours. Configuration is identical otherwise - I only added org.apache.spark.scheduler.CompressedMapStatus to the Kryo registrator for Spark 1.2.0 to cope with https://issues.apache.org/jira/browse/SPARK-5102. As a consequence (I think, but causality might be different) I see lots and lots of disk spills. I cannot provide a small test case, but maybe the log entries for a single worker thread can help someone investigate on this. (See below.) I will also open up an issue, if nobody stops me by providing an answer ;) Any help will be greatly appreciated, because otherwise I'm stuck with Spark 1.1.0, as quadrupling runtime is not an option. Sincerely, Chris 2015-02-09T14:06:06.328+01:00 INFO org.apache.spark.executor.Executor Running task 9.0 in stage 18.0 (TID 300) Executor task launch worker-18 2015-02-09T14:06:06.351+01:00 INFO org.apache.spark.CacheManager Partition rdd_35_9 not found, computing it Executor task launch worker-18 2015-02-09T14:06:06.351+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 10 non-empty blocks out of 10 blocks Executor task launch worker-18 2015-02-09T14:06:06.351+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches in 0 ms Executor task launch worker-18 2015-02-09T14:06:07.396+01:00 INFO org.apache.spark.storage.MemoryStore ensureFreeSpace(2582904) called with curMem=300174944, maxMe... Executor task launch worker-18 2015-02-09T14:06:07.397+01:00 INFO org.apache.spark.storage.MemoryStore Block rdd_35_9 stored as bytes in memory (estimated size 2.5... Executor task launch worker-18 2015-02-09T14:06:07.398+01:00 INFO org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_35_9 Executor task launch worker-18 2015-02-09T14:06:07.399+01:00 INFO org.apache.spark.CacheManager Partition rdd_38_9 not found, computing it Executor task launch worker-18 2015-02-09T14:06:07.399+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 10 non-empty blocks out of 10 blocks Executor task launch worker-18 2015-02-09T14:06:07.400+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches in 0 ms Executor task launch worker-18 2015-02-09T14:06:07.567+01:00 INFO org.apache.spark.storage.MemoryStore ensureFreeSpace(944848) called with curMem=302757848, maxMem... Executor task launch worker-18 2015-02-09T14:06:07.568+01:00 INFO org.apache.spark.storage.MemoryStore Block rdd_38_9 stored as values in memory (estimated size 92... Executor task launch worker-18 2015-02-09T14:06:07.569+01:00 INFO org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_38_9 Executor task launch worker-18 2015-02-09T14:06:07.573+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 34 non-empty blocks out of 50 blocks Executor task launch worker-18 2015-02-09T14:06:07.573+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches in 1 ms Executor task launch worker-18 2015-02-09T14:06:38.931+01:00 INFO org.apache.spark.CacheManager Partition rdd_41_9 not found, computing it Executor task launch worker-18 2015-02-09T14:06:38.931+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 3 non-empty blocks out of 10 blocks Executor task launch worker-18 2015-02-09T14:06:38.931+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches in 0 ms Executor task launch worker-18 2015-02-09T14:06:38.945+01:00 INFO org.apache.spark.storage.MemoryStore ensureFreeSpace(0) called with curMem=307529127, maxMem=9261... Executor task launch worker-18 2015-02-09T14:06:38.945+01:00 INFO org.apache.spark.storage.MemoryStore Block rdd_41_9 stored as bytes in memory (estimated size 0.0... Executor task launch worker-18 2015-02-09T14:06:38.946+01:00 INFO org.apache.spark.storage.BlockManagerMaster Updated info of block
[jira] [Resolved] (SPARK-5285) Removed GroupExpression in catalyst
[ https://issues.apache.org/jira/browse/SPARK-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5285. - Resolution: Won't Fix Removed GroupExpression in catalyst Key: SPARK-5285 URL: https://issues.apache.org/jira/browse/SPARK-5285 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Removed GroupExpression in catalyst -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5518) Error messages for plans with invalid AttributeReferences
[ https://issues.apache.org/jira/browse/SPARK-5518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5518. - Resolution: Fixed Fix Version/s: 1.3.0 Error messages for plans with invalid AttributeReferences - Key: SPARK-5518 URL: https://issues.apache.org/jira/browse/SPARK-5518 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.3.0 It is now possible for users to put invalid attribute references into query plans. We should check for this case at the end of analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5518) Error messages for plans with invalid AttributeReferences
[ https://issues.apache.org/jira/browse/SPARK-5518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319718#comment-14319718 ] Michael Armbrust commented on SPARK-5518: - https://github.com/apache/spark/pull/4558 Error messages for plans with invalid AttributeReferences - Key: SPARK-5518 URL: https://issues.apache.org/jira/browse/SPARK-5518 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.3.0 It is now possible for users to put invalid attribute references into query plans. We should check for this case at the end of analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5265) Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master
[ https://issues.apache.org/jira/browse/SPARK-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319965#comment-14319965 ] Wojciech Pituła edited comment on SPARK-5265 at 2/13/15 11:24 AM: -- We have the same issue. Such master url works fine with -deploy-mode client but breaks with -deploy-mode cluster. was (Author: krever): We have the same issue. Such master url works fine with --deploy-mode client but breaks with --deploy-mode cluster. Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master -- Key: SPARK-5265 URL: https://issues.apache.org/jira/browse/SPARK-5265 Project: Spark Issue Type: Bug Components: Deploy Reporter: Roque Vassal'lo Labels: cluster, spark-submit, standalone, zookeeper Hi, this is my first JIRA here, so I hope it is clear enough. I'm using Spark 1.2.0 and trying to submit an application on a Spark Standalone cluster in cluster deploy mode with supervise. Standalone cluster is running in high availability mode, using Zookeeper to provide leader election between three available Masters (named master1, master2 and master3). As read at Spark's documentation, to register a Worker to the Standalone cluster, I provide complete cluster info as the spark route. I mean, spark://master1:7077,master2:7077,master3:7077 and that route is parsed and three attempts are launched, first one to master1:7077, second one to master2:7077 and third one to master3:7077. This works great! But if I try to do the same while submitting applications, it fails. I mean, if I provide complete cluster info as the --master option to spark-submit script, it throws an exception because it tries to connect as it was a single node. Example: spark-submit --class org.apache.spark.examples.SparkPi --master spark://master1:7077,master2:7077,master3:7077 --deploy-mode cluster --supervise examples.jar 100 This is the output I got: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/01/14 17:02:11 INFO SecurityManager: Changing view acls to: mytest 15/01/14 17:02:11 INFO SecurityManager: Changing modify acls to: mytest 15/01/14 17:02:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mytest); users with modify permissions: Set(mytest) 15/01/14 17:02:11 INFO Slf4jLogger: Slf4jLogger started 15/01/14 17:02:11 INFO Utils: Successfully started service 'driverClient' on port 53930. 15/01/14 17:02:11 ERROR OneForOneStrategy: Invalid master URL: spark://master1:7077,master2:7077,master3:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://master1:7077,master2:7077,master3:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more Shouldn't it parse it as on Worker registration? It will not force client to know which is the current active Master of the Standalone cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
[ https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Littlestar updated SPARK-5795: -- Attachment: TestStreamCompile.java my testcase on java 1.7 and spark 1.3 trunk. Thanks. api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java - Key: SPARK-5795 URL: https://issues.apache.org/jira/browse/SPARK-5795 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor Attachments: TestStreamCompile.java import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; the following code can't compile on java. JavaPairDStreamInteger, Integer rs = rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, TextOutputFormat.class, jobConf); but similar code in JavaPairRDD works ok. JavaPairRDDString, String counts =... counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, TextOutputFormat.class, jobConf); mybe the def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } = def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4766) ML Estimator Params should subclass Transformer Params
[ https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320038#comment-14320038 ] Peter Rudenko commented on SPARK-4766: -- Very important feature that could make pretty big speedup. Let me explain why. I have a pipeline with 4 transformers and 1 estimator model (LogisticRegression) with 3 folds for cross validation and 3 hyper parameters in grid search: {code} val paramGrid = new ParamGridBuilder() .addGrid(model.regParam, Array(0.1, 0.01, 0.001)) .build() crossval.setEstimatorParamMaps(paramGrid) crossval.setNumFolds(3) {code} Transformers don't have any parameters in grid search. Right now for every possible combination of hyperparam + crossvalidation fold it transforms a data (with the same transformers) thus creating new RDD with a new ID, but the same data. Thus i cannot cache it. What i come with is to use 2 pipelines: # Transformer pipeline - transforming once whole data # Model pipeline with just a model in it. I modified [Pipeline|https://issues.apache.org/jira/browse/SPARK-5796] and LogisticRegression class (commented instances.unpersist() because the same instances would be for each hyperparameter). This reduced the time of LogisticRegression Pipeline significantly. But would be cool to do it in Pipeline: if there's no parameters for Transformer stages - just construct a data once and for each hyperparameter in estimator pass the same data. Thus for 3 folds it would read and cache data 3 times ((1 to 3).combination(2)) and wouldn't depend on number of Hyperparameters to estimator (now it's doing 9 times 3 folds combination * 3 model parameters). ML Estimator Params should subclass Transformer Params -- Key: SPARK-4766 URL: https://issues.apache.org/jira/browse/SPARK-4766 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Currently, in spark.ml, both Transformers and Estimators extend the same Params classes. There should be one Params class for the Transformer and one for the Estimator, where the Estimator params class extends the Transformer one. E.g., it is weird to be able to do: {code} val model: LogisticRegressionModel = ... model.getMaxIter() {code} (This is the only case where this happens currently, but it is worth setting a precedent.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4267: - Target Version/s: (was: 1.3.0) Fix Version/s: 1.2.2 Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later -- Key: SPARK-4267 URL: https://issues.apache.org/jira/browse/SPARK-4267 Project: Spark Issue Type: Bug Components: YARN Reporter: Tsuyoshi OZAWA Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0, 1.2.2 Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN fails to launch jobs with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
[jira] [Commented] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
[ https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320109#comment-14320109 ] Littlestar commented on SPARK-5795: --- Does it same problem as SPARK-5297, thanks. api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java - Key: SPARK-5795 URL: https://issues.apache.org/jira/browse/SPARK-5795 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor Attachments: TestStreamCompile.java import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; the following code can't compile on java. JavaPairDStreamInteger, Integer rs = rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, TextOutputFormat.class, jobConf); but similar code in JavaPairRDD works ok. JavaPairRDDString, String counts =... counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, TextOutputFormat.class, jobConf); mybe the def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } = def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5252) Streaming StatefulNetworkWordCount example hangs
[ https://issues.apache.org/jira/browse/SPARK-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5252: - Component/s: PySpark Examples Looks like you have an environment problem: {code} java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set. {code} Can you resolve this and then see if you have this problem? Streaming StatefulNetworkWordCount example hangs Key: SPARK-5252 URL: https://issues.apache.org/jira/browse/SPARK-5252 Project: Spark Issue Type: Bug Components: Examples, PySpark, Streaming Affects Versions: 1.2.0 Environment: Ubuntu Linux Reporter: Lutz Buech Attachments: debug.txt Running the stateful network word count example in Python (on one local node): https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py At the beginning, when no data is streamed, empty status outputs are generated, only decorated by the current Time, e.g.: --- Time: 2015-01-14 17:58:20 --- --- Time: 2015-01-14 17:58:21 --- As soon as I stream some data via netcat, no new status updates will show. Instead, one line saying [Stage number: (2 + 0) / 3] where number is some integer number, e.g. 132. There is no further output on stdout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5756) Analyzer should not throw scala.NotImplementedError for illegitimate sql
[ https://issues.apache.org/jira/browse/SPARK-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei resolved SPARK-5756. Resolution: Fixed Analyzer should not throw scala.NotImplementedError for illegitimate sql Key: SPARK-5756 URL: https://issues.apache.org/jira/browse/SPARK-5756 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: wangfei ```SELECT CAST(x AS STRING) FROM src``` comes a NotImplementedError: CliDriver: scala.NotImplementedError: an implementation is missing at scala.Predef$.$qmark$qmark$qmark(Predef.scala:252) at org.apache.spark.sql.catalyst.expressions.PrettyAttribute.dataType(namedExpressions.scala:221) at org.apache.spark.sql.catalyst.expressions.Cast.resolved$lzycompute(Cast.scala:30) at org.apache.spark.sql.catalyst.expressions.Cast.resolved(Cast.scala:30) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:68) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:68) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:68) at org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:56) at org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:56) at org.apache.spark.sql.catalyst.expressions.NamedExpression.typeSuffix(namedExpressions.scala:62) at org.apache.spark.sql.catalyst.expressions.Alias.toString(namedExpressions.scala:124) at org.apache.spark.sql.catalyst.expressions.Expression.prettyString(Expression.scala:78) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1$$anonfun$7.apply(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1$$anonfun$7.apply(Analyzer.scala:83) at scala.collection.immutable.Stream.map(Stream.scala:376) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:81) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:204) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:79) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
[ https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320083#comment-14320083 ] Littlestar commented on SPARK-5795: --- error info... The method saveAsNewAPIHadoopFiles(String, String, Class?, Class?, Class? extends OutputFormat?,?) in the type JavaPairDStreamInteger,Integer is not applicable for the arguments (String, String, ClassInteger, ClassInteger, ClassTextOutputFormat) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java - Key: SPARK-5795 URL: https://issues.apache.org/jira/browse/SPARK-5795 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; the following code can't compile on java. JavaPairDStreamInteger, Integer rs = rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, TextOutputFormat.class, jobConf); but similar code in JavaPairRDD works ok. JavaPairRDDString, String counts =... counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, TextOutputFormat.class, jobConf); mybe the def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } = def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5799) Compute aggregation function on specified numeric columns
[ https://issues.apache.org/jira/browse/SPARK-5799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320352#comment-14320352 ] Apache Spark commented on SPARK-5799: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4592 Compute aggregation function on specified numeric columns - Key: SPARK-5799 URL: https://issues.apache.org/jira/browse/SPARK-5799 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor Compute aggregation function on specified numeric columns. For example: val df = Seq((a, 1, 0, b), (b, 2, 4, c), (a, 2, 3, d)).toDataFrame(key, value1, value2, rest) df.groupBy(key).min(value2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5799) Compute aggregation function on specified numeric columns
Liang-Chi Hsieh created SPARK-5799: -- Summary: Compute aggregation function on specified numeric columns Key: SPARK-5799 URL: https://issues.apache.org/jira/browse/SPARK-5799 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor Compute aggregation function on specified numeric columns. For example: val df = Seq((a, 1, 0, b), (b, 2, 4, c), (a, 2, 3, d)).toDataFrame(key, value1, value2, rest) df.groupBy(key).min(value2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5798) Spark shell issue
DeepakVohra created SPARK-5798: -- Summary: Spark shell issue Key: SPARK-5798 URL: https://issues.apache.org/jira/browse/SPARK-5798 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.2.0 Environment: Spark 1.2 Scala 2.10.4 Reporter: DeepakVohra The Spark shell terminates when Spark code is run indicating an issue with Spark shell. The error is coming from the spark shell file /apachespark/spark-1.2.0-bin-cdh4/bin/spark-shell: line 48 $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SUBMISSION_OPTS[@]} spark-shell ${APPLICATION_OPTS[@]} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue
[ https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320309#comment-14320309 ] Mark Khaitman commented on SPARK-5782: -- Would it make sense to instead make the _next_limit return the MIN of the 2 values as opposed to the MAX? Python Worker / Pyspark Daemon Memory Issue --- Key: SPARK-5782 URL: https://issues.apache.org/jira/browse/SPARK-5782 Project: Spark Issue Type: Bug Components: PySpark, Shuffle Affects Versions: 1.3.0, 1.2.1, 1.2.2 Environment: CentOS 7, Spark Standalone Reporter: Mark Khaitman I'm including the Shuffle component on this, as a brief scan through the code (which I'm not 100% familiar with just yet) shows a large amount of memory handling in it: It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon workers compared to the default 1 task - 1 core configuration in our environment. This can become problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons do not cease to exist until the top level join is completed (or so it seems)... This can lead to memory exhaustion by a single framework, even though is set to have a 512MB python worker memory limit and few gigs of executor memory. Another related issue to this is that the individual python workers are not supposed to even exceed that far beyond 512MB, otherwise they're supposed to spill to disk. Some of our python workers are somehow reaching 2GB each (which when multiplied by the number of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory killer to step up to its unfortunate job! :( I think with the _next_limit method in shuffle.py, if the current memory usage is close to the memory limit, then a 1.05 multiplier can endlessly cause more memory to be consumed by the single python worker, since the max of (512 vs 511 * 1.05) would end up blowing up towards the latter of the two... Shouldn't the memory limit be the absolute cap in this case? I've only just started looking into the code, and would definitely love to contribute towards Spark, though I figured it might be quicker to resolve if someone already owns the code! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue
[ https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Khaitman updated SPARK-5782: - Priority: Critical (was: Major) Python Worker / Pyspark Daemon Memory Issue --- Key: SPARK-5782 URL: https://issues.apache.org/jira/browse/SPARK-5782 Project: Spark Issue Type: Bug Components: PySpark, Shuffle Affects Versions: 1.3.0, 1.2.1, 1.2.2 Environment: CentOS 7, Spark Standalone Reporter: Mark Khaitman Priority: Critical I'm including the Shuffle component on this, as a brief scan through the code (which I'm not 100% familiar with just yet) shows a large amount of memory handling in it: It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon workers compared to the default 1 task - 1 core configuration in our environment. This can become problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons do not cease to exist until the top level join is completed (or so it seems)... This can lead to memory exhaustion by a single framework, even though is set to have a 512MB python worker memory limit and few gigs of executor memory. Another related issue to this is that the individual python workers are not supposed to even exceed that far beyond 512MB, otherwise they're supposed to spill to disk. Some of our python workers are somehow reaching 2GB each (which when multiplied by the number of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory killer to step up to its unfortunate job! :( I think with the _next_limit method in shuffle.py, if the current memory usage is close to the memory limit, then a 1.05 multiplier can endlessly cause more memory to be consumed by the single python worker, since the max of (512 vs 511 * 1.05) would end up blowing up towards the latter of the two... Shouldn't the memory limit be the absolute cap in this case? I've only just started looking into the code, and would definitely love to contribute towards Spark, though I figured it might be quicker to resolve if someone already owns the code! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5726) Hadamard Vector Product Transformer
[ https://issues.apache.org/jira/browse/SPARK-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320514#comment-14320514 ] Octavian Geagla commented on SPARK-5726: Ok, I've made the change on the PR. Thanks, Sean! Hadamard Vector Product Transformer --- Key: SPARK-5726 URL: https://issues.apache.org/jira/browse/SPARK-5726 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Octavian Geagla Assignee: Octavian Geagla I originally posted my idea here: http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html A draft of this feature is implemented, documented, and tested already. Code is on a branch on my fork here: https://github.com/ogeagla/spark/compare/spark-mllib-weighting I'm curious if there is any interest in this feature, in which case I'd appreciate some feedback. One thing that might be useful is an example/test case using the transformer within the ML pipeline, since there are not any examples which use Vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5345) Fix unstable test case in FsHistoryProviderSuite
[ https://issues.apache.org/jira/browse/SPARK-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5345. --- Resolution: Fixed It looks like this has been fixed by SPARK-5600, so I'm going to resolve this for now. Let's re-open if the test becomes flaky again. Fix unstable test case in FsHistoryProviderSuite Key: SPARK-5345 URL: https://issues.apache.org/jira/browse/SPARK-5345 Project: Spark Issue Type: Bug Components: Deploy, Web UI Affects Versions: 1.3.0 Reporter: Kousuke Saruta Labels: flaky-test In FsHistoryProviderSuite, a test Parse new and old application logs sometimes fail and sometimes succeed. It's unstable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5735) Replace uses of EasyMock with Mockito
[ https://issues.apache.org/jira/browse/SPARK-5735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5735. Resolution: Fixed Fix Version/s: 1.3.0 Replace uses of EasyMock with Mockito - Key: SPARK-5735 URL: https://issues.apache.org/jira/browse/SPARK-5735 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell Assignee: Josh Rosen Fix For: 1.3.0 There are a few reasons we should drop EasyMock. First, we should have a single mocking framework in our tests in general to keep things consistent. Second, EasyMock has caused us some dependency pain in our tests due to objenesis. We aren't totally sure but suspect such conflicts might be causing non deterministic test failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5802) Cache scaled data in GLM
Xiangrui Meng created SPARK-5802: Summary: Cache scaled data in GLM Key: SPARK-5802 URL: https://issues.apache.org/jira/browse/SPARK-5802 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng If we modify the input data (to append bias or to scale features), we should cache the output to avoid recomputing transformed vectors each time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader
[ https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320504#comment-14320504 ] Marcelo Vanzin commented on SPARK-5770: --- bq. but the classloader still load the old one. Could you clarify what that means? Due to the way class loading works, if you reference a class that has already been loaded, you won't get the new one, but the one already loaded. Which is one reason why this addJar() can overwrite existing jars functionality is a little sketchy. Use addJar() to upload a new jar file to executor, it can't be added to classloader --- Key: SPARK-5770 URL: https://issues.apache.org/jira/browse/SPARK-5770 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula First use addJar() to upload a jar to the executor, then change the jar content and upload it again. We can see the jar file in the local has be updated, but the classloader still load the old one. The executor log has no error or exception to point it. I use spark-shell to test it. And set spark.files.overwrite is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5785) Pyspark does not support narrow dependencies
[ https://issues.apache.org/jira/browse/SPARK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-5785: Description: joins ( cogroups etc.) are always considered to have wide dependencies in pyspark, they are never narrow. This can cause unnecessary shuffles. eg., this simple job should shuffle rddA rddB once each, but it also will do a third shuffle of the unioned data: {code} rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64) rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64) joined = rddA.join(rddB) joined.count() rddA._partitionFunc == rddB._partitionFunc True {code} (Or the docs should somewhere explain that this feature is missing from pyspark.) was: joins ( cogroups etc.) are always considered to have wide dependencies in pyspark, they are never narrow. This can cause unnecessary shuffles. eg., this simple job should shuffle rddA rddB once each, but it also will do a third shuffle of the unioned data: {code} rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64) rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64) joined = rddA.join(rddB) joined.count() rddA._partitionFunc == rddB._partitionFunc True {code} (Or the docs should somewhere explain that this feature is missing from spark.) Pyspark does not support narrow dependencies Key: SPARK-5785 URL: https://issues.apache.org/jira/browse/SPARK-5785 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Imran Rashid joins ( cogroups etc.) are always considered to have wide dependencies in pyspark, they are never narrow. This can cause unnecessary shuffles. eg., this simple job should shuffle rddA rddB once each, but it also will do a third shuffle of the unioned data: {code} rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64) rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64) joined = rddA.join(rddB) joined.count() rddA._partitionFunc == rddB._partitionFunc True {code} (Or the docs should somewhere explain that this feature is missing from pyspark.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5801) Shuffle creates too many nested directories
[ https://issues.apache.org/jira/browse/SPARK-5801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-5801: -- Component/s: Shuffle Shuffle creates too many nested directories --- Key: SPARK-5801 URL: https://issues.apache.org/jira/browse/SPARK-5801 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.1 Reporter: Kay Ousterhout When running Spark on EC2, there are 4 nested shuffle directories before the hashed directory names, for example: /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/spark-675133f0-b2c8-44a1-8775-5e394674609b/spark-69c1ea15-4e7f-454a-9f57-19763c7bdd17/spark-b036335c-60fa-48ab-a346-f1b420af2027/0c My understanding is that this should look like: /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/0c This happened when I was using the sort-based shuffle (all default configurations for Spark on EC2). This is not a correctness problem (the shuffle still works fine). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4903) RDD remains cached after DROP TABLE
[ https://issues.apache.org/jira/browse/SPARK-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320439#comment-14320439 ] Yin Huai commented on SPARK-4903: - I believe that it has been resolved in 1.3 ([see this|https://github.com/apache/spark/blob/v1.3.0-snapshot1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala#L61]). I tried the following snippet in build/sbt -Phive sparkShelland verified the cached RDD was unpersisted after I dropped it. {code} sqlContext.jsonRDD(sc.parallelize({a:1}::Nil)).registerTempTable(test) sqlContext.sql(create table jt as select a from test) sqlContext.sql(cache table jt).collect sqlContext.sql(select * from jt).collect sqlContext.sql(drop table jt).collect {code} RDD remains cached after DROP TABLE - Key: SPARK-4903 URL: https://issues.apache.org/jira/browse/SPARK-4903 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Spark master @ Dec 17 (3cd516191baadf8496ccdae499771020e89acd7e) Reporter: Evert Lammerts Priority: Critical In beeline, when I run: {code:sql} CREATE TABLE test AS select col from table; CACHE TABLE test DROP TABLE test {code} The the table is removed but the RDD is still cached. Running UNCACHE is not possible anymore (table not found from metastore). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5732) Add an option to print the spark version in spark script
[ https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5732. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: uncleGen Add an option to print the spark version in spark script Key: SPARK-5732 URL: https://issues.apache.org/jira/browse/SPARK-5732 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.0.0 Reporter: uncleGen Assignee: uncleGen Priority: Minor Fix For: 1.3.0 Naturally, we may need to add an option to print the spark version in spark script. It is pretty common in many script tools -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue
[ https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Khaitman updated SPARK-5782: - Comment: was deleted (was: Would it make sense to instead make the _next_limit return the MIN of the 2 values as opposed to the MAX?) Python Worker / Pyspark Daemon Memory Issue --- Key: SPARK-5782 URL: https://issues.apache.org/jira/browse/SPARK-5782 Project: Spark Issue Type: Bug Components: PySpark, Shuffle Affects Versions: 1.3.0, 1.2.1, 1.2.2 Environment: CentOS 7, Spark Standalone Reporter: Mark Khaitman I'm including the Shuffle component on this, as a brief scan through the code (which I'm not 100% familiar with just yet) shows a large amount of memory handling in it: It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon workers compared to the default 1 task - 1 core configuration in our environment. This can become problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons do not cease to exist until the top level join is completed (or so it seems)... This can lead to memory exhaustion by a single framework, even though is set to have a 512MB python worker memory limit and few gigs of executor memory. Another related issue to this is that the individual python workers are not supposed to even exceed that far beyond 512MB, otherwise they're supposed to spill to disk. Some of our python workers are somehow reaching 2GB each (which when multiplied by the number of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory killer to step up to its unfortunate job! :( I think with the _next_limit method in shuffle.py, if the current memory usage is close to the memory limit, then a 1.05 multiplier can endlessly cause more memory to be consumed by the single python worker, since the max of (512 vs 511 * 1.05) would end up blowing up towards the latter of the two... Shouldn't the memory limit be the absolute cap in this case? I've only just started looking into the code, and would definitely love to contribute towards Spark, though I figured it might be quicker to resolve if someone already owns the code! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5529: - Component/s: YARN Executor is still hold while BlockManager has been removed -- Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5529: - Assignee: Hong Shen BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5296) Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters
[ https://issues.apache.org/jira/browse/SPARK-5296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320499#comment-14320499 ] Michael Armbrust commented on SPARK-5296: - Oh, good point... We should pass down nested ANDs Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters -- Key: SPARK-5296 URL: https://issues.apache.org/jira/browse/SPARK-5296 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet Assignee: Cheng Lian Priority: Critical Currently, the BaseRelation API allows a FilteredRelation to handle an Array[Filter] which represents filter expressions that are applied as an AND operator. We should support OR operations in a BaseRelation as well. I'm not sure what this would look like in terms of API changes, but it almost seems like a FilteredUnionedScan BaseRelation (the name stinks but you get the idea) would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader
[ https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320510#comment-14320510 ] Sean Owen commented on SPARK-5770: -- Yeah I think that's the point, that overwriting an existing JAR won't cause any classes to be reloaded, so, should it be an error? or a warning? Use addJar() to upload a new jar file to executor, it can't be added to classloader --- Key: SPARK-5770 URL: https://issues.apache.org/jira/browse/SPARK-5770 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula First use addJar() to upload a jar to the executor, then change the jar content and upload it again. We can see the jar file in the local has be updated, but the classloader still load the old one. The executor log has no error or exception to point it. I use spark-shell to test it. And set spark.files.overwrite is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5626) Spurious test failures due to NullPointerException in EasyMock test code
[ https://issues.apache.org/jira/browse/SPARK-5626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5626. --- Resolution: Fixed Spurious test failures due to NullPointerException in EasyMock test code Key: SPARK-5626 URL: https://issues.apache.org/jira/browse/SPARK-5626 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: Josh Rosen Labels: flaky-test Attachments: consoleText.txt I've seen a few cases where a test failure will trigger a cascade of spurious failures when instantiating test suites that use EasyMock. Here's a sample symptom: {code} [info] CacheManagerSuite: [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.CacheManagerSuite *** ABORTED *** (137 milliseconds) [info] java.lang.NullPointerException: [info] at org.objenesis.strategy.StdInstantiatorStrategy.newInstantiatorOf(StdInstantiatorStrategy.java:52) [info] at org.objenesis.ObjenesisBase.getInstantiatorOf(ObjenesisBase.java:90) [info] at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73) [info] at org.objenesis.ObjenesisHelper.newInstance(ObjenesisHelper.java:43) [info] at org.easymock.internal.ObjenesisClassInstantiator.newInstance(ObjenesisClassInstantiator.java:26) [info] at org.easymock.internal.ClassProxyFactory.createProxy(ClassProxyFactory.java:219) [info] at org.easymock.internal.MocksControl.createMock(MocksControl.java:59) [info] at org.easymock.EasyMock.createMock(EasyMock.java:103) [info] at org.scalatest.mock.EasyMockSugar$class.mock(EasyMockSugar.scala:267) [info] at org.apache.spark.CacheManagerSuite.mock(CacheManagerSuite.scala:28) [info] at org.apache.spark.CacheManagerSuite$$anonfun$1.apply$mcV$sp(CacheManagerSuite.scala:40) [info] at org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38) [info] at org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38) [info] at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:195) [info] at org.apache.spark.CacheManagerSuite.runTest(CacheManagerSuite.scala:28) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at org.scalatest.Suite$class.run(Suite.scala:1424) [info] at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:545) [info] at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) [info] at org.apache.spark.CacheManagerSuite.org$scalatest$BeforeAndAfter$$super$run(CacheManagerSuite.scala:28) [info] at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) [info] at org.apache.spark.CacheManagerSuite.run(CacheManagerSuite.scala:28) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:294) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:284) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [info] at java.lang.Thread.run(Thread.java:745) {code} This is from https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26852/consoleFull. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5803) Use ArrayBuilder instead of ArrayBuffer for primitive types
[ https://issues.apache.org/jira/browse/SPARK-5803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320528#comment-14320528 ] Apache Spark commented on SPARK-5803: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4594 Use ArrayBuilder instead of ArrayBuffer for primitive types --- Key: SPARK-5803 URL: https://issues.apache.org/jira/browse/SPARK-5803 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng ArrayBuffer is not specialized and hence it boxes primitive-typed values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5798) Spark shell issue
[ https://issues.apache.org/jira/browse/SPARK-5798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320533#comment-14320533 ] DeepakVohra commented on SPARK-5798: Thanks Sean for testing. Not all Spark/Scala code generates an error in Spark Shell. For example, run all pre-requisite import, var, and method code and subsequently run the following code to test: model(sc, rawUserArtistData, rawArtistData, rawArtistAlias) from: https://github.com/sryza/aas/blob/master/ch03-recommender/src/main/scala/com/cloudera/datascience/recommender/RunRecommender.scala Data files are local to Spark/Scala and not in HDFS. Environment is different: Oracle Linux 6.5, but should't be a factor. If the preceding test also does not generate an error would agree it is some other factor and not a bug. Spark shell issue - Key: SPARK-5798 URL: https://issues.apache.org/jira/browse/SPARK-5798 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.2.0 Environment: Spark 1.2 Scala 2.10.4 Reporter: DeepakVohra The Spark shell terminates when Spark code is run indicating an issue with Spark shell. The error is coming from the spark shell file /apachespark/spark-1.2.0-bin-cdh4/bin/spark-shell: line 48 $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SUBMISSION_OPTS[@]} spark-shell ${APPLICATION_OPTS[@]} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5503) Example code for Power Iteration Clustering
[ https://issues.apache.org/jira/browse/SPARK-5503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5503. -- Resolution: Fixed Fix Version/s: 1.3.0 Example code for Power Iteration Clustering --- Key: SPARK-5503 URL: https://issues.apache.org/jira/browse/SPARK-5503 Project: Spark Issue Type: Documentation Components: Documentation, Examples, MLlib Reporter: Xiangrui Meng Assignee: Stephen Boesch Fix For: 1.3.0 There are two places we need to put examples: 1. In the user guide, we should be a small example (as in the unit test). 2. Under examples/, we can have something fancy but still need to keep it minimal. 3. The user guide contains some out-of-date links, which needs to be updated as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5801) Shuffle creates too many nested directories
Kay Ousterhout created SPARK-5801: - Summary: Shuffle creates too many nested directories Key: SPARK-5801 URL: https://issues.apache.org/jira/browse/SPARK-5801 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Kay Ousterhout When running Spark on EC2, there are 4 nested shuffle directories before the hashed directory names, for example: /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/spark-675133f0-b2c8-44a1-8775-5e394674609b/spark-69c1ea15-4e7f-454a-9f57-19763c7bdd17/spark-b036335c-60fa-48ab-a346-f1b420af2027/0c My understanding is that this should look like: /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/0c This happened when I was using the sort-based shuffle (all default configurations for Spark on EC2). This is not a correctness problem (the shuffle still works fine). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5529: - Summary: BlockManager heartbeat expiration does not kill executor (was: Executor is still hold while BlockManager has been removed) BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5626) Spurious test failures due to NullPointerException in EasyMock test code
[ https://issues.apache.org/jira/browse/SPARK-5626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320519#comment-14320519 ] Josh Rosen commented on SPARK-5626: --- This should hopefully be fixed now that I've merged SPARK-5735 to remove EasyMock. I'm going to resolve this issue for now, but let's re-open it if we observe this flakiness again. Spurious test failures due to NullPointerException in EasyMock test code Key: SPARK-5626 URL: https://issues.apache.org/jira/browse/SPARK-5626 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: Josh Rosen Labels: flaky-test Attachments: consoleText.txt I've seen a few cases where a test failure will trigger a cascade of spurious failures when instantiating test suites that use EasyMock. Here's a sample symptom: {code} [info] CacheManagerSuite: [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.CacheManagerSuite *** ABORTED *** (137 milliseconds) [info] java.lang.NullPointerException: [info] at org.objenesis.strategy.StdInstantiatorStrategy.newInstantiatorOf(StdInstantiatorStrategy.java:52) [info] at org.objenesis.ObjenesisBase.getInstantiatorOf(ObjenesisBase.java:90) [info] at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73) [info] at org.objenesis.ObjenesisHelper.newInstance(ObjenesisHelper.java:43) [info] at org.easymock.internal.ObjenesisClassInstantiator.newInstance(ObjenesisClassInstantiator.java:26) [info] at org.easymock.internal.ClassProxyFactory.createProxy(ClassProxyFactory.java:219) [info] at org.easymock.internal.MocksControl.createMock(MocksControl.java:59) [info] at org.easymock.EasyMock.createMock(EasyMock.java:103) [info] at org.scalatest.mock.EasyMockSugar$class.mock(EasyMockSugar.scala:267) [info] at org.apache.spark.CacheManagerSuite.mock(CacheManagerSuite.scala:28) [info] at org.apache.spark.CacheManagerSuite$$anonfun$1.apply$mcV$sp(CacheManagerSuite.scala:40) [info] at org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38) [info] at org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38) [info] at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:195) [info] at org.apache.spark.CacheManagerSuite.runTest(CacheManagerSuite.scala:28) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at org.scalatest.Suite$class.run(Suite.scala:1424) [info] at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:545) [info] at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) [info] at org.apache.spark.CacheManagerSuite.org$scalatest$BeforeAndAfter$$super$run(CacheManagerSuite.scala:28) [info] at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) [info] at org.apache.spark.CacheManagerSuite.run(CacheManagerSuite.scala:28) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:294) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:284) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [info] at java.lang.Thread.run(Thread.java:745) {code} This is from https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26852/consoleFull. -- This message was sent by Atlassian JIRA
[jira] [Created] (SPARK-5803) Use ArrayBuilder instead of ArrayBuffer for primitive types
Xiangrui Meng created SPARK-5803: Summary: Use ArrayBuilder instead of ArrayBuffer for primitive types Key: SPARK-5803 URL: https://issues.apache.org/jira/browse/SPARK-5803 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng ArrayBuffer is not specialized and hence it boxes primitive-typed values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5805) Fix the type error in the final example given in MLlib - Clustering documentation
[ https://issues.apache.org/jira/browse/SPARK-5805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320680#comment-14320680 ] Apache Spark commented on SPARK-5805: - User 'emres' has created a pull request for this issue: https://github.com/apache/spark/pull/4596 Fix the type error in the final example given in MLlib - Clustering documentation - Key: SPARK-5805 URL: https://issues.apache.org/jira/browse/SPARK-5805 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.2.0, 1.2.1 Reporter: Emre Sevinç Priority: Minor Labels: documentation, easyfix, newbie Original Estimate: 1h Remaining Estimate: 1h The final example in [MLlib - Clustering|http://spark.apache.org/docs/1.2.0/mllib-clustering.html] documentation has a code line that leads to a type error. The problematic line reads as: {code} model.predictOnValues(testData).print() {code} but it should be {code} model.predictOnValues(testData.map(lp = (lp.label, lp.features))).print() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5805) Fix the type error in the final example given in MLlib - Clustering documentation
[ https://issues.apache.org/jira/browse/SPARK-5805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5805: - Assignee: Emre Sevinç Fix the type error in the final example given in MLlib - Clustering documentation - Key: SPARK-5805 URL: https://issues.apache.org/jira/browse/SPARK-5805 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.2.0, 1.2.1 Reporter: Emre Sevinç Assignee: Emre Sevinç Priority: Minor Labels: documentation, easyfix, newbie Original Estimate: 1h Remaining Estimate: 1h The final example in [MLlib - Clustering|http://spark.apache.org/docs/1.2.0/mllib-clustering.html] documentation has a code line that leads to a type error. The problematic line reads as: {code} model.predictOnValues(testData).print() {code} but it should be {code} model.predictOnValues(testData.map(lp = (lp.label, lp.features))).print() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5746) INSERT OVERWRITE throws FileNotFoundException when the source and destination point to the same table.
[ https://issues.apache.org/jira/browse/SPARK-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320562#comment-14320562 ] Yin Huai commented on SPARK-5746: - For now, we will throw an error when we find this case. INSERT OVERWRITE throws FileNotFoundException when the source and destination point to the same table. -- Key: SPARK-5746 URL: https://issues.apache.org/jira/browse/SPARK-5746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker With the newly introduced write support of data source API, {{JSONRelation}} and {{ParquetRelation2}} both suffer this bug. The root cause is that we removed the source table before insertion ([here|https://github.com/apache/spark/blob/1ac099e3e00ddb01af8e6e3a84c70f8363f04b5c/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L112-L121]). The correct solution should be first insert into a temporary folder, and then overwrite the source table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5806) Organize sections in mllib-clustering.md
Xiangrui Meng created SPARK-5806: Summary: Organize sections in mllib-clustering.md Key: SPARK-5806 URL: https://issues.apache.org/jira/browse/SPARK-5806 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5804) Explicitly manage cache in Crossvalidation k-fold loop
Peter Rudenko created SPARK-5804: Summary: Explicitly manage cache in Crossvalidation k-fold loop Key: SPARK-5804 URL: https://issues.apache.org/jira/browse/SPARK-5804 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: Peter Rudenko Priority: Minor On a big dataset explicitly unpersist train and validation folds allows to load more data into memory in the next loop iteration. On my environment (single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), saved more than 5 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles
[ https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320653#comment-14320653 ] Josh Rosen commented on SPARK-5227: --- I think this might be caused by HADOOP-8490: the test code might be getting a cached FileSystem instance that was created by an earlier test run, causing the configuration from the earlier test to be re-used here. We could try to completely disable this caching, but this could have a large negative performance impact on Hadoop library code which assumes that FileSystem creation is cheap. I wonder if there's a way that we can clear this cache in between our test runs, which would at least address the test-flakiness issues. InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles - Key: SPARK-5227 URL: https://issues.apache.org/jira/browse/SPARK-5227 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Josh Rosen Priority: Blocker Labels: flaky-test The InputOutputMetricsSuite input metrics when reading text file with multiple splits test has been failing consistently in our new {{branch-1.2}} Jenkins SBT build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/ Here's the error message {code} ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader
[ https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320552#comment-14320552 ] Marcelo Vanzin commented on SPARK-5770: --- It might be possible to fix the behavior, although even then the results might be sketchy. Basically, when overwriting jars, you'd have to replace the executor's class loader. That means you need to keep track of the jars added to the class loader, and when adding a new jar, you place it in front of the others and use Thread.currentThread().setContextClassLoader() to replace the class loader. But that's after like 5 seconds of thinking, so there may be a lot of corner cases in doing that. I think the best approach would be to say that overwriting jars is not allowed, even if that doesn't cover all cases. You could still add a different jar that tries to override already loaded classes, and that will have the same confusing effect of the old classes being still used. Use addJar() to upload a new jar file to executor, it can't be added to classloader --- Key: SPARK-5770 URL: https://issues.apache.org/jira/browse/SPARK-5770 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula First use addJar() to upload a jar to the executor, then change the jar content and upload it again. We can see the jar file in the local has be updated, but the classloader still load the old one. The executor log has no error or exception to point it. I use spark-shell to test it. And set spark.files.overwrite is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5804) Explicitly manage cache in Crossvalidation k-fold loop
[ https://issues.apache.org/jira/browse/SPARK-5804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320607#comment-14320607 ] Apache Spark commented on SPARK-5804: - User 'petro-rudenko' has created a pull request for this issue: https://github.com/apache/spark/pull/4595 Explicitly manage cache in Crossvalidation k-fold loop -- Key: SPARK-5804 URL: https://issues.apache.org/jira/browse/SPARK-5804 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: Peter Rudenko Priority: Minor On a big dataset explicitly unpersist train and validation folds allows to load more data into memory in the next loop iteration. On my environment (single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), saved more than 5 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320874#comment-14320874 ] Chris Love commented on SPARK-3821: --- I notice that the packer built ami comes with java7, how would your recommend handling java8? Should both be installed? Also which aws linux were the new ami's built off of? Will this be in a 1.2.x branch or just 1.3? Thanks Chris Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5779) Python broadcast does not work with Kryo serializer
[ https://issues.apache.org/jira/browse/SPARK-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320897#comment-14320897 ] Davies Liu commented on SPARK-5779: --- Yes, I will close it. Python broadcast does not work with Kryo serializer --- Key: SPARK-5779 URL: https://issues.apache.org/jira/browse/SPARK-5779 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0, 1.2.1 Reporter: Davies Liu Priority: Critical The PythonBroadcast cannot be serialized by Kryo, which is introduced in 1.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5730) Group methods in the generated doc for spark.ml algorithms.
[ https://issues.apache.org/jira/browse/SPARK-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320908#comment-14320908 ] Apache Spark commented on SPARK-5730: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4600 Group methods in the generated doc for spark.ml algorithms. --- Key: SPARK-5730 URL: https://issues.apache.org/jira/browse/SPARK-5730 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng In spark.ml, we have params and their setters/getters. It is nice to group them in the generated docs. Params should be in the top, while setters/getters should be at the bottom. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5812) Potential flaky test JavaAPISuite.glom
Tathagata Das created SPARK-5812: Summary: Potential flaky test JavaAPISuite.glom Key: SPARK-5812 URL: https://issues.apache.org/jira/browse/SPARK-5812 Project: Spark Issue Type: Bug Components: Java API, Spark Core Affects Versions: 1.3.0 Reporter: Tathagata Das https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27455/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5805) Fix the type error in the final example given in MLlib - Clustering documentation
Emre Sevinç created SPARK-5805: -- Summary: Fix the type error in the final example given in MLlib - Clustering documentation Key: SPARK-5805 URL: https://issues.apache.org/jira/browse/SPARK-5805 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.2.1, 1.2.0 Reporter: Emre Sevinç Priority: Minor The final example in [MLlib - Clustering|http://spark.apache.org/docs/1.2.0/mllib-clustering.html] documentation has a code line that leads to a type error. The problematic line reads as: {code} model.predictOnValues(testData).print() {code} but it should be {code} model.predictOnValues(testData.map(lp = (lp.label, lp.features))).print() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5806) Organize sections in mllib-clustering.md
[ https://issues.apache.org/jira/browse/SPARK-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5806: - Description: We separate code examples from algorithm descriptions. It would be better if we put the example code close to each algorithm description. Organize sections in mllib-clustering.md Key: SPARK-5806 URL: https://issues.apache.org/jira/browse/SPARK-5806 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We separate code examples from algorithm descriptions. It would be better if we put the example code close to each algorithm description. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
[ https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5731: --- Priority: Blocker (was: Major) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Blocker Labels: flaky-test {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at
[jira] [Closed] (SPARK-5779) Python broadcast does not work with Kryo serializer
[ https://issues.apache.org/jira/browse/SPARK-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-5779. - Resolution: Duplicate Python broadcast does not work with Kryo serializer --- Key: SPARK-5779 URL: https://issues.apache.org/jira/browse/SPARK-5779 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0, 1.2.1 Reporter: Davies Liu Priority: Critical The PythonBroadcast cannot be serialized by Kryo, which is introduced in 1.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles
[ https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320903#comment-14320903 ] Apache Spark commented on SPARK-5227: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4599 InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles - Key: SPARK-5227 URL: https://issues.apache.org/jira/browse/SPARK-5227 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Josh Rosen Priority: Blocker Labels: flaky-test The InputOutputMetricsSuite input metrics when reading text file with multiple splits test has been failing consistently in our new {{branch-1.2}} Jenkins SBT build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/ Here's the error message {code} ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320905#comment-14320905 ] Nicholas Chammas commented on SPARK-3821: - If you want Java 8 alongside 7, you can install both to separate paths. For spark-ec2's purposes, we only need 7. The AMIs used as the base are [defined in the Packer template|https://github.com/nchammas/spark-ec2/blob/0f313de64ad9542d1a0f0d6f27131ca4bc01d8c3/image-build/spark-packer-template.json#L5-L6]. The generated AMIs do not include Spark itself--just its dependencies, plus related tools for spark-ec2. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method
[ https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320904#comment-14320904 ] Apache Spark commented on SPARK-5679: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4599 Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method -- Key: SPARK-5679 URL: https://issues.apache.org/jira/browse/SPARK-5679 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Kostas Sakellis Labels: flaky-test Please audit these and see if there are any assumptions with respect to File IO that might not hold in all cases. I'm happy to help if you can't find anything. These both failed in the same run: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink {code} org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed read method Failing for the past 13 builds (Since Failed#26 ) Took 48 sec. Error Message 2030 did not equal 6496 Stacktrace sbt.ForkMain$ForkError: 2030 did not equal 6496 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfterAll$$super$run(InputOutputMetricsSuite.scala:46) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at
[jira] [Updated] (SPARK-5779) Python broadcast does not work with Kryo serializer
[ https://issues.apache.org/jira/browse/SPARK-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5779: -- Affects Version/s: (was: 1.2.1) (was: 1.3.0) 1.2.0 Python broadcast does not work with Kryo serializer --- Key: SPARK-5779 URL: https://issues.apache.org/jira/browse/SPARK-5779 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Davies Liu Priority: Critical The PythonBroadcast cannot be serialized by Kryo, which is introduced in 1.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5812) Potential flaky test JavaAPISuite.glom
[ https://issues.apache.org/jira/browse/SPARK-5812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5812: - Labels: flaky-test (was: ) Potential flaky test JavaAPISuite.glom -- Key: SPARK-5812 URL: https://issues.apache.org/jira/browse/SPARK-5812 Project: Spark Issue Type: Bug Components: Java API, Spark Core Affects Versions: 1.3.0 Reporter: Tathagata Das Labels: flaky-test https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27455/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320935#comment-14320935 ] Xiangrui Meng commented on SPARK-5016: -- I think we should compute the inverse in parallel. In https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L166, we don't collect to local by use aggregateByKey to save the sums to reducers. Then on each reducer, we update the Guassians, and finally collect them to the driver. GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5806) Organize sections in mllib-clustering.md
[ https://issues.apache.org/jira/browse/SPARK-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5806. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4598 [https://github.com/apache/spark/pull/4598] Organize sections in mllib-clustering.md Key: SPARK-5806 URL: https://issues.apache.org/jira/browse/SPARK-5806 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.0 We separate code examples from algorithm descriptions. It would be better if we put the example code close to each algorithm description. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
[ https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320739#comment-14320739 ] Patrick Wendell commented on SPARK-5731: [~c...@koeninger.org] [~tdas] FYI we've disabled this test because it's caused a huge productivity loss to ongoing development with frequent failures. Please try to get this test into good shape ASAP - otherwise this code will be untested. Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Blocker Labels: flaky-test {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at
[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
[ https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5731: --- Labels: flaky-test (was: ) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Blocker Labels: flaky-test {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at
[jira] [Updated] (SPARK-5807) Parallel grid search
[ https://issues.apache.org/jira/browse/SPARK-5807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Rudenko updated SPARK-5807: - Description: Right now in CrossValidator for each fold combination and ParamGrid hyperparameter pair it searches the best parameter sequentially. Assuming there's enough workers memory on a cluster to cache all training/validation folds it's possible to parallelize execution. Here's a draft i came with: {code} import scala.collection.immutable.{ Vector = ScalaVec } val metrics = ScalaVec.fill(numModels)(0.0) //Scala vector is thread safe val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match { case ((training, validation), splitIndex) = { val trainingDataset = sqlCtx.applySchema(training, schema).cache() val validationDataset = sqlCtx.applySchema(validation, schema).cache() // multi-model training logDebug(sTrain split $splitIndex with multiple sets of parameters.) val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] var i = 0 trainingDataset.unpersist() while (i numModels) { val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)), map) logDebug(sGot metric $metric for model trained with ${epm(i)}.) metrics(i) += metric i += 1 } validationDataset.unpersist() } } if (parallel) { splits.par.foreach(processFold) } else { splits.foreach(processFold) } {code} Assuming there's 3 folds it would redundantly cache all the combinations (pretty much memory), so maybe it's possible to cache each fold separately. was: Right now in CrossValidator for each fold combination and ParamGrid hyperparameter pair it searches the best parameter sequentially. Assuming there's enough workers memory on a cluster to cache all training/validation folds it's possible to parallelize execution. Here's a draft i came with: {code} import scala.collection.immutable.{ Vector = ScalaVec } val metrics = ScalaVec.fill(numModels)(0.0) //Scala vector is thread safe val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match { case ((training, validation), splitIndex) = { val trainingDataset = sqlCtx.applySchema(training, schema).cache() val validationDataset = sqlCtx.applySchema(validation, schema).cache() // multi-model training logDebug(sTrain split $splitIndex with multiple sets of parameters.) val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] var i = 0 trainingDataset.unpersist() while (i numModels) { val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)), map) logDebug(sGot metric $metric for model trained with ${epm(i)}.) metrics(i) += metric i += 1 } validationDataset.unpersist() } } if (parallel) { splits.par.foreach(processFold) } else { splits.foreach(processFold) } {code} Assuming there's 3 folds it would redundantly cache all the combinations (pretty much memory), so maybe it's possible to cache each fold separately. Parallel grid search - Key: SPARK-5807 URL: https://issues.apache.org/jira/browse/SPARK-5807 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.3.0 Reporter: Peter Rudenko Priority: Minor Right now in CrossValidator for each fold combination and ParamGrid hyperparameter pair it searches the best parameter sequentially. Assuming there's enough workers memory on a cluster to cache all training/validation folds it's possible to parallelize execution. Here's a draft i came with: {code} import scala.collection.immutable.{ Vector = ScalaVec } val metrics = ScalaVec.fill(numModels)(0.0) //Scala vector is thread safe val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match { case ((training, validation), splitIndex) = { val trainingDataset = sqlCtx.applySchema(training, schema).cache() val validationDataset = sqlCtx.applySchema(validation, schema).cache() // multi-model training logDebug(sTrain split $splitIndex with multiple sets of parameters.) val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] var i = 0 trainingDataset.unpersist() while (i numModels) { val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)), map) logDebug(sGot metric $metric for model trained with ${epm(i)}.) metrics(i) += metric i += 1 } validationDataset.unpersist() } } if (parallel)
[jira] [Created] (SPARK-5807) Parallel grid search
Peter Rudenko created SPARK-5807: Summary: Parallel grid search Key: SPARK-5807 URL: https://issues.apache.org/jira/browse/SPARK-5807 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.3.0 Reporter: Peter Rudenko Priority: Minor Right now in CrossValidator for each fold combination and ParamGrid hyperparameter pair it searches the best parameter sequentially. Assuming there's enough workers memory on a cluster to cache all training/validation folds it's possible to parallelize execution. Here's a draft i came with: {code} import scala.collection.immutable.{ Vector = ScalaVec } val metrics = ScalaVec.fill(numModels)(0.0) //Scala vector is thread safe val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match { case ((training, validation), splitIndex) = { val trainingDataset = sqlCtx.applySchema(training, schema).cache() val validationDataset = sqlCtx.applySchema(validation, schema).cache() // multi-model training logDebug(sTrain split $splitIndex with multiple sets of parameters.) val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] var i = 0 trainingDataset.unpersist() while (i numModels) { val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)), map) logDebug(sGot metric $metric for model trained with ${epm(i)}.) metrics(i) += metric i += 1 } validationDataset.unpersist() } } if (parallel) { splits.par.foreach(processFold) } else { splits.foreach(processFold) } {code} Assuming there's 3 folds it would redundantly cache all the combinations (pretty much memory), so maybe it's possible to cache each fold separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
[ https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320754#comment-14320754 ] Tathagata Das commented on SPARK-5731: -- This is very weird. the stream is receiving more messages that it is supposed to. Let me try recreating it. Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Blocker Labels: flaky-test {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38) at
[jira] [Updated] (SPARK-5730) Group methods in the generated doc for spark.ml algorithms.
[ https://issues.apache.org/jira/browse/SPARK-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5730: - Assignee: Xiangrui Meng Group methods in the generated doc for spark.ml algorithms. --- Key: SPARK-5730 URL: https://issues.apache.org/jira/browse/SPARK-5730 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng In spark.ml, we have params and their setters/getters. It is nice to group them in the generated docs. Params should be in the top, while setters/getters should be at the bottom. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5810) Maven Coordinate Inclusion failing in pySpark
Burak Yavuz created SPARK-5810: -- Summary: Maven Coordinate Inclusion failing in pySpark Key: SPARK-5810 URL: https://issues.apache.org/jira/browse/SPARK-5810 Project: Spark Issue Type: Bug Components: Deploy, PySpark Affects Versions: 1.3.0 Reporter: Burak Yavuz Priority: Blocker Fix For: 1.3.0 When including maven coordinates to download dependencies in pyspark, pyspark returns a GatewayError, because it cannot read the proper port to communicate with the JVM. This is because pyspark relies on STDIN to read the port number and in the meantime Ivy prints out a whole lot of logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
[ https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320749#comment-14320749 ] Tathagata Das commented on SPARK-5731: -- Let me take a pass at it. Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Blocker Labels: flaky-test {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at
[jira] [Commented] (SPARK-5798) Spark shell issue
[ https://issues.apache.org/jira/browse/SPARK-5798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320763#comment-14320763 ] DeepakVohra commented on SPARK-5798: Re-tested on local OS Oracle Linux 6.5 and did not get the Spark shell issue. The earlier test, which generated the Spark shell error, was on Amazon EC2. Issue may be closed. Spark shell issue - Key: SPARK-5798 URL: https://issues.apache.org/jira/browse/SPARK-5798 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.2.0 Environment: Spark 1.2 Scala 2.10.4 Reporter: DeepakVohra The Spark shell terminates when Spark code is run indicating an issue with Spark shell. The error is coming from the spark shell file /apachespark/spark-1.2.0-bin-cdh4/bin/spark-shell: line 48 $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SUBMISSION_OPTS[@]} spark-shell ${APPLICATION_OPTS[@]} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5807) Parallel grid search
[ https://issues.apache.org/jira/browse/SPARK-5807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Rudenko updated SPARK-5807: - Description: Right now in CrossValidator for each fold combination and ParamGrid hyperparameter pair it searches the best parameter sequentially. Assuming there's enough workers memory on a cluster to cache all training/validation folds it's possible to parallelize execution. Here's a draft i came with: {code} val metrics = val metrics = new ArrayBuffer[Double](numModels) with mutable.SynchronizedBuffer[Double] val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match { case ((training, validation), splitIndex) = { val trainingDataset = sqlCtx.applySchema(training, schema).cache() val validationDataset = sqlCtx.applySchema(validation, schema).cache() // multi-model training logDebug(sTrain split $splitIndex with multiple sets of parameters.) val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] var i = 0 trainingDataset.unpersist() while (i numModels) { val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)), map) logDebug(sGot metric $metric for model trained with ${epm(i)}.) metrics(i) += metric i += 1 } validationDataset.unpersist() } } if (parallel) { splits.par.foreach(processFold) } else { splits.foreach(processFold) } {code} Assuming there's 3 folds it would redundantly cache all the combinations (pretty much memory), so maybe it's possible to cache each fold separately. was: Right now in CrossValidator for each fold combination and ParamGrid hyperparameter pair it searches the best parameter sequentially. Assuming there's enough workers memory on a cluster to cache all training/validation folds it's possible to parallelize execution. Here's a draft i came with: {code} import scala.collection.immutable.{ Vector = ScalaVec } val metrics = ScalaVec.fill(numModels)(0.0) //Scala vector is thread safe val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match { case ((training, validation), splitIndex) = { val trainingDataset = sqlCtx.applySchema(training, schema).cache() val validationDataset = sqlCtx.applySchema(validation, schema).cache() // multi-model training logDebug(sTrain split $splitIndex with multiple sets of parameters.) val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] var i = 0 trainingDataset.unpersist() while (i numModels) { val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)), map) logDebug(sGot metric $metric for model trained with ${epm(i)}.) metrics(i) += metric i += 1 } validationDataset.unpersist() } } if (parallel) { splits.par.foreach(processFold) } else { splits.foreach(processFold) } {code} Assuming there's 3 folds it would redundantly cache all the combinations (pretty much memory), so maybe it's possible to cache each fold separately. Parallel grid search - Key: SPARK-5807 URL: https://issues.apache.org/jira/browse/SPARK-5807 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.3.0 Reporter: Peter Rudenko Priority: Minor Right now in CrossValidator for each fold combination and ParamGrid hyperparameter pair it searches the best parameter sequentially. Assuming there's enough workers memory on a cluster to cache all training/validation folds it's possible to parallelize execution. Here's a draft i came with: {code} val metrics = val metrics = new ArrayBuffer[Double](numModels) with mutable.SynchronizedBuffer[Double] val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match { case ((training, validation), splitIndex) = { val trainingDataset = sqlCtx.applySchema(training, schema).cache() val validationDataset = sqlCtx.applySchema(validation, schema).cache() // multi-model training logDebug(sTrain split $splitIndex with multiple sets of parameters.) val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] var i = 0 trainingDataset.unpersist() while (i numModels) { val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)), map) logDebug(sGot metric $metric for model trained with ${epm(i)}.) metrics(i) += metric i += 1 } validationDataset.unpersist() } } if (parallel) { splits.par.foreach(processFold) } else { splits.foreach(processFold) } {code} Assuming there's 3 folds it would redundantly cache all the combinations
[jira] [Commented] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset
[ https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320796#comment-14320796 ] Apache Spark commented on SPARK-5731: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/4597 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset Key: SPARK-5731 URL: https://issues.apache.org/jira/browse/SPARK-5731 Project: Spark Issue Type: Bug Components: Streaming, Tests Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Blocker Labels: flaky-test {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 20.070287525 seconds. Last failure message: 300 did not equal 48 didn't get all messages. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38) at
[jira] [Commented] (SPARK-5779) Python broadcast does not work with Kryo serializer
[ https://issues.apache.org/jira/browse/SPARK-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320803#comment-14320803 ] Josh Rosen commented on SPARK-5779: --- I thought we fixed this in SPARK-4882: https://github.com/apache/spark/pull/3831. Have you observed a new version of this issue? Python broadcast does not work with Kryo serializer --- Key: SPARK-5779 URL: https://issues.apache.org/jira/browse/SPARK-5779 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0, 1.2.1 Reporter: Davies Liu Priority: Critical The PythonBroadcast cannot be serialized by Kryo, which is introduced in 1.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4865) Include temporary tables in SHOW TABLES
[ https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320804#comment-14320804 ] Yin Huai commented on SPARK-4865: - I will start to work on it based on SPARK-3299. Include temporary tables in SHOW TABLES --- Key: SPARK-4865 URL: https://issues.apache.org/jira/browse/SPARK-4865 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Misha Chernetsov Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4865) Include temporary tables in SHOW TABLES
[ https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-4865: Priority: Blocker (was: Critical) Include temporary tables in SHOW TABLES --- Key: SPARK-4865 URL: https://issues.apache.org/jira/browse/SPARK-4865 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Misha Chernetsov Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5809) OutOfMemoryError in logDebug in RandomForest.scala
Devesh Parekh created SPARK-5809: Summary: OutOfMemoryError in logDebug in RandomForest.scala Key: SPARK-5809 URL: https://issues.apache.org/jira/browse/SPARK-5809 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Devesh Parekh When training a GBM on sparse vectors produced by HashingTF, I get the following OutOfMemoryError, where RandomForest is building a debug string to log. Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3326) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121 ) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) at java.lang.StringBuilder.append(StringBuilder.java:136) at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197) at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:327 ) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) at scala.collection.AbstractTraversable.addString(Traversable.scala:105) at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) at scala.collection.AbstractTraversable.mkString(Traversable.scala:105) at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) at scala.collection.AbstractTraversable.mkString(Traversable.scala:105) at org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152) at org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152) at org.apache.spark.Logging$class.logDebug(Logging.scala:63) at org.apache.spark.mllib.tree.RandomForest.logDebug(RandomForest.scala:67) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:150) at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:64) at org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150) at org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63) at org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96) A workaround until this is fixed is to modify log4j.properties in the conf directory to filter out debug logs in RandomForest. For example: log4j.logger.org.apache.spark.mllib.tree.RandomForest=WARN -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5789) Throw a better error message if JsonRDD.parseJson encounters unrecoverable parsing errors.
[ https://issues.apache.org/jira/browse/SPARK-5789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5789. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4582 [https://github.com/apache/spark/pull/4582] Throw a better error message if JsonRDD.parseJson encounters unrecoverable parsing errors. -- Key: SPARK-5789 URL: https://issues.apache.org/jira/browse/SPARK-5789 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Fix For: 1.3.0 For example {code} sqlContext.jsonRDD(sc.parallelize(a:1}::Nil)) {code} will throw {code} scala.MatchError: a (of class java.lang.String) at org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:302) at org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:300) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:879) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:878) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/12 15:08:55 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 26) in 10 ms on localhost (7/8) 15/02/12 15:08:55 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 4.0 (TID 33, localhost): scala.MatchError: a (of class java.lang.String) at org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:302) at org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:300) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:879) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:878) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5811) Documentation for --packages and --repositories on Spark Shell
Burak Yavuz created SPARK-5811: -- Summary: Documentation for --packages and --repositories on Spark Shell Key: SPARK-5811 URL: https://issues.apache.org/jira/browse/SPARK-5811 Project: Spark Issue Type: Documentation Components: Deploy, Spark Shell Affects Versions: 1.3.0 Reporter: Burak Yavuz Priority: Critical Fix For: 1.3.0 Documentation for the new support for dependency management using maven coordinates using --packages and --repositories -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5363) Spark 1.2 freeze without error notification
[ https://issues.apache.org/jira/browse/SPARK-5363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320986#comment-14320986 ] Apache Spark commented on SPARK-5363: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4601 Spark 1.2 freeze without error notification --- Key: SPARK-5363 URL: https://issues.apache.org/jira/browse/SPARK-5363 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Tassilo Klein Assignee: Davies Liu Priority: Critical After a number of calls to a map().collect() statement Spark freezes without reporting any error. Within the map a large broadcast variable is used. The freezing can be avoided by setting 'spark.python.worker.reuse = false' (Spark 1.2) or using an earlier version, however, at the prize of low speed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320995#comment-14320995 ] Florian Verhein commented on SPARK-3821: RE: Java, that reminds me... We should probably be using OracleJDK rather than OpenJDK. But I think this should be a separate issue, so just created #SPARK-5813. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5730) Group methods in the generated doc for spark.ml algorithms.
[ https://issues.apache.org/jira/browse/SPARK-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5730. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4600 [https://github.com/apache/spark/pull/4600] Group methods in the generated doc for spark.ml algorithms. --- Key: SPARK-5730 URL: https://issues.apache.org/jira/browse/SPARK-5730 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.0 In spark.ml, we have params and their setters/getters. It is nice to group them in the generated docs. Params should be in the top, while setters/getters should be at the bottom. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321099#comment-14321099 ] Travis Galoppo commented on SPARK-5016: --- Hmm. I'm having trouble conceptualizing how to use aggregateByKey here; the breezeData RDD is not keyed. We could have a keyed RDD of expectation sums (with a little rework), but each entry in the breezeData RDD would need to be operated on by each reducer (which seems awkward?)... or I'm way off? GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5803) Use ArrayBuilder instead of ArrayBuffer for primitive types
[ https://issues.apache.org/jira/browse/SPARK-5803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5803. -- Resolution: Fixed Fix Version/s: 1.3.0 Use ArrayBuilder instead of ArrayBuffer for primitive types --- Key: SPARK-5803 URL: https://issues.apache.org/jira/browse/SPARK-5803 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.0 ArrayBuffer is not specialized and hence it boxes primitive-typed values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5363) Spark 1.2 freeze without error notification
[ https://issues.apache.org/jira/browse/SPARK-5363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320987#comment-14320987 ] Davies Liu commented on SPARK-5363: --- [~TJKlein] Could you try the patch in https://github.com/apache/spark/pull/4601 whether it fix your problem? Spark 1.2 freeze without error notification --- Key: SPARK-5363 URL: https://issues.apache.org/jira/browse/SPARK-5363 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Tassilo Klein Assignee: Davies Liu Priority: Critical After a number of calls to a map().collect() statement Spark freezes without reporting any error. Within the map a large broadcast variable is used. The freezing can be avoided by setting 'spark.python.worker.reuse = false' (Spark 1.2) or using an earlier version, however, at the prize of low speed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5813) Spark-ec2: Switch to OracleJDK
Florian Verhein created SPARK-5813: -- Summary: Spark-ec2: Switch to OracleJDK Key: SPARK-5813 URL: https://issues.apache.org/jira/browse/SPARK-5813 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Currently using OpenJDK, however it is generally recommended to use Oracle JDK, esp for Hadoop deployments, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5814) Remove JBLAS from runtime dependencies
Xiangrui Meng created SPARK-5814: Summary: Remove JBLAS from runtime dependencies Key: SPARK-5814 URL: https://issues.apache.org/jira/browse/SPARK-5814 Project: Spark Issue Type: Dependency upgrade Components: GraphX, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng We are using mixed breeze/netlib-java and jblas code in MLlib. They take different approaches to utilize native libraries and we should keep only one of them. netlib-java has a clear separation between Java implementation and native JNI libraries, while JBLAS packs statically linked binaries that causes license issues (SPARK-5669). So we want to remove JBLAS from Spark runtime. One issue with this approach is that we have JBLAS' DoubleMatrix exposed (by mistake) in SVDPlusPlus of GraphX. We should deprecate it and replace `DoubleMatrix` by `Array[Double]`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5815) Deprecate SVDPlusPlus APIs that expose DoubleMatrix from JBLAS
Xiangrui Meng created SPARK-5815: Summary: Deprecate SVDPlusPlus APIs that expose DoubleMatrix from JBLAS Key: SPARK-5815 URL: https://issues.apache.org/jira/browse/SPARK-5815 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng It is generally bad to expose types defined in a 3rd-party package in Spark public APIs. We should deprecate those methods in SVDPlusPlus and replace them in the next release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-5124: Attachment: Pluggable RPC - draft 2.pdf Comparing to the first version, this docs adds ActionScheduler interface and change the fault tolerance to: Any error thrown by `onStart`, `receive` and `onStop` will be sent to `onError`. If onError throws an error, it will be ignored. Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org