[jira] [Commented] (SPARK-4916) [SQL][DOCS]Update SQL programming guide about cache section
[ https://issues.apache.org/jira/browse/SPARK-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255538#comment-14255538 ] Apache Spark commented on SPARK-4916: - User 'luogankun' has created a pull request for this issue: https://github.com/apache/spark/pull/3759 [SQL][DOCS]Update SQL programming guide about cache section --- Key: SPARK-4916 URL: https://issues.apache.org/jira/browse/SPARK-4916 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Gankun Luo Priority: Trivial SchemeRDD.cache() now uses in-memory columnar storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4917) Add a function to convert into a graph with canonical edges in GraphOps
Takeshi Yamamuro created SPARK-4917: --- Summary: Add a function to convert into a graph with canonical edges in GraphOps Key: SPARK-4917 URL: https://issues.apache.org/jira/browse/SPARK-4917 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Minor Convert bi-directional edges into uni-directional ones instead of 'canonicalOrientation' in GraphLoader.edgeListFile. This function is useful when a graph is loaded as it is and then is transformed into one with canonical edges. It rewrites the vertex ids of edges so that srcIds are bigger than dstIds, and merges the duplicated edges. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4917) Add a function to convert into a graph with canonical edges in GraphOps
[ https://issues.apache.org/jira/browse/SPARK-4917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255543#comment-14255543 ] Apache Spark commented on SPARK-4917: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/3760 Add a function to convert into a graph with canonical edges in GraphOps --- Key: SPARK-4917 URL: https://issues.apache.org/jira/browse/SPARK-4917 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Minor Convert bi-directional edges into uni-directional ones instead of 'canonicalOrientation' in GraphLoader.edgeListFile. This function is useful when a graph is loaded as it is and then is transformed into one with canonical edges. It rewrites the vertex ids of edges so that srcIds are bigger than dstIds, and merges the duplicated edges. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255606#comment-14255606 ] Sean Owen commented on SPARK-2075: -- [~sunrui] Yes, but I am still not clear that anyone has observed the problem with two binaries built for the same version of Hadoop. Most of the situations listed here do not match that description. I might not understand someone's issue report here. In any event, it sounds like an underlying cause has been fixed already anyway. Anonymous classes are missing from Spark distribution - Key: SPARK-2075 URL: https://issues.apache.org/jira/browse/SPARK-2075 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 1.0.0 Reporter: Paul R. Brown Assignee: Shixiong Zhu Priority: Critical Fix For: 1.3.0, 1.2.1 Running a job built against the Maven dep for 1.0.0 and the hadoop1 distribution produces: {code} java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 {code} Here's what's in the Maven dep as of 1.0.0: {code} jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} And here's what's in the hadoop1 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' {code} I.e., it's not there. It is in the hadoop2 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4918) Reuse Text in saveAsTextFile
Shixiong Zhu created SPARK-4918: --- Summary: Reuse Text in saveAsTextFile Key: SPARK-4918 URL: https://issues.apache.org/jira/browse/SPARK-4918 Project: Spark Issue Type: Improvement Reporter: Shixiong Zhu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4918) Reuse Text in saveAsTextFile
[ https://issues.apache.org/jira/browse/SPARK-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-4918: Component/s: Spark Core Reuse Text in saveAsTextFile Key: SPARK-4918 URL: https://issues.apache.org/jira/browse/SPARK-4918 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4918) Reuse Text in saveAsTextFile
[ https://issues.apache.org/jira/browse/SPARK-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255697#comment-14255697 ] Apache Spark commented on SPARK-4918: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3762 Reuse Text in saveAsTextFile Key: SPARK-4918 URL: https://issues.apache.org/jira/browse/SPARK-4918 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4919) Can Spark SQL thrift server UI provide JOB kill operate or any REST API?
Xiaoyu Wang created SPARK-4919: -- Summary: Can Spark SQL thrift server UI provide JOB kill operate or any REST API? Key: SPARK-4919 URL: https://issues.apache.org/jira/browse/SPARK-4919 Project: Spark Issue Type: Wish Components: SQL, Web UI Affects Versions: 1.2.0 Reporter: Xiaoyu Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4919) Can Spark SQL thrift server UI provide JOB kill operate or any REST API?
[ https://issues.apache.org/jira/browse/SPARK-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255707#comment-14255707 ] Sean Owen commented on SPARK-4919: -- (Could you ask questions at u...@apache.org instead? JIRA is for recording bugs or enhancements, along with proposed code changes.) Can Spark SQL thrift server UI provide JOB kill operate or any REST API? Key: SPARK-4919 URL: https://issues.apache.org/jira/browse/SPARK-4919 Project: Spark Issue Type: Wish Components: SQL, Web UI Affects Versions: 1.2.0 Reporter: Xiaoyu Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4919) Can Spark SQL thrift server UI provide JOB kill operate or any REST API?
[ https://issues.apache.org/jira/browse/SPARK-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoyu Wang updated SPARK-4919: --- Description: Can Spark SQL thrift server UI provide “JOB” kill operate or any REST API? Stages is already can be killed! Can Spark SQL thrift server UI provide JOB kill operate or any REST API? Key: SPARK-4919 URL: https://issues.apache.org/jira/browse/SPARK-4919 Project: Spark Issue Type: Wish Components: SQL, Web UI Affects Versions: 1.2.0 Reporter: Xiaoyu Wang Can Spark SQL thrift server UI provide “JOB” kill operate or any REST API? Stages is already can be killed! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2808) update kafka to version 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255717#comment-14255717 ] Apache Spark commented on SPARK-2808: - User 'helena' has created a pull request for this issue: https://github.com/apache/spark/pull/3631 update kafka to version 0.8.2 - Key: SPARK-2808 URL: https://issues.apache.org/jira/browse/SPARK-2808 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati First kafka_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
[ https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-3376: Description: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle. Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. Both of them will use disk in some stages. For examples, in the map side, all the intermediate data will be written into temporary files. In the reduce side, Spark will use external sort sometimes. In any case, disk I/O will bring some performance loss. Maybe,we can provide a pure-memory shuffle manager. In this shuffle manager, intermediate data will only go through memory. In some of scenes, it can improve performance. Experimentally, I implemented a in-memory shuffle manager upon SPARK-2044. Following is my testing result: | data size (Byte) | partitions | resources | | 5131859218 |2000 | 50 executors/ 4 cores/ 4GB | | settings | operation1 | operation2 | | shuffle spill lz4 | repartition+flatMap+groupByKey | repartition + groupByKey | |memory | 38s | 16s | |sort | 45s | 28s | |hash | 46s | 28s | |no shuffle spill lz4 | | | | memory | 16s | 16s | | | | | |shuffle spill lzf | | | |memory| 28s | 27s | |sort | 29s | 29s | |hash | 41s | 30s | |no shuffle spill lzf | | | | memory | 15s | 16s | In my implementation, I simply reused the BlockManager in the map-side and set the spark.shuffle.spill false in the reduce-side. All the intermediate data is cached in memory store. Just as Reynold Xin has pointed out, our disk-based shuffle manager has achieved a good performance. With parameter tuning, the disk-based shuffle manager will obtain similar performance as memory-based shuffle manager. However, I will continue my work and improve it. And as an alternative tuning option, InMemory shuffle is a good choice. Future work includes, but is not limited to: - memory usage management in InMemory Shuffle mode - data management when intermediate data can not fit in memory Test code: {code: borderStyle=solid} val conf = new SparkConf().setAppName(InMemoryShuffleTest) val sc = new SparkContext(conf) val dataPath = args(0) val partitions = args(1).toInt val rdd1 = sc.textFile(dataPath).cache() rdd1.count() val startTime = System.currentTimeMillis() val rdd2 = rdd1.repartition(partitions) .flatMap(_.split(,)).map(s = (s, s)) .groupBy(e = e._1) rdd2.count() val endTime = System.currentTimeMillis() println(time: + (endTime - startTime) / 1000 ) {code} Spark Sort Benchmark Test the influence of memory size per core 100GB(SORT benchmark) 100 executor /15core 1491partitions (input file blocks) | memory size per executor| inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |--|||---|-|---| |9GB | 79.652849s | 60.102337s | failed| -32.7%| -| |12GB | 54.821924s | 51.654897s |109.167068s | -3.17%|+47.8% | |15GB | 33.537199s | 40.140621s |48.088158s | +16.47% |+30.26%| |18GB | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| Test the influence of partition numer 18GB/15core per executor | partitions | inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |--|||---|-|---| |1000 | 92.675436s | 85.193158s |71.106323s | -8.78%|-30.34%| |1491 | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| |2000 | 18.385s| 26.653720s |30.103s | +31.02% |+38.92%| was: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is
[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
[ https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-3376: Description: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle. Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. Both of them will use disk in some stages. For examples, in the map side, all the intermediate data will be written into temporary files. In the reduce side, Spark will use external sort sometimes. In any case, disk I/O will bring some performance loss. Maybe,we can provide a pure-memory shuffle manager. In this shuffle manager, intermediate data will only go through memory. In some of scenes, it can improve performance. Experimentally, I implemented a in-memory shuffle manager upon SPARK-2044. Following is my testing result: | data size (Byte) | partitions | resources | | 5131859218 |2000 | 50 executors/ 4 cores/ 4GB | | settings | operation1 | operation2 | | shuffle spill lz4 | repartition+flatMap+groupByKey | repartition + groupByKey | |memory | 38s | 16s | |sort | 45s | 28s | |hash | 46s | 28s | |no shuffle spill lz4 | | | | memory | 16s | 16s | | | | | |shuffle spill lzf | | | |memory| 28s | 27s | |sort | 29s | 29s | |hash | 41s | 30s | |no shuffle spill lzf | | | | memory | 15s | 16s | In my implementation, I simply reused the BlockManager in the map-side and set the spark.shuffle.spill false in the reduce-side. All the intermediate data is cached in memory store. Just as Reynold Xin has pointed out, our disk-based shuffle manager has achieved a good performance. With parameter tuning, the disk-based shuffle manager will obtain similar performance as memory-based shuffle manager. However, I will continue my work and improve it. And as an alternative tuning option, InMemory shuffle is a good choice. Future work includes, but is not limited to: - memory usage management in InMemory Shuffle mode - data management when intermediate data can not fit in memory Test code: {code: borderStyle=solid} val conf = new SparkConf().setAppName(InMemoryShuffleTest) val sc = new SparkContext(conf) val dataPath = args(0) val partitions = args(1).toInt val rdd1 = sc.textFile(dataPath).cache() rdd1.count() val startTime = System.currentTimeMillis() val rdd2 = rdd1.repartition(partitions) .flatMap(_.split(,)).map(s = (s, s)) .groupBy(e = e._1) rdd2.count() val endTime = System.currentTimeMillis() println(time: + (endTime - startTime) / 1000 ) {code} Spark Sort Benchmark Test the influence of memory size per core 100GB(SORT benchmark) 100 executor /15core 1491partitions (input file blocks) | memory size per executor| inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |9GB | 79.652849s | 60.102337s | failed| -32.7%| -| |12GB | 54.821924s | 51.654897s |109.167068s | -3.17%|+47.8% | |15GB | 33.537199s | 40.140621s |48.088158s | +16.47% |+30.26%| |18GB | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| Test the influence of partition numer 18GB/15core per executor | partitions | inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |1000 | 92.675436s | 85.193158s |71.106323s | -8.78%|-30.34%| |1491 | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| |2000 | 18.385s| 26.653720s |30.103s | +31.02% |+38.92%| was: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle.
[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
[ https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-3376: Description: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle. Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. Both of them will use disk in some stages. For examples, in the map side, all the intermediate data will be written into temporary files. In the reduce side, Spark will use external sort sometimes. In any case, disk I/O will bring some performance loss. Maybe,we can provide a pure-memory shuffle manager. In this shuffle manager, intermediate data will only go through memory. In some of scenes, it can improve performance. Experimentally, I implemented a in-memory shuffle manager upon SPARK-2044. 1. Following is my testing result (some heary shuffle operations): | data size (Byte) | partitions | resources | | 5131859218 |2000 | 50 executors/ 4 cores/ 4GB | | settings | operation1 | operation2 | | shuffle spill lz4 | repartition+flatMap+groupByKey | repartition + groupByKey | |memory | 38s | 16s | |sort | 45s | 28s | |hash | 46s | 28s | |no shuffle spill lz4 | | | | memory | 16s | 16s | | | | | |shuffle spill lzf | | | |memory| 28s | 27s | |sort | 29s | 29s | |hash | 41s | 30s | |no shuffle spill lzf | | | | memory | 15s | 16s | In my implementation, I simply reused the BlockManager in the map-side and set the spark.shuffle.spill false in the reduce-side. All the intermediate data is cached in memory store. Just as Reynold Xin has pointed out, our disk-based shuffle manager has achieved a good performance. With parameter tuning, the disk-based shuffle manager will obtain similar performance as memory-based shuffle manager. However, I will continue my work and improve it. And as an alternative tuning option, InMemory shuffle is a good choice. Future work includes, but is not limited to: - memory usage management in InMemory Shuffle mode - data management when intermediate data can not fit in memory Test code: {code: borderStyle=solid} val conf = new SparkConf().setAppName(InMemoryShuffleTest) val sc = new SparkContext(conf) val dataPath = args(0) val partitions = args(1).toInt val rdd1 = sc.textFile(dataPath).cache() rdd1.count() val startTime = System.currentTimeMillis() val rdd2 = rdd1.repartition(partitions) .flatMap(_.split(,)).map(s = (s, s)) .groupBy(e = e._1) rdd2.count() val endTime = System.currentTimeMillis() println(time: + (endTime - startTime) / 1000 ) {code} 2. Following is a Spark Sort Benchmark: 2.1. Test the influence of memory size per core precondition: 100GB(SORT benchmark), 100 executor /15core 1491partitions (input file blocks) | memory size per executor| inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |9GB | 79.652849s | 60.102337s | failed| -32.7%| -| |12GB | 54.821924s | 51.654897s |109.167068s | -3.17%|+47.8% | |15GB | 33.537199s | 40.140621s |48.088158s | +16.47% |+30.26%| |18GB | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| 2.2. Test the influence of partition numer 18GB/15core per executor | partitions | inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |1000 | 92.675436s | 85.193158s |71.106323s | -8.78%|-30.34%| |1491 | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| |2000 | 18.385s| 26.653720s |30.103s | +31.02% |+38.92%| was: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle.
[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
[ https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-3376: Description: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle. Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. Both of them will use disk in some stages. For examples, in the map side, all the intermediate data will be written into temporary files. In the reduce side, Spark will use external sort sometimes. In any case, disk I/O will bring some performance loss. Maybe,we can provide a pure-memory shuffle manager. In this shuffle manager, intermediate data will only go through memory. In some of scenes, it can improve performance. Experimentally, I implemented a in-memory shuffle manager upon SPARK-2044. 1. Following is my testing result (some heary shuffle operations): | data size (Byte) | partitions | resources | | 5131859218 |2000 | 50 executors/ 4 cores/ 4GB | | settings | operation1 | operation2 | | shuffle spill lz4 | repartition+flatMap+groupByKey | repartition + groupByKey | |memory | 38s | 16s | |sort | 45s | 28s | |hash | 46s | 28s | |no shuffle spill lz4 | | | | memory | 16s | 16s | | | | | |shuffle spill lzf | | | |memory| 28s | 27s | |sort | 29s | 29s | |hash | 41s | 30s | |no shuffle spill lzf | | | | memory | 15s | 16s | In my implementation, I simply reused the BlockManager in the map-side and set the spark.shuffle.spill false in the reduce-side. All the intermediate data is cached in memory store. Just as Reynold Xin has pointed out, our disk-based shuffle manager has achieved a good performance. With parameter tuning, the disk-based shuffle manager will obtain similar performance as memory-based shuffle manager. However, I will continue my work and improve it. And as an alternative tuning option, InMemory shuffle is a good choice. Future work includes, but is not limited to: - memory usage management in InMemory Shuffle mode - data management when intermediate data can not fit in memory Test code: {code: borderStyle=solid} val conf = new SparkConf().setAppName(InMemoryShuffleTest) val sc = new SparkContext(conf) val dataPath = args(0) val partitions = args(1).toInt val rdd1 = sc.textFile(dataPath).cache() rdd1.count() val startTime = System.currentTimeMillis() val rdd2 = rdd1.repartition(partitions) .flatMap(_.split(,)).map(s = (s, s)) .groupBy(e = e._1) rdd2.count() val endTime = System.currentTimeMillis() println(time: + (endTime - startTime) / 1000 ) {code} 2. Following is a Spark Sort Benchmark: 2.1. Test the influence of memory size per core precondition: 100GB(SORT benchmark), 100 executor /15core 1491partitions (input file blocks) . There is no tuning for disk shuffle. | memory size per executor| inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |9GB | 79.652849s | 60.102337s | failed| -32.7%| -| |12GB | 54.821924s | 51.654897s |109.167068s | -3.17%|+47.8% | |15GB | 33.537199s | 40.140621s |48.088158s | +16.47% |+30.26%| |18GB | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| 2.2. Test the influence of partition numer 18GB/15core per executor | partitions | inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |1000 | 92.675436s | 85.193158s |71.106323s | -8.78%|-30.34%| |1491 | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| |2000 | 18.385s| 26.653720s |30.103s | +31.02% |+38.92%| was: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several
[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
[ https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-3376: Description: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle. Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. Both of them will use disk in some stages. For examples, in the map side, all the intermediate data will be written into temporary files. In the reduce side, Spark will use external sort sometimes. In any case, disk I/O will bring some performance loss. Maybe,we can provide a pure-memory shuffle manager. In this shuffle manager, intermediate data will only go through memory. In some of scenes, it can improve performance. Experimentally, I implemented a in-memory shuffle manager upon SPARK-2044. 1. Following is my testing result (some heary shuffle operations): | data size (Byte) | partitions | resources | | 5131859218 |2000 | 50 executors/ 4 cores/ 4GB | | settings | operation1 | operation2 | | shuffle spill lz4 | repartition+flatMap+groupByKey | repartition + groupByKey | |memory | 38s | 16s | |sort | 45s | 28s | |hash | 46s | 28s | |no shuffle spill lz4 | | | | memory | 16s | 16s | | | | | |shuffle spill lzf | | | |memory| 28s | 27s | |sort | 29s | 29s | |hash | 41s | 30s | |no shuffle spill lzf | | | | memory | 15s | 16s | In my implementation, I simply reused the BlockManager in the map-side and set the spark.shuffle.spill false in the reduce-side. All the intermediate data is cached in memory store. Just as Reynold Xin has pointed out, our disk-based shuffle manager has achieved a good performance. With parameter tuning, the disk-based shuffle manager will obtain similar performance as memory-based shuffle manager. However, I will continue my work and improve it. And as an alternative tuning option, InMemory shuffle is a good choice. Future work includes, but is not limited to: - memory usage management in InMemory Shuffle mode - data management when intermediate data can not fit in memory Test code: {code: borderStyle=solid} val conf = new SparkConf().setAppName(InMemoryShuffleTest) val sc = new SparkContext(conf) val dataPath = args(0) val partitions = args(1).toInt val rdd1 = sc.textFile(dataPath).cache() rdd1.count() val startTime = System.currentTimeMillis() val rdd2 = rdd1.repartition(partitions) .flatMap(_.split(,)).map(s = (s, s)) .groupBy(e = e._1) rdd2.count() val endTime = System.currentTimeMillis() println(time: + (endTime - startTime) / 1000 ) {code} 2. Following is a Spark Sort Benchmark (in spark 1.1.1). There is no tuning for disk shuffle. 2.1. Test the influence of memory size per core precondition: 100GB(SORT benchmark), 100 executor /15core 1491partitions (input file blocks) . | memory size per executor| inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |9GB | 79.652849s | 60.102337s | failed| -32.7%| -| |12GB | 54.821924s | 51.654897s |109.167068s | -3.17%|+47.8% | |15GB | 33.537199s | 40.140621s |48.088158s | +16.47% |+30.26%| |18GB | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| 2.2. Test the influence of partition numer 18GB/15core per executor | partitions | inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |1000 | 92.675436s | 85.193158s |71.106323s | -8.78%|-30.34%| |1491 | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| |2000 | 18.385s| 26.653720s |30.103s | +31.02% |+38.92%| was: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to
[jira] [Commented] (SPARK-2541) Standalone mode can't access secure HDFS anymore
[ https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255772#comment-14255772 ] Thomas Graves commented on SPARK-2541: -- Sorry don't follow your question, are you wondering if someone can work on this jira and fix this issue or something else? I haven't had time to get back to it, feel free to work on it if you have the same issue. Standalone mode can't access secure HDFS anymore Key: SPARK-2541 URL: https://issues.apache.org/jira/browse/SPARK-2541 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0, 1.0.1 Reporter: Thomas Graves Attachments: SPARK-2541-partial.patch In spark 0.9.x you could access secure HDFS from Standalone deploy, that doesn't work in 1.X anymore. It looks like the issues is in SparkHadoopUtil.runAsSparkUser. Previously it wouldn't do the doAs if the currentUser == user. Not sure how it affects when the daemons run as a super user but SPARK_USER is set to someone else. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
[ https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-3376: Description: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle. Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. Both of them will use disk in some stages. For examples, in the map side, all the intermediate data will be written into temporary files. In the reduce side, Spark will use external sort sometimes. In any case, disk I/O will bring some performance loss. Maybe,we can provide a pure-memory shuffle manager. In this shuffle manager, intermediate data will only go through memory. In some of scenes, it can improve performance. Experimentally, I implemented a in-memory shuffle manager upon SPARK-2044. 1. Following is my testing result (some heary shuffle operations): | data size (Byte) | partitions | resources | | 5131859218 |2000 | 50 executors/ 4 cores/ 4GB | | settings | operation1 | operation2 | | shuffle spill lz4 | repartition+flatMap+groupByKey | repartition + groupByKey | |memory | 38s | 16s | |sort | 45s | 28s | |hash | 46s | 28s | |no shuffle spill lz4 | | | | memory | 16s | 16s | | | | | |shuffle spill lzf | | | |memory| 28s | 27s | |sort | 29s | 29s | |hash | 41s | 30s | |no shuffle spill lzf | | | | memory | 15s | 16s | In my implementation, I simply reused the BlockManager in the map-side and set the spark.shuffle.spill false in the reduce-side. All the intermediate data is cached in memory store. Just as Reynold Xin has pointed out, our disk-based shuffle manager has achieved a good performance. With parameter tuning, the disk-based shuffle manager will obtain similar performance as memory-based shuffle manager. However, I will continue my work and improve it. And as an alternative tuning option, InMemory shuffle is a good choice. Future work includes, but is not limited to: - memory usage management in InMemory Shuffle mode - data management when intermediate data can not fit in memory Test code: {code: borderStyle=solid} val conf = new SparkConf().setAppName(InMemoryShuffleTest) val sc = new SparkContext(conf) val dataPath = args(0) val partitions = args(1).toInt val rdd1 = sc.textFile(dataPath).cache() rdd1.count() val startTime = System.currentTimeMillis() val rdd2 = rdd1.repartition(partitions) .flatMap(_.split(,)).map(s = (s, s)) .groupBy(e = e._1) rdd2.count() val endTime = System.currentTimeMillis() println(time: + (endTime - startTime) / 1000 ) {code} 2. Following is a Spark Sort Benchmark (in spark 1.1.1). There is no tuning for disk shuffle. 2.1. Test the influence of memory size per core precondition: 100GB(SORT benchmark), 100 executor /15cores 1491partitions (input file blocks) . | memory size per executor| inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |9GB | 79.652849s | 60.102337s | failed| -32.7%| -| |12GB | 54.821924s | 51.654897s |109.167068s | -3.17%|+47.8% | |15GB | 33.537199s | 40.140621s |48.088158s | +16.47% |+30.26%| |18GB | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| 2.2. Test the influence of partition number 18GB/15cores per executor | partitions | inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |1000 | 92.675436s | 85.193158s |71.106323s | -8.78%|-30.34%| |1491 | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| |2000 | 18.385s| 26.653720s |30.103s | +31.02% |+38.92%| was: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to
[jira] [Created] (SPARK-4920) current spark version In UI is not striking
uncleGen created SPARK-4920: --- Summary: current spark version In UI is not striking Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-4920: Summary: current spark version in UI is not striking (was: current spark version In UI is not striking) current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-4920: Description: !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Priority: Minor !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-4920: Description: It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! was: we can keep the same style with Spark website !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Priority: Minor It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-4920: Description: we can keep the same style with Spark website !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! was:!https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Priority: Minor we can keep the same style with Spark website !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255850#comment-14255850 ] Apache Spark commented on SPARK-4920: - User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/3763 current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Priority: Minor It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255855#comment-14255855 ] Sean Owen commented on SPARK-4920: -- I slight prefer the current UI, where the version is in the footer. Putting the version here pushes the tabs right significantly when the version is the long 1.3.0-SNAPSHOT. That said it is consistent with the web site. I imagine the necessary CSS is simple if they are both Bootstrap-based. current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Priority: Minor It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4907) Inconsistent loss and gradient in LeastSquaresGradient compared with R
[ https://issues.apache.org/jira/browse/SPARK-4907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255954#comment-14255954 ] Sean Owen commented on SPARK-4907: -- PS I think you will want to update the docs too, for example, at http://spark.apache.org/docs/latest/mllib-linear-methods.html There may be other places where the loss function formula is mentioned. Inconsistent loss and gradient in LeastSquaresGradient compared with R -- Key: SPARK-4907 URL: https://issues.apache.org/jira/browse/SPARK-4907 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai In most of the academic paper and algorithm implementations, people use L = 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2 for least-squared loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf Since MLlib uses different convention, this will result different residuals and all the stats properties will be different from GLMNET package in R. The model coefficients will be still the same under this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4921) Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks
Xuefu Zhang created SPARK-4921: -- Summary: Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks Key: SPARK-4921 URL: https://issues.apache.org/jira/browse/SPARK-4921 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang During research for HIVE-9153, we found that TaskSetManager returns PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. Changing the return value to NO_PREF, as demonstrated in the attached patch, seemingly improves the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4921) Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks
[ https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256015#comment-14256015 ] Xuefu Zhang commented on SPARK-4921: cc: [~lirui], [~sandyr] Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks - Key: SPARK-4921 URL: https://issues.apache.org/jira/browse/SPARK-4921 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang During research for HIVE-9153, we found that TaskSetManager returns PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. Changing the return value to NO_PREF, as demonstrated in the attached patch, seemingly improves the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4919) Can Spark SQL thrift server UI provide JOB kill operate or any REST API?
[ https://issues.apache.org/jira/browse/SPARK-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4919. --- Resolution: Invalid Can Spark SQL thrift server UI provide JOB kill operate or any REST API? Key: SPARK-4919 URL: https://issues.apache.org/jira/browse/SPARK-4919 Project: Spark Issue Type: Wish Components: SQL, Web UI Affects Versions: 1.2.0 Reporter: Xiaoyu Wang Can Spark SQL thrift server UI provide “JOB” kill operate or any REST API? Stages is already can be killed! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4921) Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks
[ https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated SPARK-4921: --- Attachment: NO_PREF.patch Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks - Key: SPARK-4921 URL: https://issues.apache.org/jira/browse/SPARK-4921 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang Attachments: NO_PREF.patch During research for HIVE-9153, we found that TaskSetManager returns PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. Changing the return value to NO_PREF, as demonstrated in the attached patch, seemingly improves the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode
[ https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4751: - Priority: Critical (was: Blocker) Support dynamic allocation for standalone mode -- Key: SPARK-4751 URL: https://issues.apache.org/jira/browse/SPARK-4751 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue because the scheduling mechanism in the standalone Master uses different semantics. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. Unfortunately, this means an application could get executors of different sizes (in terms of cores) if: 1) App 1 kills an executor 2) App 2, with spark.cores.max set, grabs a subset of cores on a worker 3) App 1 requests an executor In this case, the new executor that App 1 gets back will be smaller than the rest and can execute fewer tasks in parallel. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4922) Support dynamic allocation for coarse-grained Mesos
Andrew Or created SPARK-4922: Summary: Support dynamic allocation for coarse-grained Mesos Key: SPARK-4922 URL: https://issues.apache.org/jira/browse/SPARK-4922 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.2.0 Reporter: Andrew Or Priority: Critical This brings SPARK-3174, which provided dynamic allocation of cluster resources to Spark on YARN applications, to Mesos coarse-grained mode. Note that the translation is not as trivial as adding a code path that exposes the request and kill mechanisms as we did for YARN is SPARK-3822. This is because Mesos coarse-grained mode schedules on the notion of setting the number of cores allowed for an application (as in standalone mode) instead of number of executors (as in YARN mode). For more detail, please see SPARK-4751. If you intend to work on this, please provide a detailed design doc! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256045#comment-14256045 ] Andrew Or commented on SPARK-3174: -- Hey [~nemccarthy] I filed one at SPARK-4922, which is for coarse-grained mode. For fine-grained mode, there is already one that enables dynamically scaling memory instead of just CPU at SPARK-1882. I believe there has not been progress on either issue yet. Provide elastic scaling within a Spark application -- Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Andrew Or Fix For: 1.2.0 Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, dynamic-scaling-executors-10-6-14.pdf A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. It would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Discard executors when they are idle See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4918) Reuse Text in saveAsTextFile
[ https://issues.apache.org/jira/browse/SPARK-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4918: --- Description: When code reviewing https://github.com/apache/spark/pull/3740, [~rxin] pointed out that we could reuse the Hadoop Text object in saveAsTextFile to reduce GC impact. Reuse Text in saveAsTextFile Key: SPARK-4918 URL: https://issues.apache.org/jira/browse/SPARK-4918 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Priority: Minor When code reviewing https://github.com/apache/spark/pull/3740, [~rxin] pointed out that we could reuse the Hadoop Text object in saveAsTextFile to reduce GC impact. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4918) Reuse Text in saveAsTextFile
[ https://issues.apache.org/jira/browse/SPARK-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4918: --- Assignee: Shixiong Zhu Reuse Text in saveAsTextFile Key: SPARK-4918 URL: https://issues.apache.org/jira/browse/SPARK-4918 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor When code reviewing https://github.com/apache/spark/pull/3740, [~rxin] pointed out that we could reuse the Hadoop Text object in saveAsTextFile to reduce GC impact. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4918) Reuse Text in saveAsTextFile
[ https://issues.apache.org/jira/browse/SPARK-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4918. Resolution: Fixed Fix Version/s: 1.3.0 Reuse Text in saveAsTextFile Key: SPARK-4918 URL: https://issues.apache.org/jira/browse/SPARK-4918 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.3.0 When code reviewing https://github.com/apache/spark/pull/3740, [~rxin] pointed out that we could reuse the Hadoop Text object in saveAsTextFile to reduce GC impact. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4915) Wrong classname of external shuffle service in the doc for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4915: - Affects Version/s: 1.2.0 Wrong classname of external shuffle service in the doc for dynamic allocation - Key: SPARK-4915 URL: https://issues.apache.org/jira/browse/SPARK-4915 Project: Spark Issue Type: Documentation Components: Documentation, YARN Affects Versions: 1.2.0 Reporter: Tsuyoshi OZAWA Fix For: 1.2.0, 1.3.0 docs/job-scheduling.md says as follows: {quote} To enable this service, set `spark.shuffle.service.enabled` to `true`. In YARN, this external shuffle service is implemented in `org.apache.spark.yarn.network.YarnShuffleService` that runs in each `NodeManager` in your cluster. {quote} The class name org.apache.spark.yarn.network.YarnShuffleService is wrong. org.apache.spark.network.yarn.YarnShuffleService is correct class name to be specified. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4915) Wrong classname of external shuffle service in the doc for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4915. Resolution: Fixed Fix Version/s: 1.3.0 1.2.0 Assignee: Tsuyoshi OZAWA Target Version/s: 1.2.0, 1.3.0 Wrong classname of external shuffle service in the doc for dynamic allocation - Key: SPARK-4915 URL: https://issues.apache.org/jira/browse/SPARK-4915 Project: Spark Issue Type: Documentation Components: Documentation, YARN Affects Versions: 1.2.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 1.2.0, 1.3.0 docs/job-scheduling.md says as follows: {quote} To enable this service, set `spark.shuffle.service.enabled` to `true`. In YARN, this external shuffle service is implemented in `org.apache.spark.yarn.network.YarnShuffleService` that runs in each `NodeManager` in your cluster. {quote} The class name org.apache.spark.yarn.network.YarnShuffleService is wrong. org.apache.spark.network.yarn.YarnShuffleService is correct class name to be specified. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4915) Wrong classname of external shuffle service in the doc for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-4915: -- Fix Version/s: (was: 1.2.0) 1.2.1 Wrong classname of external shuffle service in the doc for dynamic allocation - Key: SPARK-4915 URL: https://issues.apache.org/jira/browse/SPARK-4915 Project: Spark Issue Type: Documentation Components: Documentation, YARN Affects Versions: 1.2.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 1.3.0, 1.2.1 docs/job-scheduling.md says as follows: {quote} To enable this service, set `spark.shuffle.service.enabled` to `true`. In YARN, this external shuffle service is implemented in `org.apache.spark.yarn.network.YarnShuffleService` that runs in each `NodeManager` in your cluster. {quote} The class name org.apache.spark.yarn.network.YarnShuffleService is wrong. org.apache.spark.network.yarn.YarnShuffleService is correct class name to be specified. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4881) Use SparkConf#getBoolean instead of get().toBoolean
[ https://issues.apache.org/jira/browse/SPARK-4881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4881: - Priority: Trivial (was: Minor) Use SparkConf#getBoolean instead of get().toBoolean --- Key: SPARK-4881 URL: https://issues.apache.org/jira/browse/SPARK-4881 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Kousuke Saruta Priority: Trivial It's really a minor issue. In ApplicationMaster, there is code like as follows. {code} val preserveFiles = sparkConf.get(spark.yarn.preserve.staging.files, false).toBoolean {code} I think, the code can be simplified like as follows. {code} val preserveFiles = sparkConf.getBoolean(spark.yarn.preserve.staging.files, false) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4915) Wrong classname of external shuffle service in the doc for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256084#comment-14256084 ] Andrew Ash commented on SPARK-4915: --- Changing fix version to 1.2.1 from 1.2.0 because this didn't make it in in-time for 1.2.0 {noformat} aash@aash-mbp ~/git/spark$ git log origin/branch-1.2 --oneline | grep SPARK-4915 31d42c4 [SPARK-4915][YARN] Fix classname to be specified for external shuffle service. aash@aash-mbp ~/git/spark$ git log v1.2.0 --oneline | grep SPARK-4915 aash@aash-mbp ~/git/spark$ {noformat} Wrong classname of external shuffle service in the doc for dynamic allocation - Key: SPARK-4915 URL: https://issues.apache.org/jira/browse/SPARK-4915 Project: Spark Issue Type: Documentation Components: Documentation, YARN Affects Versions: 1.2.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 1.3.0, 1.2.1 docs/job-scheduling.md says as follows: {quote} To enable this service, set `spark.shuffle.service.enabled` to `true`. In YARN, this external shuffle service is implemented in `org.apache.spark.yarn.network.YarnShuffleService` that runs in each `NodeManager` in your cluster. {quote} The class name org.apache.spark.yarn.network.YarnShuffleService is wrong. org.apache.spark.network.yarn.YarnShuffleService is correct class name to be specified. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4870) Add version information to driver log
[ https://issues.apache.org/jira/browse/SPARK-4870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4870: - Priority: Minor (was: Major) Add version information to driver log - Key: SPARK-4870 URL: https://issues.apache.org/jira/browse/SPARK-4870 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Zhang, Liye Priority: Minor Driver log doesn't include spark version information, version info is important in testing different spark version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4870) Add version information to driver log
[ https://issues.apache.org/jira/browse/SPARK-4870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4870. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Zhang, Liye Target Version/s: 1.3.0 Add version information to driver log - Key: SPARK-4870 URL: https://issues.apache.org/jira/browse/SPARK-4870 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Zhang, Liye Assignee: Zhang, Liye Priority: Minor Fix For: 1.3.0 Driver log doesn't include spark version information, version info is important in testing different spark version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4883) Add a name to the directoryCleaner thread
[ https://issues.apache.org/jira/browse/SPARK-4883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4883. Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Assignee: Shixiong Zhu Target Version/s: 1.3.0, 1.2.1 Add a name to the directoryCleaner thread - Key: SPARK-4883 URL: https://issues.apache.org/jira/browse/SPARK-4883 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.3.0, 1.2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4883) Add a name to the directoryCleaner thread
[ https://issues.apache.org/jira/browse/SPARK-4883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4883: - Affects Version/s: 1.2.0 Add a name to the directoryCleaner thread - Key: SPARK-4883 URL: https://issues.apache.org/jira/browse/SPARK-4883 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.2.0 Reporter: Shixiong Zhu Priority: Minor Fix For: 1.3.0, 1.2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file
[ https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4520: Assignee: sadhan sood SparkSQL exception when reading certain columns from a parquet file --- Key: SPARK-4520 URL: https://issues.apache.org/jira/browse/SPARK-4520 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: sadhan sood Assignee: sadhan sood Priority: Critical Attachments: part-r-0.parquet I am seeing this issue with spark sql throwing an exception when trying to read selective columns from a thrift parquet file and also when caching them. On some further digging, I was able to narrow it down to at-least one particular column type: mapstring, setstring to be causing this issue. To reproduce this I created a test thrift file with a very basic schema and stored some sample data in a parquet file: Test.thrift === {code} typedef binary SomeId enum SomeExclusionCause { WHITELIST = 1, HAS_PURCHASE = 2, } struct SampleThriftObject { 10: string col_a; 20: string col_b; 30: string col_c; 40: optional mapSomeExclusionCause, setSomeId col_d; } {code} = And loading the data in spark through schemaRDD: {code} import org.apache.spark.sql.SchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc); val parquetFile = /path/to/generated/parquet/file val parquetFileRDD = sqlContext.parquetFile(parquetFile) parquetFileRDD.printSchema root |-- col_a: string (nullable = true) |-- col_b: string (nullable = true) |-- col_c: string (nullable = true) |-- col_d: map (nullable = true) ||-- key: string ||-- value: array (valueContainsNull = true) |||-- element: string (containsNull = false) parquetFileRDD.registerTempTable(test) sqlContext.cacheTable(test) sqlContext.sql(select col_a from test).collect() -- see the exception stack here {code} {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:418) at java.util.ArrayList.get(ArrayList.java:431) at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95) at
[jira] [Updated] (SPARK-4733) Add missing prameter comments in ShuffleDependency
[ https://issues.apache.org/jira/browse/SPARK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4733: - Affects Version/s: 1.2.0 Add missing prameter comments in ShuffleDependency -- Key: SPARK-4733 URL: https://issues.apache.org/jira/browse/SPARK-4733 Project: Spark Issue Type: Documentation Components: Spark Core Affects Versions: 1.2.0 Reporter: Takeshi Yamamuro Priority: Trivial Add missing Javadoc comments in ShuffleDependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2973) Add a way to show tables without executing a job
[ https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256140#comment-14256140 ] Michael Armbrust commented on SPARK-2973: - I think the confusion there would be if someone then run .map(...) on that RDD. It would be pretty confusing if it did not run a Spark job. What is wrong with the approach we are already using for executeCollect(). We can add a executeTake with a default implementation and override that in ExecutedCommand. Add a way to show tables without executing a job Key: SPARK-2973 URL: https://issues.apache.org/jira/browse/SPARK-2973 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 Right now, sql(show tables).collect() will start a Spark job which shows up in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4733) Add missing prameter comments in ShuffleDependency
[ https://issues.apache.org/jira/browse/SPARK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4733. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Takeshi Yamamuro Target Version/s: 1.3.0 Add missing prameter comments in ShuffleDependency -- Key: SPARK-4733 URL: https://issues.apache.org/jira/browse/SPARK-4733 Project: Spark Issue Type: Documentation Components: Spark Core Affects Versions: 1.2.0 Reporter: Takeshi Yamamuro Assignee: Takeshi Yamamuro Priority: Trivial Fix For: 1.3.0 Add missing Javadoc comments in ShuffleDependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha
[ https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4447. Resolution: Fixed Fix Version/s: 1.3.0 Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha Key: SPARK-4447 URL: https://issues.apache.org/jira/browse/SPARK-4447 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.3.0 For example, YarnRMClient and YarnRMClientImpl can be merged YarnAllocator and YarnAllocationHandler can be merged -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha
[ https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4447: - Priority: Critical (was: Major) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha Key: SPARK-4447 URL: https://issues.apache.org/jira/browse/SPARK-4447 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Fix For: 1.3.0 For example, YarnRMClient and YarnRMClientImpl can be merged YarnAllocator and YarnAllocationHandler can be merged -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries
[ https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256174#comment-14256174 ] David Ross commented on SPARK-4908: --- Note that noticed this line from native Hive logging: {code} 14/12/19 21:44:55 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager {code} It seems to be tied to this config: https://github.com/apache/hive/blob/branch-0.13/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L719 I have this to our {{hive-site.xml}} in the spark {{conf}} directory: {code} property namehive.support.concurrency/name valuetrue/value /property {code} And I still have the issue. Perhaps there is more I need to do to support concurrency? Spark SQL built for Hive 13 fails under concurrent metadata queries --- Key: SPARK-4908 URL: https://issues.apache.org/jira/browse/SPARK-4908 Project: Spark Issue Type: Bug Reporter: David Ross We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6 We are using Spark built for Hive 13, using this option: {{-Phive-0.13.1}} In single-threaded mode, normal operations look fine. However, under concurrency, with at least 2 concurrent connections, metadata queries fail. For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} statement when you pass a default schema in the JDBC URL, all fail. {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue. Here is some example code: {code} object main extends App { import java.sql._ import scala.concurrent._ import scala.concurrent.duration._ import scala.concurrent.ExecutionContext.Implicits.global Class.forName(org.apache.hive.jdbc.HiveDriver) val host = localhost // update this val url = sjdbc:hive2://${host}:10511/some_db // update this val future = Future.traverse(1 to 3) { i = Future { println(Starting: + i) try { val conn = DriverManager.getConnection(url) } catch { case e: Throwable = e.printStackTrace() println(Failed: + i) } println(Finishing: + i) } } Await.result(future, 2.minutes) println(done!) } {code} Here is the output: {code} Starting: 1 Starting: 3 Starting: 2 java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation cancelled at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231) at org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451) at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Failed: 3 Finishing: 3 java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation cancelled at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231) at org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451) at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at
[jira] [Comment Edited] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries
[ https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256174#comment-14256174 ] David Ross edited comment on SPARK-4908 at 12/22/14 8:43 PM: - Note that I noticed this line in the logs that seems to come from Hive logging (not spark code): {code} 14/12/19 21:44:55 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager {code} It seems to be tied to this config: https://github.com/apache/hive/blob/branch-0.13/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L719 I have this to our {{hive-site.xml}} in the spark {{conf}} directory: {code} property namehive.support.concurrency/name valuetrue/value /property {code} And I still have the issue. Perhaps there is more I need to do to support concurrency? was (Author: dyross): Note that noticed this line from native Hive logging: {code} 14/12/19 21:44:55 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager {code} It seems to be tied to this config: https://github.com/apache/hive/blob/branch-0.13/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L719 I have this to our {{hive-site.xml}} in the spark {{conf}} directory: {code} property namehive.support.concurrency/name valuetrue/value /property {code} And I still have the issue. Perhaps there is more I need to do to support concurrency? Spark SQL built for Hive 13 fails under concurrent metadata queries --- Key: SPARK-4908 URL: https://issues.apache.org/jira/browse/SPARK-4908 Project: Spark Issue Type: Bug Reporter: David Ross We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6 We are using Spark built for Hive 13, using this option: {{-Phive-0.13.1}} In single-threaded mode, normal operations look fine. However, under concurrency, with at least 2 concurrent connections, metadata queries fail. For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} statement when you pass a default schema in the JDBC URL, all fail. {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue. Here is some example code: {code} object main extends App { import java.sql._ import scala.concurrent._ import scala.concurrent.duration._ import scala.concurrent.ExecutionContext.Implicits.global Class.forName(org.apache.hive.jdbc.HiveDriver) val host = localhost // update this val url = sjdbc:hive2://${host}:10511/some_db // update this val future = Future.traverse(1 to 3) { i = Future { println(Starting: + i) try { val conn = DriverManager.getConnection(url) } catch { case e: Throwable = e.printStackTrace() println(Failed: + i) } println(Finishing: + i) } } Await.result(future, 2.minutes) println(done!) } {code} Here is the output: {code} Starting: 1 Starting: 3 Starting: 2 java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation cancelled at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231) at org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451) at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Failed: 3 Finishing: 3 java.sql.SQLException:
[jira] [Resolved] (SPARK-4079) Snappy bundled with Spark does not work on older Linux distributions
[ https://issues.apache.org/jira/browse/SPARK-4079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4079. Resolution: Fixed Fix Version/s: 1.3.0 Snappy bundled with Spark does not work on older Linux distributions Key: SPARK-4079 URL: https://issues.apache.org/jira/browse/SPARK-4079 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin Assignee: Kostas Sakellis Fix For: 1.3.0 This issue has existed at least since 1.0, but has been made worse by 1.1 since snappy is now the default compression algorithm. When trying to use it on a CentOS 5 machine, for example, you'll get something like this: {noformat} java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:319) at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:226) at org.xerial.snappy.Snappy.clinit(Snappy.java:48) at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79) at org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207) ... Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5.3-af72bf3c-9dab-43af-a662-f9af657f06b1-libsnappyjava.so: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by /tmp/snappy-1.0.5.3-af72bf3c-9dab-43af-a662-f9af657f06b1-libsnappyjava.so) at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1957) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1882) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1843) at java.lang.Runtime.load0(Runtime.java:795) at java.lang.System.load(System.java:1061) at org.xerial.snappy.SnappyNativeLoader.load(SnappyNativeLoader.java:39) ... 29 more {noformat} There are two approaches I can see here (well, 3): * Declare CentOS 5 (and similar OSes) not supported, although that would suck for the people who are still on it and already use Spark * Fallback to another compression codec if Snappy cannot be loaded * Ask the Snappy guys to compile the library on an older OS... I think the second would be the best compromise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4864) Add documentation to Netty-based configs
[ https://issues.apache.org/jira/browse/SPARK-4864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4864. Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Add documentation to Netty-based configs Key: SPARK-4864 URL: https://issues.apache.org/jira/browse/SPARK-4864 Project: Spark Issue Type: Bug Components: Documentation Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 1.3.0, 1.2.1 Currently there is no public documentation for the NettyBlockTransferService or various configuration options of the network package. We should add some. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file
[ https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4520: --- Target Version/s: 1.2.1 (was: 1.3.0) SparkSQL exception when reading certain columns from a parquet file --- Key: SPARK-4520 URL: https://issues.apache.org/jira/browse/SPARK-4520 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: sadhan sood Assignee: sadhan sood Priority: Critical Attachments: part-r-0.parquet I am seeing this issue with spark sql throwing an exception when trying to read selective columns from a thrift parquet file and also when caching them. On some further digging, I was able to narrow it down to at-least one particular column type: mapstring, setstring to be causing this issue. To reproduce this I created a test thrift file with a very basic schema and stored some sample data in a parquet file: Test.thrift === {code} typedef binary SomeId enum SomeExclusionCause { WHITELIST = 1, HAS_PURCHASE = 2, } struct SampleThriftObject { 10: string col_a; 20: string col_b; 30: string col_c; 40: optional mapSomeExclusionCause, setSomeId col_d; } {code} = And loading the data in spark through schemaRDD: {code} import org.apache.spark.sql.SchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc); val parquetFile = /path/to/generated/parquet/file val parquetFileRDD = sqlContext.parquetFile(parquetFile) parquetFileRDD.printSchema root |-- col_a: string (nullable = true) |-- col_b: string (nullable = true) |-- col_c: string (nullable = true) |-- col_d: map (nullable = true) ||-- key: string ||-- value: array (valueContainsNull = true) |||-- element: string (containsNull = false) parquetFileRDD.registerTempTable(test) sqlContext.cacheTable(test) sqlContext.sql(select col_a from test).collect() -- see the exception stack here {code} {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:418) at java.util.ArrayList.get(ArrayList.java:431) at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95) at
[jira] [Commented] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
[ https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256216#comment-14256216 ] Apache Spark commented on SPARK-1714: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/3765 Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler Key: SPARK-1714 URL: https://issues.apache.org/jira/browse/SPARK-1714 Project: Spark Issue Type: Improvement Components: YARN Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
[ https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-1714: -- Target Version/s: 1.3.0 Affects Version/s: 1.2.0 Fix Version/s: (was: 1.2.0) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler Key: SPARK-1714 URL: https://issues.apache.org/jira/browse/SPARK-1714 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256228#comment-14256228 ] Peng Cheng commented on SPARK-3452: --- IMHO REPL should be kept being published, one of my project extends its API to initialize some third-party components upon launching. This should be made an 'official' API to encourage more platform integrate with it. Maven build should skip publishing artifacts people shouldn't depend on --- Key: SPARK-3452 URL: https://issues.apache.org/jira/browse/SPARK-3452 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical Fix For: 1.2.0 I think it's easy to do this by just adding a skip configuration somewhere. We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4923) Maven build should keep publishing spark-repl
Peng Cheng created SPARK-4923: - Summary: Maven build should keep publishing spark-repl Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4818) Join operation should use iterator/lazy evaluation
[ https://issues.apache.org/jira/browse/SPARK-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4818. --- Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 1.1.2 Issue resolved by pull request 3671 [https://github.com/apache/spark/pull/3671] Join operation should use iterator/lazy evaluation -- Key: SPARK-4818 URL: https://issues.apache.org/jira/browse/SPARK-4818 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1 Reporter: Johannes Simon Fix For: 1.1.2, 1.3.0, 1.2.1 The current implementation of the join operation does not use an iterator (i.e. lazy evaluation), causing it to explicitly evaluate the co-grouped values. In big data applications, these value collections can be very large. This causes the *cartesian product of all co-grouped values* for a specific key of both RDDs to be kept in memory during the flatMapValues operation, resulting in an *O(size(pair._1)*size(pair._2))* memory consumption instead of *O(1)*. Very large value collections will therefore cause GC overhead limit exceeded exceptions and fail the task, or at least slow down execution dramatically. {code:title=PairRDDFunctions.scala|borderStyle=solid} //... def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = { this.cogroup(other, partitioner).flatMapValues( pair = for (v - pair._1; w - pair._2) yield (v, w) ) } //... {code} Since cogroup returns an Iterable instance of an Array, the join implementation could be changed to the following, which uses lazy evaluation instead, and has almost no memory overhead: {code:title=PairRDDFunctions.scala|borderStyle=solid} //... def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = { this.cogroup(other, partitioner).flatMapValues( pair = for (v - pair._1.iterator; w - pair._2.iterator) yield (v, w) ) } //... {code} Alternatively, if the current implementation is intentionally not using lazy evaluation for some reason, there could be a *lazyJoin()* method next to the original join implementation that utilizes lazy evaluation. This of course applies to other join operations as well. Thanks! :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4818) Join operation should use iterator/lazy evaluation
[ https://issues.apache.org/jira/browse/SPARK-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4818: -- Assignee: Shixiong Zhu Join operation should use iterator/lazy evaluation -- Key: SPARK-4818 URL: https://issues.apache.org/jira/browse/SPARK-4818 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1 Reporter: Johannes Simon Assignee: Shixiong Zhu Fix For: 1.3.0, 1.1.2, 1.2.1 The current implementation of the join operation does not use an iterator (i.e. lazy evaluation), causing it to explicitly evaluate the co-grouped values. In big data applications, these value collections can be very large. This causes the *cartesian product of all co-grouped values* for a specific key of both RDDs to be kept in memory during the flatMapValues operation, resulting in an *O(size(pair._1)*size(pair._2))* memory consumption instead of *O(1)*. Very large value collections will therefore cause GC overhead limit exceeded exceptions and fail the task, or at least slow down execution dramatically. {code:title=PairRDDFunctions.scala|borderStyle=solid} //... def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = { this.cogroup(other, partitioner).flatMapValues( pair = for (v - pair._1; w - pair._2) yield (v, w) ) } //... {code} Since cogroup returns an Iterable instance of an Array, the join implementation could be changed to the following, which uses lazy evaluation instead, and has almost no memory overhead: {code:title=PairRDDFunctions.scala|borderStyle=solid} //... def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = { this.cogroup(other, partitioner).flatMapValues( pair = for (v - pair._1.iterator; w - pair._2.iterator) yield (v, w) ) } //... {code} Alternatively, if the current implementation is intentionally not using lazy evaluation for some reason, there could be a *lazyJoin()* method next to the original join implementation that utilizes lazy evaluation. This of course applies to other join operations as well. Thanks! :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256308#comment-14256308 ] Patrick Wendell commented on SPARK-4923: Hey [~pc...@uowmail.edu.au] - we removed this from Maven because it's not meant as a stable API in Spark. Could you talk about which parts of the repl API you are using and how you are using it? Maven build should keep publishing spark-repl - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Labels: shell Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256319#comment-14256319 ] Peng Cheng commented on SPARK-4923: --- Hey Patrick, The following API has been integrated since 1.0.0, IMHO they are stable enough for daily prototyping, creating case class used to be defective but has been fixed long time ago. SparkILoop.getAddedJars() $SparkIMain.bind $SparkIMain.quietBind $SparkIMain.interpret end of :) At first I assume that further development on it has been moved to databricks cloud. But the JIRA ticket was already there in September. So maybe demand on this API from the community is indeed low enough. However, I would still suggest keeping it, even promoting it into a Developer's API, this would encourage more projects to integrate in a more flexible way, and save prototyping/QA cost by customizing fixtures of REPL. People will still move to databricks cloud, which has far more features than that. Many influential projects already depends on the routinely published Scala-REPL (e.g. playFW), it would be strange for Spark not doing the same. What do you think? Maven build should keep publishing spark-repl - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Labels: shell Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256322#comment-14256322 ] Peng Cheng commented on SPARK-4923: --- Sorry my project is https://github.com/tribbloid/ISpark Maven build should keep publishing spark-repl - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Labels: shell Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4924) Factor out code to launch Spark applications into a separate library
Marcelo Vanzin created SPARK-4924: - Summary: Factor out code to launch Spark applications into a separate library Key: SPARK-4924 URL: https://issues.apache.org/jira/browse/SPARK-4924 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Marcelo Vanzin One of the questions we run into rather commonly is how to start a Spark application from my Java/Scala program?. There currently isn't a good answer to that: - Instantiating SparkContext has limitations (e.g., you can only have one active context at the moment, plus you lose the ability to submit apps in cluster mode) - Calling SparkSubmit directly is doable but you lose a lot of the logic handled by the shell scripts - Calling the shell script directly is doable, but sort of ugly from an API point of view. I think it would be nice to have a small library that handles that for users. On top of that, this library could be used by Spark itself to replace a lot of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library
[ https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-4924: -- Attachment: spark-launcher.txt Attaching a mini-spec to describe the motivation and a proposal for the library. I'm currently working on a prototype based on that spec and should have something to share soon-ish. Factor out code to launch Spark applications into a separate library Key: SPARK-4924 URL: https://issues.apache.org/jira/browse/SPARK-4924 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Marcelo Vanzin Attachments: spark-launcher.txt One of the questions we run into rather commonly is how to start a Spark application from my Java/Scala program?. There currently isn't a good answer to that: - Instantiating SparkContext has limitations (e.g., you can only have one active context at the moment, plus you lose the ability to submit apps in cluster mode) - Calling SparkSubmit directly is doable but you lose a lot of the logic handled by the shell scripts - Calling the shell script directly is doable, but sort of ugly from an API point of view. I think it would be nice to have a small library that handles that for users. On top of that, this library could be used by Spark itself to replace a lot of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2541) Standalone mode can't access secure HDFS anymore
[ https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256357#comment-14256357 ] Nicholas Chammas commented on SPARK-2541: - I was just wondering if we needed to ping someone to work on this. Taking another look at the history on this issue, it looks like you already started working on it, so no worries. Standalone mode can't access secure HDFS anymore Key: SPARK-2541 URL: https://issues.apache.org/jira/browse/SPARK-2541 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0, 1.0.1 Reporter: Thomas Graves Attachments: SPARK-2541-partial.patch In spark 0.9.x you could access secure HDFS from Standalone deploy, that doesn't work in 1.X anymore. It looks like the issues is in SparkHadoopUtil.runAsSparkUser. Previously it wouldn't do the doAs if the currentUser == user. Not sure how it affects when the daemons run as a super user but SPARK_USER is set to someone else. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
Alex Liu created SPARK-4925: --- Summary: Publish Spark SQL hive-thriftserver maven artifact Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1 Reporter: Alex Liu Fix For: 1.2.0, 1.1.2 The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4907) Inconsistent loss and gradient in LeastSquaresGradient compared with R
[ https://issues.apache.org/jira/browse/SPARK-4907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4907. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3746 [https://github.com/apache/spark/pull/3746] Inconsistent loss and gradient in LeastSquaresGradient compared with R -- Key: SPARK-4907 URL: https://issues.apache.org/jira/browse/SPARK-4907 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai Fix For: 1.3.0 In most of the academic paper and algorithm implementations, people use L = 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2 for least-squared loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf Since MLlib uses different convention, this will result different residuals and all the stats properties will be different from GLMNET package in R. The model coefficients will be still the same under this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4907) Inconsistent loss and gradient in LeastSquaresGradient compared with R
[ https://issues.apache.org/jira/browse/SPARK-4907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4907: - Assignee: DB Tsai Inconsistent loss and gradient in LeastSquaresGradient compared with R -- Key: SPARK-4907 URL: https://issues.apache.org/jira/browse/SPARK-4907 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai Assignee: DB Tsai Fix For: 1.3.0 In most of the academic paper and algorithm implementations, people use L = 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2 for least-squared loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf Since MLlib uses different convention, this will result different residuals and all the stats properties will be different from GLMNET package in R. The model coefficients will be still the same under this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4925: --- Fix Version/s: (was: 1.1.2) (was: 1.2.0) Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1 Reporter: Alex Liu The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256393#comment-14256393 ] Patrick Wendell commented on SPARK-4923: Hey [~pc...@uowmail.edu.au], thanks for filling that in - I didn't even realize we had code in there that was bytecode public. By stable I meant that we are promising it is an unchanging API. This is what we usually think about when we release things. For 1.2.0 I refactored our build and found out that we were publishing a bunch of random internal build components, so I took them all out of the published artifacts (examples, our assembly jar, etc) in SPARK-4923. Anyways - perhaps we could just annotate these as developer API's and be clear that they might change in the future. If you wanted to do that, and re-enable publishing them, I'd be happy to do that. Maven build should keep publishing spark-repl - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Labels: shell Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256397#comment-14256397 ] Patrick Wendell commented on SPARK-4925: The hive-thriftserver module is just used when building a Spark distribution, user applications shouldn't need to link against it. Could you talk a bit more about how you are actually using the thriftserver? Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1 Reporter: Alex Liu The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4925: --- Component/s: Build Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.2.0 Reporter: Alex Liu The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4925: --- Affects Version/s: (was: 1.1.1) 1.2.0 Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.2.0 Reporter: Alex Liu The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4923: --- Component/s: Build Maven build should keep publishing spark-repl - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Labels: shell Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3816) Add configureOutputJobPropertiesForStorageHandler to JobConf in SparkHadoopWriter
[ https://issues.apache.org/jira/browse/SPARK-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256399#comment-14256399 ] Apache Spark commented on SPARK-3816: - User 'alexliu68' has created a pull request for this issue: https://github.com/apache/spark/pull/3766 Add configureOutputJobPropertiesForStorageHandler to JobConf in SparkHadoopWriter - Key: SPARK-3816 URL: https://issues.apache.org/jira/browse/SPARK-3816 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Alex Liu Fix For: 1.2.0 It's similar to SPARK-2846. We should add PlanUtils.configureInputJobPropertiesForStorageHandler to SparkHadoopWriter, so that writer can add configuration from customized StorageHandler to JobConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256405#comment-14256405 ] Alex Liu commented on SPARK-4925: - Our build.xml downloads hive-thriftserver maven artifact, and add the downloaded jar file to class path. Currently we have it published at our private repository. But we hope we don't need maintain our private Spark build repository and only depend on public maven repository to down it. Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.2.0 Reporter: Alex Liu The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4920: --- Assignee: uncleGen current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Assignee: uncleGen Priority: Minor It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4920. Resolution: Fixed Fix Version/s: 1.2.1 I believe this has been fixed: https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a8a8e0e8752194d82b6c6e20cedbb3871b221916 current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Assignee: uncleGen Priority: Minor Fix For: 1.2.1 It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256415#comment-14256415 ] Apache Spark commented on SPARK-4925: - User 'alexliu68' has created a pull request for this issue: https://github.com/apache/spark/pull/3766 Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.2.0 Reporter: Alex Liu The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4921) Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks
[ https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256425#comment-14256425 ] Rui Li commented on SPARK-4921: --- I'm not sure if this is intended, but returning process_local for no_pref tasks may reset {{currentLocalityIndex}} to 0 which may cause more delay later. Seems there's a check to avoid this but I doubt it's sufficient: {code} // Update our locality level for delay scheduling // NO_PREF will not affect the variables related to delay scheduling if (maxLocality != TaskLocality.NO_PREF) { currentLocalityIndex = getLocalityIndex(taskLocality) lastLaunchTime = curTime } {code} Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks - Key: SPARK-4921 URL: https://issues.apache.org/jira/browse/SPARK-4921 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang Attachments: NO_PREF.patch During research for HIVE-9153, we found that TaskSetManager returns PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. Changing the return value to NO_PREF, as demonstrated in the attached patch, seemingly improves the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated SPARK-4923: -- Attachment: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch thank you so much! First patch uploaded Maven build should keep publishing spark-repl - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Labels: shell Attachments: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated SPARK-4923: -- Target Version/s: 1.3.0, 1.2.1 (was: 1.3.0) Maven build should keep publishing spark-repl - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Labels: shell Attachments: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3860) Improve dimension joins
[ https://issues.apache.org/jira/browse/SPARK-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-3860: --- Assignee: Michael Armbrust Improve dimension joins --- Key: SPARK-3860 URL: https://issues.apache.org/jira/browse/SPARK-3860 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Michael Armbrust Priority: Critical This is an umbrella ticket for improving performance for joining multiple dimension tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4926) Spark manipulate Hbase
Lily created SPARK-4926: --- Summary: Spark manipulate Hbase Key: SPARK-4926 URL: https://issues.apache.org/jira/browse/SPARK-4926 Project: Spark Issue Type: Question Reporter: Lily When I run the program below,I got an error “Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 14) had a not serializable result:org.apache.hadoop.hbase.io.ImmutableBytesWritable” How can I manipulate the results? How to realize get,put,scan of hbase by scala? There are not any examples in the source code files. import org.apache.hadoop.hbase.client.HBaseAdmin import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor } import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.spark._ object HbaseTest extends Serializable{ def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(HBaseTest) val sc = new SparkContext(sparkConf) val conf = HBaseConfiguration.create() conf.set(hbase.zookeeper.property.clientPort, 2181); conf.set(hbase.zookeeper.quorum, 192.168.179.146); conf.set(TableInputFormat.INPUT_TABLE, sensteer_rawdata) val admin = new HBaseAdmin(conf) if (!admin.isTableAvailable(sensteer_rawdata)) { val tableDesc = new HTableDescriptor(sensteer_rawdata) admin.createTable(tableDesc) } val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) val count = hBaseRDD.count() println(-- + hBaseRDD.count() + --) val res = hBaseRDD.take(count.toInt) sc.stop() } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener
[ https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256468#comment-14256468 ] Patrick Wendell commented on SPARK-4906: Hey [~mingyu.z...@gmail.com] - could you say a bit more about how a workload can generate this number of failed tasks in the live set of running stages? If they are each 10kb and you see them taking an aggregate of 500MB, this means you have 50,000 failed tasks in the live set. I've never seen this before because typically once a few tasks have failed the stage will fail, so this definitely seems like an extreme case. Running a job with hundreds of thousands of tasks might require a good size heap at the driver even for other reasons. How big of a heap are you using? We might be able limit the number of unique string objects that are allocated if we have a large number of tasks that refer to an identical stack trace. Spark master OOMs with exception stack trace stored in JobProgressListener -- Key: SPARK-4906 URL: https://issues.apache.org/jira/browse/SPARK-4906 Project: Spark Issue Type: Bug Affects Versions: 1.1.1 Reporter: Mingyu Kim Spark master was OOMing with a lot of stack traces retained in JobProgressListener. The object dependency goes like the following. JobProgressListener.stageIdToData = StageUIData.taskData = TaskUIData.errorMessage Each error message is ~10kb since it has the entire stack trace. As we have a lot of tasks, when all of the tasks across multiple stages go bad, these error messages accounted for 0.5GB of heap at some point. Please correct me if I'm wrong, but it looks like all the task info for running applications are kept in memory, which means it's almost always bound to OOM for long-running applications. Would it make sense to fix this, for example, by spilling some UI states to disk? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4927) Spark does not clean up properly during long jobs.
Ilya Ganelin created SPARK-4927: --- Summary: Spark does not clean up properly during long jobs. Key: SPARK-4927 URL: https://issues.apache.org/jira/browse/SPARK-4927 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Ilya Ganelin On a long running Spark job, Spark will eventually run out of memory on the driver node due to metadata overhead from the shuffle operation. Spark will continue to operate, however with drastically decreased performance (since swapping now occurs with every operation). The spark.cleanup.tll parameter allows a user to configure when cleanup happens but the issue with doing this is that it isn’t done safely, e.g. If this clears a cached RDD or active task in the middle of processing a stage, this ultimately causes a KeyNotFoundException when the next stage attempts to reference the cleared RDD or task. There should be a sustainable mechanism for cleaning up stale metadata that allows the program to continue running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4928) Operation ,,=,= with Decimal report error
guowei created SPARK-4928: - Summary: Operation ,,=,= with Decimal report error Key: SPARK-4928 URL: https://issues.apache.org/jira/browse/SPARK-4928 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: guowei create table test (a Decimal(10,1)); select * from test where a1; WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Types do not match DecimalType(10,1) != DecimalType(10,0), tree: (input[0] 1) at org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:249) at org.apache.spark.sql.catalyst.expressions.GreaterThan.eval(predicates.scala:204) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:794) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:794) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1324) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1324) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4928) Operator ,,=,= with decimal between different precision report error
[ https://issues.apache.org/jira/browse/SPARK-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256509#comment-14256509 ] Apache Spark commented on SPARK-4928: - User 'guowei2' has created a pull request for this issue: https://github.com/apache/spark/pull/3767 Operator ,,=,= with decimal between different precision report error -- Key: SPARK-4928 URL: https://issues.apache.org/jira/browse/SPARK-4928 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: guowei create table test (a Decimal(10,1)); select * from test where a1; WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Types do not match DecimalType(10,1) != DecimalType(10,0), tree: (input[0] 1) at org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:249) at org.apache.spark.sql.catalyst.expressions.GreaterThan.eval(predicates.scala:204) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:794) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:794) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1324) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1324) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256526#comment-14256526 ] Apache Spark commented on SPARK-4920: - User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/3768 current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Assignee: uncleGen Priority: Minor Fix For: 1.2.1 It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256544#comment-14256544 ] Apache Spark commented on SPARK-4920: - User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/3769 current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Assignee: uncleGen Priority: Minor Fix For: 1.2.1 It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256545#comment-14256545 ] Zhang, Liye commented on SPARK-4920: Seems standalone mode is not with the version. I agree with [~sowen], it'll be not good looking when the version is the long. Putting the version on footer will be not string but will be flexible for extension. current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Assignee: uncleGen Priority: Minor Fix For: 1.2.1 It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4920) current spark version in UI is not striking
[ https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256545#comment-14256545 ] Zhang, Liye edited comment on SPARK-4920 at 12/23/14 3:25 AM: -- Seems standalone mode is not with the version on web UI. I agree with [~sowen], it'll be not good looking when the version is the long. Putting the version on footer will be not striking but will be flexible for extension. was (Author: liyezhang556520): Seems standalone mode is not with the version. I agree with [~sowen], it'll be not good looking when the version is the long. Putting the version on footer will be not string but will be flexible for extension. current spark version in UI is not striking --- Key: SPARK-4920 URL: https://issues.apache.org/jira/browse/SPARK-4920 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: uncleGen Assignee: uncleGen Priority: Minor Fix For: 1.2.1 It is not convenient to see the Spark version. We can keep the same style with Spark website. !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4890) Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it
[ https://issues.apache.org/jira/browse/SPARK-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256581#comment-14256581 ] Apache Spark commented on SPARK-4890: - User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/3770 Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it - Key: SPARK-4890 URL: https://issues.apache.org/jira/browse/SPARK-4890 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.3.0 We should upgrade to a newer version of Boto (2.34.0), since this is blocking several features. It looks like newer versions of Boto don't work properly when they're loaded from a zipfile since they try to read a JSON file from a path relative to the Boto library sources. Therefore, I think we should change {{spark-ec2}} to automatically download Boto from PyPi if it's not present in {{SPARK_EC2_DIR/lib}}, similar to what we do in the {{sbt/sbt}} scripts. This shouldn't ben an issue for users since they already need to have an internet connection to launch an EC2 cluster. By performing the downloading in {{spark_ec2.py}} instead of the Bash script, this should also work for Windows users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4929) Yarn Client mode can not support the HA after the exitcode change
SaintBacchus created SPARK-4929: --- Summary: Yarn Client mode can not support the HA after the exitcode change Key: SPARK-4929 URL: https://issues.apache.org/jira/browse/SPARK-4929 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: SaintBacchus Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4929) Yarn Client mode can not support the HA after the exitcode change
[ https://issues.apache.org/jira/browse/SPARK-4929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256585#comment-14256585 ] Apache Spark commented on SPARK-4929: - User 'SaintBacchus' has created a pull request for this issue: https://github.com/apache/spark/pull/3771 Yarn Client mode can not support the HA after the exitcode change - Key: SPARK-4929 URL: https://issues.apache.org/jira/browse/SPARK-4929 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: SaintBacchus Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4930) [SQL][DOCS]Update SQL programming guide
Gankun Luo created SPARK-4930: - Summary: [SQL][DOCS]Update SQL programming guide Key: SPARK-4930 URL: https://issues.apache.org/jira/browse/SPARK-4930 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Gankun Luo Priority: Trivial `CACHE TABLE tbl` is now eager by default not lazy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4930) [SQL][DOCS]Update SQL programming guide
[ https://issues.apache.org/jira/browse/SPARK-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256632#comment-14256632 ] Apache Spark commented on SPARK-4930: - User 'luogankun' has created a pull request for this issue: https://github.com/apache/spark/pull/3773 [SQL][DOCS]Update SQL programming guide Key: SPARK-4930 URL: https://issues.apache.org/jira/browse/SPARK-4930 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Gankun Luo Priority: Trivial `CACHE TABLE tbl` is now eager by default not lazy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org