date:20141222


[ 
https://issues.apache.org/jira/browse/SPARK-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255538#comment-14255538
 ] 

Apache Spark commented on SPARK-4916:
-

User 'luogankun' has created a pull request for this issue:
https://github.com/apache/spark/pull/3759

 [SQL][DOCS]Update SQL programming guide about cache section
 ---

 Key: SPARK-4916
 URL: https://issues.apache.org/jira/browse/SPARK-4916
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Gankun Luo
Priority: Trivial

 SchemeRDD.cache() now uses in-memory columnar storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4917) Add a function to convert into a graph with canonical edges in GraphOps

2014-12-22 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-4917:
---

 Summary: Add a function to convert into a graph with canonical 
edges in GraphOps
 Key: SPARK-4917
 URL: https://issues.apache.org/jira/browse/SPARK-4917
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Minor


Convert bi-directional edges into uni-directional ones instead of 
'canonicalOrientation' in GraphLoader.edgeListFile.
This function is useful when a graph is loaded as it is and then is transformed 
into one with canonical edges.
It rewrites the vertex ids of edges so that srcIds are bigger than dstIds, and 
merges the duplicated edges.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4917) Add a function to convert into a graph with canonical edges in GraphOps


[ 
https://issues.apache.org/jira/browse/SPARK-4917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255543#comment-14255543
 ] 

Apache Spark commented on SPARK-4917:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/3760

 Add a function to convert into a graph with canonical edges in GraphOps
 ---

 Key: SPARK-4917
 URL: https://issues.apache.org/jira/browse/SPARK-4917
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Minor

 Convert bi-directional edges into uni-directional ones instead of 
 'canonicalOrientation' in GraphLoader.edgeListFile.
 This function is useful when a graph is loaded as it is and then is 
 transformed into one with canonical edges.
 It rewrites the vertex ids of edges so that srcIds are bigger than dstIds, 
 and merges the duplicated edges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution


[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255606#comment-14255606
 ] 

Sean Owen commented on SPARK-2075:
--

[~sunrui] Yes, but I am still not clear that anyone has observed the problem 
with two binaries built for the same version of Hadoop. Most of the situations 
listed here do not match that description. I might not understand someone's 
issue report here. In any event, it sounds like an underlying cause has been 
fixed already anyway.

 Anonymous classes are missing from Spark distribution
 -

 Key: SPARK-2075
 URL: https://issues.apache.org/jira/browse/SPARK-2075
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 1.0.0
Reporter: Paul R. Brown
Assignee: Shixiong Zhu
Priority: Critical
 Fix For: 1.3.0, 1.2.1


 Running a job built against the Maven dep for 1.0.0 and the hadoop1 
 distribution produces:
 {code}
 java.lang.ClassNotFoundException:
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
 {code}
 Here's what's in the Maven dep as of 1.0.0:
 {code}
 jar tvf 
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}
 And here's what's in the hadoop1 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
 {code}
 I.e., it's not there.  It is in the hadoop2 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4918) Reuse Text in saveAsTextFile

2014-12-22 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-4918:
---

 Summary: Reuse Text in saveAsTextFile
 Key: SPARK-4918
 URL: https://issues.apache.org/jira/browse/SPARK-4918
 Project: Spark
  Issue Type: Improvement
Reporter: Shixiong Zhu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4918) Reuse Text in saveAsTextFile

2014-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-4918:

Component/s: Spark Core

 Reuse Text in saveAsTextFile
 

 Key: SPARK-4918
 URL: https://issues.apache.org/jira/browse/SPARK-4918
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4918) Reuse Text in saveAsTextFile


[ 
https://issues.apache.org/jira/browse/SPARK-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255697#comment-14255697
 ] 

Apache Spark commented on SPARK-4918:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3762

 Reuse Text in saveAsTextFile
 

 Key: SPARK-4918
 URL: https://issues.apache.org/jira/browse/SPARK-4918
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4919) Can Spark SQL thrift server UI provide JOB kill operate or any REST API?

2014-12-22 Thread Xiaoyu Wang (JIRA)

Xiaoyu Wang created SPARK-4919:
--

 Summary: Can Spark SQL thrift server UI provide JOB kill operate 
or any REST API?
 Key: SPARK-4919
 URL: https://issues.apache.org/jira/browse/SPARK-4919
 Project: Spark
  Issue Type: Wish
  Components: SQL, Web UI
Affects Versions: 1.2.0
Reporter: Xiaoyu Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4919) Can Spark SQL thrift server UI provide JOB kill operate or any REST API?


[ 
https://issues.apache.org/jira/browse/SPARK-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255707#comment-14255707
 ] 

Sean Owen commented on SPARK-4919:
--

(Could you ask questions at u...@apache.org instead? JIRA is for recording bugs 
or enhancements, along with proposed code changes.)

 Can Spark SQL thrift server UI provide JOB kill operate or any REST API?
 

 Key: SPARK-4919
 URL: https://issues.apache.org/jira/browse/SPARK-4919
 Project: Spark
  Issue Type: Wish
  Components: SQL, Web UI
Affects Versions: 1.2.0
Reporter: Xiaoyu Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4919) Can Spark SQL thrift server UI provide JOB kill operate or any REST API?

2014-12-22 Thread Xiaoyu Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoyu Wang updated SPARK-4919:
---
Description: 
Can Spark SQL thrift server UI provide “JOB” kill operate or any REST API?
Stages is already can be killed!

 Can Spark SQL thrift server UI provide JOB kill operate or any REST API?
 

 Key: SPARK-4919
 URL: https://issues.apache.org/jira/browse/SPARK-4919
 Project: Spark
  Issue Type: Wish
  Components: SQL, Web UI
Affects Versions: 1.2.0
Reporter: Xiaoyu Wang

 Can Spark SQL thrift server UI provide “JOB” kill operate or any REST API?
 Stages is already can be killed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2808) update kafka to version 0.8.2


[ 
https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255717#comment-14255717
 ] 

Apache Spark commented on SPARK-2808:
-

User 'helena' has created a pull request for this issue:
https://github.com/apache/spark/pull/3631

 update kafka to version 0.8.2
 -

 Key: SPARK-2808
 URL: https://issues.apache.org/jira/browse/SPARK-2808
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati

 First kafka_2.11 0.8.1 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O


 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe，we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. Following is my testing result:

| data size (Byte)   |  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill  lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill  lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill  lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill  lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the BlockManager in the map-side and 
set the spark.shuffle.spill false in the reduce-side. All the intermediate 
data is cached in memory store. Just as Reynold Xin has pointed out, our 
disk-based shuffle manager has achieved a good performance. With  parameter 
tuning, the disk-based shuffle manager will  obtain similar performance as 
memory-based shuffle manager. However, I will continue my work and improve it. 
And as an alternative tuning option, InMemory shuffle is a good choice. 
Future work includes, but is not limited to:

- memory usage management in InMemory Shuffle mode
- data management when intermediate data can not fit in memory

Test code：

{code: borderStyle=solid}

val conf = new SparkConf().setAppName(InMemoryShuffleTest)
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(,)).map(s = (s, s))
  .groupBy(e = e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println(time:  + (endTime - startTime) / 1000 )

{code}


Spark Sort Benchmark

Test the influence of memory size per core

 100GB（SORT benchmark）   
 100 executor /15core  1491partitions (input file blocks) 

| memory size per executor|　inmemory shuffle(no shuffle spill)  |  sort shuffle 
 |  hash shuffle |   improvement(vs.sort)  |   improvement(vs.hash) |
|--|||---|-|---|
|9GB   |  79.652849s |  60.102337s | failed|   
-32.7%|  -|
|12GB  |  54.821924s |  51.654897s |109.167068s |   
-3.17%|+47.8% | 
|15GB  |  33.537199s |  40.140621s |48.088158s  |   
+16.47%   |+30.26%|
|18GB  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%| 

Test the influence of partition numer

18GB/15core per executor

| partitions |　inmemory shuffle(no shuffle spill)  |  sort shuffle  |  hash 
shuffle |   improvement(vs.sort)  |  improvement(vs.hash) |
|--|||---|-|---|

|1000  |  92.675436s |  85.193158s |71.106323s  |   
-8.78%|-30.34%| 
|1491  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%|
|2000  |  18.385s|  26.653720s |30.103s |   
+31.02%   |+38.92%|


  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O


 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe，we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. Following is my testing result:

| data size (Byte)   |  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill  lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill  lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill  lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill  lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the BlockManager in the map-side and 
set the spark.shuffle.spill false in the reduce-side. All the intermediate 
data is cached in memory store. Just as Reynold Xin has pointed out, our 
disk-based shuffle manager has achieved a good performance. With  parameter 
tuning, the disk-based shuffle manager will  obtain similar performance as 
memory-based shuffle manager. However, I will continue my work and improve it. 
And as an alternative tuning option, InMemory shuffle is a good choice. 
Future work includes, but is not limited to:

- memory usage management in InMemory Shuffle mode
- data management when intermediate data can not fit in memory

Test code：

{code: borderStyle=solid}

val conf = new SparkConf().setAppName(InMemoryShuffleTest)
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(,)).map(s = (s, s))
  .groupBy(e = e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println(time:  + (endTime - startTime) / 1000 )

{code}


Spark Sort Benchmark

Test the influence of memory size per core

 100GB（SORT benchmark）   
 100 executor /15core  1491partitions (input file blocks) 

| memory size per executor|　inmemory shuffle(no shuffle spill)  |  sort shuffle 
 |  hash shuffle |   improvement(vs.sort)  |   improvement(vs.hash) |
|9GB   |  79.652849s |  60.102337s | failed|   
-32.7%|  -|
|12GB  |  54.821924s |  51.654897s |109.167068s |   
-3.17%|+47.8% | 
|15GB  |  33.537199s |  40.140621s |48.088158s  |   
+16.47%   |+30.26%|
|18GB  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%| 

Test the influence of partition numer

18GB/15core per executor

| partitions |　inmemory shuffle(no shuffle spill)  |  sort shuffle  |  hash 
shuffle |   improvement(vs.sort)  |  improvement(vs.hash) |
|1000  |  92.675436s |  85.193158s |71.106323s  |   
-8.78%|-30.34%| 
|1491  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%|
|2000  |  18.385s|  26.653720s |30.103s |   
+31.02%   |+38.92%|


  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O


 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe，we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. 

1. Following is my testing result (some heary shuffle operations):

| data size (Byte)   |  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill  lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill  lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill  lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill  lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the BlockManager in the map-side and 
set the spark.shuffle.spill false in the reduce-side. All the intermediate 
data is cached in memory store. Just as Reynold Xin has pointed out, our 
disk-based shuffle manager has achieved a good performance. With  parameter 
tuning, the disk-based shuffle manager will  obtain similar performance as 
memory-based shuffle manager. However, I will continue my work and improve it. 
And as an alternative tuning option, InMemory shuffle is a good choice. 
Future work includes, but is not limited to:

- memory usage management in InMemory Shuffle mode
- data management when intermediate data can not fit in memory

Test code：

{code: borderStyle=solid}

val conf = new SparkConf().setAppName(InMemoryShuffleTest)
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(,)).map(s = (s, s))
  .groupBy(e = e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println(time:  + (endTime - startTime) / 1000 )

{code}

2. Following is a Spark Sort Benchmark:

2.1. Test the influence of memory size per core

precondition: 100GB(SORT benchmark), 100 executor /15core  1491partitions 
(input file blocks) 

| memory size per executor|　inmemory shuffle(no shuffle spill)  |  sort shuffle 
 |  hash shuffle |   improvement(vs.sort)  |   improvement(vs.hash) |
|9GB   |  79.652849s |  60.102337s | failed|   
-32.7%|  -|
|12GB  |  54.821924s |  51.654897s |109.167068s |   
-3.17%|+47.8% | 
|15GB  |  33.537199s |  40.140621s |48.088158s  |   
+16.47%   |+30.26%|
|18GB  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%| 

2.2. Test the influence of partition numer

18GB/15core per executor

| partitions |　inmemory shuffle(no shuffle spill)  |  sort shuffle  |  hash 
shuffle |   improvement(vs.sort)  |  improvement(vs.hash) |
|1000  |  92.675436s |  85.193158s |71.106323s  |   
-8.78%|-30.34%| 
|1491  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%|
|2000  |  18.385s|  26.653720s |30.103s |   
+31.02%   |+38.92%|


  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O


 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe，we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. 

1. Following is my testing result (some heary shuffle operations):

| data size (Byte)   |  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill  lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill  lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill  lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill  lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the BlockManager in the map-side and 
set the spark.shuffle.spill false in the reduce-side. All the intermediate 
data is cached in memory store. Just as Reynold Xin has pointed out, our 
disk-based shuffle manager has achieved a good performance. With  parameter 
tuning, the disk-based shuffle manager will  obtain similar performance as 
memory-based shuffle manager. However, I will continue my work and improve it. 
And as an alternative tuning option, InMemory shuffle is a good choice. 
Future work includes, but is not limited to:

- memory usage management in InMemory Shuffle mode
- data management when intermediate data can not fit in memory

Test code：

{code: borderStyle=solid}

val conf = new SparkConf().setAppName(InMemoryShuffleTest)
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(,)).map(s = (s, s))
  .groupBy(e = e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println(time:  + (endTime - startTime) / 1000 )

{code}

2. Following is a Spark Sort Benchmark:

2.1. Test the influence of memory size per core

precondition: 100GB(SORT benchmark), 100 executor /15core  1491partitions 
(input file blocks) . There is no tuning for disk shuffle. 

| memory size per executor|　inmemory shuffle(no shuffle spill)  |  sort shuffle 
 |  hash shuffle |   improvement(vs.sort)  |   improvement(vs.hash) |
|9GB   |  79.652849s |  60.102337s | failed|   
-32.7%|  -|
|12GB  |  54.821924s |  51.654897s |109.167068s |   
-3.17%|+47.8% | 
|15GB  |  33.537199s |  40.140621s |48.088158s  |   
+16.47%   |+30.26%|
|18GB  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%| 

2.2. Test the influence of partition numer

18GB/15core per executor

| partitions |　inmemory shuffle(no shuffle spill)  |  sort shuffle  |  hash 
shuffle |   improvement(vs.sort)  |  improvement(vs.hash) |
|1000  |  92.675436s |  85.193158s |71.106323s  |   
-8.78%|-30.34%| 
|1491  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%|
|2000  |  18.385s|  26.653720s |30.103s |   
+31.02%   |+38.92%|


  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O


 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe，we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. 

1. Following is my testing result (some heary shuffle operations):

| data size (Byte)   |  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill  lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill  lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill  lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill  lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the BlockManager in the map-side and 
set the spark.shuffle.spill false in the reduce-side. All the intermediate 
data is cached in memory store. Just as Reynold Xin has pointed out, our 
disk-based shuffle manager has achieved a good performance. With  parameter 
tuning, the disk-based shuffle manager will  obtain similar performance as 
memory-based shuffle manager. However, I will continue my work and improve it. 
And as an alternative tuning option, InMemory shuffle is a good choice. 
Future work includes, but is not limited to:

- memory usage management in InMemory Shuffle mode
- data management when intermediate data can not fit in memory

Test code：

{code: borderStyle=solid}

val conf = new SparkConf().setAppName(InMemoryShuffleTest)
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(,)).map(s = (s, s))
  .groupBy(e = e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println(time:  + (endTime - startTime) / 1000 )

{code}

2. Following is a Spark Sort Benchmark (in spark 1.1.1). There is no tuning for 
disk shuffle. 

2.1. Test the influence of memory size per core

precondition: 100GB(SORT benchmark), 100 executor /15core  1491partitions 
(input file blocks) . 

| memory size per executor|　inmemory shuffle(no shuffle spill)  |  sort shuffle 
 |  hash shuffle |   improvement(vs.sort)  |   improvement(vs.hash) |
|9GB   |  79.652849s |  60.102337s | failed|   
-32.7%|  -|
|12GB  |  54.821924s |  51.654897s |109.167068s |   
-3.17%|+47.8% | 
|15GB  |  33.537199s |  40.140621s |48.088158s  |   
+16.47%   |+30.26%|
|18GB  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%| 

2.2. Test the influence of partition numer

18GB/15core per executor

| partitions |　inmemory shuffle(no shuffle spill)  |  sort shuffle  |  hash 
shuffle |   improvement(vs.sort)  |  improvement(vs.hash) |
|1000  |  92.675436s |  85.193158s |71.106323s  |   
-8.78%|-30.34%| 
|1491  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%|
|2000  |  18.385s|  26.653720s |30.103s |   
+31.02%   |+38.92%|


  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to

[jira] [Commented] (SPARK-2541) Standalone mode can't access secure HDFS anymore

2014-12-22 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255772#comment-14255772
 ] 

Thomas Graves commented on SPARK-2541:
--

Sorry don't follow your question,  are you wondering if someone can work on 
this jira and fix this issue or something else?  

I haven't had time to get back to it, feel free to work on it if you have the 
same issue.

 Standalone mode can't access secure HDFS anymore
 

 Key: SPARK-2541
 URL: https://issues.apache.org/jira/browse/SPARK-2541
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1
Reporter: Thomas Graves
 Attachments: SPARK-2541-partial.patch


 In spark 0.9.x you could access secure HDFS from Standalone deploy, that 
 doesn't work in 1.X anymore. 
 It looks like the issues is in SparkHadoopUtil.runAsSparkUser.  Previously it 
 wouldn't do the doAs if the currentUser == user.  Not sure how it affects 
 when the daemons run as a super user but SPARK_USER is set to someone else.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O


 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe，we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. 

1. Following is my testing result (some heary shuffle operations):

| data size (Byte)   |  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill  lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill  lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill  lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill  lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the BlockManager in the map-side and 
set the spark.shuffle.spill false in the reduce-side. All the intermediate 
data is cached in memory store. Just as Reynold Xin has pointed out, our 
disk-based shuffle manager has achieved a good performance. With  parameter 
tuning, the disk-based shuffle manager will  obtain similar performance as 
memory-based shuffle manager. However, I will continue my work and improve it. 
And as an alternative tuning option, InMemory shuffle is a good choice. 
Future work includes, but is not limited to:

- memory usage management in InMemory Shuffle mode
- data management when intermediate data can not fit in memory

Test code：

{code: borderStyle=solid}

val conf = new SparkConf().setAppName(InMemoryShuffleTest)
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(,)).map(s = (s, s))
  .groupBy(e = e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println(time:  + (endTime - startTime) / 1000 )

{code}

2. Following is a Spark Sort Benchmark (in spark 1.1.1). There is no tuning for 
disk shuffle. 

2.1. Test the influence of memory size per core

precondition: 100GB(SORT benchmark), 100 executor /15cores  1491partitions 
(input file blocks) . 

| memory size per executor|　inmemory shuffle(no shuffle spill)  |  sort shuffle 
 |  hash shuffle |   improvement(vs.sort)  |   improvement(vs.hash) |
|9GB   |  79.652849s |  60.102337s | failed|   
-32.7%|  -|
|12GB  |  54.821924s |  51.654897s |109.167068s |   
-3.17%|+47.8% | 
|15GB  |  33.537199s |  40.140621s |48.088158s  |   
+16.47%   |+30.26%|
|18GB  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%| 

2.2. Test the influence of partition number

18GB/15cores per executor

| partitions |　inmemory shuffle(no shuffle spill)  |  sort shuffle  |  hash 
shuffle |   improvement(vs.sort)  |  improvement(vs.hash) |
|1000  |  92.675436s |  85.193158s |71.106323s  |   
-8.78%|-30.34%| 
|1491  |  30.930927s |  43.392401s |49.830276s  |   
+28.7%|+37.93%|
|2000  |  18.385s|  26.653720s |30.103s |   
+31.02%   |+38.92%|


  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to

[jira] [Created] (SPARK-4920) current spark version In UI is not striking

uncleGen created SPARK-4920:
---

 Summary: current spark version In UI is not striking
 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4920) current spark version in UI is not striking


 [ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-4920:

Summary: current spark version in UI is not striking  (was: current spark 
version In UI is not striking)

 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4920) current spark version in UI is not striking


 [ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-4920:

Description: 
!https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!

 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Priority: Minor

 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4920) current spark version in UI is not striking


 [ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-4920:

Description: 
It is not convenient to see the Spark version. We can keep the same style with 
Spark website.

!https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!

  was:
we can keep the same style with Spark website

!https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!


 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Priority: Minor

 It is not convenient to see the Spark version. We can keep the same style 
 with Spark website.
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4920) current spark version in UI is not striking


 [ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-4920:

Description: 
we can keep the same style with Spark website

!https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!

  
was:!https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!


 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Priority: Minor

 we can keep the same style with Spark website
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4920) current spark version in UI is not striking


[ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255850#comment-14255850
 ] 

Apache Spark commented on SPARK-4920:
-

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3763

 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Priority: Minor

 It is not convenient to see the Spark version. We can keep the same style 
 with Spark website.
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4920) current spark version in UI is not striking


[ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255855#comment-14255855
 ] 

Sean Owen commented on SPARK-4920:
--

I slight prefer the current UI, where the version is in the footer. Putting the 
version here pushes the tabs right significantly when the version is the long 
1.3.0-SNAPSHOT. That said it is consistent with the web site. I imagine the 
necessary CSS is simple if they are both Bootstrap-based.

 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Priority: Minor

 It is not convenient to see the Spark version. We can keep the same style 
 with Spark website.
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4907) Inconsistent loss and gradient in LeastSquaresGradient compared with R


[ 
https://issues.apache.org/jira/browse/SPARK-4907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255954#comment-14255954
 ] 

Sean Owen commented on SPARK-4907:
--

PS I think you will want to update the docs too, for example, at 
http://spark.apache.org/docs/latest/mllib-linear-methods.html
There may be other places where the loss function formula is mentioned.

 Inconsistent loss and gradient in LeastSquaresGradient compared with R
 --

 Key: SPARK-4907
 URL: https://issues.apache.org/jira/browse/SPARK-4907
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: DB Tsai

 In most of the academic paper and algorithm implementations, people use L = 
 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2 for least-squared 
 loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf
 Since MLlib uses different convention, this will result different residuals 
 and all the stats properties will be different from GLMNET package in R. The 
 model coefficients will be still the same under this change. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4921) Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks

2014-12-22 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created SPARK-4921:
--

 Summary: Performance issue caused by TaskSetManager returning  
PROCESS_LOCAL for NO_PREF tasks
 Key: SPARK-4921
 URL: https://issues.apache.org/jira/browse/SPARK-4921
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang


During research for HIVE-9153, we found that TaskSetManager returns 
PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. 
Changing the return value to NO_PREF, as demonstrated in the attached patch, 
seemingly improves the performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4921) Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks

2014-12-22 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256015#comment-14256015
 ] 

Xuefu Zhang commented on SPARK-4921:


cc: [~lirui], [~sandyr]

 Performance issue caused by TaskSetManager returning  PROCESS_LOCAL for 
 NO_PREF tasks
 -

 Key: SPARK-4921
 URL: https://issues.apache.org/jira/browse/SPARK-4921
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang

 During research for HIVE-9153, we found that TaskSetManager returns 
 PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. 
 Changing the return value to NO_PREF, as demonstrated in the attached patch, 
 seemingly improves the performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4919) Can Spark SQL thrift server UI provide JOB kill operate or any REST API?

2014-12-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4919.
---
Resolution: Invalid

 Can Spark SQL thrift server UI provide JOB kill operate or any REST API?
 

 Key: SPARK-4919
 URL: https://issues.apache.org/jira/browse/SPARK-4919
 Project: Spark
  Issue Type: Wish
  Components: SQL, Web UI
Affects Versions: 1.2.0
Reporter: Xiaoyu Wang

 Can Spark SQL thrift server UI provide “JOB” kill operate or any REST API?
 Stages is already can be killed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4921) Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks

2014-12-22 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated SPARK-4921:
---
Attachment: NO_PREF.patch

 Performance issue caused by TaskSetManager returning  PROCESS_LOCAL for 
 NO_PREF tasks
 -

 Key: SPARK-4921
 URL: https://issues.apache.org/jira/browse/SPARK-4921
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang
 Attachments: NO_PREF.patch


 During research for HIVE-9153, we found that TaskSetManager returns 
 PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. 
 Changing the return value to NO_PREF, as demonstrated in the attached patch, 
 seemingly improves the performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4751:
-
Priority: Critical  (was: Blocker)

 Support dynamic allocation for standalone mode
 --

 Key: SPARK-4751
 URL: https://issues.apache.org/jira/browse/SPARK-4751
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 This is equivalent to SPARK-3822 but for standalone mode.
 This is actually a very tricky issue because the scheduling mechanism in the 
 standalone Master uses different semantics. In standalone mode we allocate 
 resources based on cores. By default, an application will grab all the cores 
 in the cluster unless spark.cores.max is specified. Unfortunately, this 
 means an application could get executors of different sizes (in terms of 
 cores) if:
 1) App 1 kills an executor
 2) App 2, with spark.cores.max set, grabs a subset of cores on a worker
 3) App 1 requests an executor
 In this case, the new executor that App 1 gets back will be smaller than the 
 rest and can execute fewer tasks in parallel. Further, standalone mode is 
 subject to the constraint that only one executor can be allocated on each 
 worker per application. As a result, it is rather meaningless to request new 
 executors if the existing ones are already spread out across all nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4922) Support dynamic allocation for coarse-grained Mesos

Andrew Or created SPARK-4922:


 Summary: Support dynamic allocation for coarse-grained Mesos
 Key: SPARK-4922
 URL: https://issues.apache.org/jira/browse/SPARK-4922
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.2.0
Reporter: Andrew Or
Priority: Critical


This brings SPARK-3174, which provided dynamic allocation of cluster resources 
to Spark on YARN applications, to Mesos coarse-grained mode. 

Note that the translation is not as trivial as adding a code path that exposes 
the request and kill mechanisms as we did for YARN is SPARK-3822. This is 
because Mesos coarse-grained mode schedules on the notion of setting the number 
of cores allowed for an application (as in standalone mode) instead of number 
of executors (as in YARN mode). For more detail, please see SPARK-4751.

If you intend to work on this, please provide a detailed design doc!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application


[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256045#comment-14256045
 ] 

Andrew Or commented on SPARK-3174:
--

Hey [~nemccarthy] I filed one at SPARK-4922, which is for coarse-grained mode. 
For fine-grained mode, there is already one that enables dynamically scaling 
memory instead of just CPU at SPARK-1882. I believe there has not been progress 
on either issue yet.

 Provide elastic scaling within a Spark application
 --

 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or
 Fix For: 1.2.0

 Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, 
 dynamic-scaling-executors-10-6-14.pdf


 A common complaint with Spark in a multi-tenant environment is that 
 applications have a fixed allocation that doesn't grow and shrink with their 
 resource needs.  We're blocked on YARN-1197 for dynamically changing the 
 resources within executors, but we can still allocate and discard whole 
 executors.
 It would be useful to have some heuristics that
 * Request more executors when many pending tasks are building up
 * Discard executors when they are idle
 See the latest design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4918) Reuse Text in saveAsTextFile

2014-12-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4918:
---
Description: When code reviewing https://github.com/apache/spark/pull/3740, 
[~rxin] pointed out that we could reuse the Hadoop Text object in 
saveAsTextFile to reduce GC impact.

 Reuse Text in saveAsTextFile
 

 Key: SPARK-4918
 URL: https://issues.apache.org/jira/browse/SPARK-4918
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Priority: Minor

 When code reviewing https://github.com/apache/spark/pull/3740, [~rxin] 
 pointed out that we could reuse the Hadoop Text object in saveAsTextFile to 
 reduce GC impact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4918) Reuse Text in saveAsTextFile

2014-12-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4918:
---
Assignee: Shixiong Zhu

 Reuse Text in saveAsTextFile
 

 Key: SPARK-4918
 URL: https://issues.apache.org/jira/browse/SPARK-4918
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor

 When code reviewing https://github.com/apache/spark/pull/3740, [~rxin] 
 pointed out that we could reuse the Hadoop Text object in saveAsTextFile to 
 reduce GC impact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4918) Reuse Text in saveAsTextFile

2014-12-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4918.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Reuse Text in saveAsTextFile
 

 Key: SPARK-4918
 URL: https://issues.apache.org/jira/browse/SPARK-4918
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
 Fix For: 1.3.0


 When code reviewing https://github.com/apache/spark/pull/3740, [~rxin] 
 pointed out that we could reuse the Hadoop Text object in saveAsTextFile to 
 reduce GC impact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4915) Wrong classname of external shuffle service in the doc for dynamic allocation


 [ 
https://issues.apache.org/jira/browse/SPARK-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4915:
-
Affects Version/s: 1.2.0

 Wrong classname of external shuffle service in the doc for dynamic allocation
 -

 Key: SPARK-4915
 URL: https://issues.apache.org/jira/browse/SPARK-4915
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, YARN
Affects Versions: 1.2.0
Reporter: Tsuyoshi OZAWA
 Fix For: 1.2.0, 1.3.0


 docs/job-scheduling.md says as follows:
 {quote}
 To enable this service, set `spark.shuffle.service.enabled` to `true`. In 
 YARN, this external shuffle service is implemented in 
 `org.apache.spark.yarn.network.YarnShuffleService` that runs in each 
 `NodeManager` in your cluster. 
 {quote}
 The class name org.apache.spark.yarn.network.YarnShuffleService is wrong. 
 org.apache.spark.network.yarn.YarnShuffleService is correct class name to be 
 specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4915) Wrong classname of external shuffle service in the doc for dynamic allocation


 [ 
https://issues.apache.org/jira/browse/SPARK-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4915.

  Resolution: Fixed
   Fix Version/s: 1.3.0
  1.2.0
Assignee: Tsuyoshi OZAWA
Target Version/s: 1.2.0, 1.3.0

 Wrong classname of external shuffle service in the doc for dynamic allocation
 -

 Key: SPARK-4915
 URL: https://issues.apache.org/jira/browse/SPARK-4915
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, YARN
Affects Versions: 1.2.0
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 1.2.0, 1.3.0


 docs/job-scheduling.md says as follows:
 {quote}
 To enable this service, set `spark.shuffle.service.enabled` to `true`. In 
 YARN, this external shuffle service is implemented in 
 `org.apache.spark.yarn.network.YarnShuffleService` that runs in each 
 `NodeManager` in your cluster. 
 {quote}
 The class name org.apache.spark.yarn.network.YarnShuffleService is wrong. 
 org.apache.spark.network.yarn.YarnShuffleService is correct class name to be 
 specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4915) Wrong classname of external shuffle service in the doc for dynamic allocation

2014-12-22 Thread Andrew Ash (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-4915:
--
Fix Version/s: (was: 1.2.0)
   1.2.1

 Wrong classname of external shuffle service in the doc for dynamic allocation
 -

 Key: SPARK-4915
 URL: https://issues.apache.org/jira/browse/SPARK-4915
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, YARN
Affects Versions: 1.2.0
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 1.3.0, 1.2.1


 docs/job-scheduling.md says as follows:
 {quote}
 To enable this service, set `spark.shuffle.service.enabled` to `true`. In 
 YARN, this external shuffle service is implemented in 
 `org.apache.spark.yarn.network.YarnShuffleService` that runs in each 
 `NodeManager` in your cluster. 
 {quote}
 The class name org.apache.spark.yarn.network.YarnShuffleService is wrong. 
 org.apache.spark.network.yarn.YarnShuffleService is correct class name to be 
 specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4881) Use SparkConf#getBoolean instead of get().toBoolean


 [ 
https://issues.apache.org/jira/browse/SPARK-4881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4881:
-
Priority: Trivial  (was: Minor)

 Use SparkConf#getBoolean instead of get().toBoolean
 ---

 Key: SPARK-4881
 URL: https://issues.apache.org/jira/browse/SPARK-4881
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Kousuke Saruta
Priority: Trivial

 It's really a minor issue.
 In ApplicationMaster, there is code like as follows.
 {code}
   val preserveFiles = sparkConf.get(spark.yarn.preserve.staging.files, 
 false).toBoolean
 {code}
 I think, the code can be simplified like as follows.
 {code}
   val preserveFiles = 
 sparkConf.getBoolean(spark.yarn.preserve.staging.files, false)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4915) Wrong classname of external shuffle service in the doc for dynamic allocation

2014-12-22 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256084#comment-14256084
 ] 

Andrew Ash commented on SPARK-4915:
---

Changing fix version to 1.2.1 from 1.2.0 because this didn't make it in in-time 
for 1.2.0

{noformat}
aash@aash-mbp ~/git/spark$ git log origin/branch-1.2 --oneline | grep SPARK-4915
31d42c4 [SPARK-4915][YARN] Fix classname to be specified for external shuffle 
service.
aash@aash-mbp ~/git/spark$ git log v1.2.0 --oneline | grep SPARK-4915
aash@aash-mbp ~/git/spark$
{noformat}

 Wrong classname of external shuffle service in the doc for dynamic allocation
 -

 Key: SPARK-4915
 URL: https://issues.apache.org/jira/browse/SPARK-4915
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, YARN
Affects Versions: 1.2.0
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 1.3.0, 1.2.1


 docs/job-scheduling.md says as follows:
 {quote}
 To enable this service, set `spark.shuffle.service.enabled` to `true`. In 
 YARN, this external shuffle service is implemented in 
 `org.apache.spark.yarn.network.YarnShuffleService` that runs in each 
 `NodeManager` in your cluster. 
 {quote}
 The class name org.apache.spark.yarn.network.YarnShuffleService is wrong. 
 org.apache.spark.network.yarn.YarnShuffleService is correct class name to be 
 specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4870) Add version information to driver log


 [ 
https://issues.apache.org/jira/browse/SPARK-4870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4870:
-
Priority: Minor  (was: Major)

 Add version information to driver log
 -

 Key: SPARK-4870
 URL: https://issues.apache.org/jira/browse/SPARK-4870
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Zhang, Liye
Priority: Minor

 Driver log doesn't include spark version information, version info is 
 important in testing different spark version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4870) Add version information to driver log


 [ 
https://issues.apache.org/jira/browse/SPARK-4870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4870.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Zhang, Liye
Target Version/s: 1.3.0

 Add version information to driver log
 -

 Key: SPARK-4870
 URL: https://issues.apache.org/jira/browse/SPARK-4870
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Zhang, Liye
Assignee: Zhang, Liye
Priority: Minor
 Fix For: 1.3.0


 Driver log doesn't include spark version information, version info is 
 important in testing different spark version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4883) Add a name to the directoryCleaner thread


 [ 
https://issues.apache.org/jira/browse/SPARK-4883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4883.

  Resolution: Fixed
   Fix Version/s: 1.2.1
  1.3.0
Assignee: Shixiong Zhu
Target Version/s: 1.3.0, 1.2.1

 Add a name to the directoryCleaner thread
 -

 Key: SPARK-4883
 URL: https://issues.apache.org/jira/browse/SPARK-4883
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
 Fix For: 1.3.0, 1.2.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4883) Add a name to the directoryCleaner thread


 [ 
https://issues.apache.org/jira/browse/SPARK-4883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4883:
-
Affects Version/s: 1.2.0

 Add a name to the directoryCleaner thread
 -

 Key: SPARK-4883
 URL: https://issues.apache.org/jira/browse/SPARK-4883
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Shixiong Zhu
Priority: Minor
 Fix For: 1.3.0, 1.2.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file

2014-12-22 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4520:

Assignee: sadhan sood

 SparkSQL exception when reading certain columns from a parquet file
 ---

 Key: SPARK-4520
 URL: https://issues.apache.org/jira/browse/SPARK-4520
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: sadhan sood
Assignee: sadhan sood
Priority: Critical
 Attachments: part-r-0.parquet


 I am seeing this issue with spark sql throwing an exception when trying to 
 read selective columns from a thrift parquet file and also when caching them.
 On some further digging, I was able to narrow it down to at-least one 
 particular column type: mapstring, setstring to be causing this issue. To 
 reproduce this I created a test thrift file with a very basic schema and 
 stored some sample data in a parquet file:
 Test.thrift
 ===
 {code}
 typedef binary SomeId
 enum SomeExclusionCause {
   WHITELIST = 1,
   HAS_PURCHASE = 2,
 }
 struct SampleThriftObject {
   10: string col_a;
   20: string col_b;
   30: string col_c;
   40: optional mapSomeExclusionCause, setSomeId col_d;
 }
 {code}
 =
 And loading the data in spark through schemaRDD:
 {code}
 import org.apache.spark.sql.SchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc);
 val parquetFile = /path/to/generated/parquet/file
 val parquetFileRDD = sqlContext.parquetFile(parquetFile)
 parquetFileRDD.printSchema
 root
  |-- col_a: string (nullable = true)
  |-- col_b: string (nullable = true)
  |-- col_c: string (nullable = true)
  |-- col_d: map (nullable = true)
  ||-- key: string
  ||-- value: array (valueContainsNull = true)
  |||-- element: string (containsNull = false)
 parquetFileRDD.registerTempTable(test)
 sqlContext.cacheTable(test)
 sqlContext.sql(select col_a from test).collect() -- see the exception 
 stack here 
 {code}
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value 
 at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
   at java.util.ArrayList.elementData(ArrayList.java:418)
   at java.util.ArrayList.get(ArrayList.java:431)
   at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
   at

[jira] [Updated] (SPARK-4733) Add missing prameter comments in ShuffleDependency


 [ 
https://issues.apache.org/jira/browse/SPARK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4733:
-
Affects Version/s: 1.2.0

 Add missing prameter comments in ShuffleDependency
 --

 Key: SPARK-4733
 URL: https://issues.apache.org/jira/browse/SPARK-4733
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Takeshi Yamamuro
Priority: Trivial

 Add missing Javadoc comments in ShuffleDependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2973) Add a way to show tables without executing a job

2014-12-22 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256140#comment-14256140
 ] 

Michael Armbrust commented on SPARK-2973:
-

I think the confusion there would be if someone then run .map(...) on that RDD. 
 It would be pretty confusing if it did not run a Spark job.  What is wrong 
with the approach we are already using for executeCollect().  We can add a 
executeTake with a default implementation and override that in ExecutedCommand.

 Add a way to show tables without executing a job
 

 Key: SPARK-2973
 URL: https://issues.apache.org/jira/browse/SPARK-2973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Aaron Davidson
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0


 Right now, sql(show tables).collect() will start a Spark job which shows up 
 in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4733) Add missing prameter comments in ShuffleDependency


 [ 
https://issues.apache.org/jira/browse/SPARK-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4733.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Takeshi Yamamuro
Target Version/s: 1.3.0

 Add missing prameter comments in ShuffleDependency
 --

 Key: SPARK-4733
 URL: https://issues.apache.org/jira/browse/SPARK-4733
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Takeshi Yamamuro
Assignee: Takeshi Yamamuro
Priority: Trivial
 Fix For: 1.3.0


 Add missing Javadoc comments in ShuffleDependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha


 [ 
https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4447.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Remove layers of abstraction in YARN code no longer needed after dropping 
 yarn-alpha
 

 Key: SPARK-4447
 URL: https://issues.apache.org/jira/browse/SPARK-4447
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.3.0


 For example, YarnRMClient and YarnRMClientImpl can be merged
 YarnAllocator and YarnAllocationHandler can be merged



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha


 [ 
https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4447:
-
Priority: Critical  (was: Major)

 Remove layers of abstraction in YARN code no longer needed after dropping 
 yarn-alpha
 

 Key: SPARK-4447
 URL: https://issues.apache.org/jira/browse/SPARK-4447
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Fix For: 1.3.0


 For example, YarnRMClient and YarnRMClientImpl can be merged
 YarnAllocator and YarnAllocationHandler can be merged



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries

2014-12-22 Thread David Ross (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256174#comment-14256174
 ] 

David Ross commented on SPARK-4908:
---

Note that noticed this line from native Hive logging:

{code}
14/12/19 21:44:55 INFO ql.Driver: Concurrency mode is disabled, not creating a 
lock manager
{code}

It seems to be tied to this config:
https://github.com/apache/hive/blob/branch-0.13/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L719

I have this to our {{hive-site.xml}} in the spark {{conf}} directory:

{code}
property
  namehive.support.concurrency/name
  valuetrue/value
/property
{code}

And I still have the issue.

Perhaps there is more I need to do to support concurrency?

 Spark SQL built for Hive 13 fails under concurrent metadata queries
 ---

 Key: SPARK-4908
 URL: https://issues.apache.org/jira/browse/SPARK-4908
 Project: Spark
  Issue Type: Bug
Reporter: David Ross

 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: 
 https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6
 We are using Spark built for Hive 13, using this option:
 {{-Phive-0.13.1}}
 In single-threaded mode, normal operations look fine. However, under 
 concurrency, with at least 2 concurrent connections, metadata queries fail.
 For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} 
 statement when you pass a default schema in the JDBC URL, all fail.
 {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue.
 Here is some example code:
 {code}
 object main extends App {
   import java.sql._
   import scala.concurrent._
   import scala.concurrent.duration._
   import scala.concurrent.ExecutionContext.Implicits.global
   Class.forName(org.apache.hive.jdbc.HiveDriver)
   val host = localhost // update this
   val url = sjdbc:hive2://${host}:10511/some_db // update this
   val future = Future.traverse(1 to 3) { i =
 Future {
   println(Starting:  + i)
   try {
 val conn = DriverManager.getConnection(url)
   } catch {
 case e: Throwable = e.printStackTrace()
 println(Failed:  + i)
   }
   println(Finishing:  + i)
 }
   }
   Await.result(future, 2.minutes)
   println(done!)
 }
 {code}
 Here is the output:
 {code}
 Starting: 1
 Starting: 3
 Starting: 2
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Failed: 3
 Finishing: 3
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at

[jira] [Comment Edited] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries

2014-12-22 Thread David Ross (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256174#comment-14256174
 ] 

David Ross edited comment on SPARK-4908 at 12/22/14 8:43 PM:
-

Note that I noticed this line in the logs that seems to come from Hive logging 
(not spark code):

{code}
14/12/19 21:44:55 INFO ql.Driver: Concurrency mode is disabled, not creating a 
lock manager
{code}

It seems to be tied to this config:
https://github.com/apache/hive/blob/branch-0.13/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L719

I have this to our {{hive-site.xml}} in the spark {{conf}} directory:

{code}
property
  namehive.support.concurrency/name
  valuetrue/value
/property
{code}

And I still have the issue.

Perhaps there is more I need to do to support concurrency?


was (Author: dyross):
Note that noticed this line from native Hive logging:

{code}
14/12/19 21:44:55 INFO ql.Driver: Concurrency mode is disabled, not creating a 
lock manager
{code}

It seems to be tied to this config:
https://github.com/apache/hive/blob/branch-0.13/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L719

I have this to our {{hive-site.xml}} in the spark {{conf}} directory:

{code}
property
  namehive.support.concurrency/name
  valuetrue/value
/property
{code}

And I still have the issue.

Perhaps there is more I need to do to support concurrency?

 Spark SQL built for Hive 13 fails under concurrent metadata queries
 ---

 Key: SPARK-4908
 URL: https://issues.apache.org/jira/browse/SPARK-4908
 Project: Spark
  Issue Type: Bug
Reporter: David Ross

 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: 
 https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6
 We are using Spark built for Hive 13, using this option:
 {{-Phive-0.13.1}}
 In single-threaded mode, normal operations look fine. However, under 
 concurrency, with at least 2 concurrent connections, metadata queries fail.
 For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} 
 statement when you pass a default schema in the JDBC URL, all fail.
 {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue.
 Here is some example code:
 {code}
 object main extends App {
   import java.sql._
   import scala.concurrent._
   import scala.concurrent.duration._
   import scala.concurrent.ExecutionContext.Implicits.global
   Class.forName(org.apache.hive.jdbc.HiveDriver)
   val host = localhost // update this
   val url = sjdbc:hive2://${host}:10511/some_db // update this
   val future = Future.traverse(1 to 3) { i =
 Future {
   println(Starting:  + i)
   try {
 val conn = DriverManager.getConnection(url)
   } catch {
 case e: Throwable = e.printStackTrace()
 println(Failed:  + i)
   }
   println(Finishing:  + i)
 }
   }
   Await.result(future, 2.minutes)
   println(done!)
 }
 {code}
 Here is the output:
 {code}
 Starting: 1
 Starting: 3
 Starting: 2
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Failed: 3
 Finishing: 3
 java.sql.SQLException:

[jira] [Resolved] (SPARK-4079) Snappy bundled with Spark does not work on older Linux distributions


 [ 
https://issues.apache.org/jira/browse/SPARK-4079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4079.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Snappy bundled with Spark does not work on older Linux distributions
 

 Key: SPARK-4079
 URL: https://issues.apache.org/jira/browse/SPARK-4079
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin
Assignee: Kostas Sakellis
 Fix For: 1.3.0


 This issue has existed at least since 1.0, but has been made worse by 1.1 
 since snappy is now the default compression algorithm. When trying to use it 
 on a CentOS 5 machine, for example, you'll get something like this:
 {noformat}
   java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
 org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:319)
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:226)
at org.xerial.snappy.Snappy.clinit(Snappy.java:48)
at 
 org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79)
at 
 org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
at 
 org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
...
Caused by: java.lang.UnsatisfiedLinkError: 
 /tmp/snappy-1.0.5.3-af72bf3c-9dab-43af-a662-f9af657f06b1-libsnappyjava.so: 
 /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by 
 /tmp/snappy-1.0.5.3-af72bf3c-9dab-43af-a662-f9af657f06b1-libsnappyjava.so)
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1957)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1882)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1843)
at java.lang.Runtime.load0(Runtime.java:795)
at java.lang.System.load(System.java:1061)
at 
 org.xerial.snappy.SnappyNativeLoader.load(SnappyNativeLoader.java:39)
... 29 more
 {noformat}
 There are two approaches I can see here (well, 3):
 * Declare CentOS 5 (and similar OSes) not supported, although that would suck 
 for the people who are still on it and already use Spark
 * Fallback to another compression codec if Snappy cannot be loaded
 * Ask the Snappy guys to compile the library on an older OS...
 I think the second would be the best compromise.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4864) Add documentation to Netty-based configs


 [ 
https://issues.apache.org/jira/browse/SPARK-4864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4864.

   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

 Add documentation to Netty-based configs
 

 Key: SPARK-4864
 URL: https://issues.apache.org/jira/browse/SPARK-4864
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 1.3.0, 1.2.1


 Currently there is no public documentation for the NettyBlockTransferService 
 or various configuration options of the network package. We should add some.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file


 [ 
https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4520:
---
Target Version/s: 1.2.1  (was: 1.3.0)

 SparkSQL exception when reading certain columns from a parquet file
 ---

 Key: SPARK-4520
 URL: https://issues.apache.org/jira/browse/SPARK-4520
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: sadhan sood
Assignee: sadhan sood
Priority: Critical
 Attachments: part-r-0.parquet


 I am seeing this issue with spark sql throwing an exception when trying to 
 read selective columns from a thrift parquet file and also when caching them.
 On some further digging, I was able to narrow it down to at-least one 
 particular column type: mapstring, setstring to be causing this issue. To 
 reproduce this I created a test thrift file with a very basic schema and 
 stored some sample data in a parquet file:
 Test.thrift
 ===
 {code}
 typedef binary SomeId
 enum SomeExclusionCause {
   WHITELIST = 1,
   HAS_PURCHASE = 2,
 }
 struct SampleThriftObject {
   10: string col_a;
   20: string col_b;
   30: string col_c;
   40: optional mapSomeExclusionCause, setSomeId col_d;
 }
 {code}
 =
 And loading the data in spark through schemaRDD:
 {code}
 import org.apache.spark.sql.SchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc);
 val parquetFile = /path/to/generated/parquet/file
 val parquetFileRDD = sqlContext.parquetFile(parquetFile)
 parquetFileRDD.printSchema
 root
  |-- col_a: string (nullable = true)
  |-- col_b: string (nullable = true)
  |-- col_c: string (nullable = true)
  |-- col_d: map (nullable = true)
  ||-- key: string
  ||-- value: array (valueContainsNull = true)
  |||-- element: string (containsNull = false)
 parquetFileRDD.registerTempTable(test)
 sqlContext.cacheTable(test)
 sqlContext.sql(select col_a from test).collect() -- see the exception 
 stack here 
 {code}
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value 
 at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
   at java.util.ArrayList.elementData(ArrayList.java:418)
   at java.util.ArrayList.get(ArrayList.java:431)
   at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
   at

[jira] [Commented] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler


[ 
https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256216#comment-14256216
 ] 

Apache Spark commented on SPARK-1714:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/3765

 Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
 

 Key: SPARK-1714
 URL: https://issues.apache.org/jira/browse/SPARK-1714
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler

2014-12-22 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-1714:
--
 Target Version/s: 1.3.0
Affects Version/s: 1.2.0
Fix Version/s: (was: 1.2.0)

 Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
 

 Key: SPARK-1714
 URL: https://issues.apache.org/jira/browse/SPARK-1714
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on


[ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256228#comment-14256228
 ] 

Peng Cheng commented on SPARK-3452:
---

IMHO REPL should be kept being published, one of my project extends its API to 
initialize some third-party components upon launching.
This should be made an 'official' API to encourage more platform integrate with 
it.

 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical
 Fix For: 1.2.0


 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4923) Maven build should keep publishing spark-repl

Peng Cheng created SPARK-4923:
-

 Summary: Maven build should keep publishing spark-repl
 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical


Spark-repl installation and deployment has been discontinued (see SPARK-3452). 
But its in the dependency list of a few projects that extends its 
initialization process.
Please remove the 'skip' setting in spark-repl and make it an 'official' API to 
encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4818) Join operation should use iterator/lazy evaluation

2014-12-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4818.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0
   1.1.2

Issue resolved by pull request 3671
[https://github.com/apache/spark/pull/3671]

 Join operation should use iterator/lazy evaluation
 --

 Key: SPARK-4818
 URL: https://issues.apache.org/jira/browse/SPARK-4818
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Johannes Simon
 Fix For: 1.1.2, 1.3.0, 1.2.1


 The current implementation of the join operation does not use an iterator 
 (i.e. lazy evaluation), causing it to explicitly evaluate the co-grouped 
 values. In big data applications, these value collections can be very large. 
 This causes the *cartesian product of all co-grouped values* for a specific 
 key of both RDDs to be kept in memory during the flatMapValues operation, 
 resulting in an *O(size(pair._1)*size(pair._2))* memory consumption instead 
 of *O(1)*. Very large value collections will therefore cause GC overhead 
 limit exceeded exceptions and fail the task, or at least slow down execution 
 dramatically.
 {code:title=PairRDDFunctions.scala|borderStyle=solid}
 //...
 def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = 
 {
   this.cogroup(other, partitioner).flatMapValues( pair =
 for (v - pair._1; w - pair._2) yield (v, w)
   )
 }
 //...
 {code}
 Since cogroup returns an Iterable instance of an Array, the join 
 implementation could be changed to the following, which uses lazy evaluation 
 instead, and has almost no memory overhead:
 {code:title=PairRDDFunctions.scala|borderStyle=solid}
 //...
 def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = 
 {
   this.cogroup(other, partitioner).flatMapValues( pair =
 for (v - pair._1.iterator; w - pair._2.iterator) yield (v, w)
   )
 }
 //...
 {code}
 Alternatively, if the current implementation is intentionally not using lazy 
 evaluation for some reason, there could be a *lazyJoin()* method next to the 
 original join implementation that utilizes lazy evaluation. This of course 
 applies to other join operations as well.
 Thanks! :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4818) Join operation should use iterator/lazy evaluation

2014-12-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4818:
--
Assignee: Shixiong Zhu

 Join operation should use iterator/lazy evaluation
 --

 Key: SPARK-4818
 URL: https://issues.apache.org/jira/browse/SPARK-4818
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Johannes Simon
Assignee: Shixiong Zhu
 Fix For: 1.3.0, 1.1.2, 1.2.1


 The current implementation of the join operation does not use an iterator 
 (i.e. lazy evaluation), causing it to explicitly evaluate the co-grouped 
 values. In big data applications, these value collections can be very large. 
 This causes the *cartesian product of all co-grouped values* for a specific 
 key of both RDDs to be kept in memory during the flatMapValues operation, 
 resulting in an *O(size(pair._1)*size(pair._2))* memory consumption instead 
 of *O(1)*. Very large value collections will therefore cause GC overhead 
 limit exceeded exceptions and fail the task, or at least slow down execution 
 dramatically.
 {code:title=PairRDDFunctions.scala|borderStyle=solid}
 //...
 def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = 
 {
   this.cogroup(other, partitioner).flatMapValues( pair =
 for (v - pair._1; w - pair._2) yield (v, w)
   )
 }
 //...
 {code}
 Since cogroup returns an Iterable instance of an Array, the join 
 implementation could be changed to the following, which uses lazy evaluation 
 instead, and has almost no memory overhead:
 {code:title=PairRDDFunctions.scala|borderStyle=solid}
 //...
 def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = 
 {
   this.cogroup(other, partitioner).flatMapValues( pair =
 for (v - pair._1.iterator; w - pair._2.iterator) yield (v, w)
   )
 }
 //...
 {code}
 Alternatively, if the current implementation is intentionally not using lazy 
 evaluation for some reason, there could be a *lazyJoin()* method next to the 
 original join implementation that utilizes lazy evaluation. This of course 
 applies to other join operations as well.
 Thanks! :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl


[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256308#comment-14256308
 ] 

Patrick Wendell commented on SPARK-4923:


Hey [~pc...@uowmail.edu.au] - we removed this from Maven because it's not meant 
as a stable API in Spark. Could you talk about which parts of the repl API you 
are using and how you are using it?

 Maven build should keep publishing spark-repl
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical
  Labels: shell
   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl


[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256319#comment-14256319
 ] 

Peng Cheng commented on SPARK-4923:
---

Hey Patrick,

The following API has been integrated since 1.0.0, IMHO they are stable enough 
for daily prototyping, creating case class used to be defective but has been 
fixed long time ago.
SparkILoop.getAddedJars()
$SparkIMain.bind
$SparkIMain.quietBind
$SparkIMain.interpret
end of :)

At first I assume that further development on it has been moved to databricks 
cloud. But the JIRA ticket was already there in September. So maybe demand on 
this API from the community is indeed low enough.
However, I would still suggest keeping it, even promoting it into a Developer's 
API, this would encourage more projects to integrate in a more flexible way, 
and save prototyping/QA cost by customizing fixtures of REPL. People will still 
move to databricks cloud, which has far more features than that. Many 
influential projects already depends on the routinely published Scala-REPL 
(e.g. playFW), it would be strange for Spark not doing the same.
What do you think? 

 Maven build should keep publishing spark-repl
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical
  Labels: shell
   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl


[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256322#comment-14256322
 ] 

Peng Cheng commented on SPARK-4923:
---

Sorry my project is https://github.com/tribbloid/ISpark

 Maven build should keep publishing spark-repl
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical
  Labels: shell
   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2014-12-22 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-4924:
-

 Summary: Factor out code to launch Spark applications into a 
separate library
 Key: SPARK-4924
 URL: https://issues.apache.org/jira/browse/SPARK-4924
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Marcelo Vanzin


One of the questions we run into rather commonly is how to start a Spark 
application from my Java/Scala program?. There currently isn't a good answer 
to that:

- Instantiating SparkContext has limitations (e.g., you can only have one 
active context at the moment, plus you lose the ability to submit apps in 
cluster mode)
- Calling SparkSubmit directly is doable but you lose a lot of the logic 
handled by the shell scripts
- Calling the shell script directly is doable,  but sort of ugly from an API 
point of view.

I think it would be nice to have a small library that handles that for users. 
On top of that, this library could be used by Spark itself to replace a lot of 
the code in the current shell scripts, which have a lot of duplication.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2014-12-22 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-4924:
--
Attachment: spark-launcher.txt

Attaching a mini-spec to describe the motivation and a proposal for the 
library. I'm currently working on a prototype based on that spec and should 
have something to share soon-ish.

 Factor out code to launch Spark applications into a separate library
 

 Key: SPARK-4924
 URL: https://issues.apache.org/jira/browse/SPARK-4924
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Marcelo Vanzin
 Attachments: spark-launcher.txt


 One of the questions we run into rather commonly is how to start a Spark 
 application from my Java/Scala program?. There currently isn't a good answer 
 to that:
 - Instantiating SparkContext has limitations (e.g., you can only have one 
 active context at the moment, plus you lose the ability to submit apps in 
 cluster mode)
 - Calling SparkSubmit directly is doable but you lose a lot of the logic 
 handled by the shell scripts
 - Calling the shell script directly is doable,  but sort of ugly from an API 
 point of view.
 I think it would be nice to have a small library that handles that for users. 
 On top of that, this library could be used by Spark itself to replace a lot 
 of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2541) Standalone mode can't access secure HDFS anymore

2014-12-22 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256357#comment-14256357
 ] 

Nicholas Chammas commented on SPARK-2541:
-

I was just wondering if we needed to ping someone to work on this. Taking 
another look at the history on this issue, it looks like you already started 
working on it, so no worries.

 Standalone mode can't access secure HDFS anymore
 

 Key: SPARK-2541
 URL: https://issues.apache.org/jira/browse/SPARK-2541
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1
Reporter: Thomas Graves
 Attachments: SPARK-2541-partial.patch


 In spark 0.9.x you could access secure HDFS from Standalone deploy, that 
 doesn't work in 1.X anymore. 
 It looks like the issues is in SparkHadoopUtil.runAsSparkUser.  Previously it 
 wouldn't do the doAs if the currentUser == user.  Not sure how it affects 
 when the daemons run as a super user but SPARK_USER is set to someone else.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2014-12-22 Thread Alex Liu (JIRA)

Alex Liu created SPARK-4925:
---

 Summary: Publish Spark SQL hive-thriftserver maven artifact 
 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1
Reporter: Alex Liu
 Fix For: 1.2.0, 1.1.2


The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
Cassandra.

Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4907) Inconsistent loss and gradient in LeastSquaresGradient compared with R

2014-12-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4907.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3746
[https://github.com/apache/spark/pull/3746]

 Inconsistent loss and gradient in LeastSquaresGradient compared with R
 --

 Key: SPARK-4907
 URL: https://issues.apache.org/jira/browse/SPARK-4907
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: DB Tsai
 Fix For: 1.3.0


 In most of the academic paper and algorithm implementations, people use L = 
 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2 for least-squared 
 loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf
 Since MLlib uses different convention, this will result different residuals 
 and all the stats properties will be different from GLMNET package in R. The 
 model coefficients will be still the same under this change. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4907) Inconsistent loss and gradient in LeastSquaresGradient compared with R

2014-12-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4907:
-
Assignee: DB Tsai

 Inconsistent loss and gradient in LeastSquaresGradient compared with R
 --

 Key: SPARK-4907
 URL: https://issues.apache.org/jira/browse/SPARK-4907
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai
 Fix For: 1.3.0


 In most of the academic paper and algorithm implementations, people use L = 
 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2 for least-squared 
 loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf
 Since MLlib uses different convention, this will result different residuals 
 and all the stats properties will be different from GLMNET package in R. The 
 model coefficients will be still the same under this change. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact


 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4925:
---
Fix Version/s: (was: 1.1.2)
   (was: 1.2.0)

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1
Reporter: Alex Liu

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

[
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256393#comment-14256393
]

Patrick Wendell commented on SPARK-4923:

Hey [~pc...@uowmail.edu.au], thanks for filling that in - I didn't even realize
we had code in there that was bytecode public. By stable I meant that we are
promising it is an unchanging API. This is what we usually think about when we
release things. For 1.2.0 I refactored our build and found out that we were
publishing a bunch of random internal build components, so I took them all out
of the published artifacts (examples, our assembly jar, etc) in SPARK-4923.

Anyways - perhaps we could just annotate these as developer API's and be clear
that they might change in the future. If you wanted to do that, and re-enable
publishing them, I'd be happy to do that.

Maven build should keep publishing spark-repl
-

Key: SPARK-4923
URL: https://issues.apache.org/jira/browse/SPARK-4923
Project: Spark
Issue Type: Bug
Components: Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical
Labels: shell
Original Estimate: 1h
Remaining Estimate: 1h

Spark-repl installation and deployment has been discontinued (see
SPARK-3452). But its in the dependency list of a few projects that extends
its initialization process.
Please remove the 'skip' setting in spark-repl and make it an 'official' API
to encourage more platform to integrate with it.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact


[ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256397#comment-14256397
 ] 

Patrick Wendell commented on SPARK-4925:


The hive-thriftserver module is just used when building a Spark distribution, 
user applications shouldn't need to link against it. Could you talk a bit more 
about how you are actually using the thriftserver?

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1
Reporter: Alex Liu

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact


 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4925:
---
Component/s: Build

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.2.0
Reporter: Alex Liu

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact


 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4925:
---
Affects Version/s: (was: 1.1.1)
   1.2.0

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.2.0
Reporter: Alex Liu

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4923) Maven build should keep publishing spark-repl


 [ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4923:
---
Component/s: Build

 Maven build should keep publishing spark-repl
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical
  Labels: shell
   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3816) Add configureOutputJobPropertiesForStorageHandler to JobConf in SparkHadoopWriter


[ 
https://issues.apache.org/jira/browse/SPARK-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256399#comment-14256399
 ] 

Apache Spark commented on SPARK-3816:
-

User 'alexliu68' has created a pull request for this issue:
https://github.com/apache/spark/pull/3766

 Add configureOutputJobPropertiesForStorageHandler to JobConf in 
 SparkHadoopWriter
 -

 Key: SPARK-3816
 URL: https://issues.apache.org/jira/browse/SPARK-3816
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Alex Liu
 Fix For: 1.2.0


 It's similar to SPARK-2846. We should add 
 PlanUtils.configureInputJobPropertiesForStorageHandler to SparkHadoopWriter, 
 so that writer can add configuration from customized StorageHandler to JobConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2014-12-22 Thread Alex Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256405#comment-14256405
 ] 

Alex Liu commented on SPARK-4925:
-

Our build.xml downloads hive-thriftserver maven artifact, and add the 
downloaded jar file to class path. Currently we have it published at our 
private repository. But we hope we don't need maintain our private Spark build 
repository and only depend on public maven repository to down it.

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.2.0
Reporter: Alex Liu

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4920) current spark version in UI is not striking


 [ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4920:
---
Assignee: uncleGen

 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Assignee: uncleGen
Priority: Minor

 It is not convenient to see the Spark version. We can keep the same style 
 with Spark website.
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4920) current spark version in UI is not striking


 [ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4920.

   Resolution: Fixed
Fix Version/s: 1.2.1

I believe this has been fixed:
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a8a8e0e8752194d82b6c6e20cedbb3871b221916

 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Assignee: uncleGen
Priority: Minor
 Fix For: 1.2.1


 It is not convenient to see the Spark version. We can keep the same style 
 with Spark website.
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact


[ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256415#comment-14256415
 ] 

Apache Spark commented on SPARK-4925:
-

User 'alexliu68' has created a pull request for this issue:
https://github.com/apache/spark/pull/3766

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.2.0
Reporter: Alex Liu

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4921) Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks

2014-12-22 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256425#comment-14256425
 ] 

Rui Li commented on SPARK-4921:
---

I'm not sure if this is intended, but returning process_local for no_pref tasks 
may reset {{currentLocalityIndex}} to 0 which may cause more delay later. Seems 
there's a check to avoid this but I doubt it's sufficient:
{code}
  // Update our locality level for delay scheduling
  // NO_PREF will not affect the variables related to delay scheduling
  if (maxLocality != TaskLocality.NO_PREF) {
currentLocalityIndex = getLocalityIndex(taskLocality)
lastLaunchTime = curTime
  }
{code}

 Performance issue caused by TaskSetManager returning  PROCESS_LOCAL for 
 NO_PREF tasks
 -

 Key: SPARK-4921
 URL: https://issues.apache.org/jira/browse/SPARK-4921
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang
 Attachments: NO_PREF.patch


 During research for HIVE-9153, we found that TaskSetManager returns 
 PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. 
 Changing the return value to NO_PREF, as demonstrated in the attached patch, 
 seemingly improves the performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4923) Maven build should keep publishing spark-repl


 [ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated SPARK-4923:
--
Attachment: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch

thank you so much! First patch uploaded

 Maven build should keep publishing spark-repl
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical
  Labels: shell
 Attachments: 
 SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4923) Maven build should keep publishing spark-repl


 [ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated SPARK-4923:
--
Target Version/s: 1.3.0, 1.2.1  (was: 1.3.0)

 Maven build should keep publishing spark-repl
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical
  Labels: shell
 Attachments: 
 SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3860) Improve dimension joins

2014-12-22 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-3860:
---

Assignee: Michael Armbrust

 Improve dimension joins
 ---

 Key: SPARK-3860
 URL: https://issues.apache.org/jira/browse/SPARK-3860
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Critical

 This is an umbrella ticket for improving performance for joining multiple 
 dimension tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4926) Spark manipulate Hbase

2014-12-22 Thread Lily (JIRA)

Lily created SPARK-4926:
---

 Summary: Spark manipulate Hbase
 Key: SPARK-4926
 URL: https://issues.apache.org/jira/browse/SPARK-4926
 Project: Spark
  Issue Type: Question
Reporter: Lily


When I run the program below,I got an error “Job aborted due to stage failure: 
Task 0.0 in stage 2.0 (TID 14) had a not serializable 
result:org.apache.hadoop.hbase.io.ImmutableBytesWritable”

How can I manipulate the results?
How to realize get,put,scan of hbase by scala?
There are not any examples in the source code files.




import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.spark._

object HbaseTest extends Serializable{
  def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName(HBaseTest)
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
conf.set(hbase.zookeeper.property.clientPort, 2181);
conf.set(hbase.zookeeper.quorum, 192.168.179.146);
conf.set(TableInputFormat.INPUT_TABLE, sensteer_rawdata)

val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(sensteer_rawdata)) {
  val tableDesc = new HTableDescriptor(sensteer_rawdata)
  admin.createTable(tableDesc)
}

val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
  classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  classOf[org.apache.hadoop.hbase.client.Result])
val count = hBaseRDD.count()
println(-- + hBaseRDD.count() + --)

val res = hBaseRDD.take(count.toInt)

sc.stop()
  }

}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener


[ 
https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256468#comment-14256468
 ] 

Patrick Wendell commented on SPARK-4906:


Hey [~mingyu.z...@gmail.com] - could you say a bit more about how a workload 
can generate this number of failed tasks in the live set of running stages? 
If they are each 10kb and you see them taking an aggregate of 500MB, this means 
you have 50,000 failed tasks in the live set. I've never seen this before 
because typically once a few tasks have failed the stage will fail, so this 
definitely seems like an extreme case.

Running a job with hundreds of thousands of tasks might require a good size 
heap at the driver even for other reasons. How big of a heap are you using?

We might be able limit the number of unique string objects that are allocated 
if we have a large number of tasks that refer to an identical stack trace.

 Spark master OOMs with exception stack trace stored in JobProgressListener
 --

 Key: SPARK-4906
 URL: https://issues.apache.org/jira/browse/SPARK-4906
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.1
Reporter: Mingyu Kim

 Spark master was OOMing with a lot of stack traces retained in 
 JobProgressListener. The object dependency goes like the following.
 JobProgressListener.stageIdToData = StageUIData.taskData = 
 TaskUIData.errorMessage
 Each error message is ~10kb since it has the entire stack trace. As we have a 
 lot of tasks, when all of the tasks across multiple stages go bad, these 
 error messages accounted for 0.5GB of heap at some point.
 Please correct me if I'm wrong, but it looks like all the task info for 
 running applications are kept in memory, which means it's almost always bound 
 to OOM for long-running applications. Would it make sense to fix this, for 
 example, by spilling some UI states to disk?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4927) Spark does not clean up properly during long jobs.

2014-12-22 Thread Ilya Ganelin (JIRA)

Ilya Ganelin created SPARK-4927:
---

 Summary: Spark does not clean up properly during long jobs. 
 Key: SPARK-4927
 URL: https://issues.apache.org/jira/browse/SPARK-4927
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Ilya Ganelin


On a long running Spark job, Spark will eventually run out of memory on the 
driver node due to metadata overhead from the shuffle operation. Spark will 
continue to operate, however with drastically decreased performance (since 
swapping now occurs with every operation).

The spark.cleanup.tll parameter allows a user to configure when cleanup happens 
but the issue with doing this is that it isn’t done safely, e.g. If this clears 
a cached RDD or active task in the middle of processing a stage, this 
ultimately causes a KeyNotFoundException when the next stage attempts to 
reference the cleared RDD or task.

There should be a sustainable mechanism for cleaning up stale metadata that 
allows the program to continue running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4928) Operation ,,=,= with Decimal report error

2014-12-22 Thread guowei (JIRA)

guowei created SPARK-4928:
-

 Summary: Operation ,,=,= with Decimal report error
 Key: SPARK-4928
 URL: https://issues.apache.org/jira/browse/SPARK-4928
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: guowei


create table test (a Decimal(10,1));
select * from test where a1;

WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Types do not 
match DecimalType(10,1) != DecimalType(10,0), tree: (input[0]  1)
at 
org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:249)
at 
org.apache.spark.sql.catalyst.expressions.GreaterThan.eval(predicates.scala:204)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:794)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:794)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1324)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1324)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4928) Operator ,,=,= with decimal between different precision report error


[ 
https://issues.apache.org/jira/browse/SPARK-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256509#comment-14256509
 ] 

Apache Spark commented on SPARK-4928:
-

User 'guowei2' has created a pull request for this issue:
https://github.com/apache/spark/pull/3767

 Operator ,,=,= with decimal between different precision report error
 --

 Key: SPARK-4928
 URL: https://issues.apache.org/jira/browse/SPARK-4928
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: guowei

 create table test (a Decimal(10,1));
 select * from test where a1;
 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): 
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Types do not 
 match DecimalType(10,1) != DecimalType(10,0), tree: (input[0]  1)
   at 
 org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:249)
   at 
 org.apache.spark.sql.catalyst.expressions.GreaterThan.eval(predicates.scala:204)
   at 
 org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
   at 
 org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$apply$1.apply(predicates.scala:30)
   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:794)
   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:794)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1324)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1324)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4920) current spark version in UI is not striking


[ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256526#comment-14256526
 ] 

Apache Spark commented on SPARK-4920:
-

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3768

 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Assignee: uncleGen
Priority: Minor
 Fix For: 1.2.1


 It is not convenient to see the Spark version. We can keep the same style 
 with Spark website.
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4920) current spark version in UI is not striking


[ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256544#comment-14256544
 ] 

Apache Spark commented on SPARK-4920:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/3769

 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Assignee: uncleGen
Priority: Minor
 Fix For: 1.2.1


 It is not convenient to see the Spark version. We can keep the same style 
 with Spark website.
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4920) current spark version in UI is not striking

2014-12-22 Thread Zhang, Liye (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256545#comment-14256545
 ] 

Zhang, Liye commented on SPARK-4920:


Seems standalone mode is not with the version.
I agree with [~sowen], it'll be not good looking when the version is the long. 
Putting the version on footer will be not string but will be flexible for 
extension.


 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Assignee: uncleGen
Priority: Minor
 Fix For: 1.2.1


 It is not convenient to see the Spark version. We can keep the same style 
 with Spark website.
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4920) current spark version in UI is not striking

2014-12-22 Thread Zhang, Liye (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256545#comment-14256545
 ] 

Zhang, Liye edited comment on SPARK-4920 at 12/23/14 3:25 AM:
--

Seems standalone mode is not with the version on web UI.
I agree with [~sowen], it'll be not good looking when the version is the long. 
Putting the version on footer will be not striking but will be flexible for 
extension.



was (Author: liyezhang556520):
Seems standalone mode is not with the version.
I agree with [~sowen], it'll be not good looking when the version is the long. 
Putting the version on footer will be not string but will be flexible for 
extension.


 current spark version in UI is not striking
 ---

 Key: SPARK-4920
 URL: https://issues.apache.org/jira/browse/SPARK-4920
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: uncleGen
Assignee: uncleGen
Priority: Minor
 Fix For: 1.2.1


 It is not convenient to see the Spark version. We can keep the same style 
 with Spark website.
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/spark_version.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4890) Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it


[ 
https://issues.apache.org/jira/browse/SPARK-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256581#comment-14256581
 ] 

Apache Spark commented on SPARK-4890:
-

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/3770

 Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of 
 packaging it
 -

 Key: SPARK-4890
 URL: https://issues.apache.org/jira/browse/SPARK-4890
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.3.0


 We should upgrade to a newer version of Boto (2.34.0), since this is blocking 
 several features.  It looks like newer versions of Boto don't work properly 
 when they're loaded from a zipfile since they try to read a JSON file from a 
 path relative to the Boto library sources.
 Therefore, I think we should change {{spark-ec2}} to automatically download 
 Boto from PyPi if it's not present in {{SPARK_EC2_DIR/lib}}, similar to what 
 we do in the {{sbt/sbt}} scripts.  This shouldn't ben an issue for users 
 since they already need to have an internet connection to launch an EC2 
 cluster.  By performing the downloading in {{spark_ec2.py}} instead of the 
 Bash script, this should also work for Windows users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4929) Yarn Client mode can not support the HA after the exitcode change

2014-12-22 Thread SaintBacchus (JIRA)

SaintBacchus created SPARK-4929:
---

 Summary: Yarn Client mode can not support the HA after the 
exitcode change
 Key: SPARK-4929
 URL: https://issues.apache.org/jira/browse/SPARK-4929
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: SaintBacchus


Nowadays, yarn-client will exit directly when the HA change happens no matter 
how many times the am should retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4929) Yarn Client mode can not support the HA after the exitcode change


[ 
https://issues.apache.org/jira/browse/SPARK-4929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256585#comment-14256585
 ] 

Apache Spark commented on SPARK-4929:
-

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3771

 Yarn Client mode can not support the HA after the exitcode change
 -

 Key: SPARK-4929
 URL: https://issues.apache.org/jira/browse/SPARK-4929
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: SaintBacchus

 Nowadays, yarn-client will exit directly when the HA change happens no matter 
 how many times the am should retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4930) [SQL][DOCS]Update SQL programming guide

2014-12-22 Thread Gankun Luo (JIRA)

Gankun Luo created SPARK-4930:
-

 Summary: [SQL][DOCS]Update SQL programming guide 
 Key: SPARK-4930
 URL: https://issues.apache.org/jira/browse/SPARK-4930
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Gankun Luo
Priority: Trivial


`CACHE TABLE tbl` is now eager by default not lazy




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4930) [SQL][DOCS]Update SQL programming guide