date:20170615

[jira] [Commented] (SPARK-21104) Support sort with index when parse LibSVM Record

2017-06-15 Thread darion yaphet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051388#comment-16051388
 ] 

darion yaphet commented on SPARK-21104:
---

we should make sure the array is in ordered instead of check and report with a 
exception . 

> Support sort with index when parse LibSVM Record
> 
>
> Key: SPARK-21104
> URL: https://issues.apache.org/jira/browse/SPARK-21104
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>Priority: Minor
>
> When I'm loading LibSVM from HDFS , I found feature index should be in 
> ascending order . 
> We can sorted with *indices* when we parse the input line input a (index, 
> value) tuple and avoid check if indices are in ascending order after that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21114) Test failure in Spark 2.1 due to name mismatch

2017-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21114.
-
   Resolution: Fixed
Fix Version/s: 2.1.2

> Test failure in Spark 2.1 due to name mismatch
> --
>
> Key: SPARK-21114
> URL: https://issues.apache.org/jira/browse/SPARK-21114
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.2
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.

2017-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21072.
-
   Resolution: Fixed
 Assignee: coneyliu
Fix Version/s: 2.2.0
   2.1.2

> `TreeNode.mapChildren` should only apply to the children node. 
> ---
>
> Key: SPARK-21072
> URL: https://issues.apache.org/jira/browse/SPARK-21072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: coneyliu
>Assignee: coneyliu
> Fix For: 2.1.2, 2.2.0
>
>
> Just as the function name and comments of `TreeNode.mapChildren` mentioned, 
> the function should be apply to all currently node children. So, the follow 
> code  should judge whether it is the children node.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12552) Recovered driver's resource is not counted in the Master

2017-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051322#comment-16051322
 ] 

Apache Spark commented on SPARK-12552:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/18321

> Recovered driver's resource is not counted in the Master
> 
>
> Key: SPARK-12552
> URL: https://issues.apache.org/jira/browse/SPARK-12552
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.2.1, 2.3.0
>
>
> Currently in the implementation of Standalone Master HA, if application is 
> submitted as cluster mode, the resource (CPU cores and memory) of driver is 
> not counted again when recovered from failure, which will lead to unexpected 
> behaviors, like more than expected executors, negative core and memory usage 
> in the web UI. Also the recovered application's state is always {{WAITING}}, 
> we have to change the state to {{RUNNING}} when fully recovered.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21093:


Assignee: (was: Apache Spark)

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(+0x11ee91)[0x7fe69c5abe91]
>

[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051308#comment-16051308
 ] 

Apache Spark commented on SPARK-21093:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18320

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
>

[jira] [Assigned] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21093:


Assignee: Apache Spark

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(+0x11ee91)[0x7fe69c5abe91]
>

[jira] [Resolved] (SPARK-21112) ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT

2017-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21112.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18318
[https://github.com/apache/spark/pull/18318]

> ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT
> --
>
> Key: SPARK-21112
> URL: https://issues.apache.org/jira/browse/SPARK-21112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> {{ALTER TABLE SET TBLPROPERTIES}} should not overwrite the COMMENT even if 
> the input does not have the property of `COMMENT`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21114) Test failure in Spark 2.1 due to name mismatch

2017-06-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21114:

Description: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/

  was:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/


> Test failure in Spark 2.1 due to name mismatch
> --
>
> Key: SPARK-21114
> URL: https://issues.apache.org/jira/browse/SPARK-21114
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21114) Test failure in Spark 2.1 due to name mismatch

2017-06-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21114:

Description: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/

  
was:https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/


> Test failure in Spark 2.1 due to name mismatch
> --
>
> Key: SPARK-21114
> URL: https://issues.apache.org/jira/browse/SPARK-21114
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21114) Test failure in Spark 2.1 due to name mismatch

2017-06-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21114:

Affects Version/s: 2.0.2

> Test failure in Spark 2.1 due to name mismatch
> --
>
> Key: SPARK-21114
> URL: https://issues.apache.org/jira/browse/SPARK-21114
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21114) Test failure in Spark 2.1 due to name mismatch

2017-06-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21114:

Summary: Test failure in Spark 2.1 due to name mismatch  (was: Test failure 
fix in Spark 2.1 due to name mismatch)

> Test failure in Spark 2.1 due to name mismatch
> --
>
> Key: SPARK-21114
> URL: https://issues.apache.org/jira/browse/SPARK-21114
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21114) Test failure fix in Spark 2.1 due to name mismatch

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21114:


Assignee: Xiao Li  (was: Apache Spark)

> Test failure fix in Spark 2.1 due to name mismatch
> --
>
> Key: SPARK-21114
> URL: https://issues.apache.org/jira/browse/SPARK-21114
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21114) Test failure fix in Spark 2.1 due to name mismatch

2017-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051290#comment-16051290
 ] 

Apache Spark commented on SPARK-21114:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18319

> Test failure fix in Spark 2.1 due to name mismatch
> --
>
> Key: SPARK-21114
> URL: https://issues.apache.org/jira/browse/SPARK-21114
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21114) Test failure fix in Spark 2.1 due to name mismatch

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21114:


Assignee: Apache Spark  (was: Xiao Li)

> Test failure fix in Spark 2.1 due to name mismatch
> --
>
> Key: SPARK-21114
> URL: https://issues.apache.org/jira/browse/SPARK-21114
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21114) Test failure fix in Spark 2.1 due to name mismatch

2017-06-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-21114:
---

 Summary: Test failure fix in Spark 2.1 due to name mismatch
 Key: SPARK-21114
 URL: https://issues.apache.org/jira/browse/SPARK-21114
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.1.1
Reporter: Xiao Li
Assignee: Xiao Li


https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21114) Test failure fix in Spark 2.1 due to name mismatch

2017-06-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21114:

Description: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/
  (was: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/)

> Test failure fix in Spark 2.1 due to name mismatch
> --
>
> Key: SPARK-21114
> URL: https://issues.apache.org/jira/browse/SPARK-21114
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

2017-06-15 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051288#comment-16051288
 ] 

Takeshi Yamamuro commented on SPARK-17237:
--

I checked the other behaviours and I probably think it seems to be correct to 
use aggregated column names without qualifiers (since other aggregated columns 
has no qualifier: https://github.com/apache/spark/pull/18302/files). So, it 
seems the current master behaviour is okay to me for now.

> DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
> -
>
> Key: SPARK-17237
> URL: https://issues.apache.org/jira/browse/SPARK-17237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jiang Qiqi
>Assignee: Takeshi Yamamuro
>  Labels: newbie
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> I am trying to run a pivot transformation which I ran on a spark1.6 cluster, 
> namely
> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): 
> double, 4_count(c): bigint, 4_avg(c): double]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show
> +---+--++--++
> |  a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)|
> +---+--++--++
> |  2| 1| 4.0| 0| 0.0|
> |  3| 0| 0.0| 1| 5.0|
> +---+--++--++
> after upgrade the environment to spark2.0, got an error while executing 
> .na.fill method
> scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `3_count(`c`)`;
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:921)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21111) Fix test failure in 2.2

2017-06-15 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-2.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18316
[https://github.com/apache/spark/pull/18316]

> Fix test failure in 2.2 
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Test failure:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.2-test-sbt-hadoop-2.7/203/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051264#comment-16051264
 ] 

Hyukjin Kwon commented on SPARK-21093:
--

BTW, {{mcfork}} in R looks opening a pipe ahead but the existing logic does not 
properly close it when it is executed hot. This is why I suspect it is an R 
issue.

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
>

[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051244#comment-16051244
 ] 

Hyukjin Kwon commented on SPARK-21093:
--

This does disappear in a certain condition which I could not verify; however, 
the `daemon.R` process keeps the pipes eventually when it is particularly 
executed hot.

Executing the SparkR codes:

{code}
501  2041  1968   0  -:--AM ttys0000:01.50 
/Library/Frameworks/R.framework/Resources/bin/exec/R --slave --no-restore 
--vanilla --file=.../spark/R/lib/SparkR/worker/daemon.R
{code}


{code}
lsof -p 2041
{code}

{code}
...
R   2041 hyukjinkwon7   PIPE 0x15b5dd8f99c8d451 16384
R   2041 hyukjinkwon8   PIPE 0x15b5dd8f99c00911 16384
R   2041 hyukjinkwon9   PIPE 0x15b5dd8f940201d1 16384
R   2041 hyukjinkwon   10   PIPE 0x15b5dd8f9401f811 16384
R   2041 hyukjinkwon   11   PIPE 0x15b5dd8f99deec11 16384
R   2041 hyukjinkwon   12   PIPE 0x15b5dd8f96a18851 16384
R   2041 hyukjinkwon   13   PIPE 0x15b5dd8f99deee51 16384
R   2041 hyukjinkwon   14   PIPE 0x15b5dd8f99dee851 16384
R   2041 hyukjinkwon   15   PIPE 0x15b5dd8f99df13d1 16384
R   2041 hyukjinkwon   16   PIPE 0x15b5dd8f99c8ce51 16384
R   2041 hyukjinkwon   17   PIPE 0x15b5dd8f96a18f11 16384
R   2041 hyukjinkwon   18   PIPE 0x15b5dd8f96a195d1 16384
R   2041 hyukjinkwon   19   PIPE 0x15b5dd8f99e73d91 16384
R   2041 hyukjinkwon   20   PIPE 0x15b5dd8f96a18c11 16384
R   2041 hyukjinkwon   21   PIPE 0x15b5dd8f99df0651 16384
R   2041 hyukjinkwon   22   PIPE 0x15b5dd8f99def691 16384
R   2041 hyukjinkwon   23   PIPE 0x15b5dd8f99e75f51 16384
R   2041 hyukjinkwon   24   PIPE 0x15b5dd8f99def091 16384
R   2041 hyukjinkwon   25   PIPE 0x15b5dd8f99e74d51 16384
R   2041 hyukjinkwon   26   PIPE 0x15b5dd8f99e75591 16384
{code}

{code}
lsof -p 2041 | wc -l
{code}

This number keeps increasing/decreasing but eventually consistently increasing.

Same thing happens in CentOS too (in this case, looks increasing more 
aggressively.

{code}
R   29617 root   16r  FIFO  0,8   0t040330039 pipe
R   29617 root   17r  FIFO  0,8   0t040330197 pipe
R   29617 root   18r  FIFO  0,8   0t040330041 pipe
R   29617 root   19r  FIFO  0,8   0t040330067 pipe
R   29617 root   20w  FIFO  0,8   0t040326795 pipe
R   29617 root   21r  FIFO  0,8   0t040326800 pipe
R   29617 root   22r  FIFO  0,8   0t040330191 pipe
R   29617 root   23r  FIFO  0,8   0t040330047 pipe
R   29617 root   24r  FIFO  0,8   0t040330049 pipe
R   29617 root   25r  FIFO  0,8   0t040330011 pipe
R   29617 root   26w  FIFO  0,8   0t040326801 pipe
R   29617 root   27r  FIFO  0,8   0t040330199 pipe
R   29617 root   28r  FIFO  0,8   0t040330051 pipe
R   29617 root   29r  FIFO  0,8   0t040326808 pipe
R   29617 root   30w  FIFO  0,8   0t040330124 pipe
R   29617 root   31r  FIFO  0,8   0t040326810 pipe
R   29617 root   32r  FIFO  0,8   0t040330133 pipe
R   29617 root   33w  FIFO  0,8   0t040330012 pipe
{code} 

One workaround only to pass the test is, I think, avoid reuse the `daemon.R` by 
setting {{spark.sparkr.use.daemon}} to {{false}}. I will double check and will 
propose this to fix it for now.

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x },

[jira] [Resolved] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21096.
--
Resolution: Not A Problem

That's not an issue in Spark but maybe cloudpickle or Python.

I wonder how we could avoid this though

{code}
>>> class A():
... def __init__(self):
... self.a = "a"
... self.b = "b"
...
>>> obj = A()
>>> def test(): obj.b
...
>>> b = "a"
>>> def test1(): b
...
>>> import dis
>>> dis.dis(test)
  1   0 LOAD_GLOBAL  0 (obj)
  3 LOAD_ATTR1 (b)
  6 POP_TOP
  7 LOAD_CONST   0 (None)
 10 RETURN_VALUE
>>> dis.dis(test1)
  1   0 LOAD_GLOBAL  0 (b)
  3 POP_TOP
  4 LOAD_CONST   0 (None)
  7 RETURN_VALUE
{code}

> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> {quote}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21112) ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21112:


Assignee: Xiao Li  (was: Apache Spark)

> ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT
> --
>
> Key: SPARK-21112
> URL: https://issues.apache.org/jira/browse/SPARK-21112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {{ALTER TABLE SET TBLPROPERTIES}} should not overwrite the COMMENT even if 
> the input does not have the property of `COMMENT`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21112) ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT

2017-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051140#comment-16051140
 ] 

Apache Spark commented on SPARK-21112:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18318

> ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT
> --
>
> Key: SPARK-21112
> URL: https://issues.apache.org/jira/browse/SPARK-21112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {{ALTER TABLE SET TBLPROPERTIES}} should not overwrite the COMMENT even if 
> the input does not have the property of `COMMENT`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21112) ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21112:


Assignee: Apache Spark  (was: Xiao Li)

> ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT
> --
>
> Key: SPARK-21112
> URL: https://issues.apache.org/jira/browse/SPARK-21112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {{ALTER TABLE SET TBLPROPERTIES}} should not overwrite the COMMENT even if 
> the input does not have the property of `COMMENT`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21113) Support for read ahead input stream to amortize disk IO cost in the Spill reader

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21113:


Assignee: (was: Apache Spark)

> Support for read ahead input stream to amortize disk IO cost in the Spill 
> reader
> 
>
> Key: SPARK-21113
> URL: https://issues.apache.org/jira/browse/SPARK-21113
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Sital Kedia
>Priority: Minor
>
> Profiling some of our big jobs, we see that around 30% of the time is being 
> spent in reading the spill files from disk. In order to amortize the disk IO 
> cost, the idea is to implement a read ahead input stream which which 
> asynchronously reads ahead from the underlying input stream when specified 
> amount of data has been read from the current buffer. It does it by 
> maintaining two buffer - active buffer and read ahead buffer. Active buffer 
> contains data which should be returned when a read() call is issued. The read 
> ahead buffer is used to asynchronously read from the underlying input stream 
> and once the current active buffer is exhausted, we flip the two buffers so 
> that we can start reading from the read ahead buffer without being blocked in 
> disk I/O.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21113) Support for read ahead input stream to amortize disk IO cost in the Spill reader

2017-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051134#comment-16051134
 ] 

Apache Spark commented on SPARK-21113:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/18317

> Support for read ahead input stream to amortize disk IO cost in the Spill 
> reader
> 
>
> Key: SPARK-21113
> URL: https://issues.apache.org/jira/browse/SPARK-21113
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Sital Kedia
>Priority: Minor
>
> Profiling some of our big jobs, we see that around 30% of the time is being 
> spent in reading the spill files from disk. In order to amortize the disk IO 
> cost, the idea is to implement a read ahead input stream which which 
> asynchronously reads ahead from the underlying input stream when specified 
> amount of data has been read from the current buffer. It does it by 
> maintaining two buffer - active buffer and read ahead buffer. Active buffer 
> contains data which should be returned when a read() call is issued. The read 
> ahead buffer is used to asynchronously read from the underlying input stream 
> and once the current active buffer is exhausted, we flip the two buffers so 
> that we can start reading from the read ahead buffer without being blocked in 
> disk I/O.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21113) Support for read ahead input stream to amortize disk IO cost in the Spill reader

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21113:


Assignee: Apache Spark

> Support for read ahead input stream to amortize disk IO cost in the Spill 
> reader
> 
>
> Key: SPARK-21113
> URL: https://issues.apache.org/jira/browse/SPARK-21113
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Sital Kedia
>Assignee: Apache Spark
>Priority: Minor
>
> Profiling some of our big jobs, we see that around 30% of the time is being 
> spent in reading the spill files from disk. In order to amortize the disk IO 
> cost, the idea is to implement a read ahead input stream which which 
> asynchronously reads ahead from the underlying input stream when specified 
> amount of data has been read from the current buffer. It does it by 
> maintaining two buffer - active buffer and read ahead buffer. Active buffer 
> contains data which should be returned when a read() call is issued. The read 
> ahead buffer is used to asynchronously read from the underlying input stream 
> and once the current active buffer is exhausted, we flip the two buffers so 
> that we can start reading from the read ahead buffer without being blocked in 
> disk I/O.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21112) ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT

2017-06-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-21112:
---

 Summary: ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT
 Key: SPARK-21112
 URL: https://issues.apache.org/jira/browse/SPARK-21112
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


{{ALTER TABLE SET TBLPROPERTIES}} should not overwrite the COMMENT even if the 
input does not have the property of `COMMENT`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21113) Support for read ahead input stream to amortize disk IO cost in the Spill reader

2017-06-15 Thread Sital Kedia (JIRA)

Sital Kedia created SPARK-21113:
---

 Summary: Support for read ahead input stream to amortize disk IO 
cost in the Spill reader
 Key: SPARK-21113
 URL: https://issues.apache.org/jira/browse/SPARK-21113
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.2
Reporter: Sital Kedia
Priority: Minor


Profiling some of our big jobs, we see that around 30% of the time is being 
spent in reading the spill files from disk. In order to amortize the disk IO 
cost, the idea is to implement a read ahead input stream which which 
asynchronously reads ahead from the underlying input stream when specified 
amount of data has been read from the current buffer. It does it by maintaining 
two buffer - active buffer and read ahead buffer. Active buffer contains data 
which should be returned when a read() call is issued. The read ahead buffer is 
used to asynchronously read from the underlying input stream and once the 
current active buffer is exhausted, we flip the two buffers so that we can 
start reading from the read ahead buffer without being blocked in disk I/O.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-06-15 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051109#comment-16051109
 ] 

Marcelo Vanzin commented on SPARK-18838:


bq. If we're careful in performance optimization of Spark's core internal 
listeners ... then it might be okay to publish events directly to those 
listeners

While it's possible to squeeze better performance out of current code, it's 
hard to make statements without an actual goal in mind. What are the 
requirements for a listener to be considered for this feature? e.g., no more 
than "x" us processing time per event, or something like that? How many 
listeners do you allow to operate in this manner before they start to slow down 
the code that's publishing events?

I actually spent a whole lot of time refactoring and optimizing listener code 
as part of the work in SPARK-18085 (which I'm slowly sending out for review), 
but there's just so much you can do in certain cases. The executor allocation 
listener is probably the only one I'd even consider for this category. Any 
other listener, especially ones that collect data in one way or another, ends 
up hitting slow paths every once in a while - e.g. you're collecting things in 
some data structure and suddenly you need to resize / copy a bunch of things, 
and there goes your processing time.

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-06-15 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051097#comment-16051097
 ] 

Marcelo Vanzin commented on SPARK-18838:


[~sitalke...@gmail.com] I requested your PR be closed. It was sitting there 
with unaddressed feedback and multiple people asking for updates for about 6 
months. If you're still interested in updating it you can re-open it and, more 
importantly, actually update it.

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21111) Fix test failure in 2.2

2017-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051095#comment-16051095
 ] 

Apache Spark commented on SPARK-2:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18316

> Fix test failure in 2.2 
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> Test failure:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.2-test-sbt-hadoop-2.7/203/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21111) Fix test failure in 2.2

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2:


Assignee: Xiao Li  (was: Apache Spark)

> Fix test failure in 2.2 
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> Test failure:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.2-test-sbt-hadoop-2.7/203/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21111) Fix test failure in 2.2

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2:


Assignee: Apache Spark  (was: Xiao Li)

> Fix test failure in 2.2 
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Blocker
>
> Test failure:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.2-test-sbt-hadoop-2.7/203/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-15 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051094#comment-16051094
 ] 

Shivaram Venkataraman commented on SPARK-21093:
---

Thanks [~hyukjin.kwon] -- thats a very useful debugging notes. In addition to 
filing the bug in R, I am wondering if there is some thing we can do in our 
SparkR code to mitigate this. Could we say add a sleep or pause before the 
gapply tests ? Or in other words do the pipes / sockets disappear after some 
time ?

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
>

[jira] [Commented] (SPARK-21025) missing data in jsc.union

2017-06-15 Thread meng xi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051090#comment-16051090
 ] 

meng xi commented on SPARK-21025:
-

I create a new list for each rdd and it works, thanks! 

> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
> Attachments: SparkTest.java
>
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21111) Fix test failure in 2.2

2017-06-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-2:
---

 Summary: Fix test failure in 2.2 
 Key: SPARK-2
 URL: https://issues.apache.org/jira/browse/SPARK-2
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li
Priority: Blocker


Test failure:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.2-test-sbt-hadoop-2.7/203/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-15 Thread Dayou Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051017#comment-16051017
 ] 

Dayou Zhou commented on SPARK-21101:


Hi [~srowen], thanks for the helpful and constructive comment.  So yes I have 
also tried starting STS using --jars option, i.e.

./start-thriftserver.sh --jars /path/to/udf.jar

and have also verified that by doing this, I no longer need to specify USING 
JAR when creating my udf, i.e.

CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf'

However, the bad news is that when I invoke the udf, I get exactly the same 
error as before, i.e.

>> No handler for Hive UDF 'com.foo.MyUdtf': java.lang.NullPointerException; 
>> line 1 pos 7

So I have reported what I wanted to report and I will leave the authorities to 
decide whether this is a bug or a 'question' (even though I do have an opinion 
on which).  Thanks for your help.

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-06-15 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051013#comment-16051013
 ] 

Reynold Xin commented on SPARK-1:
-

But this ticket has nothing to do with SQL?

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide

2017-06-15 Thread Neil Parker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050999#comment-16050999
 ] 

Neil Parker edited comment on SPARK-20937 at 6/15/17 8:11 PM:
--

+1
I ran into an issue recently and took me awhile to find this option


was (Author: nwparker):
+1
I ran into the issue recently and took me awhile to find this option

> Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, 
> DataFrames and Datasets Guide
> -
>
> Key: SPARK-20937
> URL: https://issues.apache.org/jira/browse/SPARK-20937
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As a follow-up to SPARK-20297 (and SPARK-10400) in which 
> {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala 
> and Hive, Spark SQL docs for [Parquet 
> Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
>  should have it documented.
> p.s. It was asked about in [Why can't Impala read parquet files after Spark 
> SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow 
> today.
> p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance 
> Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 
> 3-10. Parquet data source options) that gives the option some wider publicity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide

2017-06-15 Thread Neil Parker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050999#comment-16050999
 ] 

Neil Parker commented on SPARK-20937:
-

+1
I ran into the issue recently and took me awhile to find this option

> Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, 
> DataFrames and Datasets Guide
> -
>
> Key: SPARK-20937
> URL: https://issues.apache.org/jira/browse/SPARK-20937
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As a follow-up to SPARK-20297 (and SPARK-10400) in which 
> {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala 
> and Hive, Spark SQL docs for [Parquet 
> Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
>  should have it documented.
> p.s. It was asked about in [Why can't Impala read parquet files after Spark 
> SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow 
> today.
> p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance 
> Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 
> 3-10. Parquet data source options) that gives the option some wider publicity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19490) Hive partition columns are case-sensitive

2017-06-15 Thread Taklon Stephen Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050993#comment-16050993
 ] 

Taklon Stephen Wu commented on SPARK-19490:
---

I pinged in the PR, but I didn't get anything back. So, do we have any plan to 
review this patch?

> Hive partition columns are case-sensitive
> -
>
> Key: SPARK-19490
> URL: https://issues.apache.org/jira/browse/SPARK-19490
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> The real partitions columns are lower case (year, month, day)
> {code}
> Caused by: java.lang.RuntimeException: Expected only partition pruning 
> predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> {code}
> Use these sql can reproduce this bug:
> CREATE TABLE partition_test (key Int) partitioned by (date string)
> SELECT * FROM partition_test where DATE = '20170101'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21110) Structs should be usable in inequality filters

2017-06-15 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-21110:
-
Summary: Structs should be usable in inequality filters  (was: Structs 
should be orderable)

> Structs should be usable in inequality filters
> --
>
> Key: SPARK-21110
> URL: https://issues.apache.org/jira/browse/SPARK-21110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It seems like a missing feature that you can't compare structs in a filter on 
> a DataFrame.
> Here's a simple demonstration of a) where this would be useful and b) how 
> it's different from simply comparing each of the components of the structs.
> {code}
> import pyspark
> from pyspark.sql.functions import col, struct, concat
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [
> ('Boston', 'Bob'),
> ('Boston', 'Nick'),
> ('San Francisco', 'Bob'),
> ('San Francisco', 'Nick'),
> ],
> ['city', 'person']
> )
> pairs = (
> df.select(
> struct('city', 'person').alias('p1')
> )
> .crossJoin(
> df.select(
> struct('city', 'person').alias('p2')
> )
> )
> )
> print("Everything")
> pairs.show()
> print("Comparing parts separately (doesn't give me what I want)")
> (pairs
> .where(col('p1.city') < col('p2.city'))
> .where(col('p1.person') < col('p2.person'))
> .show())
> print("Comparing parts together with concat (gives me what I want but is 
> hacky)")
> (pairs
> .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
> .show())
> print("Comparing parts together with struct (my desired solution but 
> currently yields an error)")
> (pairs
> .where(col('p1') < col('p2'))
> .show())
> {code}
> The last query yields the following error in Spark 2.1.1:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
> data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint 
> or int or bigint or float or double or decimal or timestamp or date or string 
> or binary) type, not struct;;
> 'Filter (p1#5 < p2#8)
> +- Join Cross
>:- Project [named_struct(city, city#0, person, person#1) AS p1#5]
>:  +- LogicalRDD [city#0, person#1]
>+- Project [named_struct(city, city#0, person, person#1) AS p2#8]
>   +- LogicalRDD [city#0, person#1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21110) Structs should be orderable

2017-06-15 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-21110:


 Summary: Structs should be orderable
 Key: SPARK-21110
 URL: https://issues.apache.org/jira/browse/SPARK-21110
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Nicholas Chammas
Priority: Minor


It seems like a missing feature that you can't compare structs in a filter on a 
DataFrame.

Here's a simple demonstration of a) where this would be useful and b) how it's 
different from simply comparing each of the components of the structs.

{code}
import pyspark
from pyspark.sql.functions import col, struct, concat

spark = pyspark.sql.SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
[
('Boston', 'Bob'),
('Boston', 'Nick'),
('San Francisco', 'Bob'),
('San Francisco', 'Nick'),
],
['city', 'person']
)
pairs = (
df.select(
struct('city', 'person').alias('p1')
)
.crossJoin(
df.select(
struct('city', 'person').alias('p2')
)
)
)

print("Everything")
pairs.show()

print("Comparing parts separately (doesn't give me what I want)")
(pairs
.where(col('p1.city') < col('p2.city'))
.where(col('p1.person') < col('p2.person'))
.show())

print("Comparing parts together with concat (gives me what I want but is 
hacky)")
(pairs
.where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
.show())

print("Comparing parts together with struct (my desired solution but currently 
yields an error)")
(pairs
.where(col('p1') < col('p2'))
.show())
{code}

The last query yields the following error in Spark 2.1.1:

{code}
org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint or 
int or bigint or float or double or decimal or timestamp or date or string or 
binary) type, not struct;;
'Filter (p1#5 < p2#8)
+- Join Cross
   :- Project [named_struct(city, city#0, person, person#1) AS p1#5]
   :  +- LogicalRDD [city#0, person#1]
   +- Project [named_struct(city, city#0, person, person#1) AS p2#8]
  +- LogicalRDD [city#0, person#1]
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21109) union two dataset[A] don't work as expected if one of the datasets is originated from a dataframe

2017-06-15 Thread Jerry Lam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Lam updated SPARK-21109:
--
Description: 
To reproduce the issue:
{code}
case class my_case(id0: Long, id1: Int, id2: Int, id3: String)
val data1 = Seq(my_case(0L, 0, 0, "0")).toDS
val data2 = Seq(("1", 1, 1, 1L)).toDF("id3", "id1", "id2", "id0").as[my_case]

data1.show
+---+---+---+---+
|id0|id1|id2|id3|
+---+---+---+---+
|  0|  0|  0|  0|
+---+---+---+---+

data2.show
+---+---+---+---+
|id3|id1|id2|id0|
+---+---+---+---+
|  1|  1|  1|  1|
+---+---+---+---+

data1.union(data2).show

org.apache.spark.sql.AnalysisException: Cannot up cast `id0` from string to 
bigint as it may truncate
The type path of the target object is:
- field (class: "scala.Long", name: "id0")
- root class: "my_case"
You can either add an explicit cast to the input data or choose a higher 
precision type of the field in the target object;
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2123)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2153)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2140)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:336)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:334)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:245)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:245)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:245)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2140)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2136)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
  at

[jira] [Updated] (SPARK-21109) union two dataset[A] don't work as expected if one of the datasets is originated from a dataframe

2017-06-15 Thread Jerry Lam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Lam updated SPARK-21109:
--
Description: 
To reproduce the issue:
{code}
case class my_case(id0: Long, id1: Int, id2: Int, id3: String)
val data1 = Seq(my_case(0L, 0, 0, "0")).toDS
val data2 = Seq(("1", 1, 1, 1L)).toDF("id3", "id1", "id2", "id0").as[my_case]

data1.show
+---+---+---+---+
|id0|id1|id2|id3|
+---+---+---+---+
|  0|  0|  0|  0|
+---+---+---+---+

data2.show
+---+---+---+---+
|id3|id1|id2|id0|
+---+---+---+---+
|  1|  1|  1|  1|
+---+---+---+---+

data1.union(data2).show

org.apache.spark.sql.AnalysisException: Cannot up cast `id0` from string to 
bigint as it may truncate
The type path of the target object is:
- field (class: "scala.Long", name: "id0")
- root class: "my_case"
You can either add an explicit cast to the input data or choose a higher 
precision type of the field in the target object;
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2123)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2153)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2140)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:336)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:334)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:245)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:245)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:245)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2140)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2136)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
  at

[jira] [Comment Edited] (SPARK-18649) sc.textFile(my_file).collect() raises socket.timeout on large files

2017-06-15 Thread Serge Vilvovsky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050940#comment-16050940
 ] 

Serge Vilvovsky edited comment on SPARK-18649 at 6/15/17 6:57 PM:
--

Does anybody work on the issue? Same problem (PythonRDD: Error while sending 
iterator
java.net.SocketException: Connection reset) occurs for me on collect() of the 
large elasticsearch RDD. 


was (Author: sergevil):
Does anybody work on the issue? Same problem (PythonRDD: Error while sending 
iterator
java.net.SocketException: Connection reset) occures for me on collect() of the 
large elasticsearch RDD. 

> sc.textFile(my_file).collect() raises socket.timeout on large files
> ---
>
> Key: SPARK-18649
> URL: https://issues.apache.org/jira/browse/SPARK-18649
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: PySpark version 1.6.2
>Reporter: Erik Cederstrand
>
> I'm trying to load a file into the driver with this code:
> contents = sc.textFile('hdfs://path/to/big_file.csv').collect()
> Loading into the driver instead of creating a distributed RDD is intentional 
> in this case. The file is ca. 6GB, and I have adjusted driver memory 
> accordingly to fit the local data. After some time, my spark/submitted job 
> crashes with the stack trace below.
> I have traced this to pyspark/rdd.py where the _load_from_socket() method 
> creates a socket with a hard-coded timeout of 3 seconds (this code is also 
> present in HEAD although I'm on PySpark 1.6.2). Raising this hard-coded value 
> to e.g. 600 lets me read the entire file.
> Is there any reason that this value does not use e.g. the 
> 'spark.network.timeout' setting instead?
> Traceback (most recent call last):
>   File "my_textfile_test.py", line 119, in 
> contents = sc.textFile('hdfs://path/to/file.csv').collect()
>   File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 772, in collect
>   File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 142, in _load_from_socket
>   File 
> "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 517, in load_stream
>   File 
> "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 511, in loads
>   File "/usr/lib/python2.7/socket.py", line 380, in read
> data = self._sock.recv(left)
> socket.timeout: timed out
> 16/11/30 13:33:14 WARN Utils: Suppressing exception in finally: Broken pipe
> java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>   at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$2.apply$mcV$sp(PythonRDD.scala:650)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1248)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:649)
>   Suppressed: java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
>   at 
> java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at 
> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
>   ... 3 more
> 16/11/30 13:33:14 ERROR PythonRDD: Error while sending iterator
> java.net.SocketException: Connection reset
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
>   at java.io.DataOutputStream.write(DataOutputStream.java:107)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
>   at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
>   at

[jira] [Created] (SPARK-21109) union two dataset[A] don't work as expected if one of the datasets is originated from a dataframe

2017-06-15 Thread Jerry Lam (JIRA)

Jerry Lam created SPARK-21109:
-

 Summary: union two dataset[A] don't work as expected if one of the 
datasets is originated from a dataframe
 Key: SPARK-21109
 URL: https://issues.apache.org/jira/browse/SPARK-21109
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Jerry Lam


To reproduce the issue:
{code}
case class my_case(id0: Long, id1: Int, id2: Int, id3: String)
val data1 = Seq(my_case(0L, 0, 0, "0")).toDS
val data2 = Seq(("1", 1, 1, 1L)).toDF("id3", "id1", "id2", "id0").as[my_case]

data1.show
+---+---+---+---+
|id0|id1|id2|id3|
+---+---+---+---+
|  0|  0|  0|  0|
+---+---+---+---+

data2.show
+---+---+---+---+
|id3|id1|id2|id0|
+---+---+---+---+
|  1|  1|  1|  1|
+---+---+---+---+

data1.union(data2).show

org.apache.spark.sql.AnalysisException: Cannot up cast `id0` from string to 
bigint as it may truncate
The type path of the target object is:
- field (class: "scala.Long", name: "id0")
- root class: "my_case"
You can either add an explicit cast to the input data or choose a higher 
precision type of the field in the target object;
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2123)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2153)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2140)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:336)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:334)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:245)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:245)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:245)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2140)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2136)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
  at

[jira] [Comment Edited] (SPARK-18649) sc.textFile(my_file).collect() raises socket.timeout on large files

2017-06-15 Thread Serge Vilvovsky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050940#comment-16050940
 ] 

Serge Vilvovsky edited comment on SPARK-18649 at 6/15/17 6:57 PM:
--

Does anybody work on the issue? Same problem (PythonRDD: Error while sending 
iterator
java.net.SocketException: Connection reset) occures for me on collect() of the 
large elasticsearch RDD. 


was (Author: sergevil):
Does anybody work on the issue? Same problem happens for me on collect() of the 
large elasticsearch RDD. 

> sc.textFile(my_file).collect() raises socket.timeout on large files
> ---
>
> Key: SPARK-18649
> URL: https://issues.apache.org/jira/browse/SPARK-18649
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: PySpark version 1.6.2
>Reporter: Erik Cederstrand
>
> I'm trying to load a file into the driver with this code:
> contents = sc.textFile('hdfs://path/to/big_file.csv').collect()
> Loading into the driver instead of creating a distributed RDD is intentional 
> in this case. The file is ca. 6GB, and I have adjusted driver memory 
> accordingly to fit the local data. After some time, my spark/submitted job 
> crashes with the stack trace below.
> I have traced this to pyspark/rdd.py where the _load_from_socket() method 
> creates a socket with a hard-coded timeout of 3 seconds (this code is also 
> present in HEAD although I'm on PySpark 1.6.2). Raising this hard-coded value 
> to e.g. 600 lets me read the entire file.
> Is there any reason that this value does not use e.g. the 
> 'spark.network.timeout' setting instead?
> Traceback (most recent call last):
>   File "my_textfile_test.py", line 119, in 
> contents = sc.textFile('hdfs://path/to/file.csv').collect()
>   File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 772, in collect
>   File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 142, in _load_from_socket
>   File 
> "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 517, in load_stream
>   File 
> "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 511, in loads
>   File "/usr/lib/python2.7/socket.py", line 380, in read
> data = self._sock.recv(left)
> socket.timeout: timed out
> 16/11/30 13:33:14 WARN Utils: Suppressing exception in finally: Broken pipe
> java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>   at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$2.apply$mcV$sp(PythonRDD.scala:650)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1248)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:649)
>   Suppressed: java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
>   at 
> java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at 
> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
>   ... 3 more
> 16/11/30 13:33:14 ERROR PythonRDD: Error while sending iterator
> java.net.SocketException: Connection reset
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
>   at java.io.DataOutputStream.write(DataOutputStream.java:107)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
>   at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
>   at 
>

[jira] [Commented] (SPARK-18649) sc.textFile(my_file).collect() raises socket.timeout on large files

2017-06-15 Thread Serge Vilvovsky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050940#comment-16050940
 ] 

Serge Vilvovsky commented on SPARK-18649:
-

Does anybody work on the issue? Same problem happens for me on collect() of the 
large elasticsearch RDD. 

> sc.textFile(my_file).collect() raises socket.timeout on large files
> ---
>
> Key: SPARK-18649
> URL: https://issues.apache.org/jira/browse/SPARK-18649
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: PySpark version 1.6.2
>Reporter: Erik Cederstrand
>
> I'm trying to load a file into the driver with this code:
> contents = sc.textFile('hdfs://path/to/big_file.csv').collect()
> Loading into the driver instead of creating a distributed RDD is intentional 
> in this case. The file is ca. 6GB, and I have adjusted driver memory 
> accordingly to fit the local data. After some time, my spark/submitted job 
> crashes with the stack trace below.
> I have traced this to pyspark/rdd.py where the _load_from_socket() method 
> creates a socket with a hard-coded timeout of 3 seconds (this code is also 
> present in HEAD although I'm on PySpark 1.6.2). Raising this hard-coded value 
> to e.g. 600 lets me read the entire file.
> Is there any reason that this value does not use e.g. the 
> 'spark.network.timeout' setting instead?
> Traceback (most recent call last):
>   File "my_textfile_test.py", line 119, in 
> contents = sc.textFile('hdfs://path/to/file.csv').collect()
>   File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 772, in collect
>   File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 142, in _load_from_socket
>   File 
> "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 517, in load_stream
>   File 
> "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 511, in loads
>   File "/usr/lib/python2.7/socket.py", line 380, in read
> data = self._sock.recv(left)
> socket.timeout: timed out
> 16/11/30 13:33:14 WARN Utils: Suppressing exception in finally: Broken pipe
> java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>   at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$2.apply$mcV$sp(PythonRDD.scala:650)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1248)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:649)
>   Suppressed: java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
>   at 
> java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at 
> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
>   at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
>   ... 3 more
> 16/11/30 13:33:14 ERROR PythonRDD: Error while sending iterator
> java.net.SocketException: Connection reset
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
>   at java.io.DataOutputStream.write(DataOutputStream.java:107)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
>   at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at

[jira] [Resolved] (SPARK-20434) Move Hadoop delegation token code from yarn to core

2017-06-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20434.

   Resolution: Fixed
 Assignee: Michael Gummelt
Fix Version/s: 2.3.0

> Move Hadoop delegation token code from yarn to core
> ---
>
> Key: SPARK-20434
> URL: https://issues.apache.org/jira/browse/SPARK-20434
> Project: Spark
>  Issue Type: Task
>  Components: Mesos, Spark Core, YARN
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
> Fix For: 2.3.0
>
>
> This is to enable kerberos support for other schedulers, such as Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21108) convert LinearSVC to aggregator framework

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21108:


Assignee: Apache Spark

> convert LinearSVC to aggregator framework
> -
>
> Key: SPARK-21108
> URL: https://issues.apache.org/jira/browse/SPARK-21108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21108) convert LinearSVC to aggregator framework

2017-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050931#comment-16050931
 ] 

Apache Spark commented on SPARK-21108:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/18315

> convert LinearSVC to aggregator framework
> -
>
> Key: SPARK-21108
> URL: https://issues.apache.org/jira/browse/SPARK-21108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21108) convert LinearSVC to aggregator framework

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21108:


Assignee: (was: Apache Spark)

> convert LinearSVC to aggregator framework
> -
>
> Key: SPARK-21108
> URL: https://issues.apache.org/jira/browse/SPARK-21108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21108) convert LinearSVC to aggregator framework

2017-06-15 Thread yuhao yang (JIRA)

yuhao yang created SPARK-21108:
--

 Summary: convert LinearSVC to aggregator framework
 Key: SPARK-21108
 URL: https://issues.apache.org/jira/browse/SPARK-21108
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21081) Throw specific IllegalStateException subtype when asserting that SparkContext not stopped

2017-06-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050927#comment-16050927
 ] 

Sean Owen edited comment on SPARK-21081 at 6/15/17 6:39 PM:


[~fzhinkin] when isn't it possible? you can {code}catch { case e: 
IllegalStateException if sc.isStopped => ... }{code} I'm not sure there's an 
argument for this change, which would be sort of inconsistent with how the rest 
of the code deals with exceptions.


was (Author: srowen):
[~fzhinkin] when isn't it possible? you can {{ catch { case e: 
IllegalStateException if sc.isStopped => ... } }} I'm not sure there's an 
argument for this change, which would be sort of inconsistent with how the rest 
of the code deals with exceptions.

> Throw specific IllegalStateException subtype when asserting that SparkContext 
> not stopped
> -
>
> Key: SPARK-21081
> URL: https://issues.apache.org/jira/browse/SPARK-21081
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Filipp Zhinkin
>Priority: Minor
>
> org.apache.spark.SparkContext.assertNotStopped throws IllegalStateException 
> if the context was stopped.
> Unfortunately, it is not so easy to distinguish IAE caused by a stopped 
> context from some other  failed assertion and handle it properly (start a new 
> context, for example).
> I'm suggesting to add a specific IllegalStateException subclass 
> (SparkContextClosedException) and start throwing it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21081) Throw specific IllegalStateException subtype when asserting that SparkContext not stopped

2017-06-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050927#comment-16050927
 ] 

Sean Owen commented on SPARK-21081:
---

[~fzhinkin] when isn't it possible? you can {{ catch { case e: 
IllegalStateException if sc.isStopped => ... } }} I'm not sure there's an 
argument for this change, which would be sort of inconsistent with how the rest 
of the code deals with exceptions.

> Throw specific IllegalStateException subtype when asserting that SparkContext 
> not stopped
> -
>
> Key: SPARK-21081
> URL: https://issues.apache.org/jira/browse/SPARK-21081
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Filipp Zhinkin
>Priority: Minor
>
> org.apache.spark.SparkContext.assertNotStopped throws IllegalStateException 
> if the context was stopped.
> Unfortunately, it is not so easy to distinguish IAE caused by a stopped 
> context from some other  failed assertion and handle it properly (start a new 
> context, for example).
> I'm suggesting to add a specific IllegalStateException subclass 
> (SparkContextClosedException) and start throwing it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21107) Pyspark: ISO-8859-1 column names inconsistently converted to UTF-8

2017-06-15 Thread Tavis Barr (JIRA)

Tavis Barr created SPARK-21107:
--

 Summary: Pyspark: ISO-8859-1 column names inconsistently converted 
to UTF-8
 Key: SPARK-21107
 URL: https://issues.apache.org/jira/browse/SPARK-21107
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
 Environment: Windows 7 standalone
Reporter: Tavis Barr
Priority: Minor


When I create a column name with ISO-8859-1 (or possibly, I suspect, other 
non-UTF-8) characters in it, they are sometimes converted to UTF-8, sometimes 
not.

Examples:
>>> df = sc.parallelize([[1,2],[1,4],[2,5],[2,6]]).toDF([u"L\xe0",u"Here"])
>>> df.show()
+---++
| Là|Here|
+---++
|  1|   2|
|  1|   4|
|  2|   5|
|  2|   6|
+---++

>>> df.columns
['L\xc3\xa0', 'Here']
>>> df.select(u'L\xc3\xa0').show()
Traceback (most recent call last):
  File "", line 1, in 
  File 
"F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\pyspark\sql\dataframe.py",
 line 992, in select
jdf = self._jdf.select(self._jcols(*cols))
  File 
"F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py",
 line 1133, in __call__
  File 
"F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\pyspark\sql\utils.py",
 line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve '`L\xc3\xa0`' given input 
columns: [L\xe0, Here];;\n'Project ['L\xc3\xa0]\n+- LogicalRDD [L\xe0#14L, 
Here#15L]\n"
>>> df.select(u'L\xe0').show()
+---+
| Là|
+---+
|  1|
|  1|
|  2|
|  2|
+---+
>>> df.select(u'L\xe0').collect()[0].asDict()
{'L\xc3\xa0': 1}

This does not seem to affect the Scala version:

scala> val df = 
sc.parallelize(Seq((1,2),(1,4),(2,5),(2,6))).toDF("L\u00e0","Here")
df: org.apache.spark.sql.DataFrame = [Lα: int, Here: int]

scala> df.select("L\u00e0").show()
[...output elided..]
+---+
| Là|
+---+
|  1|
|  1|
|  2|
|  2|
+---+

scala> df.columns(0).map(c => c.toInt )
res8: scala.collection.immutable.IndexedSeq[Int] = Vector(76, 224)

[Note that 224 is \u00e0, i.e., the original value]




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050896#comment-16050896
 ] 

Hyukjin Kwon edited comment on SPARK-21093 at 6/15/17 6:19 PM:
---

In case of my Mac, it looks the problem is here - 
https://github.com/apache/spark/blob/2881a2d1d1a650a91df2c6a01275eba14a43b42a/R/pkg/inst/worker/daemon.R#L45-L53

This code path looks particularly busy for {{dapply}} and {{gapply}}. I could 
reproduce the issue for both APIs too.

{code}
df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
collect(dapply(repartition(df, 200), function(x) { x }, schema(df)))
{code}

It looks processes / sockets are being closed fine but the pipes remind. This 
looks leading to an error, such as, "Error in parallel:::mcfork() :". I checked 
this via {{watch -n 0.1 "lsof -c R | wc -l"}} in my Mac.

Running the code below multiple times (this should be varied for ulimit ...):

{code}
for(i in 0:200) {
  p <- parallel:::mcfork()
  if (inherits(p, "masterProcess")) {
tools::pskill(Sys.getpid(), tools::SIGUSR1)
parallel:::mcexit(0L)
  }
}
{code}

reproduced the problem. Please correct me if I am wrong. I suspect an issue in 
R. I will update a comment after testing this on CentOS soon.


was (Author: hyukjin.kwon):
In case of my Mac, it looks the problem is here - 
https://github.com/apache/spark/blob/2881a2d1d1a650a91df2c6a01275eba14a43b42a/R/pkg/inst/worker/daemon.R#L45-L53

This code path looks particularly busy for {{dapply}} and {{gapply}}. I could 
reproduce the issue for both APIs too.

{code}
df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
collect(dapply(repartition(df, 200), function(x) { x }, schema(df)))
{code}

It looks processes / sockets are being closed fine but the pipes remind. This 
looks leading to an error, such as, "Error in parallel:::mcfork() :" due to 
"Resource temporarily unavailable". I checked this via {{watch -n 0.1 "lsof -c 
R | wc -l"}} in my Mac.

Running the code below multiple times (this should be varied for ulimit ...):

{code}
for(i in 0:200) {
  p <- parallel:::mcfork()
  if (inherits(p, "masterProcess")) {
tools::pskill(Sys.getpid(), tools::SIGUSR1)
parallel:::mcexit(0L)
  }
}
{code}

reproduced the problem. Please correct me if I am wrong. I suspect an issue in 
R. I will update a comment after testing this on CentOS soon.

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
>

[jira] [Comment Edited] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049944#comment-16049944
 ] 

Hyukjin Kwon edited comment on SPARK-21093 at 6/15/17 6:17 PM:
---

I am taking a look here gdb with bt:

In case of CentOS above,

{code}
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /usr/lib64/R/bin/exec/R...Reading symbols from 
/usr/lib64/R/bin/exec/R...(no debugging symbols found)...done.
(no debugging symbols found)...done.
[New LWP 25284]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib64/R/bin/exec/R --slave --no-restore --vanilla 
--file=/home/hyukjinkwon'.
Program terminated with signal 6, Aborted.
#0  0x7fbdffb545f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install R-core-3.4.0-2.el7.x86_64
(gdb) where
#0  0x7fbdffb545f7 in raise () from /lib64/libc.so.6
#1  0x7fbdffb55ce8 in abort () from /lib64/libc.so.6
#2  0x7fbdffb94327 in __libc_message () from /lib64/libc.so.6
#3  0x7fbdffc2d597 in __fortify_fail () from /lib64/libc.so.6
#4  0x7fbdffc2b750 in __chk_fail () from /lib64/libc.so.6
#5  0x7fbdffc2d507 in __fdelt_warn () from /lib64/libc.so.6
#6  0x7fbdefca5015 in R_SockConnect () from 
/usr/lib64/R/modules//internet.so
#7  0x7fbdefcad81e in sock_open () from /usr/lib64/R/modules//internet.so
#8  0x7fbe026381b6 in do_sockconn () from /usr/lib64/R/lib/libR.so
#9  0x7fbe0268b4d0 in bcEval () from /usr/lib64/R/lib/libR.so
#10 0x7fbe0269b138 in Rf_eval () from /usr/lib64/R/lib/libR.so
#11 0x7fbe0269d1af in R_execClosure () from /usr/lib64/R/lib/libR.so
#12 0x7fbe0269b2f4 in Rf_eval () from /usr/lib64/R/lib/libR.so
#13 0x7fbe0269ef8e in do_set () from /usr/lib64/R/lib/libR.so
#14 0x7fbe0269b529 in Rf_eval () from /usr/lib64/R/lib/libR.so
#15 0x7fbe026a04ce in do_eval () from /usr/lib64/R/lib/libR.so
#16 0x7fbe0268b4d0 in bcEval () from /usr/lib64/R/lib/libR.so
#17 0x7fbe0269b138 in Rf_eval () from /usr/lib64/R/lib/libR.so
#18 0x7fbe0269d1af in R_execClosure () from /usr/lib64/R/lib/libR.so
#19 0x7fbe02694101 in bcEval () from /usr/lib64/R/lib/libR.so
#20 0x7fbe0269b138 in Rf_eval () from /usr/lib64/R/lib/libR.so
#21 0x7fbe0269ba7e in forcePromise () from /usr/lib64/R/lib/libR.so
#22 0x7fbe0269b7b7 in Rf_eval () from /usr/lib64/R/lib/libR.so
#23 0x7fbe026a06d1 in do_withVisible () from /usr/lib64/R/lib/libR.so
#24 0x7fbe026d02e9 in do_internal () from /usr/lib64/R/lib/libR.so
---Type  to continue, or q  to quit---
{code}

Another thing I found is, this looks actually reproduced in my Mac too. If the 
command above is executed multiple times (for my Mac, it has to be executed 
(14~16-ish times) but the error message looks apparently different. However, my 
wild guess is the root cause is the same. 



was (Author: hyukjin.kwon):
I am taking a look here gdb with bt:

{code}
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /usr/lib64/R/bin/exec/R...Reading symbols from 
/usr/lib64/R/bin/exec/R...(no debugging symbols found)...done.
(no debugging symbols found)...done.
[New LWP 25284]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib64/R/bin/exec/R --slave --no-restore --vanilla 
--file=/home/hyukjinkwon'.
Program terminated with signal 6, Aborted.
#0  0x7fbdffb545f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install R-core-3.4.0-2.el7.x86_64
(gdb) where
#0  0x7fbdffb545f7 in raise () from /lib64/libc.so.6
#1  0x7fbdffb55ce8 in abort () from /lib64/libc.so.6
#2  0x7fbdffb94327 in __libc_message () from /lib64/libc.so.6
#3  0x7fbdffc2d597 in __fortify_fail () from /lib64/libc.so.6
#4  0x7fbdffc2b750 in __chk_fail () from /lib64/libc.so.6
#5  0x7fbdffc2d507 in __fdelt_warn () from

[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050896#comment-16050896
 ] 

Hyukjin Kwon commented on SPARK-21093:
--

In case of my Mac, it looks the problem is here - 
https://github.com/apache/spark/blob/2881a2d1d1a650a91df2c6a01275eba14a43b42a/R/pkg/inst/worker/daemon.R#L45-L53

This code path looks particularly busy for {{dapply}} and {{gapply}}. I could 
reproduce the issue for both APIs too.

{code}
df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
collect(dapply(repartition(df, 200), function(x) { x }, schema(df)))
{code}

It looks processes / sockets are being closed fine but the pipes remind. This 
looks leading to an error, such as, "Error in parallel:::mcfork() :" due to 
"Resource temporarily unavailable". I checked this via {{watch -n 0.1 "lsof -c 
R | wc -l"}} in my Mac.

Running the code below multiple times (this should be varied for ulimit ...):

{code}
for(i in 0:200) {
  p <- parallel:::mcfork()
  if (inherits(p, "masterProcess")) {
tools::pskill(Sys.getpid(), tools::SIGUSR1)
parallel:::mcexit(0L)
  }
}
{code}

reproduced the problem. Please correct me if I am wrong. I suspect an issue in 
R. I will update a comment after testing this on CentOS soon.

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
>

[jira] [Commented] (SPARK-21104) Support sort with index when parse LibSVM Record

2017-06-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050850#comment-16050850
 ] 

Sean Owen commented on SPARK-21104:
---

I'm not sure what this is about. If the input isn't sorted, it doesn't need to 
be sorted. 
It should probably be checked though.

> Support sort with index when parse LibSVM Record
> 
>
> Key: SPARK-21104
> URL: https://issues.apache.org/jira/browse/SPARK-21104
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>Priority: Minor
>
> When I'm loading LibSVM from HDFS , I found feature index should be in 
> ascending order . 
> We can sorted with *indices* when we parse the input line input a (index, 
> value) tuple and avoid check if indices are in ascending order after that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13210) NPE in Sort

2017-06-15 Thread Michael Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050807#comment-16050807
 ] 

Michael Smith commented on SPARK-13210:
---

I also saw this in a run on Spark 2.1.1. Same stack trace as in David Whorter's 
comment from May 10. It's not consistent though - I got an OutOfMemoryError in 
another attempt:

{code}
java.lang.OutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.(UnsafeInMemorySorter.java:126)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:154)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:121)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextPartition(WindowExec.scala:340)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:391)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:290)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.(WindowExec.scala:310)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14.apply(WindowExec.scala:290)
{code}

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
>

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-06-15 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050776#comment-16050776
 ] 

Xiao Li commented on SPARK-1:
-

This function is still missing in the SQL interface.

We can achieve the resolution by names by using the CORRESPONDING BY clause. 
For example,

{noformat}
(select * from t1) union corresponding by (c1, c2) (select * from t2);
{noformat}


> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050774#comment-16050774
 ] 

Sean Owen commented on SPARK-21101:
---

[~dyzhou] I see your reply. The thrift server should just be another job that 
is spark-submit-ted. So I think you can in fact use {{--jars}} to add JARs to 
its classpath. That is what the guidance is getting at here. It is a bit more 
of a question therefore than JIRA issue, but I see why you're not convinced of 
that, but the way forward is to give the idea in that link a try next. 

Generally: I would only treat comments from committers or regular contributors 
as authoritative.

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-15 Thread Dayou Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050762#comment-16050762
 ] 

Dayou Zhou commented on SPARK-21101:


Hi [~zhangzr1026], I'm still waiting for someone (anyone) to explain to me why 
this is not a bug, but whatever.  If this is how you treat people who love 
Spark, use Spark, and are trying to help make it better, than fine.

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21097) Dynamic allocation will preserve cached data

2017-06-15 Thread Brad (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad updated SPARK-21097:
-
Attachment: Preserving Cached Data with Dynamic Allocation.pdf

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21097) Dynamic allocation will preserve cached data

2017-06-15 Thread Brad (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad updated SPARK-21097:
-
Attachment: (was: Preserving Cached Data with Dynamic Allocation.docx)

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21097) Dynamic allocation will preserve cached data

2017-06-15 Thread Brad (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad updated SPARK-21097:
-
Attachment: (was: Preserving Cached Data with Dynamic Allocation.docx)

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21097) Dynamic allocation will preserve cached data

2017-06-15 Thread Brad (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad updated SPARK-21097:
-
Attachment: Preserving Cached Data with Dynamic Allocation.docx

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-06-15 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050730#comment-16050730
 ] 

Reynold Xin commented on SPARK-1:
-

What's left in this ticket? Didn't we fix it already? If it is about by name 
resolution that belongs to a separate ticket. 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21097) Dynamic allocation will preserve cached data

2017-06-15 Thread Brad (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad updated SPARK-21097:
-
Attachment: Preserving Cached Data with Dynamic Allocation.docx

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
> Attachments: Preserving Cached Data with Dynamic Allocation.docx
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16251) LocalCheckpointSuite's - missing checkpoint block fails with informative message is flaky.

2017-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-16251:
---

Assignee: Jiang Xingbo

> LocalCheckpointSuite's - missing checkpoint block fails with informative 
> message is flaky.
> --
>
> Key: SPARK-16251
> URL: https://issues.apache.org/jira/browse/SPARK-16251
> Project: Spark
>  Issue Type: Bug
>Reporter: Prashant Sharma
>Assignee: Jiang Xingbo
>Priority: Minor
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20200) Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite

2017-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20200:
---

Assignee: Jiang Xingbo

> Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite
> -
>
> Key: SPARK-20200
> URL: https://issues.apache.org/jira/browse/SPARK-20200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Assignee: Jiang Xingbo
>Priority: Minor
>  Labels: flaky-test
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/2909/testReport/junit/org.apache.spark.rdd/LocalCheckpointSuite/missing_checkpoint_block_fails_with_informative_message/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite_name=missing+checkpoint+block+fails+with+informative+message
> Error Message
> {code}
> Collect should have failed if local checkpoint block is removed...
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: Collect should have failed if 
> local checkpoint block is removed...
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply$mcV$sp(LocalCheckpointSuite.scala:173)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)

[jira] [Resolved] (SPARK-20200) Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite

2017-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20200.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0
   2.0.3

Issue resolved by pull request 18314
[https://github.com/apache/spark/pull/18314]

> Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite
> -
>
> Key: SPARK-20200
> URL: https://issues.apache.org/jira/browse/SPARK-20200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Priority: Minor
>  Labels: flaky-test
> Fix For: 2.0.3, 2.2.0, 2.1.2
>
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/2909/testReport/junit/org.apache.spark.rdd/LocalCheckpointSuite/missing_checkpoint_block_fails_with_informative_message/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite_name=missing+checkpoint+block+fails+with+informative+message
> Error Message
> {code}
> Collect should have failed if local checkpoint block is removed...
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: Collect should have failed if 
> local checkpoint block is removed...
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply$mcV$sp(LocalCheckpointSuite.scala:173)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at

[jira] [Resolved] (SPARK-16251) LocalCheckpointSuite's - missing checkpoint block fails with informative message is flaky.

2017-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16251.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0
   2.0.3

Issue resolved by pull request 18314
[https://github.com/apache/spark/pull/18314]

> LocalCheckpointSuite's - missing checkpoint block fails with informative 
> message is flaky.
> --
>
> Key: SPARK-16251
> URL: https://issues.apache.org/jira/browse/SPARK-16251
> Project: Spark
>  Issue Type: Bug
>Reporter: Prashant Sharma
>Priority: Minor
> Fix For: 2.0.3, 2.2.0, 2.1.2
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-15 Thread Irina Truong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong reopened SPARK-21096:
--

The 2 methods I described should be equivalent, but they are not.

> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> {quote}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-15 Thread Irina Truong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050609#comment-16050609
 ] 

Irina Truong commented on SPARK-21096:
--

I am not passing in {{self}}. I am passing in {{self.multiplier}} - an integer 
value.

If this spark behavior is correct, why does the 2nd method not break?

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> {quote}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21106) compile error

2017-06-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050445#comment-16050445
 ] 

Hyukjin Kwon commented on SPARK-21106:
--

It should build on Windows fine per AppVeyor. It sounds a question which should 
go to mailing list.

> compile error
> -
>
> Key: SPARK-21106
> URL: https://issues.apache.org/jira/browse/SPARK-21106
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
> Environment: win10 
>Reporter: yinbinfeng0451
>
> [INFO] 
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
> spark-network-shuffle_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory 
> D:\bigdata\spark\common\network-shuffle\src\main\resources
> [INFO] Copying 3 resources
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
> spark-network-shuffle_2.11 ---
> [INFO] Downloaded: 
> https://repo1.maven.org/maven2/javax/activation/activation/1.1/activation-1.1.jar
>  (62 KB at 101.0 KB/sec)
> [INFO] Downloading: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:45:
>  not found: type SparkFlumeProtocol
> [ERROR]   val transactionTimeout: Int, val backOffInterval: Int) extends 
> SparkFlumeProtocol with Logging {
> [ERROR]  ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:70:
>  not found: type EventBatch
> [ERROR]   override def getEventBatch(n: Int): EventBatch = {
> [ERROR]   ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:80:
>  not found: type EventBatch
> [ERROR]   def getEventBatch: EventBatch = {
> [ERROR]  ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSinkUtils.scala:25:
>  not found: type EventBatch
> [ERROR]   def isErrorBatch(batch: EventBatch): Boolean = {
> [ERROR]   ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:85:
>  not found: type EventBatch
> [ERROR] new EventBatch("Spark sink has been stopped!", "", 
> java.util.Collections.emptyList())
> [ERROR] ^
> [INFO] Downloaded: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
>  (18 KB at 30.1 KB/sec)
> [INFO] Downloading: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelPipelineFactory not found - 
> continuing with a stub.
> [WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found 
> - continuing with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found 
> - continuing with a stub.
> [WARNING] Class org.jboss.netty.channel.group.ChannelGroup not found - 
> continuing with a stub.
> [ERROR] error while loading Protocol, invalid LOC header (bad signature)
> [ERROR] error while loading SpecificData, invalid LOC header (bad signature)
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSink.scala:86:
>  not found: type SparkFlumeProtocol
> [ERROR] val responder = new 
> SpecificResponder(classOf[SparkFlumeProtocol], handler.get)
> [ERROR]   ^
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Zinc server is not available at port 3030 - reverting to normal 
> incremental compile
> [INFO] Using incremental compilation
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:48:
>  not found: type EventBatch
> [ERROR]   @volatile private var

[jira] [Commented] (SPARK-21056) InMemoryFileIndex.listLeafFiles should create at most one spark job when listing files in parallel

2017-06-15 Thread Bertrand Bossy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050426#comment-16050426
 ] 

Bertrand Bossy commented on SPARK-21056:


cc [~michael]

> InMemoryFileIndex.listLeafFiles should create at most one spark job when 
> listing files in parallel
> --
>
> Key: SPARK-21056
> URL: https://issues.apache.org/jira/browse/SPARK-21056
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Bertrand Bossy
>
> Given partitioned file relation (e.g. parquet):
> {code}
> root/a=../b=../c=..
> {code}
> InMemoryFileIndex.listLeafFiles runs numberOfPartitions(a) times 
> numberOfPartitions(b) spark jobs sequentially to list leaf files, if both 
> numberOfPartitions(a) and numberOfPartitions(b) are below 
> {{spark.sql.sources.parallelPartitionDiscovery.threshold}} and 
> numberOfPartitions(c) is above 
> {{spark.sql.sources.parallelPartitionDiscovery.threshold}}
> Since the jobs are run sequentially, the overhead of the jobs dominates and 
> the file listing operation can become significantly slower than listing the 
> files from the driver.
> I propose that InMemoryFileIndex.listLeafFiles should launch at most one 
> spark job for listing leaf files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21098) Set lineseparator csv multiline and csv write to \n

2017-06-15 Thread Daniel van der Ende (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel van der Ende updated SPARK-21098:

Description: The Univocity-parser library uses the system line ending 
character as the default line ending character. Rather than remain dependent on 
the setting in this lib, we could set the default to \n.  We cannot make this 
configurable for reading as it depends on LineReader from Hadoop, which has a 
hardcoded \n as line ending.  (was: sets the lineseparator for reading a 
multiline csv file or writing a csv file to '\n'. We cannot make this 
configurable for reading as it depends on LineReader from Hadoop, which has a 
hardcoded \n as line ending.)

> Set lineseparator csv multiline and csv write to \n
> ---
>
> Key: SPARK-21098
> URL: https://issues.apache.org/jira/browse/SPARK-21098
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Priority: Minor
>
> The Univocity-parser library uses the system line ending character as the 
> default line ending character. Rather than remain dependent on the setting in 
> this lib, we could set the default to \n.  We cannot make this configurable 
> for reading as it depends on LineReader from Hadoop, which has a hardcoded \n 
> as line ending.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21098) Set lineseparator csv multiline and csv write to \n

2017-06-15 Thread Daniel van der Ende (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel van der Ende updated SPARK-21098:

Description: sets the lineseparator for reading a multiline csv file or 
writing a csv file to '\n'. We cannot make this configurable for reading as it 
depends on LineReader from Hadoop, which has a hardcoded \n as line ending.  
(was: In order to allow users to work with csv files with non-unix line 
endings, it would be nice to allow users to pass their line separator (aka 
newline character) as an option to the spark.read.csv command.)

> Set lineseparator csv multiline and csv write to \n
> ---
>
> Key: SPARK-21098
> URL: https://issues.apache.org/jira/browse/SPARK-21098
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Priority: Minor
>
> sets the lineseparator for reading a multiline csv file or writing a csv file 
> to '\n'. We cannot make this configurable for reading as it depends on 
> LineReader from Hadoop, which has a hardcoded \n as line ending.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21098) Set lineseparator csv multiline and csv write to \n

2017-06-15 Thread Daniel van der Ende (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel van der Ende updated SPARK-21098:

Summary: Set lineseparator csv multiline and csv write to \n  (was: Add 
line separator option to csv read/write)

> Set lineseparator csv multiline and csv write to \n
> ---
>
> Key: SPARK-21098
> URL: https://issues.apache.org/jira/browse/SPARK-21098
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Priority: Minor
>
> In order to allow users to work with csv files with non-unix line endings, it 
> would be nice to allow users to pass their line separator (aka newline 
> character) as an option to the spark.read.csv command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2017-06-15 Thread Abdelrahman Elsaidy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050359#comment-16050359
 ] 

Abdelrahman Elsaidy commented on SPARK-14437:
-

[~hogeland] Was an issue created for the spark 2.0.0 error? I am getting the 
same error when I run a spark job with multiple executors on a single worker 
node using spark standalone cluster. However, when I run the job with a single 
executor on a single worker it completes successfully.

When using two executors, one executor completes while the other gets this 
error:
{code:java}
17/06/15 07:06:33 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
java.lang.RuntimeException: Stream '/files/script.py' was not found.
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
{code}


> Spark using Netty RPC gets wrong address in some setups
> ---
>
> Key: SPARK-14437
> URL: https://issues.apache.org/jira/browse/SPARK-14437
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: AWS, Docker, Flannel
>Reporter: Kevin Hogeland
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Netty can't get the correct origin address in certain network setups. Spark 
> should handle this, as relying on Netty correctly reporting all addresses 
> leads to incompatible and unpredictable network states. We're currently using 
> Docker with Flannel on AWS. Container communication looks something like: 
> {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
> -> Container 2 (4.5.6.1)}}
> If the client in that setup is Container 1 (1.2.3.4), Netty channels from 
> there to Container 2 will have a client address of 1.2.3.0.
> The {{RequestMessage}}

[jira] [Commented] (SPARK-21104) Support sort with index when parse LibSVM Record

2017-06-15 Thread darion yaphet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050267#comment-16050267
 ] 

darion yaphet commented on SPARK-21104:
---

The feature sort by index is not necessary . So I add a sort step in record 
parser .

> Support sort with index when parse LibSVM Record
> 
>
> Key: SPARK-21104
> URL: https://issues.apache.org/jira/browse/SPARK-21104
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>Priority: Minor
>
> When I'm loading LibSVM from HDFS , I found feature index should be in 
> ascending order . 
> We can sorted with *indices* when we parse the input line input a (index, 
> value) tuple and avoid check if indices are in ascending order after that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21081) Throw specific IllegalStateException subtype when asserting that SparkContext not stopped

2017-06-15 Thread Filipp Zhinkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050256#comment-16050256
 ] 

Filipp Zhinkin commented on SPARK-21081:


It's not always an option to call isStopped.

> Throw specific IllegalStateException subtype when asserting that SparkContext 
> not stopped
> -
>
> Key: SPARK-21081
> URL: https://issues.apache.org/jira/browse/SPARK-21081
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Filipp Zhinkin
>Priority: Minor
>
> org.apache.spark.SparkContext.assertNotStopped throws IllegalStateException 
> if the context was stopped.
> Unfortunately, it is not so easy to distinguish IAE caused by a stopped 
> context from some other  failed assertion and handle it properly (start a new 
> context, for example).
> I'm suggesting to add a specific IllegalStateException subclass 
> (SparkContextClosedException) and start throwing it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21106) compile error

2017-06-15 Thread yinbinfeng0451 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050252#comment-16050252
 ] 

yinbinfeng0451 commented on SPARK-21106:


all  the source from spark ,I did not changed.
and use mave build ,report the error,please help!

> compile error
> -
>
> Key: SPARK-21106
> URL: https://issues.apache.org/jira/browse/SPARK-21106
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
> Environment: win10 
>Reporter: yinbinfeng0451
>
> [INFO] 
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
> spark-network-shuffle_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory 
> D:\bigdata\spark\common\network-shuffle\src\main\resources
> [INFO] Copying 3 resources
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
> spark-network-shuffle_2.11 ---
> [INFO] Downloaded: 
> https://repo1.maven.org/maven2/javax/activation/activation/1.1/activation-1.1.jar
>  (62 KB at 101.0 KB/sec)
> [INFO] Downloading: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:45:
>  not found: type SparkFlumeProtocol
> [ERROR]   val transactionTimeout: Int, val backOffInterval: Int) extends 
> SparkFlumeProtocol with Logging {
> [ERROR]  ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:70:
>  not found: type EventBatch
> [ERROR]   override def getEventBatch(n: Int): EventBatch = {
> [ERROR]   ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:80:
>  not found: type EventBatch
> [ERROR]   def getEventBatch: EventBatch = {
> [ERROR]  ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSinkUtils.scala:25:
>  not found: type EventBatch
> [ERROR]   def isErrorBatch(batch: EventBatch): Boolean = {
> [ERROR]   ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:85:
>  not found: type EventBatch
> [ERROR] new EventBatch("Spark sink has been stopped!", "", 
> java.util.Collections.emptyList())
> [ERROR] ^
> [INFO] Downloaded: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
>  (18 KB at 30.1 KB/sec)
> [INFO] Downloading: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelPipelineFactory not found - 
> continuing with a stub.
> [WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found 
> - continuing with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found 
> - continuing with a stub.
> [WARNING] Class org.jboss.netty.channel.group.ChannelGroup not found - 
> continuing with a stub.
> [ERROR] error while loading Protocol, invalid LOC header (bad signature)
> [ERROR] error while loading SpecificData, invalid LOC header (bad signature)
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSink.scala:86:
>  not found: type SparkFlumeProtocol
> [ERROR] val responder = new 
> SpecificResponder(classOf[SparkFlumeProtocol], handler.get)
> [ERROR]   ^
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Zinc server is not available at port 3030 - reverting to normal 
> incremental compile
> [INFO] Using incremental compilation
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:48:
>  not found: type EventBatch
> [ERROR]   @volatile private var

[jira] [Commented] (SPARK-21106) compile error

2017-06-15 Thread yinbinfeng0451 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050247#comment-16050247
 ] 

yinbinfeng0451 commented on SPARK-21106:


and use spark2.1.1 source to compile

> compile error
> -
>
> Key: SPARK-21106
> URL: https://issues.apache.org/jira/browse/SPARK-21106
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
> Environment: win10 
>Reporter: yinbinfeng0451
>
> [INFO] 
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
> spark-network-shuffle_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory 
> D:\bigdata\spark\common\network-shuffle\src\main\resources
> [INFO] Copying 3 resources
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
> spark-network-shuffle_2.11 ---
> [INFO] Downloaded: 
> https://repo1.maven.org/maven2/javax/activation/activation/1.1/activation-1.1.jar
>  (62 KB at 101.0 KB/sec)
> [INFO] Downloading: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:45:
>  not found: type SparkFlumeProtocol
> [ERROR]   val transactionTimeout: Int, val backOffInterval: Int) extends 
> SparkFlumeProtocol with Logging {
> [ERROR]  ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:70:
>  not found: type EventBatch
> [ERROR]   override def getEventBatch(n: Int): EventBatch = {
> [ERROR]   ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:80:
>  not found: type EventBatch
> [ERROR]   def getEventBatch: EventBatch = {
> [ERROR]  ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSinkUtils.scala:25:
>  not found: type EventBatch
> [ERROR]   def isErrorBatch(batch: EventBatch): Boolean = {
> [ERROR]   ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:85:
>  not found: type EventBatch
> [ERROR] new EventBatch("Spark sink has been stopped!", "", 
> java.util.Collections.emptyList())
> [ERROR] ^
> [INFO] Downloaded: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
>  (18 KB at 30.1 KB/sec)
> [INFO] Downloading: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelPipelineFactory not found - 
> continuing with a stub.
> [WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found 
> - continuing with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found 
> - continuing with a stub.
> [WARNING] Class org.jboss.netty.channel.group.ChannelGroup not found - 
> continuing with a stub.
> [ERROR] error while loading Protocol, invalid LOC header (bad signature)
> [ERROR] error while loading SpecificData, invalid LOC header (bad signature)
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSink.scala:86:
>  not found: type SparkFlumeProtocol
> [ERROR] val responder = new 
> SpecificResponder(classOf[SparkFlumeProtocol], handler.get)
> [ERROR]   ^
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Zinc server is not available at port 3030 - reverting to normal 
> incremental compile
> [INFO] Using incremental compilation
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:48:
>  not found: type EventBatch
> [ERROR]   @volatile private var eventBatch: EventBatch = new 
> EventBatch("Unknown Error", "",
>

[jira] [Commented] (SPARK-21106) compile error

2017-06-15 Thread yinbinfeng0451 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050245#comment-16050245
 ] 

yinbinfeng0451 commented on SPARK-21106:


use eclipse maven , build use  clean ,package 

> compile error
> -
>
> Key: SPARK-21106
> URL: https://issues.apache.org/jira/browse/SPARK-21106
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
> Environment: win10 
>Reporter: yinbinfeng0451
>
> [INFO] 
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
> spark-network-shuffle_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory 
> D:\bigdata\spark\common\network-shuffle\src\main\resources
> [INFO] Copying 3 resources
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
> spark-network-shuffle_2.11 ---
> [INFO] Downloaded: 
> https://repo1.maven.org/maven2/javax/activation/activation/1.1/activation-1.1.jar
>  (62 KB at 101.0 KB/sec)
> [INFO] Downloading: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:45:
>  not found: type SparkFlumeProtocol
> [ERROR]   val transactionTimeout: Int, val backOffInterval: Int) extends 
> SparkFlumeProtocol with Logging {
> [ERROR]  ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:70:
>  not found: type EventBatch
> [ERROR]   override def getEventBatch(n: Int): EventBatch = {
> [ERROR]   ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:80:
>  not found: type EventBatch
> [ERROR]   def getEventBatch: EventBatch = {
> [ERROR]  ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSinkUtils.scala:25:
>  not found: type EventBatch
> [ERROR]   def isErrorBatch(batch: EventBatch): Boolean = {
> [ERROR]   ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:85:
>  not found: type EventBatch
> [ERROR] new EventBatch("Spark sink has been stopped!", "", 
> java.util.Collections.emptyList())
> [ERROR] ^
> [INFO] Downloaded: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
>  (18 KB at 30.1 KB/sec)
> [INFO] Downloading: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelPipelineFactory not found - 
> continuing with a stub.
> [WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found 
> - continuing with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found 
> - continuing with a stub.
> [WARNING] Class org.jboss.netty.channel.group.ChannelGroup not found - 
> continuing with a stub.
> [ERROR] error while loading Protocol, invalid LOC header (bad signature)
> [ERROR] error while loading SpecificData, invalid LOC header (bad signature)
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSink.scala:86:
>  not found: type SparkFlumeProtocol
> [ERROR] val responder = new 
> SpecificResponder(classOf[SparkFlumeProtocol], handler.get)
> [ERROR]   ^
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Zinc server is not available at port 3030 - reverting to normal 
> incremental compile
> [INFO] Using incremental compilation
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:48:
>  not found: type EventBatch
> [ERROR]   @volatile private var eventBatch: EventBatch = new 
> EventBatch("Unknown Error",

[jira] [Resolved] (SPARK-21106) compile error

2017-06-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21106.
---
  Resolution: Cannot Reproduce
   Fix Version/s: (was: 2.1.1)
Target Version/s:   (was: 2.1.1)

Does not fail for me or Jenkins. You didn't say how you build. 

> compile error
> -
>
> Key: SPARK-21106
> URL: https://issues.apache.org/jira/browse/SPARK-21106
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
> Environment: win10 
>Reporter: yinbinfeng0451
>
> [INFO] 
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
> spark-network-shuffle_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory 
> D:\bigdata\spark\common\network-shuffle\src\main\resources
> [INFO] Copying 3 resources
> [INFO] Copying 3 resources
> [INFO] 
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
> spark-network-shuffle_2.11 ---
> [INFO] Downloaded: 
> https://repo1.maven.org/maven2/javax/activation/activation/1.1/activation-1.1.jar
>  (62 KB at 101.0 KB/sec)
> [INFO] Downloading: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:45:
>  not found: type SparkFlumeProtocol
> [ERROR]   val transactionTimeout: Int, val backOffInterval: Int) extends 
> SparkFlumeProtocol with Logging {
> [ERROR]  ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:70:
>  not found: type EventBatch
> [ERROR]   override def getEventBatch(n: Int): EventBatch = {
> [ERROR]   ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:80:
>  not found: type EventBatch
> [ERROR]   def getEventBatch: EventBatch = {
> [ERROR]  ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSinkUtils.scala:25:
>  not found: type EventBatch
> [ERROR]   def isErrorBatch(batch: EventBatch): Boolean = {
> [ERROR]   ^
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:85:
>  not found: type EventBatch
> [ERROR] new EventBatch("Spark sink has been stopped!", "", 
> java.util.Collections.emptyList())
> [ERROR] ^
> [INFO] Downloaded: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
>  (18 KB at 30.1 KB/sec)
> [INFO] Downloading: 
> https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelPipelineFactory not found - 
> continuing with a stub.
> [WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found 
> - continuing with a stub.
> [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
> with a stub.
> [WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found 
> - continuing with a stub.
> [WARNING] Class org.jboss.netty.channel.group.ChannelGroup not found - 
> continuing with a stub.
> [ERROR] error while loading Protocol, invalid LOC header (bad signature)
> [ERROR] error while loading SpecificData, invalid LOC header (bad signature)
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSink.scala:86:
>  not found: type SparkFlumeProtocol
> [ERROR] val responder = new 
> SpecificResponder(classOf[SparkFlumeProtocol], handler.get)
> [ERROR]   ^
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
> with a stub.
> [WARNING] Zinc server is not available at port 3030 - reverting to normal 
> incremental compile
> [INFO] Using incremental compilation
> [ERROR] 
> D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:48:
>  not found: type EventBatch
> [ERROR]

[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-06-15 Thread Peter Bykov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050196#comment-16050196
 ] 

Peter Bykov commented on SPARK-21063:
-

Maybe you have some assumptions why this does not work?

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20200) Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20200:


Assignee: (was: Apache Spark)

> Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite
> -
>
> Key: SPARK-20200
> URL: https://issues.apache.org/jira/browse/SPARK-20200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/2909/testReport/junit/org.apache.spark.rdd/LocalCheckpointSuite/missing_checkpoint_block_fails_with_informative_message/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite_name=missing+checkpoint+block+fails+with+informative+message
> Error Message
> {code}
> Collect should have failed if local checkpoint block is removed...
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: Collect should have failed if 
> local checkpoint block is removed...
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply$mcV$sp(LocalCheckpointSuite.scala:173)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
>   at 
>

[jira] [Assigned] (SPARK-20200) Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite

2017-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20200:


Assignee: Apache Spark

> Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite
> -
>
> Key: SPARK-20200
> URL: https://issues.apache.org/jira/browse/SPARK-20200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/2909/testReport/junit/org.apache.spark.rdd/LocalCheckpointSuite/missing_checkpoint_block_fails_with_informative_message/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite_name=missing+checkpoint+block+fails+with+informative+message
> Error Message
> {code}
> Collect should have failed if local checkpoint block is removed...
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: Collect should have failed if 
> local checkpoint block is removed...
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply$mcV$sp(LocalCheckpointSuite.scala:173)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
>   at 
>

[jira] [Commented] (SPARK-16251) LocalCheckpointSuite's - missing checkpoint block fails with informative message is flaky.

2017-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050189#comment-16050189
 ] 

Apache Spark commented on SPARK-16251:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/18314

> LocalCheckpointSuite's - missing checkpoint block fails with informative 
> message is flaky.
> --
>
> Key: SPARK-16251
> URL: https://issues.apache.org/jira/browse/SPARK-16251
> Project: Spark
>  Issue Type: Bug
>Reporter: Prashant Sharma
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20200) Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite

2017-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050190#comment-16050190
 ] 

Apache Spark commented on SPARK-20200:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/18314

> Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite
> -
>
> Key: SPARK-20200
> URL: https://issues.apache.org/jira/browse/SPARK-20200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/2909/testReport/junit/org.apache.spark.rdd/LocalCheckpointSuite/missing_checkpoint_block_fails_with_informative_message/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite_name=missing+checkpoint+block+fails+with+informative+message
> Error Message
> {code}
> Collect should have failed if local checkpoint block is removed...
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: Collect should have failed if 
> local checkpoint block is removed...
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply$mcV$sp(LocalCheckpointSuite.scala:173)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
>   at

[jira] [Created] (SPARK-21106) compile error

2017-06-15 Thread yinbinfeng0451 (JIRA)

yinbinfeng0451 created SPARK-21106:
--

 Summary: compile error
 Key: SPARK-21106
 URL: https://issues.apache.org/jira/browse/SPARK-21106
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.1.1
 Environment: win10 
Reporter: yinbinfeng0451
 Fix For: 2.1.1


[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
spark-network-shuffle_2.11 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 
D:\bigdata\spark\common\network-shuffle\src\main\resources
[INFO] Copying 3 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
spark-network-shuffle_2.11 ---
[INFO] Downloaded: 
https://repo1.maven.org/maven2/javax/activation/activation/1.1/activation-1.1.jar
 (62 KB at 101.0 KB/sec)
[INFO] Downloading: 
https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
[ERROR] 
D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:45:
 not found: type SparkFlumeProtocol
[ERROR]   val transactionTimeout: Int, val backOffInterval: Int) extends 
SparkFlumeProtocol with Logging {
[ERROR]  ^
[ERROR] 
D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:70:
 not found: type EventBatch
[ERROR]   override def getEventBatch(n: Int): EventBatch = {
[ERROR]   ^
[ERROR] 
D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:80:
 not found: type EventBatch
[ERROR]   def getEventBatch: EventBatch = {
[ERROR]  ^
[ERROR] 
D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSinkUtils.scala:25:
 not found: type EventBatch
[ERROR]   def isErrorBatch(batch: EventBatch): Boolean = {
[ERROR]   ^
[ERROR] 
D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala:85:
 not found: type EventBatch
[ERROR] new EventBatch("Spark sink has been stopped!", "", 
java.util.Collections.emptyList())
[ERROR] ^
[INFO] Downloaded: 
https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
 (18 KB at 30.1 KB/sec)
[INFO] Downloading: 
https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar
[WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
with a stub.
[WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
with a stub.
[WARNING] Class org.jboss.netty.channel.ChannelPipelineFactory not found - 
continuing with a stub.
[WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found - 
continuing with a stub.
[WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
with a stub.
[WARNING] Class org.jboss.netty.handler.execution.ExecutionHandler not found - 
continuing with a stub.
[WARNING] Class org.jboss.netty.channel.group.ChannelGroup not found - 
continuing with a stub.
[ERROR] error while loading Protocol, invalid LOC header (bad signature)
[ERROR] error while loading SpecificData, invalid LOC header (bad signature)
[ERROR] 
D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkSink.scala:86:
 not found: type SparkFlumeProtocol
[ERROR] val responder = new SpecificResponder(classOf[SparkFlumeProtocol], 
handler.get)
[ERROR]   ^
[WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
with a stub.
[WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
with a stub.
[WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
with a stub.
[WARNING] Class com.google.common.collect.ImmutableMap not found - continuing 
with a stub.
[WARNING] Zinc server is not available at port 3030 - reverting to normal 
incremental compile
[INFO] Using incremental compilation
[ERROR] 
D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:48:
 not found: type EventBatch
[ERROR]   @volatile private var eventBatch: EventBatch = new 
EventBatch("Unknown Error", "",
[ERROR] ^
[ERROR] 
D:\bigdata\spark\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\TransactionProcessor.scala:48:
 not found: type EventBatch
[ERROR]   @volatile private var eventBatch: EventBatch = new 
EventBatch("Unknown Error", "",
[ERROR]

[jira] [Updated] (SPARK-21105) Useless empty files in hive table

2017-06-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21105:
--
Priority: Minor  (was: Major)

I'm not sure, but isn't that required? Not sure you can not output those files

> Useless empty files in hive table
> -
>
> Key: SPARK-21105
> URL: https://issues.apache.org/jira/browse/SPARK-21105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: pin_zhang
>Priority: Minor
>
> case class Base(v: Option[Double])
> object EmptyFiles {
>   
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("scala").setMaster("local[12]")
> val ctx = new SparkContext(conf)
> val spark = 
> SparkSession.builder().enableHiveSupport().config(conf).getOrCreate()
> val seq = Seq(Base(Some(1D)), Base(Some(1D)));
> val rdd = ctx.makeRDD[Base](seq)
> import spark.implicits._
> 
> rdd.toDS().write.format("json").mode(SaveMode.Append).saveAsTable("EmptyFiles")
>   }
> }
> // DataSet create many useless empty files for empty partition
> // if insert  small RDD into the table many times, which result in too many 
> empty files, which slow down the query.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21104) Support sort with index when parse LibSVM Record

2017-06-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050172#comment-16050172
 ] 

Sean Owen commented on SPARK-21104:
---

Doesn't the format dictate that it's sorted already?

> Support sort with index when parse LibSVM Record
> 
>
> Key: SPARK-21104
> URL: https://issues.apache.org/jira/browse/SPARK-21104
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>Priority: Minor
>
> When I'm loading LibSVM from HDFS , I found feature index should be in 
> ascending order . 
> We can sorted with *indices* when we parse the input line input a (index, 
> value) tuple and avoid check if indices are in ascending order after that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21104) Support sort with index when parse LibSVM Record

2017-06-15 Thread darion yaphet (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-21104:
--
Description: 
When I'm loading LibSVM from HDFS , I found feature index should be in 
ascending order . 
We can sorted with *indices* when we parse the input line input a (index, 
value) tuple and avoid check if indices are in ascending order after that.

  was:When I'm loading LibSVM from HDFS , I found feature index should be in 
ascending order . We can sorted with *indices* when we parse the input line 
input a (index, value) tuple and avoid check if indices are in ascending order 
after that.


> Support sort with index when parse LibSVM Record
> 
>
> Key: SPARK-21104
> URL: https://issues.apache.org/jira/browse/SPARK-21104
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>Priority: Minor
>
> When I'm loading LibSVM from HDFS , I found feature index should be in 
> ascending order . 
> We can sorted with *indices* when we parse the input line input a (index, 
> value) tuple and avoid check if indices are in ascending order after that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 115 matches

Mail list logo