[jira] [Assigned] (SPARK-16639) query fails if having condition contains grouping column

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16639:


Assignee: Apache Spark

> query fails if having condition contains grouping column
> 
>
> Key: SPARK-16639
> URL: https://issues.apache.org/jira/browse/SPARK-16639
> Project: Spark
>  Issue Type: Bug
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> {code}
> create table tbl(a int, b string);
> select count(b) from tbl group by a + 1 having a + 1 = 2;
> {code}
> this will fail analysis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16639) query fails if having condition contains grouping column

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16639:


Assignee: (was: Apache Spark)

> query fails if having condition contains grouping column
> 
>
> Key: SPARK-16639
> URL: https://issues.apache.org/jira/browse/SPARK-16639
> Project: Spark
>  Issue Type: Bug
>Reporter: Wenchen Fan
>
> {code}
> create table tbl(a int, b string);
> select count(b) from tbl group by a + 1 having a + 1 = 2;
> {code}
> this will fail analysis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16639) query fails if having condition contains grouping column

2016-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387176#comment-15387176
 ] 

Apache Spark commented on SPARK-16639:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/14296

> query fails if having condition contains grouping column
> 
>
> Key: SPARK-16639
> URL: https://issues.apache.org/jira/browse/SPARK-16639
> Project: Spark
>  Issue Type: Bug
>Reporter: Wenchen Fan
>
> {code}
> create table tbl(a int, b string);
> select count(b) from tbl group by a + 1 having a + 1 = 2;
> {code}
> this will fail analysis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387172#comment-15387172
 ] 

Cheng Lian commented on SPARK-16646:


Thanks for the help! I'm not working on this.

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Hyukjin Kwon
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16646:
---
Reporter: Cheng Lian  (was: liancheng)

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16646:
---
Assignee: Hyukjin Kwon

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Hyukjin Kwon
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16648:
---
Reporter: Cheng Lian  (was: liancheng)

> LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
> --
>
> Key: SPARK-16648
> URL: https://issues.apache.org/jira/browse/SPARK-16648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following simple SQL query reproduces this issue:
> {code:sql}
> SELECT LAST_VALUE(FALSE) OVER ();
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> 

[jira] [Updated] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16648:

Assignee: Cheng Lian

> LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
> --
>
> Key: SPARK-16648
> URL: https://issues.apache.org/jira/browse/SPARK-16648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>Assignee: Cheng Lian
>
> The following simple SQL query reproduces this issue:
> {code:sql}
> SELECT LAST_VALUE(FALSE) OVER ();
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> 

[jira] [Assigned] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16648:


Assignee: Apache Spark

> LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
> --
>
> Key: SPARK-16648
> URL: https://issues.apache.org/jira/browse/SPARK-16648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>Assignee: Apache Spark
>
> The following simple SQL query reproduces this issue:
> {code:sql}
> SELECT LAST_VALUE(FALSE) OVER ();
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> 

[jira] [Commented] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387162#comment-15387162
 ] 

Apache Spark commented on SPARK-16648:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/14295

> LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
> --
>
> Key: SPARK-16648
> URL: https://issues.apache.org/jira/browse/SPARK-16648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> The following simple SQL query reproduces this issue:
> {code:sql}
> SELECT LAST_VALUE(FALSE) OVER ();
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> 

[jira] [Assigned] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16648:


Assignee: (was: Apache Spark)

> LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
> --
>
> Key: SPARK-16648
> URL: https://issues.apache.org/jira/browse/SPARK-16648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> The following simple SQL query reproduces this issue:
> {code:sql}
> SELECT LAST_VALUE(FALSE) OVER ();
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> 

[jira] [Comment Edited] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387145#comment-15387145
 ] 

Lianhui Wang edited comment on SPARK-2666 at 7/21/16 4:38 AM:
--

Thanks. I think what [~irashid] said is more about non-external shuffle. But in 
our use cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2,  
if it hitx a shuffle fetch failure while running stage 2, say on executor A. So 
it needs to regenerate the map output for stage 1 that was on executor A. But 
it don't rerun for stage 0 on executor A.
So i think we can firstly handle with FetchFailed on Yarn-external shuffle 
service(maybe connection timeout, out of memory, etc). I think many users have 
met FetchFailed on Yarn-external shuffle service.
as [~tgraves] said before, Now If the stages fails because FetchFailed, it 
rerun 1) all the ones not succeeded yet in the failed stage (including the ones 
that could still be running). So it cause many duplicate running tasks of 
failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful 
tasks of the failed stage. 
Until now, i think our first target is for Yarn-external shuffle service if the 
stages fails because FetchFailed it should decrease the number of rerunning 
tasks of the failed stage. As i pointed out before that the best way is like 
Mapreduce we just resubmit the map stage of failed stage. 
1. When FetchFailed has happened on task, the task don't be finished and 
continue to fetch other results. It just report the ShuffleBlockId of 
FetchFailed to DAGScheduler. other running tasks of this stage did like this 
task.
2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task 
for the ShuffleBlockId. Once the task has been finished, it will register the 
map output to MapOutputTracker.
3. The task that has FetchFailed before get the map output of FetchFailed from 
MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the 
map output of FetchFailed successfully and will fetch the results of 
FetchFailed.
But there is a dead lock if the tasks of Step-2 can not be run because there is 
no slots for it.Under this situation it should kill some running tasks for it. 
In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did 
it for 2) it only run the failed ones and wait for the ones still running in 
failed stage. The disadvantage of SPARK-14649 is that other running tasks of 
the failed stage maybe need a long time to rerun when they spend time to fetch 
other's results.



was (Author: lianhuiwang):
I think what [~irashid] said is more about non-external shuffle. But in our use 
cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2,  if it 
hitx a shuffle fetch failure while running stage 2, say on executor A. So it 
needs to regenerate the map output for stage 1 that was on executor A. But it 
don't rerun for stage 0 on executor A.
So i think we can firstly handle with FetchFailed on Yarn-external shuffle 
service(maybe connection timeout, out of memory, etc). I think many users have 
met FetchFailed on Yarn-external shuffle service.
as [~tgraves] said before, Now If the stages fails because FetchFailed, it 
rerun 1) all the ones not succeeded yet in the failed stage (including the ones 
that could still be running). So it cause many duplicate running tasks of 
failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful 
tasks of the failed stage. 
Until now, i think our first target is for Yarn-external shuffle service if the 
stages fails because FetchFailed it should decrease the number of rerunning 
tasks of the failed stage. As i pointed out before that the best way is like 
Mapreduce we just resubmit the map stage of failed stage. 
1. When FetchFailed has happened on task, the task don't be finished and 
continue to fetch other results. It just report the ShuffleBlockId of 
FetchFailed to DAGScheduler. other running tasks of this stage did like this 
task.
2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task 
for the ShuffleBlockId. Once the task has been finished, it will register the 
map output to MapOutputTracker.
3. The task that has FetchFailed before get the map output of FetchFailed from 
MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the 
map output of FetchFailed successfully and will fetch the results of 
FetchFailed.
But there is a dead lock if the tasks of Step-2 can not be run because there is 
no slots for it.Under this situation it should kill some running tasks for it. 
In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did 
it for 2) it only run the failed ones and wait for the ones still running in 
failed stage. The disadvantage of SPARK-14649 is that other running tasks of 
the failed stage maybe need a long time to rerun when they spend 

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387145#comment-15387145
 ] 

Lianhui Wang commented on SPARK-2666:
-

I think what [~irashid] said is more about non-external shuffle. But in our use 
cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2,  if it 
hitx a shuffle fetch failure while running stage 2, say on executor A. So it 
needs to regenerate the map output for stage 1 that was on executor A. But it 
don't rerun for stage 0 on executor A.
So i think we can firstly handle with FetchFailed on Yarn-external shuffle 
service(maybe connection timeout, out of memory, etc). I think many users have 
met FetchFailed on Yarn-external shuffle service.
as [~tgraves] said before, Now If the stages fails because FetchFailed, it 
rerun 1) all the ones not succeeded yet in the failed stage (including the ones 
that could still be running). So it cause many duplicate running tasks of 
failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful 
tasks of the failed stage. 
Until now, i think our first target is for Yarn-external shuffle service if the 
stages fails because FetchFailed it should decrease the number of rerunning 
tasks of the failed stage. As i pointed out before that the best way is like 
Mapreduce we just resubmit the map stage of failed stage. 
1. When FetchFailed has happened on task, the task don't be finished and 
continue to fetch other results. It just report the ShuffleBlockId of 
FetchFailed to DAGScheduler. other running tasks of this stage did like this 
task.
2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task 
for the ShuffleBlockId. Once the task has been finished, it will register the 
map output to MapOutputTracker.
3. The task that has FetchFailed before get the map output of FetchFailed from 
MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the 
map output of FetchFailed successfully and will fetch the results of 
FetchFailed.
But there is a dead lock if the tasks of Step-2 can not be run because there is 
no slots for it.Under this situation it should kill some running tasks for it. 
In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did 
it for 2) it only run the failed ones and wait for the ones still running in 
failed stage. The disadvantage of SPARK-14649 is that other running tasks of 
the failed stage maybe need a long time to rerun when they spend time to fetch 
other's results.


> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA

[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387067#comment-15387067
 ] 

Hyukjin Kwon commented on SPARK-16646:
--

Maybe I had to ask if you are working on this. I will close mine if you are!

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16646:


Assignee: Apache Spark

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>Assignee: Apache Spark
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387025#comment-15387025
 ] 

Apache Spark commented on SPARK-16646:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14294

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16646:


Assignee: (was: Apache Spark)

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16659) use Maven project to submit spark application via yarn-client

2016-07-20 Thread Jack Jiang (JIRA)
Jack Jiang created SPARK-16659:
--

 Summary: use Maven project to submit spark application via 
yarn-client
 Key: SPARK-16659
 URL: https://issues.apache.org/jira/browse/SPARK-16659
 Project: Spark
  Issue Type: Question
Reporter: Jack Jiang


i want to use spark sql to execute hive sql in my maven project,here is the 
main code:
System.setProperty("hadoop.home.dir",
"D:\\hadoop-common-2.2.0-bin-master");
SparkConf sparkConf = new SparkConf()
.setAppName("test").setMaster("yarn-client");
// .set("hive.metastore.uris", "thrift://172.30.115.59:9083");
SparkContext ctx = new SparkContext(sparkConf);
// ctx.addJar("lib/hive-hbase-handler-0.14.0.2.2.6.0-2800.jar");
HiveContext sqlContext = new 
org.apache.spark.sql.hive.HiveContext(ctx);
String[] tables = sqlContext.tableNames();
for (String tablename : tables) {
System.out.println("tablename : " + tablename);
}
when i run it,it comes to a error:
10:16:17,496  INFO Client:59 - 
 client token: N/A
 diagnostics: Application application_1468409747983_0280 failed 2 times 
due to AM Container for appattempt_1468409747983_0280_02 exited with  
exitCode: -1000
For more detailed output, check application tracking 
page:http://hadoop003.icccuat.com:8088/proxy/application_1468409747983_0280/Then,
 click on links to logs of each attempt.
Diagnostics: File 
file:/C:/Users/uatxj990267/AppData/Local/Temp/spark-8874c486-893d-4ac3-a088-48e4cdb484e1/__spark_conf__9007071161920501082.zip
 does not exist
java.io.FileNotFoundException: File 
file:/C:/Users/uatxj990267/AppData/Local/Temp/spark-8874c486-893d-4ac3-a088-48e4cdb484e1/__spark_conf__9007071161920501082.zip
 does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:608)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:821)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:598)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:414)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Failing this attempt. Failing the application.
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1469067373412
 final status: FAILED
 tracking URL: 
http://hadoop003.icccuat.com:8088/cluster/app/application_1468409747983_0280
 user: uatxj990267
10:16:17,496 ERROR SparkContext:96 - Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might 
have been killed or unable to launch application master.
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:123)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:523)
at com.huateng.test.SparkSqlDemo.main(SparkSqlDemo.java:33)
but when i change this code setMaster("yarn-client") to 
setMaster(local[2]),it's OK?what's wrong with it ?can anyone help me?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16643) When doing Shuffle, report "java.io.FileNotFoundException"

2016-07-20 Thread Deng Changchun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387012#comment-15387012
 ] 

Deng Changchun commented on SPARK-16643:


Thank you for response. 
I think this problem is totally different from 
https://issues.apache.org/jira/browse/SPARK-12240, even through they both 
reported FileNotFoundException.

For SPARK-12240, I can solve it through setting ulimit. Come back to this 
problem, I have setted ulimit unlimited, the error info is not "too many open 
files", just "that file or directory doesn't exist". So I don't think they are 
the similar problem.

By the way, I will try with a more recent version.

> When doing Shuffle, report "java.io.FileNotFoundException"
> --
>
> Key: SPARK-16643
> URL: https://issues.apache.org/jira/browse/SPARK-16643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: LSB Version: 
> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
> Distributor ID:   CentOS
> Description:  CentOS release 6.6 (Final)
> Release:  6.6
> Codename: Final
> java version "1.7.0_10"
> Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
> Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
>Reporter: Deng Changchun
>
> In our spark cluster of standalone mode, we execute some SQLs on SparkSQL, 
> such  some aggregate sqls as "select count(rowKey) from HVRC_B_LOG where 1=1 
> and RESULTTIME >= 146332800 and RESULTTIME <= 1463414399000"
> at the begining all is good, however after about 15 days, when execute the 
> aggreate sqls, it will report error, the log looks like:
> 【Notice:
> it is very strange that it won't report error every time when executing 
> aggreate sql, let's say random, after executing some aggregate sqls, it will 
> log error by chance.】
> 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] 
> executor.Executor: Managed memory leak detected; size = 8388608 bytes, TID = 
> 624
> 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] 
> executor.Executor: Exception in task 0.3 in stage 580.0 (TID 624)
> java.io.FileNotFoundException: 
> /tmp/spark-cb199fce-bb80-4e6f-853f-4d7984bf5f34/executor-fb7c2149-c6c4-4697-ba2f-3b53dcd7f34a/blockmgr-0a9003ad-23b3-4ff5-b76f-6fbc5d71e730/3e/temp_shuffle_ef68b340-85e4-483c-90e8-5e8c8d8ee4ee
>  (没有那个文件或目录)
>   at java.io.FileOutputStream.open(Native Method)
>   at java.io.FileOutputStream.(FileOutputStream.java:212)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16644:
-
Fix Version/s: (was: 2.0.0)
   2.1.0
   2.0.1

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.1, 2.1.0
>
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16644:
-
Assignee: Wenchen Fan

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16644.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 14281
[https://github.com/apache/spark/pull/14281]

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16272) Allow configs to reference other configs, env and system properties

2016-07-20 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16272.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.1.0

> Allow configs to reference other configs, env and system properties
> ---
>
> Key: SPARK-16272
> URL: https://issues.apache.org/jira/browse/SPARK-16272
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently, Spark's configuration is static; it is whatever is written to the 
> config file, with some rare exceptions (such as some YARN code that does 
> expansion of Hadoop configuration).
> But there are a few use cases that don't work well in that situation. For 
> example, consider {{spark.sql.hive.metastore.jars}}. It references a list of 
> paths containing the classpath for accessing Hive's metastore. If you're 
> launching an application in cluster mode, it means that whatever is in the 
> configuration of the edge node needs to match the configuration of the random 
> node in the cluster where the driver will actually run.
> This would be easily solved if there was a way to reference system properties 
> or env variables; for example, when YARN launches a container, a bunch of env 
> variables are set, which could be used to modify that path to match the 
> correct location on the node.
> So I'm proposing a change where config properties can opt-in to use this 
> variable expansion feature; it's opt-in to avoid breaking existing code (who 
> knows) and to avoid the extra cost of doing the variable expansion of every 
> config read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16344) Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16344.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14014
[https://github.com/apache/spark/pull/14014]

> Array of struct with a single field name "element" can't be decoded from 
> Parquet files written by Spark 1.6+
> 
>
> Key: SPARK-16344
> URL: https://issues.apache.org/jira/browse/SPARK-16344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.1.0
>
>
> This is a weird corner case. Users may hit this issue if they have a schema 
> that
> # has an array field whose element type is a struct, and
> # the struct has one and only one field, and
> # that field is named as "element".
> The following Spark shell snippet for Spark 1.6 reproduces this bug:
> {code}
> case class A(element: Long)
> case class B(f: Array[A])
> val path = "/tmp/silly.parquet"
> Seq(B(Array(A(42.toDF("f0").write.mode("overwrite").parquet(path)
> val df = sqlContext.read.parquet(path)
> df.printSchema()
> // root
> //  |-- f0: array (nullable = true)
> //  ||-- element: struct (containsNull = true)
> //  |||-- element: long (nullable = true)
> df.show()
> {code}
> Exception thrown:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> file:/tmp/silly.parquet/part-r-7-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: Expected instance of group converter 
> but got 
> "org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter"
> at 
> org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37)
> at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:266)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
> at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
> at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
> 

[jira] [Assigned] (SPARK-16658) Add EdgePartition.withVertexAttributes

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16658:


Assignee: (was: Apache Spark)

> Add EdgePartition.withVertexAttributes
> --
>
> Key: SPARK-16658
> URL: https://issues.apache.org/jira/browse/SPARK-16658
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben McCann
>
> I'm using cloudml/zen, which has forked graphx. I'd like to see their changes 
> upstreamed, so that they can go back to using the upstream graphx instead of 
> having a fork.
> Their implementation of withVertexAttributes: 
> https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala
> Their usage of that method: 
> https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16658) Add EdgePartition.withVertexAttributes

2016-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386796#comment-15386796
 ] 

Apache Spark commented on SPARK-16658:
--

User 'benmccann' has created a pull request for this issue:
https://github.com/apache/spark/pull/14291

> Add EdgePartition.withVertexAttributes
> --
>
> Key: SPARK-16658
> URL: https://issues.apache.org/jira/browse/SPARK-16658
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben McCann
>
> I'm using cloudml/zen, which has forked graphx. I'd like to see their changes 
> upstreamed, so that they can go back to using the upstream graphx instead of 
> having a fork.
> Their implementation of withVertexAttributes: 
> https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala
> Their usage of that method: 
> https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16658) Add EdgePartition.withVertexAttributes

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16658:


Assignee: Apache Spark

> Add EdgePartition.withVertexAttributes
> --
>
> Key: SPARK-16658
> URL: https://issues.apache.org/jira/browse/SPARK-16658
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben McCann
>Assignee: Apache Spark
>
> I'm using cloudml/zen, which has forked graphx. I'd like to see their changes 
> upstreamed, so that they can go back to using the upstream graphx instead of 
> having a fork.
> Their implementation of withVertexAttributes: 
> https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala
> Their usage of that method: 
> https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16658) Add EdgePartition.withVertexAttributes

2016-07-20 Thread Ben McCann (JIRA)
Ben McCann created SPARK-16658:
--

 Summary: Add EdgePartition.withVertexAttributes
 Key: SPARK-16658
 URL: https://issues.apache.org/jira/browse/SPARK-16658
 Project: Spark
  Issue Type: Improvement
Reporter: Ben McCann


I'm using cloudml/zen, which has forked graphx. I'd like to see their changes 
upstreamed, so that they can go back to using the upstream graphx instead of 
having a fork.

Their implementation of withVertexAttributes: 
https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala

Their usage of that method: 
https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16280) Implement histogram_numeric SQL function

2016-07-20 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386787#comment-15386787
 ] 

MIN-FU YANG commented on SPARK-16280:
-

I implemented histogram_numeric according to the Algorithm I & II in the paper: 
Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", 
J. Machine Learning Research 11 (2010), pp. 849--872.

After doing some benchmarking, I come out an optimal solution and ready for 
review: https://github.com/apache/spark/pull/14129

> Implement histogram_numeric SQL function
> 
>
> Key: SPARK-16280
> URL: https://issues.apache.org/jira/browse/SPARK-16280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16657) Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and CreateHiveTableAsSelectCommand

2016-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386765#comment-15386765
 ] 

Apache Spark commented on SPARK-16657:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14290

> Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and 
> CreateHiveTableAsSelectCommand
> -
>
> Key: SPARK-16657
> URL: https://issues.apache.org/jira/browse/SPARK-16657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> The query in `InsertIntoHadoopFsRelationCommand` and 
> `CreateHiveTableAsSelectCommand` should be treated as inner children, like 
> what we did for the other similar nodes: 
> `CreateDataSourceTableAsSelectCommand`, `innerChildren`, and 
> `InsertIntoDataSourceCommand`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16657) Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and CreateHiveTableAsSelectCommand

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16657:


Assignee: (was: Apache Spark)

> Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and 
> CreateHiveTableAsSelectCommand
> -
>
> Key: SPARK-16657
> URL: https://issues.apache.org/jira/browse/SPARK-16657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> The query in `InsertIntoHadoopFsRelationCommand` and 
> `CreateHiveTableAsSelectCommand` should be treated as inner children, like 
> what we did for the other similar nodes: 
> `CreateDataSourceTableAsSelectCommand`, `innerChildren`, and 
> `InsertIntoDataSourceCommand`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16657) Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and CreateHiveTableAsSelectCommand

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16657:


Assignee: Apache Spark

> Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and 
> CreateHiveTableAsSelectCommand
> -
>
> Key: SPARK-16657
> URL: https://issues.apache.org/jira/browse/SPARK-16657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> The query in `InsertIntoHadoopFsRelationCommand` and 
> `CreateHiveTableAsSelectCommand` should be treated as inner children, like 
> what we did for the other similar nodes: 
> `CreateDataSourceTableAsSelectCommand`, `innerChildren`, and 
> `InsertIntoDataSourceCommand`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16657) Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and CreateHiveTableAsSelectCommand

2016-07-20 Thread Xiao Li (JIRA)
Xiao Li created SPARK-16657:
---

 Summary: Replace children by innerChildren in 
InsertIntoHadoopFsRelationCommand and CreateHiveTableAsSelectCommand
 Key: SPARK-16657
 URL: https://issues.apache.org/jira/browse/SPARK-16657
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


The query in `InsertIntoHadoopFsRelationCommand` and 
`CreateHiveTableAsSelectCommand` should be treated as inner children, like what 
we did for the other similar nodes: `CreateDataSourceTableAsSelectCommand`, 
`innerChildren`, and `InsertIntoDataSourceCommand`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386764#comment-15386764
 ] 

Yin Huai commented on SPARK-16644:
--

oh, I think this problem only happens if an aggregate expression uses a 
grouping expression and that exact grouping expression is part of the output of 
the aggregate operator.

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-07-20 Thread Emma Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386759#comment-15386759
 ] 

Emma Tang commented on SPARK-12675:
---

This issue is happening in 1.6.2. I'm on a 32 node cluster. This issue only 
occurs when the input data exceeds a certain size. 

> Executor dies because of ClassCastException and causes timeout
> --
>
> Key: SPARK-12675
> URL: https://issues.apache.org/jira/browse/SPARK-12675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
> Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz
>Reporter: Alexandru Rosianu
>
> I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script 
> which doesn't work (a bit simplified):
> {code:title=Script.scala}
> // Prepare data sets
> logInfo("Getting datasets")
> val emoTrainingData = 
> sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet")
> val trainingData = emoTrainingData
> // Configure the pipeline
> val pipeline = new Pipeline().setStages(Array(
>   new 
> FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"),
>   new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"),
>   new Tokenizer().setInputCol("text").setOutputCol("raw_words"),
>   new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"),
>   new HashingTF().setInputCol("words").setOutputCol("features"),
>   new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"),
>   new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", 
> "raw_words", "words", "features")
> ))
> // Fit the pipeline
> logInfo(s"Training model on ${trainingData.count()} rows")
> val model = pipeline.fit(trainingData)
> {code}
> It executes up to the last line. It prints "Training model on xx rows", then 
> it starts fitting, the executor dies, the drivers doesn't receive heartbeats 
> from the executor and it times out, then the script exits. It doesn't get 
> past that line.
> This is the exception that kills the executor:
> {code}
> java.io.IOException: java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.HashMap$SerializationProxy to field 
> org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type 
> scala.collection.immutable.Map in instance of 
> org.apache.spark.executor.TaskMetrics
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207)
>   at 
> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at org.apache.spark.util.Utils$.deserialize(Utils.scala:92)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>   at 

[jira] [Commented] (SPARK-16656) CreateTableAsSelectSuite is flaky

2016-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386750#comment-15386750
 ] 

Apache Spark commented on SPARK-16656:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/14289

> CreateTableAsSelectSuite is flaky
> -
>
> Key: SPARK-16656
> URL: https://issues.apache.org/jira/browse/SPARK-16656
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62593/testReport/junit/org.apache.spark.sql.sources/CreateTableAsSelectSuite/create_a_table__drop_it_and_create_another_one_with_the_same_name/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16656) CreateTableAsSelectSuite is flaky

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16656:


Assignee: Yin Huai  (was: Apache Spark)

> CreateTableAsSelectSuite is flaky
> -
>
> Key: SPARK-16656
> URL: https://issues.apache.org/jira/browse/SPARK-16656
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62593/testReport/junit/org.apache.spark.sql.sources/CreateTableAsSelectSuite/create_a_table__drop_it_and_create_another_one_with_the_same_name/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16656) CreateTableAsSelectSuite is flaky

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16656:


Assignee: Apache Spark  (was: Yin Huai)

> CreateTableAsSelectSuite is flaky
> -
>
> Key: SPARK-16656
> URL: https://issues.apache.org/jira/browse/SPARK-16656
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62593/testReport/junit/org.apache.spark.sql.sources/CreateTableAsSelectSuite/create_a_table__drop_it_and_create_another_one_with_the_same_name/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16656) CreateTableAsSelectSuite is flaky

2016-07-20 Thread Yin Huai (JIRA)
Yin Huai created SPARK-16656:


 Summary: CreateTableAsSelectSuite is flaky
 Key: SPARK-16656
 URL: https://issues.apache.org/jira/browse/SPARK-16656
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62593/testReport/junit/org.apache.spark.sql.sources/CreateTableAsSelectSuite/create_a_table__drop_it_and_create_another_one_with_the_same_name/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-20 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

*UPDATE*
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.functions import struct
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop

*UPDATE 2*
Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
source.
I attached some VisualVM profiles there.
Most interesting are from queries.
https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

*UPDATE*
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.functions import struct
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop

*UPDATE 2*
Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
source.
There is more 


> Spark 2.0 slower than 1.6 when querying nested columns
> 

[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-20 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

*UPDATE*
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.functions import struct
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop

*UPDATE 2*
Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
source.
There is more 

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

*UPDATE*
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.functions import struct
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop

*UPDATE 2*
Analysis in 


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: 

[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-20 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

*UPDATE*
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.functions import struct
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop

*UPDATE 2*
Analysis in 

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

*UPDATE*
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.functions import struct
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 

[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-20 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386586#comment-15386586
 ] 

Michael Allman commented on SPARK-16320:


Okay, so the metastore is not a factor.

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13301) PySpark Dataframe return wrong results with custom UDF

2016-07-20 Thread Sam Fishman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386564#comment-15386564
 ] 

Sam Fishman edited comment on SPARK-13301 at 7/20/16 8:27 PM:
--

I am having the same issue when applying a udf to a DataFrame. I've noticed 
that it seems to only occur when data is read in using sqlContext.sql(). When I 
use parallelize on a local Python collection and then call toDF on the 
resulting RDD, I don't seem to have the issue:

Works:
data = [ ["1", "A->B"], ["2", "B->C"], ["3", "C->A"], ["4", "D->E"], ["5", 
"E->D"] ]
rdd = sc.parallelize(data)
df = rdd.toDF(["id", "segment"])
def myFunction(number):
id_test = number+ ' test'
return(id_test)
test_function_udf = udf(myFunction, StringType())
df2 = df.withColumn('test', test_function_udf("id"))

Does not work (ie "wrong results" similar to Simone's output):
df = sqlContext.sql("select * from my_table")
def myFunction(number):
id_test = number+ ' test'
return(id_test)
test_function_udf = udf(myFunction, StringType())
df2 = df.withColumn('test', test_function_udf("id"))

I should note that this also only seems to be an issue when using a UDF. If I 
use the builtin concat function, I do not get the error. 



was (Author: sfishman):
I am having the same issue when applying a udf to a DataFrame. I've noticed 
that it seems to only occur when data is read in using sqlContext.sql(). When I 
use parallelize on a local Python collection and then call toDF on the 
resulting RDD, I don't seem to have the issue:

Works:
data = [ ["1", "A->B"], ["2", "B->C"], ["3", "C->A"], ["4", "D->E"], ["5", 
"E->D"] ]
rdd = sc.parallelize(data)
df = rdd.toDF(["id", "segment"])
def myFunction(number):
id_test = number+ ' test'
return(id_test)
test_function_udf = udf(myFunction, StringType())
df2 = df.withColumn('test', test_function_udf("id"))

Does not work (ie "wrong results" similar to Simone's output):
# Assuming I have data stored in a table
df = sqlContext.sql("select * from my_table")
def myFunction(number):
id_test = number+ ' test'
return(id_test)
test_function_udf = udf(myFunction, StringType())
df2 = df.withColumn('test', test_function_udf("id"))

I should note that this also only seems to be an issue when using a UDF. If I 
use the builtin concat function, I do not get the error. 


> PySpark Dataframe return wrong results with custom UDF
> --
>
> Key: SPARK-13301
> URL: https://issues.apache.org/jira/browse/SPARK-13301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: PySpark in yarn-client mode - CDH 5.5.1
>Reporter: Simone
>Priority: Critical
>
> Using a User Defined Function in PySpark inside the withColumn() method of 
> Dataframe, gives wrong results.
> Here an example:
> from pyspark.sql import functions
> import string
> myFunc = functions.udf(lambda s: string.lower(s))
> myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()
> |col1|   col2|col3|
> |1265AB4F65C05740E...|Ivo|4f00ae514e7c015be...|
> |1D94AB4F75C83B51E...|   Raffaele|4f00dcf6422100c0e...|
> |4F008903600A0133E...|   Cristina|4f008903600a0133e...|
> The results are wrong and seem to be random: some record are OK (for example 
> the third) some others NO (for example the first 2).
> The problem seems not occur with Spark built-in functions:
> from pyspark.sql.functions import *
> myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show()
> Without the withColumn() method, results seems to be always correct:
> myDF.select("col1", "col2", myFunc(myDF["col1"])).show()
> This can be considered only in part a workaround because you have to list 
> each time all column of your Dataframe.
> Also in Scala/Java the problems seems not occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386561#comment-15386561
 ] 

Thomas Graves commented on SPARK-2666:
--

thanks for the explanation.

I guess we would have to look through the failures cases, but if you are using 
the external shuffle service it feels like marking everything on that node as 
bad even if its from another executor would be better because this case seems 
like more of a node failure or something that would be much more likely to 
affect other map outputs.

I guess if its serving shuffle from the executor, it could just be something 
bad on that executor ( out of memory, timeout due to overload, etc). 



> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13301) PySpark Dataframe return wrong results with custom UDF

2016-07-20 Thread Sam Fishman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386564#comment-15386564
 ] 

Sam Fishman commented on SPARK-13301:
-

I am having the same issue when applying a udf to a DataFrame. I've noticed 
that it seems to only occur when data is read in using sqlContext.sql(). When I 
use parallelize on a local Python collection and then call toDF on the 
resulting RDD, I don't seem to have the issue:

Works:
data = [ ["1", "A->B"], ["2", "B->C"], ["3", "C->A"], ["4", "D->E"], ["5", 
"E->D"] ]
rdd = sc.parallelize(data)
df = rdd.toDF(["id", "segment"])
def myFunction(number):
id_test = number+ ' test'
return(id_test)
test_function_udf = udf(myFunction, StringType())
df2 = df.withColumn('test', test_function_udf("id"))

Does not work (ie "wrong results" similar to Simone's output):
# Assuming I have data stored in a table
df = sqlContext.sql("select * from my_table")
def myFunction(number):
id_test = number+ ' test'
return(id_test)
test_function_udf = udf(myFunction, StringType())
df2 = df.withColumn('test', test_function_udf("id"))

I should note that this also only seems to be an issue when using a UDF. If I 
use the builtin concat function, I do not get the error. 


> PySpark Dataframe return wrong results with custom UDF
> --
>
> Key: SPARK-13301
> URL: https://issues.apache.org/jira/browse/SPARK-13301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: PySpark in yarn-client mode - CDH 5.5.1
>Reporter: Simone
>Priority: Critical
>
> Using a User Defined Function in PySpark inside the withColumn() method of 
> Dataframe, gives wrong results.
> Here an example:
> from pyspark.sql import functions
> import string
> myFunc = functions.udf(lambda s: string.lower(s))
> myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()
> |col1|   col2|col3|
> |1265AB4F65C05740E...|Ivo|4f00ae514e7c015be...|
> |1D94AB4F75C83B51E...|   Raffaele|4f00dcf6422100c0e...|
> |4F008903600A0133E...|   Cristina|4f008903600a0133e...|
> The results are wrong and seem to be random: some record are OK (for example 
> the third) some others NO (for example the first 2).
> The problem seems not occur with Spark built-in functions:
> from pyspark.sql.functions import *
> myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show()
> Without the withColumn() method, results seems to be always correct:
> myDF.select("col1", "col2", myFunc(myDF["col1"])).show()
> This can be considered only in part a workaround because you have to list 
> each time all column of your Dataframe.
> Also in Scala/Java the problems seems not occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386552#comment-15386552
 ] 

Maciej Bryński commented on SPARK-16320:


[~michael]
Could you check SPARK-16321 ? 
I attached there some VisualVM snapshots. And the underlying data is the same.

Another think is that I'm using read.parquet method which means that Spark 
shouldn't connect to metastore.


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386524#comment-15386524
 ] 

Imran Rashid commented on SPARK-2666:
-

[~tgraves] [~lianhuiwang].  When there is a fetch failure, spark considers all 
shuffle output on that executor to be gone.  (The code is rather confusing -- 
first it just removes the one block with the fetch failed: 
https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1134
  but just after that, it removes everything on the executor: 
https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1184)


When a stage is retried, it reruns all the tasks for the missing shuffle 
outputs, at the time the stage is retried.  Usually, this is just all of the 
map output that was on the executor which had the fetch failed.  But, its not 
necessarily exactly the same, as even more shuffle outputs could be lost before 
the stage retry kicks in.

* Suppose you had three stages in a row, 0 --> 1 --> 2, and you hit a shuffle 
fetch failure while running stage 2, say on executor A.  So you need to 
regenerate the map output for stage 1 that was on executor A.  But most likely 
spark will discover that to regenerate that missing output, it needs some map 
output from stage 0, which was on executor A.  So first it will go re-run the 
missing parts of stage 0, and then when it gets to stage 1, the dag scheduler 
will look at what map outputs are beginning.  So there is some extra time in 
there to discover more missing shuffle outputs.

* Spark only marks the shuffle output as missing for the *executor* that 
shuffle data couldn't be read from, not for the entire node.  So if its a 
hardware failure, you're likely to hit more failures even after the first fetch 
failure comes in, since you probably can't read from any of the nodes on that 
host.

Despite this, I don't think there is a very good reason to leave tasks running 
after there is a fetch failure.  If there is a hardware failure, then the rest 
of the retry process is also likely to discover this and remove those executors 
as well.  (Kay and I had discussed this earlier in the thread and we seemed to 
agree, though I dunno if we had thought through all the details at that time.)  
If anything, I wonder if when there is a fetch failure, we should mark all data 
as missing on the entire node, not just the executor, but I don't think that is 
necessary.

> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372

[jira] [Updated] (SPARK-16507) Add CRAN checks to SparkR

2016-07-20 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-16507:
-
Target Version/s: 2.0.0, 2.1.0  (was: 2.1.0)

> Add CRAN checks to SparkR 
> --
>
> Key: SPARK-16507
> URL: https://issues.apache.org/jira/browse/SPARK-16507
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 2.0.0
>
>
> One of the steps to publishing SparkR is to pass the `R CMD check --as-cran`. 
> We should add a script to do this and fix any errors / warnings we find



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16632) Vectorized parquet reader fails to read certain fields from Hive tables

2016-07-20 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386409#comment-15386409
 ] 

Marcelo Vanzin commented on SPARK-16632:


Yes, that's the right stack trace. It's a CTAS query which is probably why the 
write path shows up there.

> Vectorized parquet reader fails to read certain fields from Hive tables
> ---
>
> Key: SPARK-16632
> URL: https://issues.apache.org/jira/browse/SPARK-16632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hive 1.1 (CDH)
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.1.0
>
>
> The vectorized parquet reader fails to read certain tables created by Hive. 
> When the tables have type "tinyint" or "smallint", Catalyst converts those to 
> "ByteType" and "ShortType" respectively. But when Hive writes those tables in 
> parquet format, the parquet schema in the files contains "int32" fields.
> To reproduce, run these commands in the hive shell (or beeline):
> {code}
> create table abyte (value tinyint) stored as parquet;
> create table ashort (value smallint) stored as parquet;
> insert into abyte values (1);
> insert into ashort values (1);
> {code}
> Then query them with Spark 2.0:
> {code}
> spark.sql("select * from abyte").show();
> spark.sql("select * from ashort").show();
> {code}
> You'll see this exception (for the byte case):
> {noformat}
> 16/07/13 12:24:23 ERROR datasources.InsertIntoHadoopFsRelationCommand: 
> Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, scm-centos71-iqalat-2.gce.cloudera.com): 
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getByte(OnHeapColumnVector.java:159)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {noformat}
> This works when you point Spark directly at the files (instead of using the 
> metastore data), or when you disable the vectorized parquet reader.
> The root cause seems to be that Hive creates these tables with a 
> not-so-complete schema:
> {noformat}
> $ parquet-tools schema /tmp/byte.parquet 
> message hive_schema {
>   optional int32 value;
> }
> {noformat}
> There's no indication that the field is a 32-bit field used to store 8-bit 
> values. When the ParquetReadSupport code tries to consolidate both schemas, 
> it just chooses whatever is in the parquet file for primitive types (see 
> ParquetReadSupport.clipParquetType); the vectorized reader uses the catalyst 
> schema, which comes from the Hive metastore, and says it's a byte field, so 
> when it tries to read the data, the byte data stored in "OnHeapColumnVector" 
> is null.
> I have tested a small change to 

[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-20 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386395#comment-15386395
 ] 

Michael Allman commented on SPARK-16320:


The code path for reading data from parquet files has been refactored 
extensively. The fact that [~maver1ck] is testing performance on a table with 
400 partitions makes me wonder if my PR for 
https://issues.apache.org/jira/browse/SPARK-15968 will make a difference for 
repeated queries on partitioned tables. That PR was merged into master and 
backported to 2.0. The commit short hash is d5d2457.

Another issue that's in play when querying Hive metastore tables with a large 
number of partitions is the time it takes to query the Hive metastore for the 
table's partition metadata. Especially if your metastore is not configured to 
use direct sql, this in and of itself can take from seconds to minutes. With 
400 partitions and without direct sql, you might see about 10 to 20 seconds or 
so of query planning time. It would be helpful to look at query times minus the 
time spent in query planning. Since query planning happens in the driver before 
the job's stages are launched, you can estimate this by looking at the actual 
stage times in your SQL query job.

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14887) Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits

2016-07-20 Thread Hayri Volkan Agun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386393#comment-15386393
 ] 

Hayri Volkan Agun commented on SPARK-14887:
---

Same issue in 1.6.2 can repeated with around 300 iterative udf transformation 
after a 20 unionAll calls on approaximately 4000 rows (~25 columns) dataframe...

> Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits
> ---
>
> Key: SPARK-14887
> URL: https://issues.apache.org/jira/browse/SPARK-14887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: fang fang chen
>
> Similiar issue with SPARK-14138 and SPARK-8443:
> With large sql syntax(673K), following error happened:
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15425) Disallow cartesian joins by default

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15425:
-
Labels: release_notes releasenotes  (was: )

> Disallow cartesian joins by default
> ---
>
> Key: SPARK-15425
> URL: https://issues.apache.org/jira/browse/SPARK-15425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sameer Agarwal
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> It is fairly easy for users to shoot themselves in the foot if they run 
> cartesian joins. Often they might not even be aware of the join methods 
> chosen. This happened to me a few times in the last few weeks.
> It would be a good idea to disable cartesian joins by default, and require 
> explicit enabling of it via "crossJoin" method or in SQL "cross join". This 
> however might be too large of a scope for 2.0 given the timing. As a small 
> and quick fix, we can just have a single config option 
> (spark.sql.join.enableCartesian) that controls this behavior. In the future 
> we can implement the fine-grained control.
> Note that the error message should be friendly and say "Set 
> spark.sql.join.enableCartesian to true to turn on cartesian joins."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16644:
-
Component/s: SQL

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16644:
-
Target Version/s: 2.0.1  (was: 2.0.0)

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16633) lag/lead does not return the default value when the offset row does not exist

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16633:
-
Target Version/s: 2.0.1  (was: 2.0.0)

> lag/lead does not return the default value when the offset row does not exist
> -
>
> Key: SPARK-16633
> URL: https://issues.apache.org/jira/browse/SPARK-16633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Attachments: window_function_bug.html
>
>
> Please see the attached notebook. Seems lag/lead somehow fail to recognize 
> that a offset row does not exist and generate wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16642) ResolveWindowFrame should not be triggered on UnresolvedFunctions.

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16642:
-
Target Version/s: 2.0.1  (was: 2.0.0)

> ResolveWindowFrame should not be triggered on UnresolvedFunctions.
> --
>
> Key: SPARK-16642
> URL: https://issues.apache.org/jira/browse/SPARK-16642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> The case at 
> https://github.com/apache/spark/blob/75146be6ba5e9f559f5f15430310bb476ee0812c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1790-L1792
>  is shown below
> {code}
> case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, 
> UnspecifiedFrame)) =>
>   val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, 
> acceptWindowFrame = true)
>   we.copy(windowSpec = s.copy(frameSpecification = frame))
> {code}
> This case will be triggered even when the function is an unresolved. So, when 
> the functions like lead are used, we may see errors like {{Window Frame RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame 
> ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING.}} because we wrongly set the the 
> frame specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16648:
-
Target Version/s: 2.0.1  (was: 2.0.0)

> LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
> --
>
> Key: SPARK-16648
> URL: https://issues.apache.org/jira/browse/SPARK-16648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> The following simple SQL query reproduces this issue:
> {code:sql}
> SELECT LAST_VALUE(FALSE) OVER ();
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> 

[jira] [Updated] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16644:
-
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386360#comment-15386360
 ] 

Dongjoon Hyun commented on SPARK-16651:
---

I made a PR for you and other people, but I'm not sure it'll be merged.

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16654) UI Should show blacklisted executors & nodes

2016-07-20 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386338#comment-15386338
 ] 

Alex Bozarth commented on SPARK-16654:
--

Perhaps we can change the status column to "Blacklisted" or "Alive 
(Blacklisted)" instead on Alive or Dead? I'm not very familiar with how the 
blacklisting works but I would be willing to learn and add this once the other 
PR is merged.

> UI Should show blacklisted executors & nodes
> 
>
> Key: SPARK-16654
> URL: https://issues.apache.org/jira/browse/SPARK-16654
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> SPARK-8425 will add the ability to blacklist entire executors and nodes to 
> deal w/ faulty hardware.  However, without displaying it on the UI, it can be 
> hard to realize which executor is bad, and why tasks aren't getting scheduled 
> on certain executors.
> As a first step, we should just show nodes and executors that are blacklisted 
> for the entire application (no need to show blacklisting for tasks & stages).
> This should also ensure that blacklisting events get into the event logs for 
> the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-20 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386333#comment-15386333
 ] 

Michael Allman commented on SPARK-16320:


[~maver1ck] Would it be possible for you to share your parquet file on S3? I 
would like to test with it specifically. If not publicly, could you share it 
with me privately? Thanks.

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16642) ResolveWindowFrame should not be triggered on UnresolvedFunctions.

2016-07-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16642:
-
Target Version/s: 2.0.0

> ResolveWindowFrame should not be triggered on UnresolvedFunctions.
> --
>
> Key: SPARK-16642
> URL: https://issues.apache.org/jira/browse/SPARK-16642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> The case at 
> https://github.com/apache/spark/blob/75146be6ba5e9f559f5f15430310bb476ee0812c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1790-L1792
>  is shown below
> {code}
> case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, 
> UnspecifiedFrame)) =>
>   val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, 
> acceptWindowFrame = true)
>   we.copy(windowSpec = s.copy(frameSpecification = frame))
> {code}
> This case will be triggered even when the function is an unresolved. So, when 
> the functions like lead are used, we may see errors like {{Window Frame RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame 
> ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING.}} because we wrongly set the the 
> frame specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16655) Spark thrift server application is not stopped if its in ACCEPTED stage

2016-07-20 Thread Yesha Vora (JIRA)
Yesha Vora created SPARK-16655:
--

 Summary: Spark thrift server application is not stopped if its in 
ACCEPTED stage
 Key: SPARK-16655
 URL: https://issues.apache.org/jira/browse/SPARK-16655
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Yesha Vora


When spark-thriftserver is started in yarn-client mode, It starts a yarn 
application.  If yarn application is in ACCEPTED stage and stop operation is 
performed on spark thrift server,  yarn application does not get 
killed/stopped. 

On stop operation, spark thriftserver should stop the yarn application. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16651:


Assignee: (was: Apache Spark)

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386317#comment-15386317
 ] 

Apache Spark commented on SPARK-16651:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14288

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16651:


Assignee: Apache Spark

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>Assignee: Apache Spark
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16650) Improve documentation of spark.task.maxFailures

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16650:


Assignee: Apache Spark

> Improve documentation of spark.task.maxFailures   
> 
>
> Key: SPARK-16650
> URL: https://issues.apache.org/jira/browse/SPARK-16650
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> The documentation for spark.task.maxFailures isn't clear as to whether this 
> is just the number of total task failures or if its a single task failing 
> multiple attempts.  
> It turns out its the latter, a single task has to fail spark.task.maxFailures 
> number of attempts before it fails the job.
> We should try to make that more clear in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16650) Improve documentation of spark.task.maxFailures

2016-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386311#comment-15386311
 ] 

Apache Spark commented on SPARK-16650:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/14287

> Improve documentation of spark.task.maxFailures   
> 
>
> Key: SPARK-16650
> URL: https://issues.apache.org/jira/browse/SPARK-16650
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>
> The documentation for spark.task.maxFailures isn't clear as to whether this 
> is just the number of total task failures or if its a single task failing 
> multiple attempts.  
> It turns out its the latter, a single task has to fail spark.task.maxFailures 
> number of attempts before it fails the job.
> We should try to make that more clear in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16650) Improve documentation of spark.task.maxFailures

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16650:


Assignee: (was: Apache Spark)

> Improve documentation of spark.task.maxFailures   
> 
>
> Key: SPARK-16650
> URL: https://issues.apache.org/jira/browse/SPARK-16650
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>
> The documentation for spark.task.maxFailures isn't clear as to whether this 
> is just the number of total task failures or if its a single task failing 
> multiple attempts.  
> It turns out its the latter, a single task has to fail spark.task.maxFailures 
> number of attempts before it fails the job.
> We should try to make that more clear in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386295#comment-15386295
 ] 

Dongjoon Hyun edited comment on SPARK-16651 at 7/20/16 5:55 PM:


Also, in Spark 2.0 RC5, DataFrame is merged into Dataset and the documentation 
of Dataset still has that notice.

{code}
def withColumnRenamed(existingName: String, newName: String): DataFrame
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't 
contain existingName.
Since 2.0.0
{code}

http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/api/scala/index.html#org.apache.spark.sql.Dataset


was (Author: dongjoon):
Also, in Spark 2.0 RC, Dataframe is merged into Dataset and the documentation 
of Dataset still has that notice.

{code}
def withColumnRenamed(existingName: String, newName: String): DataFrame
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't 
contain existingName.
Since 2.0.0
{code}

http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/api/scala/index.html#org.apache.spark.sql.Dataset

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386295#comment-15386295
 ] 

Dongjoon Hyun commented on SPARK-16651:
---

Also, in Spark 2.0 RC, Dataframe is merged into Dataset and the documentation 
of Dataset still has that notice.

{code}
def withColumnRenamed(existingName: String, newName: String): DataFrame
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't 
contain existingName.
Since 2.0.0
{code}

http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/api/scala/index.html#org.apache.spark.sql.Dataset

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386286#comment-15386286
 ] 

Dongjoon Hyun commented on SPARK-16651:
---

Hi, [~tomwphillips].
It's a documented behavior in Scala side since 1.3.

{code}
def withColumnRenamed(existingName: String, newName: String): DataFrame
Returns a new DataFrame with a column renamed. This is a no-op if schema 
doesn't contain existingName.
{code}

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

Maybe, we can update Python API but we can not change the behavior.

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16634) GenericArrayData can't be loaded in certain JVMs

2016-07-20 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16634.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.1.0
   2.0.1

> GenericArrayData can't be loaded in certain JVMs
> 
>
> Key: SPARK-16634
> URL: https://issues.apache.org/jira/browse/SPARK-16634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> There's an annoying bug in some JVMs that causes certain scala-generated 
> bytecode to not load. The current code in GenericArrayData.scala triggers 
> that bug (at least with 1.7.0_67, maybe others).
> Since it's easy to work around the bug, I'd rather do that instead of asking 
> people who might be running that version to have to upgrade.
> Error:
> {noformat}
> 16/07/19 16:02:35 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 0.0 
> (TID 2) on executor vanzin-st1-3.vpc.cloudera.com: java.lang.VerifyError (Bad 
>  method call from inside of a branch
> Exception Details:
>   Location:
> 
> org/apache/spark/sql/catalyst/util/GenericArrayData.(Ljava/lang/Object;)V
>  @52: invokespecial
>   Reason:
> Error exists in the bytecode
>   Bytecode:
> 000: 2a2b 4d2c c100 dc99 000e 2cc0 00dc 4e2d
> 010: 3a04 a700 20b2 0129 2c04 b601 2d99 001b
> 020: 2c3a 05b2 007a 1905 b600 7eb9 00fe 0100
> 030: 3a04 1904 b700 f3b1 bb01 2f59 2cb7 0131
> 040: bf 
>   Stackmap Table:
> 
> full_frame(@21,{UninitializedThis,Object[#177],Object[#177]},{UninitializedThis})
> 
> full_frame(@50,{UninitializedThis,Object[#177],Object[#177],Top,Object[#220]},{UninitializedThis})
> 
> full_frame(@56,{UninitializedThis,Object[#177],Object[#177]},{UninitializedThis})
> ) [duplicate 2]
> {noformat}
> I didn't run into this with 2.0, not sure whether the issue exists there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16654) UI Should show blacklisted executors & nodes

2016-07-20 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386242#comment-15386242
 ] 

Imran Rashid commented on SPARK-16654:
--

cc [~tgraves]

> UI Should show blacklisted executors & nodes
> 
>
> Key: SPARK-16654
> URL: https://issues.apache.org/jira/browse/SPARK-16654
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> SPARK-8425 will add the ability to blacklist entire executors and nodes to 
> deal w/ faulty hardware.  However, without displaying it on the UI, it can be 
> hard to realize which executor is bad, and why tasks aren't getting scheduled 
> on certain executors.
> As a first step, we should just show nodes and executors that are blacklisted 
> for the entire application (no need to show blacklisting for tasks & stages).
> This should also ensure that blacklisting events get into the event logs for 
> the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16654) UI Should show blacklisted executors & nodes

2016-07-20 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-16654:


 Summary: UI Should show blacklisted executors & nodes
 Key: SPARK-16654
 URL: https://issues.apache.org/jira/browse/SPARK-16654
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Web UI
Affects Versions: 2.0.0
Reporter: Imran Rashid


SPARK-8425 will add the ability to blacklist entire executors and nodes to deal 
w/ faulty hardware.  However, without displaying it on the UI, it can be hard 
to realize which executor is bad, and why tasks aren't getting scheduled on 
certain executors.

As a first step, we should just show nodes and executors that are blacklisted 
for the entire application (no need to show blacklisting for tasks & stages).

This should also ensure that blacklisting events get into the event logs for 
the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching

2016-07-20 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-15951:
--
Assignee: Kishor Patil

> Change Executors Page to use datatables to support sorting columns and 
> searching
> 
>
> Key: SPARK-15951
> URL: https://issues.apache.org/jira/browse/SPARK-15951
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
>Assignee: Kishor Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Support column sort and search for Executors Server using jQuery DataTable 
> and REST API. Before this commit, the executors page was generated hard-coded 
> html and can not support search, also, the sorting was disabled if there is 
> any application that has more than one attempt. Supporting search and sort 
> (over all applications rather than the 20 entries in the current page) in any 
> case will greatly improve user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching

2016-07-20 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-15951.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Change Executors Page to use datatables to support sorting columns and 
> searching
> 
>
> Key: SPARK-15951
> URL: https://issues.apache.org/jira/browse/SPARK-15951
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Support column sort and search for Executors Server using jQuery DataTable 
> and REST API. Before this commit, the executors page was generated hard-coded 
> html and can not support search, also, the sorting was disabled if there is 
> any application that has more than one attempt. Supporting search and sort 
> (over all applications rather than the 20 entries in the current page) in any 
> case will greatly improve user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386235#comment-15386235
 ] 

Thomas Graves commented on SPARK-2666:
--

I think eventually adding prestart (MapReduce slowstart type setting) makes 
sense.  This is actually why I didn't change the mapoutput statuses to go along 
with task launch. I wanted to be able to do this or get incremental map output 
status results.

But as far as the keep the remaining tasks running I think it depends on the 
behavior and I haven't had time to go look in more detail.

If the stages fails, what tasks does it rerun:

1) does it rerun all the ones not succeeded yet in the failed stage (including 
the ones that could still be running)?  
2) does it only run the failed ones and wait for the ones still running in 
failed stage?  If they succeed it uses those results.

>From what I saw with this job I thought it was acting like number 1 above. The 
>only use to leave the ones running is to see if they get FetchFailures, this 
>seems like a lot of overhead to find that out if that task takes a long time.

When a fetch failure happens, does the schedule re-run all maps that had run on 
that node or just the ones specifically mentioned by the fetch failure?  Again 
I thought it was just the specific map that the fetch failure failed to get, 
thus why it needs to know if the other reducers get fetch failures.

I can kind of understand letting them run to see if they hit fetch failures as 
well but on a large job or with tasks that take a long time, if we aren't  
counting them as success then its more a waste of resources and just extends 
the job time as well as confuses the user since the UI doesn't represent those 
still running.

 In the case i was seeing my tasks took roughly an hour.  One stage failed so 
it restarted that stage, but since it didn't kill the tasks from the original 
stage it had very few executors open to run new ones, thus the job took a lot 
longer then it should.   I don't remember the exact cause of the failures 
anymore.

Anyway I think the results are going to vary a lot based on the type of job and 
length of each stage (map vs reduce). 

personally I think it would be better to change to fail all maps that ran on 
the host it failed to fetch from and kill the rest of the running reducers in 
that stage. But I would have to investigate the code more to fully understand.



> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order

2016-07-20 Thread Wade Salazar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386225#comment-15386225
 ] 

Wade Salazar commented on SPARK-15918:
--

Is the stance on this issue that since other SQL interpreters require both 
SELECT statements to be in order Spark will follow?  Though this seems 
relatively innocuous it leads to considerable amount of troubleshooting when 
this behavior is not expected.  Can we request that the requirement to have 
each SELECT statement eliminated as an improvement in SparkSQL 

> unionAll returns wrong result when two dataframes has schema in different 
> order
> ---
>
> Key: SPARK-15918
> URL: https://issues.apache.org/jira/browse/SPARK-15918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: CentOS
>Reporter: Prabhu Joseph
>
> On applying unionAll operation between A and B dataframes, they both has same 
> schema but in different order and hence the result has column value mapping 
> changed.
> Repro:
> {code}
> A.show()
> +---++---+--+--+-++---+--+---+---+-+
> |tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---++---+--+--+-++---+--+---+---+-+
> +---++---+--+--+-++---+--+---+---+-+
> B.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+--+---+---+--++
> |F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUDP713.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT718.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT703Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUR716A.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT803Z.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT728.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUR806.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> +-+---+--+---+---+--+--+--+---+---+--++
> A = A.unionAll(B)
> A.show()
> +---+---+--+--+--+-++---+--+---+---+-+
> |tag|   year_day|   
> tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---+---+--+--+--+-++---+--+---+---+-+
> |  F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> +---+---+--+--+--+-++---+--+---+---+-+
> {code}
> On changing the schema of A according to B and doing unionAll works fine
> {code}
> C = 
> A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day")
> A = C.unionAll(B)
> A.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+--+---+---+--++
> |F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |

[jira] [Updated] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-07-20 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-15968:
---
Fix Version/s: 2.0.0

> HiveMetastoreCatalog does not correctly validate partitioned metastore 
> relation when searching the internal table cache
> ---
>
> Key: SPARK-15968
> URL: https://issues.apache.org/jira/browse/SPARK-15968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>Assignee: Michael Allman
>  Labels: hive, metastore
> Fix For: 2.0.0, 2.1.0
>
>
> The {{getCached}} method of {{HiveMetastoreCatalog}} computes 
> {{pathsInMetastore}} from the metastore relation's catalog table. This only 
> returns the table base path, which is not correct for non-empty partitioned 
> tables. As a result, cache lookups on non-empty partitioned tables always 
> miss and these relations are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13767) py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server

2016-07-20 Thread Rahul Bhatia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386192#comment-15386192
 ] 

Rahul Bhatia commented on SPARK-13767:
--

I'm seeing the error that Venkata showed as well, if anyone has any thoughts on 
why that would occur, I'd really appreciate it. 

Thanks, 

> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
> the Java server
> 
>
> Key: SPARK-13767
> URL: https://issues.apache.org/jira/browse/SPARK-13767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Poonam Agrawal
>
> I am trying to create spark context object with the following commands on 
> pyspark:
> from pyspark import SparkContext, SparkConf
> conf = 
> SparkConf().setAppName('App_name').setMaster("spark://local-or-remote-ip:7077").set('spark.cassandra.connection.host',
>  'cassandra-machine-ip').set('spark.storage.memoryFraction', 
> '0.2').set('spark.rdd.compress', 'true').set('spark.streaming.blockInterval', 
> 500).set('spark.serializer', 
> 'org.apache.spark.serializer.KryoSerializer').set('spark.scheduler.mode', 
> 'FAIR').set('spark.mesos.coarse', 'true')
> sc = SparkContext(conf=conf)
> but I am getting the following error:
> Traceback (most recent call last):
> File "", line 1, in 
> File "/usr/local/lib/spark-1.4.1/python/pyspark/conf.py", line 106, in 
> __init__
>   self._jconf = _jvm.SparkConf(loadDefaults)
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 766, in __getattr__
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 362, in send_command
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 318, in _get_connection
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 325, in _create_connection
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 432, in start
> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
> the Java server
> I am getting the same error executing the command : conf = 
> SparkConf().setAppName("App_name").setMaster("spark://127.0.0.1:7077")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16613) RDD.pipe returns values for empty partitions

2016-07-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16613.
-
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.1.0
   2.0.1

> RDD.pipe returns values for empty partitions
> 
>
> Key: SPARK-16613
> URL: https://issues.apache.org/jira/browse/SPARK-16613
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Krasnyansky
>Assignee: Sean Owen
> Fix For: 2.0.1, 2.1.0
>
>
> Suppose we have such Spark code
> {code}
> object PipeExample {
>   def main(args: Array[String]) {
> val fstRdd = sc.parallelize(List("hi", "hello", "how", "are", "you"))
> val pipeRdd = 
> fstRdd.pipe("/Users/finkel/spark-pipe-example/src/main/resources/len.sh")
> pipeRdd.collect.foreach(println)
>   }
> }
> {code}
> It uses a bash script to convert a string to its length.
> {code}
> #!/bin/sh
> read input
> len=${#input}
> echo $len
> {code}
> So far so good, but when I run the code, it prints incorrect output. For 
> example:
> {code}
> 0
> 2
> 0
> 5
> 3
> 0
> 3
> 3
> {code}
> I expect to see
> {code}
> 2
> 5
> 3
> 3
> 3
> {code}
> which is correct output for the app. I think it's a bug. It's expected to see 
> only positive integers and avoid zeros.
> Environment:
> 1. Spark version is 1.6.2
> 2. Scala version is 2.11.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-07-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15968:

Fix Version/s: (was: 2.1.0)

> HiveMetastoreCatalog does not correctly validate partitioned metastore 
> relation when searching the internal table cache
> ---
>
> Key: SPARK-15968
> URL: https://issues.apache.org/jira/browse/SPARK-15968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>Assignee: Michael Allman
>  Labels: hive, metastore
> Fix For: 2.0.0
>
>
> The {{getCached}} method of {{HiveMetastoreCatalog}} computes 
> {{pathsInMetastore}} from the metastore relation's catalog table. This only 
> returns the table base path, which is not correct for non-empty partitioned 
> tables. As a result, cache lookups on non-empty partitioned tables always 
> miss and these relations are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16653:


Assignee: Apache Spark

> Make convergence tolerance param in ANN default value consistent with other 
> algorithm using LBFGS
> -
>
> Key: SPARK-16653
> URL: https://issues.apache.org/jira/browse/SPARK-16653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Optimizer
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The default value of  convergence tolerance param in ANN is 1e-4,
> but other algorithm (such as LinearRegression, LogisticRegression, and so on) 
> using LBFGS optimizers this param's default values are all 1e-6.
> I think make them the same will be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS

2016-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386028#comment-15386028
 ] 

Apache Spark commented on SPARK-16653:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/14286

> Make convergence tolerance param in ANN default value consistent with other 
> algorithm using LBFGS
> -
>
> Key: SPARK-16653
> URL: https://issues.apache.org/jira/browse/SPARK-16653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Optimizer
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The default value of  convergence tolerance param in ANN is 1e-4,
> but other algorithm (such as LinearRegression, LogisticRegression, and so on) 
> using LBFGS optimizers this param's default values are all 1e-6.
> I think make them the same will be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16653:


Assignee: (was: Apache Spark)

> Make convergence tolerance param in ANN default value consistent with other 
> algorithm using LBFGS
> -
>
> Key: SPARK-16653
> URL: https://issues.apache.org/jira/browse/SPARK-16653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Optimizer
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The default value of  convergence tolerance param in ANN is 1e-4,
> but other algorithm (such as LinearRegression, LogisticRegression, and so on) 
> using LBFGS optimizers this param's default values are all 1e-6.
> I think make them the same will be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS

2016-07-20 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-16653:
---
Component/s: Optimizer
 ML

> Make convergence tolerance param in ANN default value consistent with other 
> algorithm using LBFGS
> -
>
> Key: SPARK-16653
> URL: https://issues.apache.org/jira/browse/SPARK-16653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Optimizer
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The default value of  convergence tolerance param in ANN is 1e-4,
> but other algorithm (such as LinearRegression, LogisticRegression, and so on) 
> using LBFGS optimizers this param's default values are all 1e-6.
> I think make them the same will be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS

2016-07-20 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-16653:
--

 Summary: Make convergence tolerance param in ANN default value 
consistent with other algorithm using LBFGS
 Key: SPARK-16653
 URL: https://issues.apache.org/jira/browse/SPARK-16653
 Project: Spark
  Issue Type: Improvement
Reporter: Weichen Xu


The default value of  convergence tolerance param in ANN is 1e-4,
but other algorithm (such as LinearRegression, LogisticRegression, and so on) 
using LBFGS optimizers this param's default values are all 1e-6.

I think make them the same will be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3210) Flume Polling Receiver must be more tolerant to connection failures.

2016-07-20 Thread Ian Brooks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386002#comment-15386002
 ] 

Ian Brooks commented on SPARK-3210:
---

Hi,

I have also noticed an issue with resiliency for the flume polling receiver. 
This issue I have is as follows

1. Start Flume agent then Spark application
2. Spark application correctly connects to the flume agent and can receive data 
what is sent to Flume
3. Restart Flume
4. Spark application doesn't detect that Flume has been restarted and as such 
doesn't reconnect at any point preventing the Spark application from receiving 
any more data until its restarted.

I've had a trawl through the documentation and source code for 
FlumeUtils.createPollingStream but couldn't see anyway to test and reconnect if 
needed.

-Ian 

> Flume Polling Receiver must be more tolerant to connection failures.
> 
>
> Key: SPARK-3210
> URL: https://issues.apache.org/jira/browse/SPARK-3210
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-20 Thread Daniel Barclay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385984#comment-15385984
 ] 

Daniel Barclay commented on SPARK-16652:


More info.:

Having the {{String}} member isn't necessary to trigger the bug.

Maybe any kind of list fails:  {{List\[Long]}}, {{List\[Int]}}, 
{{List\[Double]}}, {{List\[Any]}}, {{List\[AnyRef]}} and {{List\[String]}} all 
fail.

> JVM crash from unsafe memory access for Dataset of class with List[Long]
> 
>
> Key: SPARK-16652
> URL: https://issues.apache.org/jira/browse/SPARK-16652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2
> Environment: Scala 2.10.
> JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
> MacOs 10.11.2
>Reporter: Daniel Barclay
> Attachments: UnsafeAccessCrashBugTest.scala
>
>
> Generating and writing out a {{Dataset}} of a class that has a {{List}} (at 
> least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
> crash.
> The crash seems to be related to unsafe memory access especially because 
> earlier code (before I got it reduced to the current bug test case)  reported 
> "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code}}".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-20 Thread Daniel Barclay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Barclay updated SPARK-16652:
---
Description: 
Generating and writing out a {{Dataset}} of a class that has a {{List}} (at 
least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
crash.

The crash seems to be related to unsafe memory access especially because 
earlier code (before I got it reduced to the current bug test case)  reported 
"{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
operation in compiled Java code}}".



  was:
Generating and writing out a {{Dataset}} of a class that has a {{List}} (or at 
least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
crash.

The crash seems to be related to unsafe memory access especially because 
earlier code (before I got it reduced to the current bug test case)  reported 
"{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
operation in compiled Java code}}".




> JVM crash from unsafe memory access for Dataset of class with List[Long]
> 
>
> Key: SPARK-16652
> URL: https://issues.apache.org/jira/browse/SPARK-16652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2
> Environment: Scala 2.10.
> JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
> MacOs 10.11.2
>Reporter: Daniel Barclay
> Attachments: UnsafeAccessCrashBugTest.scala
>
>
> Generating and writing out a {{Dataset}} of a class that has a {{List}} (at 
> least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
> crash.
> The crash seems to be related to unsafe memory access especially because 
> earlier code (before I got it reduced to the current bug test case)  reported 
> "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code}}".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-20 Thread Daniel Barclay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385966#comment-15385966
 ] 

Daniel Barclay commented on SPARK-16652:


SPARK-3947 reports the same InternalError message that I saw in trimming down 
my test case for SPARK-16652, although the stack traces are much different.

> JVM crash from unsafe memory access for Dataset of class with List[Long]
> 
>
> Key: SPARK-16652
> URL: https://issues.apache.org/jira/browse/SPARK-16652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2
> Environment: Scala 2.10.
> JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
> MacOs 10.11.2
>Reporter: Daniel Barclay
> Attachments: UnsafeAccessCrashBugTest.scala
>
>
> Generating and writing out a {{Dataset}} of a class that has a {{List}} (or 
> at least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
> crash.
> The crash seems to be related to unsafe memory access especially because 
> earlier code (before I got it reduced to the current bug test case)  reported 
> "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code}}".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-20 Thread Daniel Barclay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Barclay updated SPARK-16652:
---
Attachment: UnsafeAccessCrashBugTest.scala

> JVM crash from unsafe memory access for Dataset of class with List[Long]
> 
>
> Key: SPARK-16652
> URL: https://issues.apache.org/jira/browse/SPARK-16652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2
> Environment: Scala 2.10.
> JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
> MacOs 10.11.2
>Reporter: Daniel Barclay
> Attachments: UnsafeAccessCrashBugTest.scala
>
>
> Generating and writing out a {{Dataset}} of a class that has a {{List}} (or 
> at least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
> crash.
> The crash seems to be related to unsafe memory access especially because 
> earlier code (before I got it reduced to the current bug test case)  reported 
> "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code}}".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-20 Thread Daniel Barclay (JIRA)
Daniel Barclay created SPARK-16652:
--

 Summary: JVM crash from unsafe memory access for Dataset of class 
with List[Long]
 Key: SPARK-16652
 URL: https://issues.apache.org/jira/browse/SPARK-16652
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.2, 1.6.1
 Environment: Scala 2.10.
JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
MacOs 10.11.2
Reporter: Daniel Barclay


Generating and writing out a {{Dataset}} of a class that has a {{List}} (or at 
least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
crash.

The crash seems to be related to unsafe memory access especially because 
earlier code (before I got it reduced to the current bug test case)  reported 
"{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
operation in compiled Java code}}".





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Tom Phillips (JIRA)
Tom Phillips created SPARK-16651:


 Summary: No exception using DataFrame.withColumnRenamed when 
existing column doesn't exist
 Key: SPARK-16651
 URL: https://issues.apache.org/jira/browse/SPARK-16651
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.0
Reporter: Tom Phillips


The {{withColumnRenamed}} method does not raise an exception when the existing 
column does not exist in the dataframe.

Example:

{code}
In [4]: df.show()
+---+-+
|age| name|
+---+-+
|  1|Alice|
+---+-+


In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')

In [6]: df.show()
+---+-+
|age| name|
+---+-+
|  1|Alice|
+---+-+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16650) Improve documentation of spark.task.maxFailures

2016-07-20 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-16650:
-

 Summary: Improve documentation of spark.task.maxFailures   
 Key: SPARK-16650
 URL: https://issues.apache.org/jira/browse/SPARK-16650
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.6.2
Reporter: Thomas Graves


The documentation for spark.task.maxFailures isn't clear as to whether this is 
just the number of total task failures or if its a single task failing multiple 
attempts.  

It turns out its the latter, a single task has to fail spark.task.maxFailures 
number of attempts before it fails the job.

We should try to make that more clear in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread liancheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liancheng updated SPARK-16648:
--
Description: 
The following simple SQL query reproduces this issue:

{code:sql}
SELECT LAST_VALUE(FALSE) OVER ();
{code}

Exception thrown:

{noformat}
java.lang.IndexOutOfBoundsException: 0
  at 
scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
  at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
  at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
  at 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
  at 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveExpressions$1.applyOrElse(LogicalPlan.scala:79)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveExpressions$1.applyOrElse(LogicalPlan.scala:78)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
  at 

[jira] [Commented] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread liancheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385886#comment-15385886
 ] 

liancheng commented on SPARK-16648:
---

The problematic code is the newly introduced {{TreeNode.withNewChildren}}. 
{{Last}} is a unary expression with two {{Expression}} arguments:

{code}
case class Last(child: Expression, ignoreNullsExpr: Expression) extends 
DeclarativeAggregate {
  ...
  override def children: Seq[Expression] = child :: Nil
  ...
}
{code}

Argument {{ignoreNullsExpr}} defaults to {{Literal.FalseLiteral}}. Thus 
{{LAST_VALUE(FALSE)}} is equivalent to {{Last(Literal.FalseLiteral, 
Literal.FalseLiteral)}}. This breaks the following case branch in 
{{TreeNode.withNewChildren}}:

{code}
case arg: TreeNode[_] if containsChild(arg) =>// Both `child` and 
`ignoreNullsExpr` hit this branch,
  val newChild = remainingNewChildren.remove(0)   // but only `child` is the 
real child node of `Last`.
  val oldChild = remainingOldChildren.remove(0)
  if (newChild fastEquals oldChild) {
oldChild
  } else {
changed = true
newChild
  }
{code}


> LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
> --
>
> Key: SPARK-16648
> URL: https://issues.apache.org/jira/browse/SPARK-16648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> {code:sql}
> SELECT LAST_VALUE(FALSE) OVER ();
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> 

[jira] [Assigned] (SPARK-16629) UDTs can not be compared to DataTypes in dataframes.

2016-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16629:


Assignee: (was: Apache Spark)

> UDTs can not be compared to DataTypes in dataframes.
> 
>
> Key: SPARK-16629
> URL: https://issues.apache.org/jira/browse/SPARK-16629
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>
> Currently UDTs can not be compared to Datatypes even if their sqlTypes match. 
> this leads to errors like this 
> {code}
> In [12]: filtered = df.filter(df['udt_time'] > threshold)
> ---
> AnalysisException Traceback (most recent call last)
> /Users/franklyndsouza/dev/starscream/bin/starscream in ()
> > 1 thresholded = df.filter(df['udt_time'] > threshold)
> AnalysisException: u"cannot resolve '(`udt_time` > TIMESTAMP('2015-10-20 
> 01:00:00.0'))' due to data typ mismatch: '(`udt_time` > TIMESTAMP('2015-10-20 
> 01:00:00.0'))' requires (boolean or tinyint or smallint or int or bigint or 
> float or double or decimal or timestamp or date or string or binary) type, 
> not pythonuserdefined"
> {code}
> i've proposed a fix for this here https://github.com/apache/spark/pull/14164



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >