[jira] [Assigned] (SPARK-16639) query fails if having condition contains grouping column
[ https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16639: Assignee: Apache Spark > query fails if having condition contains grouping column > > > Key: SPARK-16639 > URL: https://issues.apache.org/jira/browse/SPARK-16639 > Project: Spark > Issue Type: Bug >Reporter: Wenchen Fan >Assignee: Apache Spark > > {code} > create table tbl(a int, b string); > select count(b) from tbl group by a + 1 having a + 1 = 2; > {code} > this will fail analysis -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16639) query fails if having condition contains grouping column
[ https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16639: Assignee: (was: Apache Spark) > query fails if having condition contains grouping column > > > Key: SPARK-16639 > URL: https://issues.apache.org/jira/browse/SPARK-16639 > Project: Spark > Issue Type: Bug >Reporter: Wenchen Fan > > {code} > create table tbl(a int, b string); > select count(b) from tbl group by a + 1 having a + 1 = 2; > {code} > this will fail analysis -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16639) query fails if having condition contains grouping column
[ https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387176#comment-15387176 ] Apache Spark commented on SPARK-16639: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/14296 > query fails if having condition contains grouping column > > > Key: SPARK-16639 > URL: https://issues.apache.org/jira/browse/SPARK-16639 > Project: Spark > Issue Type: Bug >Reporter: Wenchen Fan > > {code} > create table tbl(a int, b string); > select count(b) from tbl group by a + 1 having a + 1 = 2; > {code} > this will fail analysis -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types
[ https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387172#comment-15387172 ] Cheng Lian commented on SPARK-16646: Thanks for the help! I'm not working on this. > LEAST doesn't accept numeric arguments with different data types > > > Key: SPARK-16646 > URL: https://issues.apache.org/jira/browse/SPARK-16646 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Hyukjin Kwon > > {code:sql} > SELECT LEAST(1, 1.5); > {code} > {noformat} > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, > CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should > all have the same type, got LEAST (ArrayBuffer(IntegerType, > DecimalType(2,1))).; line 1 pos 7 (state=,code=0) > {noformat} > This query works for 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types
[ https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16646: --- Reporter: Cheng Lian (was: liancheng) > LEAST doesn't accept numeric arguments with different data types > > > Key: SPARK-16646 > URL: https://issues.apache.org/jira/browse/SPARK-16646 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {code:sql} > SELECT LEAST(1, 1.5); > {code} > {noformat} > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, > CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should > all have the same type, got LEAST (ArrayBuffer(IntegerType, > DecimalType(2,1))).; line 1 pos 7 (state=,code=0) > {noformat} > This query works for 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types
[ https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16646: --- Assignee: Hyukjin Kwon > LEAST doesn't accept numeric arguments with different data types > > > Key: SPARK-16646 > URL: https://issues.apache.org/jira/browse/SPARK-16646 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Hyukjin Kwon > > {code:sql} > SELECT LEAST(1, 1.5); > {code} > {noformat} > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, > CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should > all have the same type, got LEAST (ArrayBuffer(IntegerType, > DecimalType(2,1))).; line 1 pos 7 (state=,code=0) > {noformat} > This query works for 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16648: --- Reporter: Cheng Lian (was: liancheng) > LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException > -- > > Key: SPARK-16648 > URL: https://issues.apache.org/jira/browse/SPARK-16648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following simple SQL query reproduces this issue: > {code:sql} > SELECT LAST_VALUE(FALSE) OVER (); > {code} > Exception thrown: > {noformat} > java.lang.IndexOutOfBoundsException: 0 > at > scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) > at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at >
[jira] [Updated] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16648: Assignee: Cheng Lian > LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException > -- > > Key: SPARK-16648 > URL: https://issues.apache.org/jira/browse/SPARK-16648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: liancheng >Assignee: Cheng Lian > > The following simple SQL query reproduces this issue: > {code:sql} > SELECT LAST_VALUE(FALSE) OVER (); > {code} > Exception thrown: > {noformat} > java.lang.IndexOutOfBoundsException: 0 > at > scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) > at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at >
[jira] [Assigned] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16648: Assignee: Apache Spark > LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException > -- > > Key: SPARK-16648 > URL: https://issues.apache.org/jira/browse/SPARK-16648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: liancheng >Assignee: Apache Spark > > The following simple SQL query reproduces this issue: > {code:sql} > SELECT LAST_VALUE(FALSE) OVER (); > {code} > Exception thrown: > {noformat} > java.lang.IndexOutOfBoundsException: 0 > at > scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) > at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at >
[jira] [Commented] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387162#comment-15387162 ] Apache Spark commented on SPARK-16648: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/14295 > LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException > -- > > Key: SPARK-16648 > URL: https://issues.apache.org/jira/browse/SPARK-16648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: liancheng > > The following simple SQL query reproduces this issue: > {code:sql} > SELECT LAST_VALUE(FALSE) OVER (); > {code} > Exception thrown: > {noformat} > java.lang.IndexOutOfBoundsException: 0 > at > scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) > at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at >
[jira] [Assigned] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16648: Assignee: (was: Apache Spark) > LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException > -- > > Key: SPARK-16648 > URL: https://issues.apache.org/jira/browse/SPARK-16648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: liancheng > > The following simple SQL query reproduces this issue: > {code:sql} > SELECT LAST_VALUE(FALSE) OVER (); > {code} > Exception thrown: > {noformat} > java.lang.IndexOutOfBoundsException: 0 > at > scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) > at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at >
[jira] [Comment Edited] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387145#comment-15387145 ] Lianhui Wang edited comment on SPARK-2666 at 7/21/16 4:38 AM: -- Thanks. I think what [~irashid] said is more about non-external shuffle. But in our use cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2, if it hitx a shuffle fetch failure while running stage 2, say on executor A. So it needs to regenerate the map output for stage 1 that was on executor A. But it don't rerun for stage 0 on executor A. So i think we can firstly handle with FetchFailed on Yarn-external shuffle service(maybe connection timeout, out of memory, etc). I think many users have met FetchFailed on Yarn-external shuffle service. as [~tgraves] said before, Now If the stages fails because FetchFailed, it rerun 1) all the ones not succeeded yet in the failed stage (including the ones that could still be running). So it cause many duplicate running tasks of failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful tasks of the failed stage. Until now, i think our first target is for Yarn-external shuffle service if the stages fails because FetchFailed it should decrease the number of rerunning tasks of the failed stage. As i pointed out before that the best way is like Mapreduce we just resubmit the map stage of failed stage. 1. When FetchFailed has happened on task, the task don't be finished and continue to fetch other results. It just report the ShuffleBlockId of FetchFailed to DAGScheduler. other running tasks of this stage did like this task. 2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task for the ShuffleBlockId. Once the task has been finished, it will register the map output to MapOutputTracker. 3. The task that has FetchFailed before get the map output of FetchFailed from MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the map output of FetchFailed successfully and will fetch the results of FetchFailed. But there is a dead lock if the tasks of Step-2 can not be run because there is no slots for it.Under this situation it should kill some running tasks for it. In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did it for 2) it only run the failed ones and wait for the ones still running in failed stage. The disadvantage of SPARK-14649 is that other running tasks of the failed stage maybe need a long time to rerun when they spend time to fetch other's results. was (Author: lianhuiwang): I think what [~irashid] said is more about non-external shuffle. But in our use cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2, if it hitx a shuffle fetch failure while running stage 2, say on executor A. So it needs to regenerate the map output for stage 1 that was on executor A. But it don't rerun for stage 0 on executor A. So i think we can firstly handle with FetchFailed on Yarn-external shuffle service(maybe connection timeout, out of memory, etc). I think many users have met FetchFailed on Yarn-external shuffle service. as [~tgraves] said before, Now If the stages fails because FetchFailed, it rerun 1) all the ones not succeeded yet in the failed stage (including the ones that could still be running). So it cause many duplicate running tasks of failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful tasks of the failed stage. Until now, i think our first target is for Yarn-external shuffle service if the stages fails because FetchFailed it should decrease the number of rerunning tasks of the failed stage. As i pointed out before that the best way is like Mapreduce we just resubmit the map stage of failed stage. 1. When FetchFailed has happened on task, the task don't be finished and continue to fetch other results. It just report the ShuffleBlockId of FetchFailed to DAGScheduler. other running tasks of this stage did like this task. 2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task for the ShuffleBlockId. Once the task has been finished, it will register the map output to MapOutputTracker. 3. The task that has FetchFailed before get the map output of FetchFailed from MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the map output of FetchFailed successfully and will fetch the results of FetchFailed. But there is a dead lock if the tasks of Step-2 can not be run because there is no slots for it.Under this situation it should kill some running tasks for it. In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did it for 2) it only run the failed ones and wait for the ones still running in failed stage. The disadvantage of SPARK-14649 is that other running tasks of the failed stage maybe need a long time to rerun when they spend
[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387145#comment-15387145 ] Lianhui Wang commented on SPARK-2666: - I think what [~irashid] said is more about non-external shuffle. But in our use cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2, if it hitx a shuffle fetch failure while running stage 2, say on executor A. So it needs to regenerate the map output for stage 1 that was on executor A. But it don't rerun for stage 0 on executor A. So i think we can firstly handle with FetchFailed on Yarn-external shuffle service(maybe connection timeout, out of memory, etc). I think many users have met FetchFailed on Yarn-external shuffle service. as [~tgraves] said before, Now If the stages fails because FetchFailed, it rerun 1) all the ones not succeeded yet in the failed stage (including the ones that could still be running). So it cause many duplicate running tasks of failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful tasks of the failed stage. Until now, i think our first target is for Yarn-external shuffle service if the stages fails because FetchFailed it should decrease the number of rerunning tasks of the failed stage. As i pointed out before that the best way is like Mapreduce we just resubmit the map stage of failed stage. 1. When FetchFailed has happened on task, the task don't be finished and continue to fetch other results. It just report the ShuffleBlockId of FetchFailed to DAGScheduler. other running tasks of this stage did like this task. 2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task for the ShuffleBlockId. Once the task has been finished, it will register the map output to MapOutputTracker. 3. The task that has FetchFailed before get the map output of FetchFailed from MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the map output of FetchFailed successfully and will fetch the results of FetchFailed. But there is a dead lock if the tasks of Step-2 can not be run because there is no slots for it.Under this situation it should kill some running tasks for it. In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did it for 2) it only run the failed ones and wait for the ones still running in failed stage. The disadvantage of SPARK-14649 is that other running tasks of the failed stage maybe need a long time to rerun when they spend time to fetch other's results. > Always try to cancel running tasks when a stage is marked as zombie > --- > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Lianhui Wang > > There are some situations in which the scheduler can mark a task set as a > "zombie" before the task set has completed all of its tasks. For example: > (a) When a task fails b/c of a {{FetchFailed}} > (b) When a stage completes because two different attempts create all the > ShuffleMapOutput, though no attempt has completed all its tasks (at least, > this *should* result in the task set being marked as zombie, see SPARK-10370) > (there may be others, I'm not sure if this list is exhaustive.) > Marking a taskset as zombie prevents any *additional* tasks from getting > scheduled, however it does not cancel all currently running tasks. We should > cancel all running to avoid wasting resources (and also to make the behavior > a little more clear to the end user). Rather than canceling tasks in each > case piecemeal, we should refactor the scheduler so that these two actions > are always taken together -- canceling tasks should go hand-in-hand with > marking the taskset as zombie. > Some implementation notes: > * We should change {{taskSetManager.isZombie}} to be private and put it > behind a method like {{markZombie}} or something. > * marking a stage as zombie before the all tasks have completed does *not* > necessarily mean the stage attempt has failed. In case (a), the stage > attempt has failed, but in stage (b) we are not canceling b/c of a failure, > rather just b/c no more tasks are needed. > * {{taskScheduler.cancelTasks}} always marks the task set as zombie. > However, it also has some side-effects like logging that the stage has failed > and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) > when nothing has failed. So it may need some additional refactoring to go > along w/ {{markZombie}}. > * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need > to be sure to catch the {{UnsupportedOperationException}} s > * Testing this *might* benefit from SPARK-10372 -- This message was sent by Atlassian JIRA
[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types
[ https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387067#comment-15387067 ] Hyukjin Kwon commented on SPARK-16646: -- Maybe I had to ask if you are working on this. I will close mine if you are! > LEAST doesn't accept numeric arguments with different data types > > > Key: SPARK-16646 > URL: https://issues.apache.org/jira/browse/SPARK-16646 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: liancheng > > {code:sql} > SELECT LEAST(1, 1.5); > {code} > {noformat} > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, > CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should > all have the same type, got LEAST (ArrayBuffer(IntegerType, > DecimalType(2,1))).; line 1 pos 7 (state=,code=0) > {noformat} > This query works for 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types
[ https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16646: Assignee: Apache Spark > LEAST doesn't accept numeric arguments with different data types > > > Key: SPARK-16646 > URL: https://issues.apache.org/jira/browse/SPARK-16646 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: liancheng >Assignee: Apache Spark > > {code:sql} > SELECT LEAST(1, 1.5); > {code} > {noformat} > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, > CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should > all have the same type, got LEAST (ArrayBuffer(IntegerType, > DecimalType(2,1))).; line 1 pos 7 (state=,code=0) > {noformat} > This query works for 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types
[ https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387025#comment-15387025 ] Apache Spark commented on SPARK-16646: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14294 > LEAST doesn't accept numeric arguments with different data types > > > Key: SPARK-16646 > URL: https://issues.apache.org/jira/browse/SPARK-16646 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: liancheng > > {code:sql} > SELECT LEAST(1, 1.5); > {code} > {noformat} > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, > CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should > all have the same type, got LEAST (ArrayBuffer(IntegerType, > DecimalType(2,1))).; line 1 pos 7 (state=,code=0) > {noformat} > This query works for 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types
[ https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16646: Assignee: (was: Apache Spark) > LEAST doesn't accept numeric arguments with different data types > > > Key: SPARK-16646 > URL: https://issues.apache.org/jira/browse/SPARK-16646 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: liancheng > > {code:sql} > SELECT LEAST(1, 1.5); > {code} > {noformat} > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, > CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should > all have the same type, got LEAST (ArrayBuffer(IntegerType, > DecimalType(2,1))).; line 1 pos 7 (state=,code=0) > {noformat} > This query works for 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16659) use Maven project to submit spark application via yarn-client
Jack Jiang created SPARK-16659: -- Summary: use Maven project to submit spark application via yarn-client Key: SPARK-16659 URL: https://issues.apache.org/jira/browse/SPARK-16659 Project: Spark Issue Type: Question Reporter: Jack Jiang i want to use spark sql to execute hive sql in my maven project,here is the main code: System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master"); SparkConf sparkConf = new SparkConf() .setAppName("test").setMaster("yarn-client"); // .set("hive.metastore.uris", "thrift://172.30.115.59:9083"); SparkContext ctx = new SparkContext(sparkConf); // ctx.addJar("lib/hive-hbase-handler-0.14.0.2.2.6.0-2800.jar"); HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(ctx); String[] tables = sqlContext.tableNames(); for (String tablename : tables) { System.out.println("tablename : " + tablename); } when i run it,it comes to a error: 10:16:17,496 INFO Client:59 - client token: N/A diagnostics: Application application_1468409747983_0280 failed 2 times due to AM Container for appattempt_1468409747983_0280_02 exited with exitCode: -1000 For more detailed output, check application tracking page:http://hadoop003.icccuat.com:8088/proxy/application_1468409747983_0280/Then, click on links to logs of each attempt. Diagnostics: File file:/C:/Users/uatxj990267/AppData/Local/Temp/spark-8874c486-893d-4ac3-a088-48e4cdb484e1/__spark_conf__9007071161920501082.zip does not exist java.io.FileNotFoundException: File file:/C:/Users/uatxj990267/AppData/Local/Temp/spark-8874c486-893d-4ac3-a088-48e4cdb484e1/__spark_conf__9007071161920501082.zip does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:608) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:821) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:598) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:414) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Failing this attempt. Failing the application. ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1469067373412 final status: FAILED tracking URL: http://hadoop003.icccuat.com:8088/cluster/app/application_1468409747983_0280 user: uatxj990267 10:16:17,496 ERROR SparkContext:96 - Error initializing SparkContext. org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:123) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144) at org.apache.spark.SparkContext.(SparkContext.scala:523) at com.huateng.test.SparkSqlDemo.main(SparkSqlDemo.java:33) but when i change this code setMaster("yarn-client") to setMaster(local[2]),it's OK?what's wrong with it ?can anyone help me? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16643) When doing Shuffle, report "java.io.FileNotFoundException"
[ https://issues.apache.org/jira/browse/SPARK-16643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387012#comment-15387012 ] Deng Changchun commented on SPARK-16643: Thank you for response. I think this problem is totally different from https://issues.apache.org/jira/browse/SPARK-12240, even through they both reported FileNotFoundException. For SPARK-12240, I can solve it through setting ulimit. Come back to this problem, I have setted ulimit unlimited, the error info is not "too many open files", just "that file or directory doesn't exist". So I don't think they are the similar problem. By the way, I will try with a more recent version. > When doing Shuffle, report "java.io.FileNotFoundException" > -- > > Key: SPARK-16643 > URL: https://issues.apache.org/jira/browse/SPARK-16643 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 > Environment: LSB Version: > :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch > Distributor ID: CentOS > Description: CentOS release 6.6 (Final) > Release: 6.6 > Codename: Final > java version "1.7.0_10" > Java(TM) SE Runtime Environment (build 1.7.0_10-b18) > Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode) >Reporter: Deng Changchun > > In our spark cluster of standalone mode, we execute some SQLs on SparkSQL, > such some aggregate sqls as "select count(rowKey) from HVRC_B_LOG where 1=1 > and RESULTTIME >= 146332800 and RESULTTIME <= 1463414399000" > at the begining all is good, however after about 15 days, when execute the > aggreate sqls, it will report error, the log looks like: > 【Notice: > it is very strange that it won't report error every time when executing > aggreate sql, let's say random, after executing some aggregate sqls, it will > log error by chance.】 > 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] > executor.Executor: Managed memory leak detected; size = 8388608 bytes, TID = > 624 > 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] > executor.Executor: Exception in task 0.3 in stage 580.0 (TID 624) > java.io.FileNotFoundException: > /tmp/spark-cb199fce-bb80-4e6f-853f-4d7984bf5f34/executor-fb7c2149-c6c4-4697-ba2f-3b53dcd7f34a/blockmgr-0a9003ad-23b3-4ff5-b76f-6fbc5d71e730/3e/temp_shuffle_ef68b340-85e4-483c-90e8-5e8c8d8ee4ee > (没有那个文件或目录) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:212) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16644) constraints propagation may fail the query
[ https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16644: - Fix Version/s: (was: 2.0.0) 2.1.0 2.0.1 > constraints propagation may fail the query > -- > > Key: SPARK-16644 > URL: https://issues.apache.org/jira/browse/SPARK-16644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.1, 2.1.0 > > > {code} > create table(a int, b int); > select > a, > max(b) as c1, > b as c2 > from tbl > where a = b > group by a, b > having c1 = 1 > {code} > this query fails in 2.0, but works in 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16644) constraints propagation may fail the query
[ https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16644: - Assignee: Wenchen Fan > constraints propagation may fail the query > -- > > Key: SPARK-16644 > URL: https://issues.apache.org/jira/browse/SPARK-16644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > {code} > create table(a int, b int); > select > a, > max(b) as c1, > b as c2 > from tbl > where a = b > group by a, b > having c1 = 1 > {code} > this query fails in 2.0, but works in 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16644) constraints propagation may fail the query
[ https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16644. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 14281 [https://github.com/apache/spark/pull/14281] > constraints propagation may fail the query > -- > > Key: SPARK-16644 > URL: https://issues.apache.org/jira/browse/SPARK-16644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > Fix For: 2.0.0 > > > {code} > create table(a int, b int); > select > a, > max(b) as c1, > b as c2 > from tbl > where a = b > group by a, b > having c1 = 1 > {code} > this query fails in 2.0, but works in 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16272) Allow configs to reference other configs, env and system properties
[ https://issues.apache.org/jira/browse/SPARK-16272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16272. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.1.0 > Allow configs to reference other configs, env and system properties > --- > > Key: SPARK-16272 > URL: https://issues.apache.org/jira/browse/SPARK-16272 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.1.0 > > > Currently, Spark's configuration is static; it is whatever is written to the > config file, with some rare exceptions (such as some YARN code that does > expansion of Hadoop configuration). > But there are a few use cases that don't work well in that situation. For > example, consider {{spark.sql.hive.metastore.jars}}. It references a list of > paths containing the classpath for accessing Hive's metastore. If you're > launching an application in cluster mode, it means that whatever is in the > configuration of the edge node needs to match the configuration of the random > node in the cluster where the driver will actually run. > This would be easily solved if there was a way to reference system properties > or env variables; for example, when YARN launches a container, a bunch of env > variables are set, which could be used to modify that path to match the > correct location on the node. > So I'm proposing a change where config properties can opt-in to use this > variable expansion feature; it's opt-in to avoid breaking existing code (who > knows) and to avoid the extra cost of doing the variable expansion of every > config read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16344) Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+
[ https://issues.apache.org/jira/browse/SPARK-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16344. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14014 [https://github.com/apache/spark/pull/14014] > Array of struct with a single field name "element" can't be decoded from > Parquet files written by Spark 1.6+ > > > Key: SPARK-16344 > URL: https://issues.apache.org/jira/browse/SPARK-16344 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.1.0 > > > This is a weird corner case. Users may hit this issue if they have a schema > that > # has an array field whose element type is a struct, and > # the struct has one and only one field, and > # that field is named as "element". > The following Spark shell snippet for Spark 1.6 reproduces this bug: > {code} > case class A(element: Long) > case class B(f: Array[A]) > val path = "/tmp/silly.parquet" > Seq(B(Array(A(42.toDF("f0").write.mode("overwrite").parquet(path) > val df = sqlContext.read.parquet(path) > df.printSchema() > // root > // |-- f0: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- element: long (nullable = true) > df.show() > {code} > Exception thrown: > {noformat} > org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in > block -1 in file > file:/tmp/silly.parquet/part-r-7-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: Expected instance of group converter > but got > "org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter" > at > org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37) > at > org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:266) > at > org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134) > at > org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99) > at > org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) > at > org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) >
[jira] [Assigned] (SPARK-16658) Add EdgePartition.withVertexAttributes
[ https://issues.apache.org/jira/browse/SPARK-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16658: Assignee: (was: Apache Spark) > Add EdgePartition.withVertexAttributes > -- > > Key: SPARK-16658 > URL: https://issues.apache.org/jira/browse/SPARK-16658 > Project: Spark > Issue Type: Improvement >Reporter: Ben McCann > > I'm using cloudml/zen, which has forked graphx. I'd like to see their changes > upstreamed, so that they can go back to using the upstream graphx instead of > having a fork. > Their implementation of withVertexAttributes: > https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala > Their usage of that method: > https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16658) Add EdgePartition.withVertexAttributes
[ https://issues.apache.org/jira/browse/SPARK-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386796#comment-15386796 ] Apache Spark commented on SPARK-16658: -- User 'benmccann' has created a pull request for this issue: https://github.com/apache/spark/pull/14291 > Add EdgePartition.withVertexAttributes > -- > > Key: SPARK-16658 > URL: https://issues.apache.org/jira/browse/SPARK-16658 > Project: Spark > Issue Type: Improvement >Reporter: Ben McCann > > I'm using cloudml/zen, which has forked graphx. I'd like to see their changes > upstreamed, so that they can go back to using the upstream graphx instead of > having a fork. > Their implementation of withVertexAttributes: > https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala > Their usage of that method: > https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16658) Add EdgePartition.withVertexAttributes
[ https://issues.apache.org/jira/browse/SPARK-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16658: Assignee: Apache Spark > Add EdgePartition.withVertexAttributes > -- > > Key: SPARK-16658 > URL: https://issues.apache.org/jira/browse/SPARK-16658 > Project: Spark > Issue Type: Improvement >Reporter: Ben McCann >Assignee: Apache Spark > > I'm using cloudml/zen, which has forked graphx. I'd like to see their changes > upstreamed, so that they can go back to using the upstream graphx instead of > having a fork. > Their implementation of withVertexAttributes: > https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala > Their usage of that method: > https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16658) Add EdgePartition.withVertexAttributes
Ben McCann created SPARK-16658: -- Summary: Add EdgePartition.withVertexAttributes Key: SPARK-16658 URL: https://issues.apache.org/jira/browse/SPARK-16658 Project: Spark Issue Type: Improvement Reporter: Ben McCann I'm using cloudml/zen, which has forked graphx. I'd like to see their changes upstreamed, so that they can go back to using the upstream graphx instead of having a fork. Their implementation of withVertexAttributes: https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala Their usage of that method: https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16280) Implement histogram_numeric SQL function
[ https://issues.apache.org/jira/browse/SPARK-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386787#comment-15386787 ] MIN-FU YANG commented on SPARK-16280: - I implemented histogram_numeric according to the Algorithm I & II in the paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. After doing some benchmarking, I come out an optimal solution and ready for review: https://github.com/apache/spark/pull/14129 > Implement histogram_numeric SQL function > > > Key: SPARK-16280 > URL: https://issues.apache.org/jira/browse/SPARK-16280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16657) Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and CreateHiveTableAsSelectCommand
[ https://issues.apache.org/jira/browse/SPARK-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386765#comment-15386765 ] Apache Spark commented on SPARK-16657: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14290 > Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and > CreateHiveTableAsSelectCommand > - > > Key: SPARK-16657 > URL: https://issues.apache.org/jira/browse/SPARK-16657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > The query in `InsertIntoHadoopFsRelationCommand` and > `CreateHiveTableAsSelectCommand` should be treated as inner children, like > what we did for the other similar nodes: > `CreateDataSourceTableAsSelectCommand`, `innerChildren`, and > `InsertIntoDataSourceCommand`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16657) Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and CreateHiveTableAsSelectCommand
[ https://issues.apache.org/jira/browse/SPARK-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16657: Assignee: (was: Apache Spark) > Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and > CreateHiveTableAsSelectCommand > - > > Key: SPARK-16657 > URL: https://issues.apache.org/jira/browse/SPARK-16657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > The query in `InsertIntoHadoopFsRelationCommand` and > `CreateHiveTableAsSelectCommand` should be treated as inner children, like > what we did for the other similar nodes: > `CreateDataSourceTableAsSelectCommand`, `innerChildren`, and > `InsertIntoDataSourceCommand`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16657) Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and CreateHiveTableAsSelectCommand
[ https://issues.apache.org/jira/browse/SPARK-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16657: Assignee: Apache Spark > Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and > CreateHiveTableAsSelectCommand > - > > Key: SPARK-16657 > URL: https://issues.apache.org/jira/browse/SPARK-16657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > The query in `InsertIntoHadoopFsRelationCommand` and > `CreateHiveTableAsSelectCommand` should be treated as inner children, like > what we did for the other similar nodes: > `CreateDataSourceTableAsSelectCommand`, `innerChildren`, and > `InsertIntoDataSourceCommand`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16657) Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and CreateHiveTableAsSelectCommand
Xiao Li created SPARK-16657: --- Summary: Replace children by innerChildren in InsertIntoHadoopFsRelationCommand and CreateHiveTableAsSelectCommand Key: SPARK-16657 URL: https://issues.apache.org/jira/browse/SPARK-16657 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li The query in `InsertIntoHadoopFsRelationCommand` and `CreateHiveTableAsSelectCommand` should be treated as inner children, like what we did for the other similar nodes: `CreateDataSourceTableAsSelectCommand`, `innerChildren`, and `InsertIntoDataSourceCommand`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16644) constraints propagation may fail the query
[ https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386764#comment-15386764 ] Yin Huai commented on SPARK-16644: -- oh, I think this problem only happens if an aggregate expression uses a grouping expression and that exact grouping expression is part of the output of the aggregate operator. > constraints propagation may fail the query > -- > > Key: SPARK-16644 > URL: https://issues.apache.org/jira/browse/SPARK-16644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > {code} > create table(a int, b int); > select > a, > max(b) as c1, > b as c2 > from tbl > where a = b > group by a, b > having c1 = 1 > {code} > this query fails in 2.0, but works in 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout
[ https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386759#comment-15386759 ] Emma Tang commented on SPARK-12675: --- This issue is happening in 1.6.2. I'm on a 32 node cluster. This issue only occurs when the input data exceeds a certain size. > Executor dies because of ClassCastException and causes timeout > -- > > Key: SPARK-12675 > URL: https://issues.apache.org/jira/browse/SPARK-12675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1, 2.0.0 > Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz >Reporter: Alexandru Rosianu > > I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script > which doesn't work (a bit simplified): > {code:title=Script.scala} > // Prepare data sets > logInfo("Getting datasets") > val emoTrainingData = > sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet") > val trainingData = emoTrainingData > // Configure the pipeline > val pipeline = new Pipeline().setStages(Array( > new > FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"), > new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"), > new Tokenizer().setInputCol("text").setOutputCol("raw_words"), > new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"), > new HashingTF().setInputCol("words").setOutputCol("features"), > new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"), > new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", > "raw_words", "words", "features") > )) > // Fit the pipeline > logInfo(s"Training model on ${trainingData.count()} rows") > val model = pipeline.fit(trainingData) > {code} > It executes up to the last line. It prints "Training model on xx rows", then > it starts fitting, the executor dies, the drivers doesn't receive heartbeats > from the executor and it times out, then the script exits. It doesn't get > past that line. > This is the exception that kills the executor: > {code} > java.io.IOException: java.lang.ClassCastException: cannot assign instance > of scala.collection.immutable.HashMap$SerializationProxy to field > org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type > scala.collection.immutable.Map in instance of > org.apache.spark.executor.TaskMetrics > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207) > at > org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) > at org.apache.spark.util.Utils$.deserialize(Utils.scala:92) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) > at
[jira] [Commented] (SPARK-16656) CreateTableAsSelectSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-16656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386750#comment-15386750 ] Apache Spark commented on SPARK-16656: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/14289 > CreateTableAsSelectSuite is flaky > - > > Key: SPARK-16656 > URL: https://issues.apache.org/jira/browse/SPARK-16656 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62593/testReport/junit/org.apache.spark.sql.sources/CreateTableAsSelectSuite/create_a_table__drop_it_and_create_another_one_with_the_same_name/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16656) CreateTableAsSelectSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-16656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16656: Assignee: Yin Huai (was: Apache Spark) > CreateTableAsSelectSuite is flaky > - > > Key: SPARK-16656 > URL: https://issues.apache.org/jira/browse/SPARK-16656 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62593/testReport/junit/org.apache.spark.sql.sources/CreateTableAsSelectSuite/create_a_table__drop_it_and_create_another_one_with_the_same_name/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16656) CreateTableAsSelectSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-16656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16656: Assignee: Apache Spark (was: Yin Huai) > CreateTableAsSelectSuite is flaky > - > > Key: SPARK-16656 > URL: https://issues.apache.org/jira/browse/SPARK-16656 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62593/testReport/junit/org.apache.spark.sql.sources/CreateTableAsSelectSuite/create_a_table__drop_it_and_create_another_one_with_the_same_name/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16656) CreateTableAsSelectSuite is flaky
Yin Huai created SPARK-16656: Summary: CreateTableAsSelectSuite is flaky Key: SPARK-16656 URL: https://issues.apache.org/jira/browse/SPARK-16656 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62593/testReport/junit/org.apache.spark.sql.sources/CreateTableAsSelectSuite/create_a_table__drop_it_and_create_another_one_with_the_same_name/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-16320: --- Description: I did some test on parquet file with many nested columns (about 30G in 400 partitions) and Spark 2.0 is sometimes slower. I tested following queries: 1) {code}select count(*) where id > some_id{code} In this query performance is similar. (about 1 sec) 2) {code}select count(*) where nested_column.id > some_id{code} Spark 1.6 -> 1.6 min Spark 2.0 -> 2.1 min Should I expect such a drop in performance ? I don't know how to prepare sample data to show the problem. Any ideas ? Or public data with many nested columns ? *UPDATE* I created script to generate data and to confirm this problem. {code} #Initialization from pyspark import SparkContext, SparkConf from pyspark.sql import HiveContext from pyspark.sql.functions import struct conf = SparkConf() conf.set('spark.cores.max', 15) conf.set('spark.executor.memory', '30g') conf.set('spark.driver.memory', '30g') sc = SparkContext(conf=conf) sqlctx = HiveContext(sc) #Data creation MAX_SIZE = 2**32 - 1 path = '/mnt/mfs/parquet_nested' def create_sample_data(levels, rows, path): def _create_column_data(cols): import random random.seed() return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in range(cols)} def _create_sample_df(cols, rows): rdd = sc.parallelize(range(rows)) data = rdd.map(lambda r: _create_column_data(cols)) df = sqlctx.createDataFrame(data) return df def _create_nested_data(levels, rows): if len(levels) == 1: return _create_sample_df(levels[0], rows).cache() else: df = _create_nested_data(levels[1:], rows) return df.select([struct(df.columns).alias("column{}".format(i)) for i in range(levels[0])]) df = _create_nested_data(levels, rows) df.write.mode('overwrite').parquet(path) #Sample data create_sample_data([2,10,200], 100, path) #Query df = sqlctx.read.parquet(path) %%timeit df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() {code} Results Spark 1.6 1 loop, best of 3: *1min 5s* per loop Spark 2.0 1 loop, best of 3: *1min 21s* per loop *UPDATE 2* Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same source. I attached some VisualVM profiles there. Most interesting are from queries. https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps was: I did some test on parquet file with many nested columns (about 30G in 400 partitions) and Spark 2.0 is sometimes slower. I tested following queries: 1) {code}select count(*) where id > some_id{code} In this query performance is similar. (about 1 sec) 2) {code}select count(*) where nested_column.id > some_id{code} Spark 1.6 -> 1.6 min Spark 2.0 -> 2.1 min Should I expect such a drop in performance ? I don't know how to prepare sample data to show the problem. Any ideas ? Or public data with many nested columns ? *UPDATE* I created script to generate data and to confirm this problem. {code} #Initialization from pyspark import SparkContext, SparkConf from pyspark.sql import HiveContext from pyspark.sql.functions import struct conf = SparkConf() conf.set('spark.cores.max', 15) conf.set('spark.executor.memory', '30g') conf.set('spark.driver.memory', '30g') sc = SparkContext(conf=conf) sqlctx = HiveContext(sc) #Data creation MAX_SIZE = 2**32 - 1 path = '/mnt/mfs/parquet_nested' def create_sample_data(levels, rows, path): def _create_column_data(cols): import random random.seed() return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in range(cols)} def _create_sample_df(cols, rows): rdd = sc.parallelize(range(rows)) data = rdd.map(lambda r: _create_column_data(cols)) df = sqlctx.createDataFrame(data) return df def _create_nested_data(levels, rows): if len(levels) == 1: return _create_sample_df(levels[0], rows).cache() else: df = _create_nested_data(levels[1:], rows) return df.select([struct(df.columns).alias("column{}".format(i)) for i in range(levels[0])]) df = _create_nested_data(levels, rows) df.write.mode('overwrite').parquet(path) #Sample data create_sample_data([2,10,200], 100, path) #Query df = sqlctx.read.parquet(path) %%timeit df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() {code} Results Spark 1.6 1 loop, best of 3: *1min 5s* per loop Spark 2.0 1 loop, best of 3: *1min 21s* per loop *UPDATE 2* Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same source. There is more > Spark 2.0 slower than 1.6 when querying nested columns >
[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-16320: --- Description: I did some test on parquet file with many nested columns (about 30G in 400 partitions) and Spark 2.0 is sometimes slower. I tested following queries: 1) {code}select count(*) where id > some_id{code} In this query performance is similar. (about 1 sec) 2) {code}select count(*) where nested_column.id > some_id{code} Spark 1.6 -> 1.6 min Spark 2.0 -> 2.1 min Should I expect such a drop in performance ? I don't know how to prepare sample data to show the problem. Any ideas ? Or public data with many nested columns ? *UPDATE* I created script to generate data and to confirm this problem. {code} #Initialization from pyspark import SparkContext, SparkConf from pyspark.sql import HiveContext from pyspark.sql.functions import struct conf = SparkConf() conf.set('spark.cores.max', 15) conf.set('spark.executor.memory', '30g') conf.set('spark.driver.memory', '30g') sc = SparkContext(conf=conf) sqlctx = HiveContext(sc) #Data creation MAX_SIZE = 2**32 - 1 path = '/mnt/mfs/parquet_nested' def create_sample_data(levels, rows, path): def _create_column_data(cols): import random random.seed() return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in range(cols)} def _create_sample_df(cols, rows): rdd = sc.parallelize(range(rows)) data = rdd.map(lambda r: _create_column_data(cols)) df = sqlctx.createDataFrame(data) return df def _create_nested_data(levels, rows): if len(levels) == 1: return _create_sample_df(levels[0], rows).cache() else: df = _create_nested_data(levels[1:], rows) return df.select([struct(df.columns).alias("column{}".format(i)) for i in range(levels[0])]) df = _create_nested_data(levels, rows) df.write.mode('overwrite').parquet(path) #Sample data create_sample_data([2,10,200], 100, path) #Query df = sqlctx.read.parquet(path) %%timeit df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() {code} Results Spark 1.6 1 loop, best of 3: *1min 5s* per loop Spark 2.0 1 loop, best of 3: *1min 21s* per loop *UPDATE 2* Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same source. There is more was: I did some test on parquet file with many nested columns (about 30G in 400 partitions) and Spark 2.0 is sometimes slower. I tested following queries: 1) {code}select count(*) where id > some_id{code} In this query performance is similar. (about 1 sec) 2) {code}select count(*) where nested_column.id > some_id{code} Spark 1.6 -> 1.6 min Spark 2.0 -> 2.1 min Should I expect such a drop in performance ? I don't know how to prepare sample data to show the problem. Any ideas ? Or public data with many nested columns ? *UPDATE* I created script to generate data and to confirm this problem. {code} #Initialization from pyspark import SparkContext, SparkConf from pyspark.sql import HiveContext from pyspark.sql.functions import struct conf = SparkConf() conf.set('spark.cores.max', 15) conf.set('spark.executor.memory', '30g') conf.set('spark.driver.memory', '30g') sc = SparkContext(conf=conf) sqlctx = HiveContext(sc) #Data creation MAX_SIZE = 2**32 - 1 path = '/mnt/mfs/parquet_nested' def create_sample_data(levels, rows, path): def _create_column_data(cols): import random random.seed() return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in range(cols)} def _create_sample_df(cols, rows): rdd = sc.parallelize(range(rows)) data = rdd.map(lambda r: _create_column_data(cols)) df = sqlctx.createDataFrame(data) return df def _create_nested_data(levels, rows): if len(levels) == 1: return _create_sample_df(levels[0], rows).cache() else: df = _create_nested_data(levels[1:], rows) return df.select([struct(df.columns).alias("column{}".format(i)) for i in range(levels[0])]) df = _create_nested_data(levels, rows) df.write.mode('overwrite').parquet(path) #Sample data create_sample_data([2,10,200], 100, path) #Query df = sqlctx.read.parquet(path) %%timeit df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() {code} Results Spark 1.6 1 loop, best of 3: *1min 5s* per loop Spark 2.0 1 loop, best of 3: *1min 21s* per loop *UPDATE 2* Analysis in > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter:
[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-16320: --- Description: I did some test on parquet file with many nested columns (about 30G in 400 partitions) and Spark 2.0 is sometimes slower. I tested following queries: 1) {code}select count(*) where id > some_id{code} In this query performance is similar. (about 1 sec) 2) {code}select count(*) where nested_column.id > some_id{code} Spark 1.6 -> 1.6 min Spark 2.0 -> 2.1 min Should I expect such a drop in performance ? I don't know how to prepare sample data to show the problem. Any ideas ? Or public data with many nested columns ? *UPDATE* I created script to generate data and to confirm this problem. {code} #Initialization from pyspark import SparkContext, SparkConf from pyspark.sql import HiveContext from pyspark.sql.functions import struct conf = SparkConf() conf.set('spark.cores.max', 15) conf.set('spark.executor.memory', '30g') conf.set('spark.driver.memory', '30g') sc = SparkContext(conf=conf) sqlctx = HiveContext(sc) #Data creation MAX_SIZE = 2**32 - 1 path = '/mnt/mfs/parquet_nested' def create_sample_data(levels, rows, path): def _create_column_data(cols): import random random.seed() return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in range(cols)} def _create_sample_df(cols, rows): rdd = sc.parallelize(range(rows)) data = rdd.map(lambda r: _create_column_data(cols)) df = sqlctx.createDataFrame(data) return df def _create_nested_data(levels, rows): if len(levels) == 1: return _create_sample_df(levels[0], rows).cache() else: df = _create_nested_data(levels[1:], rows) return df.select([struct(df.columns).alias("column{}".format(i)) for i in range(levels[0])]) df = _create_nested_data(levels, rows) df.write.mode('overwrite').parquet(path) #Sample data create_sample_data([2,10,200], 100, path) #Query df = sqlctx.read.parquet(path) %%timeit df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() {code} Results Spark 1.6 1 loop, best of 3: *1min 5s* per loop Spark 2.0 1 loop, best of 3: *1min 21s* per loop *UPDATE 2* Analysis in was: I did some test on parquet file with many nested columns (about 30G in 400 partitions) and Spark 2.0 is sometimes slower. I tested following queries: 1) {code}select count(*) where id > some_id{code} In this query performance is similar. (about 1 sec) 2) {code}select count(*) where nested_column.id > some_id{code} Spark 1.6 -> 1.6 min Spark 2.0 -> 2.1 min Should I expect such a drop in performance ? I don't know how to prepare sample data to show the problem. Any ideas ? Or public data with many nested columns ? *UPDATE* I created script to generate data and to confirm this problem. {code} #Initialization from pyspark import SparkContext, SparkConf from pyspark.sql import HiveContext from pyspark.sql.functions import struct conf = SparkConf() conf.set('spark.cores.max', 15) conf.set('spark.executor.memory', '30g') conf.set('spark.driver.memory', '30g') sc = SparkContext(conf=conf) sqlctx = HiveContext(sc) #Data creation MAX_SIZE = 2**32 - 1 path = '/mnt/mfs/parquet_nested' def create_sample_data(levels, rows, path): def _create_column_data(cols): import random random.seed() return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in range(cols)} def _create_sample_df(cols, rows): rdd = sc.parallelize(range(rows)) data = rdd.map(lambda r: _create_column_data(cols)) df = sqlctx.createDataFrame(data) return df def _create_nested_data(levels, rows): if len(levels) == 1: return _create_sample_df(levels[0], rows).cache() else: df = _create_nested_data(levels[1:], rows) return df.select([struct(df.columns).alias("column{}".format(i)) for i in range(levels[0])]) df = _create_nested_data(levels, rows) df.write.mode('overwrite').parquet(path) #Sample data create_sample_data([2,10,200], 100, path) #Query df = sqlctx.read.parquet(path) %%timeit df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() {code} Results Spark 1.6 1 loop, best of 3: *1min 5s* per loop Spark 2.0 1 loop, best of 3: *1min 21s* per loop > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I did some test on parquet file with many nested columns (about
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386586#comment-15386586 ] Michael Allman commented on SPARK-16320: Okay, so the metastore is not a factor. > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13301) PySpark Dataframe return wrong results with custom UDF
[ https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386564#comment-15386564 ] Sam Fishman edited comment on SPARK-13301 at 7/20/16 8:27 PM: -- I am having the same issue when applying a udf to a DataFrame. I've noticed that it seems to only occur when data is read in using sqlContext.sql(). When I use parallelize on a local Python collection and then call toDF on the resulting RDD, I don't seem to have the issue: Works: data = [ ["1", "A->B"], ["2", "B->C"], ["3", "C->A"], ["4", "D->E"], ["5", "E->D"] ] rdd = sc.parallelize(data) df = rdd.toDF(["id", "segment"]) def myFunction(number): id_test = number+ ' test' return(id_test) test_function_udf = udf(myFunction, StringType()) df2 = df.withColumn('test', test_function_udf("id")) Does not work (ie "wrong results" similar to Simone's output): df = sqlContext.sql("select * from my_table") def myFunction(number): id_test = number+ ' test' return(id_test) test_function_udf = udf(myFunction, StringType()) df2 = df.withColumn('test', test_function_udf("id")) I should note that this also only seems to be an issue when using a UDF. If I use the builtin concat function, I do not get the error. was (Author: sfishman): I am having the same issue when applying a udf to a DataFrame. I've noticed that it seems to only occur when data is read in using sqlContext.sql(). When I use parallelize on a local Python collection and then call toDF on the resulting RDD, I don't seem to have the issue: Works: data = [ ["1", "A->B"], ["2", "B->C"], ["3", "C->A"], ["4", "D->E"], ["5", "E->D"] ] rdd = sc.parallelize(data) df = rdd.toDF(["id", "segment"]) def myFunction(number): id_test = number+ ' test' return(id_test) test_function_udf = udf(myFunction, StringType()) df2 = df.withColumn('test', test_function_udf("id")) Does not work (ie "wrong results" similar to Simone's output): # Assuming I have data stored in a table df = sqlContext.sql("select * from my_table") def myFunction(number): id_test = number+ ' test' return(id_test) test_function_udf = udf(myFunction, StringType()) df2 = df.withColumn('test', test_function_udf("id")) I should note that this also only seems to be an issue when using a UDF. If I use the builtin concat function, I do not get the error. > PySpark Dataframe return wrong results with custom UDF > -- > > Key: SPARK-13301 > URL: https://issues.apache.org/jira/browse/SPARK-13301 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: PySpark in yarn-client mode - CDH 5.5.1 >Reporter: Simone >Priority: Critical > > Using a User Defined Function in PySpark inside the withColumn() method of > Dataframe, gives wrong results. > Here an example: > from pyspark.sql import functions > import string > myFunc = functions.udf(lambda s: string.lower(s)) > myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show() > |col1| col2|col3| > |1265AB4F65C05740E...|Ivo|4f00ae514e7c015be...| > |1D94AB4F75C83B51E...| Raffaele|4f00dcf6422100c0e...| > |4F008903600A0133E...| Cristina|4f008903600a0133e...| > The results are wrong and seem to be random: some record are OK (for example > the third) some others NO (for example the first 2). > The problem seems not occur with Spark built-in functions: > from pyspark.sql.functions import * > myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show() > Without the withColumn() method, results seems to be always correct: > myDF.select("col1", "col2", myFunc(myDF["col1"])).show() > This can be considered only in part a workaround because you have to list > each time all column of your Dataframe. > Also in Scala/Java the problems seems not occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386561#comment-15386561 ] Thomas Graves commented on SPARK-2666: -- thanks for the explanation. I guess we would have to look through the failures cases, but if you are using the external shuffle service it feels like marking everything on that node as bad even if its from another executor would be better because this case seems like more of a node failure or something that would be much more likely to affect other map outputs. I guess if its serving shuffle from the executor, it could just be something bad on that executor ( out of memory, timeout due to overload, etc). > Always try to cancel running tasks when a stage is marked as zombie > --- > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Lianhui Wang > > There are some situations in which the scheduler can mark a task set as a > "zombie" before the task set has completed all of its tasks. For example: > (a) When a task fails b/c of a {{FetchFailed}} > (b) When a stage completes because two different attempts create all the > ShuffleMapOutput, though no attempt has completed all its tasks (at least, > this *should* result in the task set being marked as zombie, see SPARK-10370) > (there may be others, I'm not sure if this list is exhaustive.) > Marking a taskset as zombie prevents any *additional* tasks from getting > scheduled, however it does not cancel all currently running tasks. We should > cancel all running to avoid wasting resources (and also to make the behavior > a little more clear to the end user). Rather than canceling tasks in each > case piecemeal, we should refactor the scheduler so that these two actions > are always taken together -- canceling tasks should go hand-in-hand with > marking the taskset as zombie. > Some implementation notes: > * We should change {{taskSetManager.isZombie}} to be private and put it > behind a method like {{markZombie}} or something. > * marking a stage as zombie before the all tasks have completed does *not* > necessarily mean the stage attempt has failed. In case (a), the stage > attempt has failed, but in stage (b) we are not canceling b/c of a failure, > rather just b/c no more tasks are needed. > * {{taskScheduler.cancelTasks}} always marks the task set as zombie. > However, it also has some side-effects like logging that the stage has failed > and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) > when nothing has failed. So it may need some additional refactoring to go > along w/ {{markZombie}}. > * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need > to be sure to catch the {{UnsupportedOperationException}} s > * Testing this *might* benefit from SPARK-10372 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13301) PySpark Dataframe return wrong results with custom UDF
[ https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386564#comment-15386564 ] Sam Fishman commented on SPARK-13301: - I am having the same issue when applying a udf to a DataFrame. I've noticed that it seems to only occur when data is read in using sqlContext.sql(). When I use parallelize on a local Python collection and then call toDF on the resulting RDD, I don't seem to have the issue: Works: data = [ ["1", "A->B"], ["2", "B->C"], ["3", "C->A"], ["4", "D->E"], ["5", "E->D"] ] rdd = sc.parallelize(data) df = rdd.toDF(["id", "segment"]) def myFunction(number): id_test = number+ ' test' return(id_test) test_function_udf = udf(myFunction, StringType()) df2 = df.withColumn('test', test_function_udf("id")) Does not work (ie "wrong results" similar to Simone's output): # Assuming I have data stored in a table df = sqlContext.sql("select * from my_table") def myFunction(number): id_test = number+ ' test' return(id_test) test_function_udf = udf(myFunction, StringType()) df2 = df.withColumn('test', test_function_udf("id")) I should note that this also only seems to be an issue when using a UDF. If I use the builtin concat function, I do not get the error. > PySpark Dataframe return wrong results with custom UDF > -- > > Key: SPARK-13301 > URL: https://issues.apache.org/jira/browse/SPARK-13301 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: PySpark in yarn-client mode - CDH 5.5.1 >Reporter: Simone >Priority: Critical > > Using a User Defined Function in PySpark inside the withColumn() method of > Dataframe, gives wrong results. > Here an example: > from pyspark.sql import functions > import string > myFunc = functions.udf(lambda s: string.lower(s)) > myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show() > |col1| col2|col3| > |1265AB4F65C05740E...|Ivo|4f00ae514e7c015be...| > |1D94AB4F75C83B51E...| Raffaele|4f00dcf6422100c0e...| > |4F008903600A0133E...| Cristina|4f008903600a0133e...| > The results are wrong and seem to be random: some record are OK (for example > the third) some others NO (for example the first 2). > The problem seems not occur with Spark built-in functions: > from pyspark.sql.functions import * > myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show() > Without the withColumn() method, results seems to be always correct: > myDF.select("col1", "col2", myFunc(myDF["col1"])).show() > This can be considered only in part a workaround because you have to list > each time all column of your Dataframe. > Also in Scala/Java the problems seems not occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386552#comment-15386552 ] Maciej Bryński commented on SPARK-16320: [~michael] Could you check SPARK-16321 ? I attached there some VisualVM snapshots. And the underlying data is the same. Another think is that I'm using read.parquet method which means that Spark shouldn't connect to metastore. > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386524#comment-15386524 ] Imran Rashid commented on SPARK-2666: - [~tgraves] [~lianhuiwang]. When there is a fetch failure, spark considers all shuffle output on that executor to be gone. (The code is rather confusing -- first it just removes the one block with the fetch failed: https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1134 but just after that, it removes everything on the executor: https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1184) When a stage is retried, it reruns all the tasks for the missing shuffle outputs, at the time the stage is retried. Usually, this is just all of the map output that was on the executor which had the fetch failed. But, its not necessarily exactly the same, as even more shuffle outputs could be lost before the stage retry kicks in. * Suppose you had three stages in a row, 0 --> 1 --> 2, and you hit a shuffle fetch failure while running stage 2, say on executor A. So you need to regenerate the map output for stage 1 that was on executor A. But most likely spark will discover that to regenerate that missing output, it needs some map output from stage 0, which was on executor A. So first it will go re-run the missing parts of stage 0, and then when it gets to stage 1, the dag scheduler will look at what map outputs are beginning. So there is some extra time in there to discover more missing shuffle outputs. * Spark only marks the shuffle output as missing for the *executor* that shuffle data couldn't be read from, not for the entire node. So if its a hardware failure, you're likely to hit more failures even after the first fetch failure comes in, since you probably can't read from any of the nodes on that host. Despite this, I don't think there is a very good reason to leave tasks running after there is a fetch failure. If there is a hardware failure, then the rest of the retry process is also likely to discover this and remove those executors as well. (Kay and I had discussed this earlier in the thread and we seemed to agree, though I dunno if we had thought through all the details at that time.) If anything, I wonder if when there is a fetch failure, we should mark all data as missing on the entire node, not just the executor, but I don't think that is necessary. > Always try to cancel running tasks when a stage is marked as zombie > --- > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Lianhui Wang > > There are some situations in which the scheduler can mark a task set as a > "zombie" before the task set has completed all of its tasks. For example: > (a) When a task fails b/c of a {{FetchFailed}} > (b) When a stage completes because two different attempts create all the > ShuffleMapOutput, though no attempt has completed all its tasks (at least, > this *should* result in the task set being marked as zombie, see SPARK-10370) > (there may be others, I'm not sure if this list is exhaustive.) > Marking a taskset as zombie prevents any *additional* tasks from getting > scheduled, however it does not cancel all currently running tasks. We should > cancel all running to avoid wasting resources (and also to make the behavior > a little more clear to the end user). Rather than canceling tasks in each > case piecemeal, we should refactor the scheduler so that these two actions > are always taken together -- canceling tasks should go hand-in-hand with > marking the taskset as zombie. > Some implementation notes: > * We should change {{taskSetManager.isZombie}} to be private and put it > behind a method like {{markZombie}} or something. > * marking a stage as zombie before the all tasks have completed does *not* > necessarily mean the stage attempt has failed. In case (a), the stage > attempt has failed, but in stage (b) we are not canceling b/c of a failure, > rather just b/c no more tasks are needed. > * {{taskScheduler.cancelTasks}} always marks the task set as zombie. > However, it also has some side-effects like logging that the stage has failed > and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) > when nothing has failed. So it may need some additional refactoring to go > along w/ {{markZombie}}. > * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need > to be sure to catch the {{UnsupportedOperationException}} s > * Testing this *might* benefit from SPARK-10372
[jira] [Updated] (SPARK-16507) Add CRAN checks to SparkR
[ https://issues.apache.org/jira/browse/SPARK-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-16507: - Target Version/s: 2.0.0, 2.1.0 (was: 2.1.0) > Add CRAN checks to SparkR > -- > > Key: SPARK-16507 > URL: https://issues.apache.org/jira/browse/SPARK-16507 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman > Fix For: 2.0.0 > > > One of the steps to publishing SparkR is to pass the `R CMD check --as-cran`. > We should add a script to do this and fix any errors / warnings we find -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16632) Vectorized parquet reader fails to read certain fields from Hive tables
[ https://issues.apache.org/jira/browse/SPARK-16632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386409#comment-15386409 ] Marcelo Vanzin commented on SPARK-16632: Yes, that's the right stack trace. It's a CTAS query which is probably why the write path shows up there. > Vectorized parquet reader fails to read certain fields from Hive tables > --- > > Key: SPARK-16632 > URL: https://issues.apache.org/jira/browse/SPARK-16632 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Hive 1.1 (CDH) >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.1.0 > > > The vectorized parquet reader fails to read certain tables created by Hive. > When the tables have type "tinyint" or "smallint", Catalyst converts those to > "ByteType" and "ShortType" respectively. But when Hive writes those tables in > parquet format, the parquet schema in the files contains "int32" fields. > To reproduce, run these commands in the hive shell (or beeline): > {code} > create table abyte (value tinyint) stored as parquet; > create table ashort (value smallint) stored as parquet; > insert into abyte values (1); > insert into ashort values (1); > {code} > Then query them with Spark 2.0: > {code} > spark.sql("select * from abyte").show(); > spark.sql("select * from ashort").show(); > {code} > You'll see this exception (for the byte case): > {noformat} > 16/07/13 12:24:23 ERROR datasources.InsertIntoHadoopFsRelationCommand: > Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 3, scm-centos71-iqalat-2.gce.cloudera.com): > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getByte(OnHeapColumnVector.java:159) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > {noformat} > This works when you point Spark directly at the files (instead of using the > metastore data), or when you disable the vectorized parquet reader. > The root cause seems to be that Hive creates these tables with a > not-so-complete schema: > {noformat} > $ parquet-tools schema /tmp/byte.parquet > message hive_schema { > optional int32 value; > } > {noformat} > There's no indication that the field is a 32-bit field used to store 8-bit > values. When the ParquetReadSupport code tries to consolidate both schemas, > it just chooses whatever is in the parquet file for primitive types (see > ParquetReadSupport.clipParquetType); the vectorized reader uses the catalyst > schema, which comes from the Hive metastore, and says it's a byte field, so > when it tries to read the data, the byte data stored in "OnHeapColumnVector" > is null. > I have tested a small change to
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386395#comment-15386395 ] Michael Allman commented on SPARK-16320: The code path for reading data from parquet files has been refactored extensively. The fact that [~maver1ck] is testing performance on a table with 400 partitions makes me wonder if my PR for https://issues.apache.org/jira/browse/SPARK-15968 will make a difference for repeated queries on partitioned tables. That PR was merged into master and backported to 2.0. The commit short hash is d5d2457. Another issue that's in play when querying Hive metastore tables with a large number of partitions is the time it takes to query the Hive metastore for the table's partition metadata. Especially if your metastore is not configured to use direct sql, this in and of itself can take from seconds to minutes. With 400 partitions and without direct sql, you might see about 10 to 20 seconds or so of query planning time. It would be helpful to look at query times minus the time spent in query planning. Since query planning happens in the driver before the job's stages are launched, you can estimate this by looking at the actual stage times in your SQL query job. > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14887) Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits
[ https://issues.apache.org/jira/browse/SPARK-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386393#comment-15386393 ] Hayri Volkan Agun commented on SPARK-14887: --- Same issue in 1.6.2 can repeated with around 300 iterative udf transformation after a 20 unionAll calls on approaximately 4000 rows (~25 columns) dataframe... > Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits > --- > > Key: SPARK-14887 > URL: https://issues.apache.org/jira/browse/SPARK-14887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: fang fang chen > > Similiar issue with SPARK-14138 and SPARK-8443: > With large sql syntax(673K), following error happened: > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15425) Disallow cartesian joins by default
[ https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-15425: - Labels: release_notes releasenotes (was: ) > Disallow cartesian joins by default > --- > > Key: SPARK-15425 > URL: https://issues.apache.org/jira/browse/SPARK-15425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Sameer Agarwal > Labels: release_notes, releasenotes > Fix For: 2.0.0 > > > It is fairly easy for users to shoot themselves in the foot if they run > cartesian joins. Often they might not even be aware of the join methods > chosen. This happened to me a few times in the last few weeks. > It would be a good idea to disable cartesian joins by default, and require > explicit enabling of it via "crossJoin" method or in SQL "cross join". This > however might be too large of a scope for 2.0 given the timing. As a small > and quick fix, we can just have a single config option > (spark.sql.join.enableCartesian) that controls this behavior. In the future > we can implement the fine-grained control. > Note that the error message should be friendly and say "Set > spark.sql.join.enableCartesian to true to turn on cartesian joins." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16644) constraints propagation may fail the query
[ https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16644: - Component/s: SQL > constraints propagation may fail the query > -- > > Key: SPARK-16644 > URL: https://issues.apache.org/jira/browse/SPARK-16644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > {code} > create table(a int, b int); > select > a, > max(b) as c1, > b as c2 > from tbl > where a = b > group by a, b > having c1 = 1 > {code} > this query fails in 2.0, but works in 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16644) constraints propagation may fail the query
[ https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16644: - Target Version/s: 2.0.1 (was: 2.0.0) > constraints propagation may fail the query > -- > > Key: SPARK-16644 > URL: https://issues.apache.org/jira/browse/SPARK-16644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > {code} > create table(a int, b int); > select > a, > max(b) as c1, > b as c2 > from tbl > where a = b > group by a, b > having c1 = 1 > {code} > this query fails in 2.0, but works in 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16633) lag/lead does not return the default value when the offset row does not exist
[ https://issues.apache.org/jira/browse/SPARK-16633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16633: - Target Version/s: 2.0.1 (was: 2.0.0) > lag/lead does not return the default value when the offset row does not exist > - > > Key: SPARK-16633 > URL: https://issues.apache.org/jira/browse/SPARK-16633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > Attachments: window_function_bug.html > > > Please see the attached notebook. Seems lag/lead somehow fail to recognize > that a offset row does not exist and generate wrong results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16642) ResolveWindowFrame should not be triggered on UnresolvedFunctions.
[ https://issues.apache.org/jira/browse/SPARK-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16642: - Target Version/s: 2.0.1 (was: 2.0.0) > ResolveWindowFrame should not be triggered on UnresolvedFunctions. > -- > > Key: SPARK-16642 > URL: https://issues.apache.org/jira/browse/SPARK-16642 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > The case at > https://github.com/apache/spark/blob/75146be6ba5e9f559f5f15430310bb476ee0812c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1790-L1792 > is shown below > {code} > case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, > UnspecifiedFrame)) => > val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, > acceptWindowFrame = true) > we.copy(windowSpec = s.copy(frameSpecification = frame)) > {code} > This case will be triggered even when the function is an unresolved. So, when > the functions like lead are used, we may see errors like {{Window Frame RANGE > BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame > ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING.}} because we wrongly set the the > frame specification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16648: - Target Version/s: 2.0.1 (was: 2.0.0) > LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException > -- > > Key: SPARK-16648 > URL: https://issues.apache.org/jira/browse/SPARK-16648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: liancheng > > The following simple SQL query reproduces this issue: > {code:sql} > SELECT LAST_VALUE(FALSE) OVER (); > {code} > Exception thrown: > {noformat} > java.lang.IndexOutOfBoundsException: 0 > at > scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) > at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at >
[jira] [Updated] (SPARK-16644) constraints propagation may fail the query
[ https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16644: - Affects Version/s: 2.0.0 Target Version/s: 2.0.0 > constraints propagation may fail the query > -- > > Key: SPARK-16644 > URL: https://issues.apache.org/jira/browse/SPARK-16644 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > {code} > create table(a int, b int); > select > a, > max(b) as c1, > b as c2 > from tbl > where a = b > group by a, b > having c1 = 1 > {code} > this query fails in 2.0, but works in 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386360#comment-15386360 ] Dongjoon Hyun commented on SPARK-16651: --- I made a PR for you and other people, but I'm not sure it'll be merged. > No exception using DataFrame.withColumnRenamed when existing column doesn't > exist > - > > Key: SPARK-16651 > URL: https://issues.apache.org/jira/browse/SPARK-16651 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 >Reporter: Tom Phillips > > The {{withColumnRenamed}} method does not raise an exception when the > existing column does not exist in the dataframe. > Example: > {code} > In [4]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > In [5]: df = df.withColumnRenamed('dob', 'date_of_birth') > In [6]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16654) UI Should show blacklisted executors & nodes
[ https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386338#comment-15386338 ] Alex Bozarth commented on SPARK-16654: -- Perhaps we can change the status column to "Blacklisted" or "Alive (Blacklisted)" instead on Alive or Dead? I'm not very familiar with how the blacklisting works but I would be willing to learn and add this once the other PR is merged. > UI Should show blacklisted executors & nodes > > > Key: SPARK-16654 > URL: https://issues.apache.org/jira/browse/SPARK-16654 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Imran Rashid > > SPARK-8425 will add the ability to blacklist entire executors and nodes to > deal w/ faulty hardware. However, without displaying it on the UI, it can be > hard to realize which executor is bad, and why tasks aren't getting scheduled > on certain executors. > As a first step, we should just show nodes and executors that are blacklisted > for the entire application (no need to show blacklisting for tasks & stages). > This should also ensure that blacklisting events get into the event logs for > the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386333#comment-15386333 ] Michael Allman commented on SPARK-16320: [~maver1ck] Would it be possible for you to share your parquet file on S3? I would like to test with it specifically. If not publicly, could you share it with me privately? Thanks. > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16642) ResolveWindowFrame should not be triggered on UnresolvedFunctions.
[ https://issues.apache.org/jira/browse/SPARK-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16642: - Target Version/s: 2.0.0 > ResolveWindowFrame should not be triggered on UnresolvedFunctions. > -- > > Key: SPARK-16642 > URL: https://issues.apache.org/jira/browse/SPARK-16642 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > The case at > https://github.com/apache/spark/blob/75146be6ba5e9f559f5f15430310bb476ee0812c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1790-L1792 > is shown below > {code} > case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, > UnspecifiedFrame)) => > val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, > acceptWindowFrame = true) > we.copy(windowSpec = s.copy(frameSpecification = frame)) > {code} > This case will be triggered even when the function is an unresolved. So, when > the functions like lead are used, we may see errors like {{Window Frame RANGE > BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame > ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING.}} because we wrongly set the the > frame specification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16655) Spark thrift server application is not stopped if its in ACCEPTED stage
Yesha Vora created SPARK-16655: -- Summary: Spark thrift server application is not stopped if its in ACCEPTED stage Key: SPARK-16655 URL: https://issues.apache.org/jira/browse/SPARK-16655 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Yesha Vora When spark-thriftserver is started in yarn-client mode, It starts a yarn application. If yarn application is in ACCEPTED stage and stop operation is performed on spark thrift server, yarn application does not get killed/stopped. On stop operation, spark thriftserver should stop the yarn application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16651: Assignee: (was: Apache Spark) > No exception using DataFrame.withColumnRenamed when existing column doesn't > exist > - > > Key: SPARK-16651 > URL: https://issues.apache.org/jira/browse/SPARK-16651 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 >Reporter: Tom Phillips > > The {{withColumnRenamed}} method does not raise an exception when the > existing column does not exist in the dataframe. > Example: > {code} > In [4]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > In [5]: df = df.withColumnRenamed('dob', 'date_of_birth') > In [6]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386317#comment-15386317 ] Apache Spark commented on SPARK-16651: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/14288 > No exception using DataFrame.withColumnRenamed when existing column doesn't > exist > - > > Key: SPARK-16651 > URL: https://issues.apache.org/jira/browse/SPARK-16651 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 >Reporter: Tom Phillips > > The {{withColumnRenamed}} method does not raise an exception when the > existing column does not exist in the dataframe. > Example: > {code} > In [4]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > In [5]: df = df.withColumnRenamed('dob', 'date_of_birth') > In [6]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16651: Assignee: Apache Spark > No exception using DataFrame.withColumnRenamed when existing column doesn't > exist > - > > Key: SPARK-16651 > URL: https://issues.apache.org/jira/browse/SPARK-16651 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 >Reporter: Tom Phillips >Assignee: Apache Spark > > The {{withColumnRenamed}} method does not raise an exception when the > existing column does not exist in the dataframe. > Example: > {code} > In [4]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > In [5]: df = df.withColumnRenamed('dob', 'date_of_birth') > In [6]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16650) Improve documentation of spark.task.maxFailures
[ https://issues.apache.org/jira/browse/SPARK-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16650: Assignee: Apache Spark > Improve documentation of spark.task.maxFailures > > > Key: SPARK-16650 > URL: https://issues.apache.org/jira/browse/SPARK-16650 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Assignee: Apache Spark > > The documentation for spark.task.maxFailures isn't clear as to whether this > is just the number of total task failures or if its a single task failing > multiple attempts. > It turns out its the latter, a single task has to fail spark.task.maxFailures > number of attempts before it fails the job. > We should try to make that more clear in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16650) Improve documentation of spark.task.maxFailures
[ https://issues.apache.org/jira/browse/SPARK-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386311#comment-15386311 ] Apache Spark commented on SPARK-16650: -- User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/14287 > Improve documentation of spark.task.maxFailures > > > Key: SPARK-16650 > URL: https://issues.apache.org/jira/browse/SPARK-16650 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.6.2 >Reporter: Thomas Graves > > The documentation for spark.task.maxFailures isn't clear as to whether this > is just the number of total task failures or if its a single task failing > multiple attempts. > It turns out its the latter, a single task has to fail spark.task.maxFailures > number of attempts before it fails the job. > We should try to make that more clear in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16650) Improve documentation of spark.task.maxFailures
[ https://issues.apache.org/jira/browse/SPARK-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16650: Assignee: (was: Apache Spark) > Improve documentation of spark.task.maxFailures > > > Key: SPARK-16650 > URL: https://issues.apache.org/jira/browse/SPARK-16650 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.6.2 >Reporter: Thomas Graves > > The documentation for spark.task.maxFailures isn't clear as to whether this > is just the number of total task failures or if its a single task failing > multiple attempts. > It turns out its the latter, a single task has to fail spark.task.maxFailures > number of attempts before it fails the job. > We should try to make that more clear in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386295#comment-15386295 ] Dongjoon Hyun edited comment on SPARK-16651 at 7/20/16 5:55 PM: Also, in Spark 2.0 RC5, DataFrame is merged into Dataset and the documentation of Dataset still has that notice. {code} def withColumnRenamed(existingName: String, newName: String): DataFrame Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName. Since 2.0.0 {code} http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/api/scala/index.html#org.apache.spark.sql.Dataset was (Author: dongjoon): Also, in Spark 2.0 RC, Dataframe is merged into Dataset and the documentation of Dataset still has that notice. {code} def withColumnRenamed(existingName: String, newName: String): DataFrame Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName. Since 2.0.0 {code} http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/api/scala/index.html#org.apache.spark.sql.Dataset > No exception using DataFrame.withColumnRenamed when existing column doesn't > exist > - > > Key: SPARK-16651 > URL: https://issues.apache.org/jira/browse/SPARK-16651 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 >Reporter: Tom Phillips > > The {{withColumnRenamed}} method does not raise an exception when the > existing column does not exist in the dataframe. > Example: > {code} > In [4]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > In [5]: df = df.withColumnRenamed('dob', 'date_of_birth') > In [6]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386295#comment-15386295 ] Dongjoon Hyun commented on SPARK-16651: --- Also, in Spark 2.0 RC, Dataframe is merged into Dataset and the documentation of Dataset still has that notice. {code} def withColumnRenamed(existingName: String, newName: String): DataFrame Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName. Since 2.0.0 {code} http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/api/scala/index.html#org.apache.spark.sql.Dataset > No exception using DataFrame.withColumnRenamed when existing column doesn't > exist > - > > Key: SPARK-16651 > URL: https://issues.apache.org/jira/browse/SPARK-16651 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 >Reporter: Tom Phillips > > The {{withColumnRenamed}} method does not raise an exception when the > existing column does not exist in the dataframe. > Example: > {code} > In [4]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > In [5]: df = df.withColumnRenamed('dob', 'date_of_birth') > In [6]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386286#comment-15386286 ] Dongjoon Hyun commented on SPARK-16651: --- Hi, [~tomwphillips]. It's a documented behavior in Scala side since 1.3. {code} def withColumnRenamed(existingName: String, newName: String): DataFrame Returns a new DataFrame with a column renamed. This is a no-op if schema doesn't contain existingName. {code} http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame Maybe, we can update Python API but we can not change the behavior. > No exception using DataFrame.withColumnRenamed when existing column doesn't > exist > - > > Key: SPARK-16651 > URL: https://issues.apache.org/jira/browse/SPARK-16651 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 >Reporter: Tom Phillips > > The {{withColumnRenamed}} method does not raise an exception when the > existing column does not exist in the dataframe. > Example: > {code} > In [4]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > In [5]: df = df.withColumnRenamed('dob', 'date_of_birth') > In [6]: df.show() > +---+-+ > |age| name| > +---+-+ > | 1|Alice| > +---+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16634) GenericArrayData can't be loaded in certain JVMs
[ https://issues.apache.org/jira/browse/SPARK-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16634. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.1.0 2.0.1 > GenericArrayData can't be loaded in certain JVMs > > > Key: SPARK-16634 > URL: https://issues.apache.org/jira/browse/SPARK-16634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > There's an annoying bug in some JVMs that causes certain scala-generated > bytecode to not load. The current code in GenericArrayData.scala triggers > that bug (at least with 1.7.0_67, maybe others). > Since it's easy to work around the bug, I'd rather do that instead of asking > people who might be running that version to have to upgrade. > Error: > {noformat} > 16/07/19 16:02:35 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 0.0 > (TID 2) on executor vanzin-st1-3.vpc.cloudera.com: java.lang.VerifyError (Bad > method call from inside of a branch > Exception Details: > Location: > > org/apache/spark/sql/catalyst/util/GenericArrayData.(Ljava/lang/Object;)V > @52: invokespecial > Reason: > Error exists in the bytecode > Bytecode: > 000: 2a2b 4d2c c100 dc99 000e 2cc0 00dc 4e2d > 010: 3a04 a700 20b2 0129 2c04 b601 2d99 001b > 020: 2c3a 05b2 007a 1905 b600 7eb9 00fe 0100 > 030: 3a04 1904 b700 f3b1 bb01 2f59 2cb7 0131 > 040: bf > Stackmap Table: > > full_frame(@21,{UninitializedThis,Object[#177],Object[#177]},{UninitializedThis}) > > full_frame(@50,{UninitializedThis,Object[#177],Object[#177],Top,Object[#220]},{UninitializedThis}) > > full_frame(@56,{UninitializedThis,Object[#177],Object[#177]},{UninitializedThis}) > ) [duplicate 2] > {noformat} > I didn't run into this with 2.0, not sure whether the issue exists there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16654) UI Should show blacklisted executors & nodes
[ https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386242#comment-15386242 ] Imran Rashid commented on SPARK-16654: -- cc [~tgraves] > UI Should show blacklisted executors & nodes > > > Key: SPARK-16654 > URL: https://issues.apache.org/jira/browse/SPARK-16654 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Imran Rashid > > SPARK-8425 will add the ability to blacklist entire executors and nodes to > deal w/ faulty hardware. However, without displaying it on the UI, it can be > hard to realize which executor is bad, and why tasks aren't getting scheduled > on certain executors. > As a first step, we should just show nodes and executors that are blacklisted > for the entire application (no need to show blacklisting for tasks & stages). > This should also ensure that blacklisting events get into the event logs for > the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16654) UI Should show blacklisted executors & nodes
Imran Rashid created SPARK-16654: Summary: UI Should show blacklisted executors & nodes Key: SPARK-16654 URL: https://issues.apache.org/jira/browse/SPARK-16654 Project: Spark Issue Type: Improvement Components: Scheduler, Web UI Affects Versions: 2.0.0 Reporter: Imran Rashid SPARK-8425 will add the ability to blacklist entire executors and nodes to deal w/ faulty hardware. However, without displaying it on the UI, it can be hard to realize which executor is bad, and why tasks aren't getting scheduled on certain executors. As a first step, we should just show nodes and executors that are blacklisted for the entire application (no need to show blacklisting for tasks & stages). This should also ensure that blacklisting events get into the event logs for the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching
[ https://issues.apache.org/jira/browse/SPARK-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-15951: -- Assignee: Kishor Patil > Change Executors Page to use datatables to support sorting columns and > searching > > > Key: SPARK-15951 > URL: https://issues.apache.org/jira/browse/SPARK-15951 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Kishor Patil >Assignee: Kishor Patil >Priority: Minor > Fix For: 2.1.0 > > > Support column sort and search for Executors Server using jQuery DataTable > and REST API. Before this commit, the executors page was generated hard-coded > html and can not support search, also, the sorting was disabled if there is > any application that has more than one attempt. Supporting search and sort > (over all applications rather than the 20 entries in the current page) in any > case will greatly improve user experience. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching
[ https://issues.apache.org/jira/browse/SPARK-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-15951. --- Resolution: Fixed Fix Version/s: 2.1.0 > Change Executors Page to use datatables to support sorting columns and > searching > > > Key: SPARK-15951 > URL: https://issues.apache.org/jira/browse/SPARK-15951 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Kishor Patil >Priority: Minor > Fix For: 2.1.0 > > > Support column sort and search for Executors Server using jQuery DataTable > and REST API. Before this commit, the executors page was generated hard-coded > html and can not support search, also, the sorting was disabled if there is > any application that has more than one attempt. Supporting search and sort > (over all applications rather than the 20 entries in the current page) in any > case will greatly improve user experience. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386235#comment-15386235 ] Thomas Graves commented on SPARK-2666: -- I think eventually adding prestart (MapReduce slowstart type setting) makes sense. This is actually why I didn't change the mapoutput statuses to go along with task launch. I wanted to be able to do this or get incremental map output status results. But as far as the keep the remaining tasks running I think it depends on the behavior and I haven't had time to go look in more detail. If the stages fails, what tasks does it rerun: 1) does it rerun all the ones not succeeded yet in the failed stage (including the ones that could still be running)? 2) does it only run the failed ones and wait for the ones still running in failed stage? If they succeed it uses those results. >From what I saw with this job I thought it was acting like number 1 above. The >only use to leave the ones running is to see if they get FetchFailures, this >seems like a lot of overhead to find that out if that task takes a long time. When a fetch failure happens, does the schedule re-run all maps that had run on that node or just the ones specifically mentioned by the fetch failure? Again I thought it was just the specific map that the fetch failure failed to get, thus why it needs to know if the other reducers get fetch failures. I can kind of understand letting them run to see if they hit fetch failures as well but on a large job or with tasks that take a long time, if we aren't counting them as success then its more a waste of resources and just extends the job time as well as confuses the user since the UI doesn't represent those still running. In the case i was seeing my tasks took roughly an hour. One stage failed so it restarted that stage, but since it didn't kill the tasks from the original stage it had very few executors open to run new ones, thus the job took a lot longer then it should. I don't remember the exact cause of the failures anymore. Anyway I think the results are going to vary a lot based on the type of job and length of each stage (map vs reduce). personally I think it would be better to change to fail all maps that ran on the host it failed to fetch from and kill the rest of the running reducers in that stage. But I would have to investigate the code more to fully understand. > Always try to cancel running tasks when a stage is marked as zombie > --- > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Reporter: Lianhui Wang > > There are some situations in which the scheduler can mark a task set as a > "zombie" before the task set has completed all of its tasks. For example: > (a) When a task fails b/c of a {{FetchFailed}} > (b) When a stage completes because two different attempts create all the > ShuffleMapOutput, though no attempt has completed all its tasks (at least, > this *should* result in the task set being marked as zombie, see SPARK-10370) > (there may be others, I'm not sure if this list is exhaustive.) > Marking a taskset as zombie prevents any *additional* tasks from getting > scheduled, however it does not cancel all currently running tasks. We should > cancel all running to avoid wasting resources (and also to make the behavior > a little more clear to the end user). Rather than canceling tasks in each > case piecemeal, we should refactor the scheduler so that these two actions > are always taken together -- canceling tasks should go hand-in-hand with > marking the taskset as zombie. > Some implementation notes: > * We should change {{taskSetManager.isZombie}} to be private and put it > behind a method like {{markZombie}} or something. > * marking a stage as zombie before the all tasks have completed does *not* > necessarily mean the stage attempt has failed. In case (a), the stage > attempt has failed, but in stage (b) we are not canceling b/c of a failure, > rather just b/c no more tasks are needed. > * {{taskScheduler.cancelTasks}} always marks the task set as zombie. > However, it also has some side-effects like logging that the stage has failed > and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) > when nothing has failed. So it may need some additional refactoring to go > along w/ {{markZombie}}. > * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need > to be sure to catch the {{UnsupportedOperationException}} s > * Testing this *might* benefit from SPARK-10372 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order
[ https://issues.apache.org/jira/browse/SPARK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386225#comment-15386225 ] Wade Salazar commented on SPARK-15918: -- Is the stance on this issue that since other SQL interpreters require both SELECT statements to be in order Spark will follow? Though this seems relatively innocuous it leads to considerable amount of troubleshooting when this behavior is not expected. Can we request that the requirement to have each SELECT statement eliminated as an improvement in SparkSQL > unionAll returns wrong result when two dataframes has schema in different > order > --- > > Key: SPARK-15918 > URL: https://issues.apache.org/jira/browse/SPARK-15918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: CentOS >Reporter: Prabhu Joseph > > On applying unionAll operation between A and B dataframes, they both has same > schema but in different order and hence the result has column value mapping > changed. > Repro: > {code} > A.show() > +---++---+--+--+-++---+--+---+---+-+ > |tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value| > +---++---+--+--+-++---+--+---+---+-+ > +---++---+--+--+-++---+--+---+---+-+ > B.show() > +-+---+--+---+---+--+--+--+---+---+--++ > |dtype|tag| > time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day| > +-+---+--+---+---+--+--+--+---+---+--++ > |F|C_FNHXUT701Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUDP713.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUT718.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUT703Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUR716A.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUT803Z.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUT728.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUR806.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > +-+---+--+---+---+--+--+--+---+---+--++ > A = A.unionAll(B) > A.show() > +---+---+--+--+--+-++---+--+---+---+-+ > |tag| year_day| > tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value| > +---+---+--+--+--+-++---+--+---+---+-+ > | F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > +---+---+--+--+--+-++---+--+---+---+-+ > {code} > On changing the schema of A according to B and doing unionAll works fine > {code} > C = > A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day") > A = C.unionAll(B) > A.show() > +-+---+--+---+---+--+--+--+---+---+--++ > |dtype|tag| > time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day| > +-+---+--+---+---+--+--+--+---+---+--++ > |F|C_FNHXUT701Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |
[jira] [Updated] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache
[ https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-15968: --- Fix Version/s: 2.0.0 > HiveMetastoreCatalog does not correctly validate partitioned metastore > relation when searching the internal table cache > --- > > Key: SPARK-15968 > URL: https://issues.apache.org/jira/browse/SPARK-15968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Michael Allman >Assignee: Michael Allman > Labels: hive, metastore > Fix For: 2.0.0, 2.1.0 > > > The {{getCached}} method of {{HiveMetastoreCatalog}} computes > {{pathsInMetastore}} from the metastore relation's catalog table. This only > returns the table base path, which is not correct for non-empty partitioned > tables. As a result, cache lookups on non-empty partitioned tables always > miss and these relations are always recomputed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13767) py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server
[ https://issues.apache.org/jira/browse/SPARK-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386192#comment-15386192 ] Rahul Bhatia commented on SPARK-13767: -- I'm seeing the error that Venkata showed as well, if anyone has any thoughts on why that would occur, I'd really appreciate it. Thanks, > py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to > the Java server > > > Key: SPARK-13767 > URL: https://issues.apache.org/jira/browse/SPARK-13767 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Poonam Agrawal > > I am trying to create spark context object with the following commands on > pyspark: > from pyspark import SparkContext, SparkConf > conf = > SparkConf().setAppName('App_name').setMaster("spark://local-or-remote-ip:7077").set('spark.cassandra.connection.host', > 'cassandra-machine-ip').set('spark.storage.memoryFraction', > '0.2').set('spark.rdd.compress', 'true').set('spark.streaming.blockInterval', > 500).set('spark.serializer', > 'org.apache.spark.serializer.KryoSerializer').set('spark.scheduler.mode', > 'FAIR').set('spark.mesos.coarse', 'true') > sc = SparkContext(conf=conf) > but I am getting the following error: > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/lib/spark-1.4.1/python/pyspark/conf.py", line 106, in > __init__ > self._jconf = _jvm.SparkConf(loadDefaults) > File > "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 766, in __getattr__ > File > "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 362, in send_command > File > "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 318, in _get_connection > File > "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 325, in _create_connection > File > "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 432, in start > py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to > the Java server > I am getting the same error executing the command : conf = > SparkConf().setAppName("App_name").setMaster("spark://127.0.0.1:7077") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16613) RDD.pipe returns values for empty partitions
[ https://issues.apache.org/jira/browse/SPARK-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16613. - Resolution: Fixed Assignee: Sean Owen Fix Version/s: 2.1.0 2.0.1 > RDD.pipe returns values for empty partitions > > > Key: SPARK-16613 > URL: https://issues.apache.org/jira/browse/SPARK-16613 > Project: Spark > Issue Type: Bug >Reporter: Alex Krasnyansky >Assignee: Sean Owen > Fix For: 2.0.1, 2.1.0 > > > Suppose we have such Spark code > {code} > object PipeExample { > def main(args: Array[String]) { > val fstRdd = sc.parallelize(List("hi", "hello", "how", "are", "you")) > val pipeRdd = > fstRdd.pipe("/Users/finkel/spark-pipe-example/src/main/resources/len.sh") > pipeRdd.collect.foreach(println) > } > } > {code} > It uses a bash script to convert a string to its length. > {code} > #!/bin/sh > read input > len=${#input} > echo $len > {code} > So far so good, but when I run the code, it prints incorrect output. For > example: > {code} > 0 > 2 > 0 > 5 > 3 > 0 > 3 > 3 > {code} > I expect to see > {code} > 2 > 5 > 3 > 3 > 3 > {code} > which is correct output for the app. I think it's a bug. It's expected to see > only positive integers and avoid zeros. > Environment: > 1. Spark version is 1.6.2 > 2. Scala version is 2.11.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache
[ https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15968: Fix Version/s: (was: 2.1.0) > HiveMetastoreCatalog does not correctly validate partitioned metastore > relation when searching the internal table cache > --- > > Key: SPARK-15968 > URL: https://issues.apache.org/jira/browse/SPARK-15968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Michael Allman >Assignee: Michael Allman > Labels: hive, metastore > Fix For: 2.0.0 > > > The {{getCached}} method of {{HiveMetastoreCatalog}} computes > {{pathsInMetastore}} from the metastore relation's catalog table. This only > returns the table base path, which is not correct for non-empty partitioned > tables. As a result, cache lookups on non-empty partitioned tables always > miss and these relations are always recomputed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS
[ https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16653: Assignee: Apache Spark > Make convergence tolerance param in ANN default value consistent with other > algorithm using LBFGS > - > > Key: SPARK-16653 > URL: https://issues.apache.org/jira/browse/SPARK-16653 > Project: Spark > Issue Type: Improvement > Components: ML, Optimizer >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > The default value of convergence tolerance param in ANN is 1e-4, > but other algorithm (such as LinearRegression, LogisticRegression, and so on) > using LBFGS optimizers this param's default values are all 1e-6. > I think make them the same will be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS
[ https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386028#comment-15386028 ] Apache Spark commented on SPARK-16653: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/14286 > Make convergence tolerance param in ANN default value consistent with other > algorithm using LBFGS > - > > Key: SPARK-16653 > URL: https://issues.apache.org/jira/browse/SPARK-16653 > Project: Spark > Issue Type: Improvement > Components: ML, Optimizer >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > The default value of convergence tolerance param in ANN is 1e-4, > but other algorithm (such as LinearRegression, LogisticRegression, and so on) > using LBFGS optimizers this param's default values are all 1e-6. > I think make them the same will be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS
[ https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16653: Assignee: (was: Apache Spark) > Make convergence tolerance param in ANN default value consistent with other > algorithm using LBFGS > - > > Key: SPARK-16653 > URL: https://issues.apache.org/jira/browse/SPARK-16653 > Project: Spark > Issue Type: Improvement > Components: ML, Optimizer >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > The default value of convergence tolerance param in ANN is 1e-4, > but other algorithm (such as LinearRegression, LogisticRegression, and so on) > using LBFGS optimizers this param's default values are all 1e-6. > I think make them the same will be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS
[ https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-16653: --- Component/s: Optimizer ML > Make convergence tolerance param in ANN default value consistent with other > algorithm using LBFGS > - > > Key: SPARK-16653 > URL: https://issues.apache.org/jira/browse/SPARK-16653 > Project: Spark > Issue Type: Improvement > Components: ML, Optimizer >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > The default value of convergence tolerance param in ANN is 1e-4, > but other algorithm (such as LinearRegression, LogisticRegression, and so on) > using LBFGS optimizers this param's default values are all 1e-6. > I think make them the same will be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS
Weichen Xu created SPARK-16653: -- Summary: Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS Key: SPARK-16653 URL: https://issues.apache.org/jira/browse/SPARK-16653 Project: Spark Issue Type: Improvement Reporter: Weichen Xu The default value of convergence tolerance param in ANN is 1e-4, but other algorithm (such as LinearRegression, LogisticRegression, and so on) using LBFGS optimizers this param's default values are all 1e-6. I think make them the same will be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3210) Flume Polling Receiver must be more tolerant to connection failures.
[ https://issues.apache.org/jira/browse/SPARK-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386002#comment-15386002 ] Ian Brooks commented on SPARK-3210: --- Hi, I have also noticed an issue with resiliency for the flume polling receiver. This issue I have is as follows 1. Start Flume agent then Spark application 2. Spark application correctly connects to the flume agent and can receive data what is sent to Flume 3. Restart Flume 4. Spark application doesn't detect that Flume has been restarted and as such doesn't reconnect at any point preventing the Spark application from receiving any more data until its restarted. I've had a trawl through the documentation and source code for FlumeUtils.createPollingStream but couldn't see anyway to test and reconnect if needed. -Ian > Flume Polling Receiver must be more tolerant to connection failures. > > > Key: SPARK-3210 > URL: https://issues.apache.org/jira/browse/SPARK-3210 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Hari Shreedharan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]
[ https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385984#comment-15385984 ] Daniel Barclay commented on SPARK-16652: More info.: Having the {{String}} member isn't necessary to trigger the bug. Maybe any kind of list fails: {{List\[Long]}}, {{List\[Int]}}, {{List\[Double]}}, {{List\[Any]}}, {{List\[AnyRef]}} and {{List\[String]}} all fail. > JVM crash from unsafe memory access for Dataset of class with List[Long] > > > Key: SPARK-16652 > URL: https://issues.apache.org/jira/browse/SPARK-16652 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2 > Environment: Scala 2.10. > JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)" > MacOs 10.11.2 >Reporter: Daniel Barclay > Attachments: UnsafeAccessCrashBugTest.scala > > > Generating and writing out a {{Dataset}} of a class that has a {{List}} (at > least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM > crash. > The crash seems to be related to unsafe memory access especially because > earlier code (before I got it reduced to the current bug test case) reported > "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access > operation in compiled Java code}}". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]
[ https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Barclay updated SPARK-16652: --- Description: Generating and writing out a {{Dataset}} of a class that has a {{List}} (at least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM crash. The crash seems to be related to unsafe memory access especially because earlier code (before I got it reduced to the current bug test case) reported "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code}}". was: Generating and writing out a {{Dataset}} of a class that has a {{List}} (or at least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM crash. The crash seems to be related to unsafe memory access especially because earlier code (before I got it reduced to the current bug test case) reported "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code}}". > JVM crash from unsafe memory access for Dataset of class with List[Long] > > > Key: SPARK-16652 > URL: https://issues.apache.org/jira/browse/SPARK-16652 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2 > Environment: Scala 2.10. > JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)" > MacOs 10.11.2 >Reporter: Daniel Barclay > Attachments: UnsafeAccessCrashBugTest.scala > > > Generating and writing out a {{Dataset}} of a class that has a {{List}} (at > least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM > crash. > The crash seems to be related to unsafe memory access especially because > earlier code (before I got it reduced to the current bug test case) reported > "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access > operation in compiled Java code}}". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]
[ https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385966#comment-15385966 ] Daniel Barclay commented on SPARK-16652: SPARK-3947 reports the same InternalError message that I saw in trimming down my test case for SPARK-16652, although the stack traces are much different. > JVM crash from unsafe memory access for Dataset of class with List[Long] > > > Key: SPARK-16652 > URL: https://issues.apache.org/jira/browse/SPARK-16652 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2 > Environment: Scala 2.10. > JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)" > MacOs 10.11.2 >Reporter: Daniel Barclay > Attachments: UnsafeAccessCrashBugTest.scala > > > Generating and writing out a {{Dataset}} of a class that has a {{List}} (or > at least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM > crash. > The crash seems to be related to unsafe memory access especially because > earlier code (before I got it reduced to the current bug test case) reported > "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access > operation in compiled Java code}}". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]
[ https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Barclay updated SPARK-16652: --- Attachment: UnsafeAccessCrashBugTest.scala > JVM crash from unsafe memory access for Dataset of class with List[Long] > > > Key: SPARK-16652 > URL: https://issues.apache.org/jira/browse/SPARK-16652 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2 > Environment: Scala 2.10. > JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)" > MacOs 10.11.2 >Reporter: Daniel Barclay > Attachments: UnsafeAccessCrashBugTest.scala > > > Generating and writing out a {{Dataset}} of a class that has a {{List}} (or > at least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM > crash. > The crash seems to be related to unsafe memory access especially because > earlier code (before I got it reduced to the current bug test case) reported > "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access > operation in compiled Java code}}". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]
Daniel Barclay created SPARK-16652: -- Summary: JVM crash from unsafe memory access for Dataset of class with List[Long] Key: SPARK-16652 URL: https://issues.apache.org/jira/browse/SPARK-16652 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.2, 1.6.1 Environment: Scala 2.10. JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)" MacOs 10.11.2 Reporter: Daniel Barclay Generating and writing out a {{Dataset}} of a class that has a {{List}} (or at least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM crash. The crash seems to be related to unsafe memory access especially because earlier code (before I got it reduced to the current bug test case) reported "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code}}". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist
Tom Phillips created SPARK-16651: Summary: No exception using DataFrame.withColumnRenamed when existing column doesn't exist Key: SPARK-16651 URL: https://issues.apache.org/jira/browse/SPARK-16651 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.0 Reporter: Tom Phillips The {{withColumnRenamed}} method does not raise an exception when the existing column does not exist in the dataframe. Example: {code} In [4]: df.show() +---+-+ |age| name| +---+-+ | 1|Alice| +---+-+ In [5]: df = df.withColumnRenamed('dob', 'date_of_birth') In [6]: df.show() +---+-+ |age| name| +---+-+ | 1|Alice| +---+-+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16650) Improve documentation of spark.task.maxFailures
Thomas Graves created SPARK-16650: - Summary: Improve documentation of spark.task.maxFailures Key: SPARK-16650 URL: https://issues.apache.org/jira/browse/SPARK-16650 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.6.2 Reporter: Thomas Graves The documentation for spark.task.maxFailures isn't clear as to whether this is just the number of total task failures or if its a single task failing multiple attempts. It turns out its the latter, a single task has to fail spark.task.maxFailures number of attempts before it fails the job. We should try to make that more clear in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liancheng updated SPARK-16648: -- Description: The following simple SQL query reproduces this issue: {code:sql} SELECT LAST_VALUE(FALSE) OVER (); {code} Exception thrown: {noformat} java.lang.IndexOutOfBoundsException: 0 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48) at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214) at org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637) at org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveExpressions$1.applyOrElse(LogicalPlan.scala:79) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveExpressions$1.applyOrElse(LogicalPlan.scala:78) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at
[jira] [Commented] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385886#comment-15385886 ] liancheng commented on SPARK-16648: --- The problematic code is the newly introduced {{TreeNode.withNewChildren}}. {{Last}} is a unary expression with two {{Expression}} arguments: {code} case class Last(child: Expression, ignoreNullsExpr: Expression) extends DeclarativeAggregate { ... override def children: Seq[Expression] = child :: Nil ... } {code} Argument {{ignoreNullsExpr}} defaults to {{Literal.FalseLiteral}}. Thus {{LAST_VALUE(FALSE)}} is equivalent to {{Last(Literal.FalseLiteral, Literal.FalseLiteral)}}. This breaks the following case branch in {{TreeNode.withNewChildren}}: {code} case arg: TreeNode[_] if containsChild(arg) =>// Both `child` and `ignoreNullsExpr` hit this branch, val newChild = remainingNewChildren.remove(0) // but only `child` is the real child node of `Last`. val oldChild = remainingOldChildren.remove(0) if (newChild fastEquals oldChild) { oldChild } else { changed = true newChild } {code} > LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException > -- > > Key: SPARK-16648 > URL: https://issues.apache.org/jira/browse/SPARK-16648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: liancheng > > {code:sql} > SELECT LAST_VALUE(FALSE) OVER (); > {code} > Exception thrown: > {noformat} > java.lang.IndexOutOfBoundsException: 0 > at > scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) > at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48) > at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637) > at > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at >
[jira] [Assigned] (SPARK-16629) UDTs can not be compared to DataTypes in dataframes.
[ https://issues.apache.org/jira/browse/SPARK-16629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16629: Assignee: (was: Apache Spark) > UDTs can not be compared to DataTypes in dataframes. > > > Key: SPARK-16629 > URL: https://issues.apache.org/jira/browse/SPARK-16629 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza > > Currently UDTs can not be compared to Datatypes even if their sqlTypes match. > this leads to errors like this > {code} > In [12]: filtered = df.filter(df['udt_time'] > threshold) > --- > AnalysisException Traceback (most recent call last) > /Users/franklyndsouza/dev/starscream/bin/starscream in () > > 1 thresholded = df.filter(df['udt_time'] > threshold) > AnalysisException: u"cannot resolve '(`udt_time` > TIMESTAMP('2015-10-20 > 01:00:00.0'))' due to data typ mismatch: '(`udt_time` > TIMESTAMP('2015-10-20 > 01:00:00.0'))' requires (boolean or tinyint or smallint or int or bigint or > float or double or decimal or timestamp or date or string or binary) type, > not pythonuserdefined" > {code} > i've proposed a fix for this here https://github.com/apache/spark/pull/14164 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org