[jira] [Created] (SPARK-22490) PySpark doc has misleading string for SparkSession.builder
Xiao Li created SPARK-22490: --- Summary: PySpark doc has misleading string for SparkSession.builder Key: SPARK-22490 URL: https://issues.apache.org/jira/browse/SPARK-22490 Project: Spark Issue Type: Documentation Components: PySpark Affects Versions: 2.2.0 Reporter: Xiao Li Priority: Minor We need to fix the following line in our PySpark doc http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html {noformat} SparkSession.builder = ¶ {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided
[ https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247113#comment-16247113 ] Dongjoon Hyun commented on SPARK-22042: --- Also, cc [~felixcheung]. > ReorderJoinPredicates can break when child's partitioning is not decided > > > Key: SPARK-22042 > URL: https://issues.apache.org/jira/browse/SPARK-22042 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Tejas Patil > > When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its > children, the children may not be properly constructed as the child-subtree > has to still go through other planner rules. > In this particular case, the child is `SortMergeJoinExec`. Since the required > `Exchange` operators are not in place (because `EnsureRequirements` runs > _after_ `ReorderJoinPredicates`), the join's children would not have > partitioning defined. This breaks while creation the `PartitioningCollection` > here : > https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69 > Small repro: > {noformat} > context.sql("SET spark.sql.autoBroadcastJoinThreshold=0") > val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", > "k") > df.write.format("parquet").saveAsTable("table1") > df.write.format("parquet").saveAsTable("table2") > df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table") > sql(""" > SELECT * > FROM ( > SELECT a.i, a.j, a.k > FROM bucketed_table a > JOIN table1 b > ON a.i = b.i > ) c > JOIN table2 > ON c.i = table2.i > """).explain > {noformat} > This fails with : > {noformat} > java.lang.IllegalArgumentException: requirement failed: > PartitioningCollection requires all of its partitionings have the same > numPartitions. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69) > at > org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91) > at org.apache.spark.sql.Dataset.explain(Dataset.scala:464) > at org.apache.spark.sql.Dataset.explain(Dataset.scala
[jira] [Commented] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided
[ https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247112#comment-16247112 ] Dongjoon Hyun commented on SPARK-22042: --- Since this raises a runtime exception to the query, I raised this to Major. It would be great if 2.2.1 has this. How do you think about this issue, [~smilegator] and [~cloud_fan]? > ReorderJoinPredicates can break when child's partitioning is not decided > > > Key: SPARK-22042 > URL: https://issues.apache.org/jira/browse/SPARK-22042 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Tejas Patil > > When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its > children, the children may not be properly constructed as the child-subtree > has to still go through other planner rules. > In this particular case, the child is `SortMergeJoinExec`. Since the required > `Exchange` operators are not in place (because `EnsureRequirements` runs > _after_ `ReorderJoinPredicates`), the join's children would not have > partitioning defined. This breaks while creation the `PartitioningCollection` > here : > https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69 > Small repro: > {noformat} > context.sql("SET spark.sql.autoBroadcastJoinThreshold=0") > val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", > "k") > df.write.format("parquet").saveAsTable("table1") > df.write.format("parquet").saveAsTable("table2") > df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table") > sql(""" > SELECT * > FROM ( > SELECT a.i, a.j, a.k > FROM bucketed_table a > JOIN table1 b > ON a.i = b.i > ) c > JOIN table2 > ON c.i = table2.i > """).explain > {noformat} > This fails with : > {noformat} > java.lang.IllegalArgumentException: requirement failed: > PartitioningCollection requires all of its partitionings have the same > numPartitions. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69) > at > org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExe
[jira] [Updated] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided
[ https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22042: -- Priority: Major (was: Minor) > ReorderJoinPredicates can break when child's partitioning is not decided > > > Key: SPARK-22042 > URL: https://issues.apache.org/jira/browse/SPARK-22042 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Tejas Patil > > When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its > children, the children may not be properly constructed as the child-subtree > has to still go through other planner rules. > In this particular case, the child is `SortMergeJoinExec`. Since the required > `Exchange` operators are not in place (because `EnsureRequirements` runs > _after_ `ReorderJoinPredicates`), the join's children would not have > partitioning defined. This breaks while creation the `PartitioningCollection` > here : > https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69 > Small repro: > {noformat} > context.sql("SET spark.sql.autoBroadcastJoinThreshold=0") > val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", > "k") > df.write.format("parquet").saveAsTable("table1") > df.write.format("parquet").saveAsTable("table2") > df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table") > sql(""" > SELECT * > FROM ( > SELECT a.i, a.j, a.k > FROM bucketed_table a > JOIN table1 b > ON a.i = b.i > ) c > JOIN table2 > ON c.i = table2.i > """).explain > {noformat} > This fails with : > {noformat} > java.lang.IllegalArgumentException: requirement failed: > PartitioningCollection requires all of its partitionings have the same > numPartitions. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69) > at > org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91) > at org.apache.spark.sql.Dataset.explain(Dataset.scala:464) > at org.apache.spark.sql.Dataset.explain(Dataset.scala:477) > ... 60 elided > {noformat} -- This m
[jira] [Assigned] (SPARK-22397) Add multiple column support to QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22397: Assignee: Apache Spark > Add multiple column support to QuantileDiscretizer > -- > > Key: SPARK-22397 > URL: https://issues.apache.org/jira/browse/SPARK-22397 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Assignee: Apache Spark > > Once SPARK-20542 adds multi column support to {{Bucketizer}}, we can add > multi column support to the {{QuantileDiscretizer}} too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22397) Add multiple column support to QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22397: Assignee: (was: Apache Spark) > Add multiple column support to QuantileDiscretizer > -- > > Key: SPARK-22397 > URL: https://issues.apache.org/jira/browse/SPARK-22397 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath > > Once SPARK-20542 adds multi column support to {{Bucketizer}}, we can add > multi column support to the {{QuantileDiscretizer}} too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22397) Add multiple column support to QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247106#comment-16247106 ] Apache Spark commented on SPARK-22397: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/19715 > Add multiple column support to QuantileDiscretizer > -- > > Key: SPARK-22397 > URL: https://issues.apache.org/jira/browse/SPARK-22397 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath > > Once SPARK-20542 adds multi column support to {{Bucketizer}}, we can add > multi column support to the {{QuantileDiscretizer}} too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13612) Multiplication of BigDecimal columns not working as expected
[ https://issues.apache.org/jira/browse/SPARK-13612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varadharajan resolved SPARK-13612. -- Resolution: Workaround > Multiplication of BigDecimal columns not working as expected > > > Key: SPARK-13612 > URL: https://issues.apache.org/jira/browse/SPARK-13612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Varadharajan > > Please consider the below snippet: > {code} > case class AM(id: Int, a: BigDecimal) > case class AX(id: Int, b: BigDecimal) > val x = sc.parallelize(List(AM(1, 10))).toDF > val y = sc.parallelize(List(AX(1, 10))).toDF > x.join(y, x("id") === y("id")).withColumn("z", x("a") * y("b")).show > {code} > output: > {code} > | id| a| id| b| z| > | 1|10.00...| 1|10.00...|null| > {code} > Here the multiplication of the columns ("z") return null instead of 100. > As of now we are using the below workaround, but definitely looks like a > serious issue. > {code} > x.join(y, x("id") === y("id")).withColumn("z", x("a") / (expr("1") / > y("b"))).show > {code} > {code} > | id| a| id| b| z| > | 1|10.00...| 1|10.00...|100.0...| > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22489) Shouldn't change broadcast join buildSide if user clearly specified
[ https://issues.apache.org/jira/browse/SPARK-22489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22489: Assignee: (was: Apache Spark) > Shouldn't change broadcast join buildSide if user clearly specified > --- > > Key: SPARK-22489 > URL: https://issues.apache.org/jira/browse/SPARK-22489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang > > How to reproduce: > {code:java} > import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec > spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value").createTempView("table1") > spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value").createTempView("table2") > val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON > t1.key = t2.key").queryExecution.executedPlan > println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide) > {code} > The result is {{BuildRight}}, but should be {{BuildLeft}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22489) Shouldn't change broadcast join buildSide if user clearly specified
[ https://issues.apache.org/jira/browse/SPARK-22489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22489: Assignee: Apache Spark > Shouldn't change broadcast join buildSide if user clearly specified > --- > > Key: SPARK-22489 > URL: https://issues.apache.org/jira/browse/SPARK-22489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang >Assignee: Apache Spark > > How to reproduce: > {code:java} > import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec > spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value").createTempView("table1") > spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value").createTempView("table2") > val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON > t1.key = t2.key").queryExecution.executedPlan > println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide) > {code} > The result is {{BuildRight}}, but should be {{BuildLeft}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22489) Shouldn't change broadcast join buildSide if user clearly specified
[ https://issues.apache.org/jira/browse/SPARK-22489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247059#comment-16247059 ] Apache Spark commented on SPARK-22489: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/19714 > Shouldn't change broadcast join buildSide if user clearly specified > --- > > Key: SPARK-22489 > URL: https://issues.apache.org/jira/browse/SPARK-22489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang > > How to reproduce: > {code:java} > import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec > spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value").createTempView("table1") > spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value").createTempView("table2") > val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON > t1.key = t2.key").queryExecution.executedPlan > println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide) > {code} > The result is {{BuildRight}}, but should be {{BuildLeft}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22472. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik >Assignee: Wenchen Fan > Labels: release-notes > Fix For: 2.2.1, 2.3.0 > > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22489) Shouldn't change broadcast join buildSide if user clearly specified
[ https://issues.apache.org/jira/browse/SPARK-22489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-22489: Description: How to reproduce: {code:java} import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value").createTempView("table1") spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value").createTempView("table2") val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON t1.key = t2.key").queryExecution.executedPlan println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide) {code} The result is {{BuildRight}}, but should be {{BuildLeft}}. was: How to reproduce: {code:java} import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value").createTempView("table1") spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value").createTempView("table2") val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON t1.key = t2.key").queryExecution.executedPlan println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide) {code} The result is {{BuildRight}}, but should be {{BuildLeft}} > Shouldn't change broadcast join buildSide if user clearly specified > --- > > Key: SPARK-22489 > URL: https://issues.apache.org/jira/browse/SPARK-22489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang > > How to reproduce: > {code:java} > import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec > spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value").createTempView("table1") > spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value").createTempView("table2") > val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON > t1.key = t2.key").queryExecution.executedPlan > println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide) > {code} > The result is {{BuildRight}}, but should be {{BuildLeft}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22489) Shouldn't change broadcast join buildSide if user clearly specified
Yuming Wang created SPARK-22489: --- Summary: Shouldn't change broadcast join buildSide if user clearly specified Key: SPARK-22489 URL: https://issues.apache.org/jira/browse/SPARK-22489 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Yuming Wang How to reproduce: {code:java} import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value").createTempView("table1") spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value").createTempView("table2") val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON t1.key = t2.key").queryExecution.executedPlan println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide) {code} The result is {{BuildRight}}, but should be {{BuildLeft}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22472: Labels: release-notes (was: ) > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik >Assignee: Wenchen Fan > Labels: release-notes > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-22472: --- Assignee: Wenchen Fan Target Version/s: 2.2.1 > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik >Assignee: Wenchen Fan > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22472: Target Version/s: 2.2.1, 2.3.0 (was: 2.2.1) > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik >Assignee: Wenchen Fan > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22488) The view resolution in the SparkSession internal table() API
[ https://issues.apache.org/jira/browse/SPARK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22488: Assignee: Xiao Li (was: Apache Spark) > The view resolution in the SparkSession internal table() API > - > > Key: SPARK-22488 > URL: https://issues.apache.org/jira/browse/SPARK-22488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.2, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > The current internal `table()` API of `SparkSession` bypasses the Analyzer > and directly calls `sessionState.catalog.lookupRelation` API. This skips the > view resolution logics in our Analyzer rule `ResolveRelations`. This internal > API is widely used by various DDL commands or the other internal APIs. > Users might get the strange error caused by view resolution when the default > database is different. > ``` > Table or view not found: t1; line 1 pos 14 > org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 > pos 14 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22488) The view resolution in the SparkSession internal table() API
[ https://issues.apache.org/jira/browse/SPARK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246993#comment-16246993 ] Apache Spark commented on SPARK-22488: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/19713 > The view resolution in the SparkSession internal table() API > - > > Key: SPARK-22488 > URL: https://issues.apache.org/jira/browse/SPARK-22488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.2, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > The current internal `table()` API of `SparkSession` bypasses the Analyzer > and directly calls `sessionState.catalog.lookupRelation` API. This skips the > view resolution logics in our Analyzer rule `ResolveRelations`. This internal > API is widely used by various DDL commands or the other internal APIs. > Users might get the strange error caused by view resolution when the default > database is different. > ``` > Table or view not found: t1; line 1 pos 14 > org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 > pos 14 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22488) The view resolution in the SparkSession internal table() API
[ https://issues.apache.org/jira/browse/SPARK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22488: Assignee: Apache Spark (was: Xiao Li) > The view resolution in the SparkSession internal table() API > - > > Key: SPARK-22488 > URL: https://issues.apache.org/jira/browse/SPARK-22488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.2, 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > The current internal `table()` API of `SparkSession` bypasses the Analyzer > and directly calls `sessionState.catalog.lookupRelation` API. This skips the > view resolution logics in our Analyzer rule `ResolveRelations`. This internal > API is widely used by various DDL commands or the other internal APIs. > Users might get the strange error caused by view resolution when the default > database is different. > ``` > Table or view not found: t1; line 1 pos 14 > org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 > pos 14 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22488) The view resolution in the SparkSession internal table() API
Xiao Li created SPARK-22488: --- Summary: The view resolution in the SparkSession internal table() API Key: SPARK-22488 URL: https://issues.apache.org/jira/browse/SPARK-22488 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.1.2 Reporter: Xiao Li Assignee: Xiao Li The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands or the other internal APIs. Users might get the strange error caused by view resolution when the default database is different. ``` Table or view not found: t1; line 1 pos 14 org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22477) windowing in spark
[ https://issues.apache.org/jira/browse/SPARK-22477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22477. -- Resolution: Invalid Let's ask questions to https://spark.apache.org/community.html, not a JIRA. > windowing in spark > -- > > Key: SPARK-22477 > URL: https://issues.apache.org/jira/browse/SPARK-22477 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Deepak Nayak > > I ahve performed map operation over a set of attrtibutes and have stored it > in a variable. Next i want to perform windowing over a single attribute and > not on the variable which contains all the attributes. > I have tried doing that but i am not getting all the attribute values after > windowing it gives windowed output only for a particular attribute. > Kindly help anyone if possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22447) SAS reading functionality
[ https://issues.apache.org/jira/browse/SPARK-22447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22447. -- Resolution: Won't Fix I think we wouldn't port the third party unless there is a strong reason. To me leaving it as a thridparty sounds making more sense. Let me leave it resolved until any committer sees the value on porting it. > SAS reading functionality > - > > Key: SPARK-22447 > URL: https://issues.apache.org/jira/browse/SPARK-22447 > Project: Spark > Issue Type: New Feature > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Abdeali Kothari >Priority: Trivial > > It would be useful to have a SAS file reading functionality in Spark. > This file specification is well documented at > http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/CUJ/1992/9210/ross/ross.htm > There is a 3rd party tool for this: https://github.com/saurfang/spark-sas7bdat > But it is not maintained and has issues with large files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize
[ https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22426. -- Resolution: Duplicate > Spark AM launching containers on node where External spark shuffle service > failed to initialize > --- > > Key: SPARK-22426 > URL: https://issues.apache.org/jira/browse/SPARK-22426 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph > > When Spark External Shuffle Service on a NodeManager fails, the remote > executors will fail while fetching the data from the executors launched on > this Node. Spark ApplicationMaster should not launch containers on this Node > or not use external shuffle service. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite
[ https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22308. - Resolution: Fixed Fix Version/s: 2.3.0 > Support unit tests of spark code using ScalaTest using suites other than > FunSuite > - > > Key: SPARK-22308 > URL: https://issues.apache.org/jira/browse/SPARK-22308 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core, SQL, Tests >Affects Versions: 2.2.0 >Reporter: Nathan Kronenfeld >Assignee: Nathan Kronenfeld >Priority: Minor > Labels: scalatest, test-suite, test_issue > Fix For: 2.3.0 > > > External codebases that have spark code can test it using SharedSparkContext, > no matter how they write their scalatests - basing on FunSuite, FunSpec, > FlatSpec, or WordSpec. > SharedSQLContext only supports FunSuite. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22487) No usages of HIVE_EXECUTION_VERSION found in whole spark project
[ https://issues.apache.org/jira/browse/SPARK-22487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22487: Assignee: (was: Apache Spark) > No usages of HIVE_EXECUTION_VERSION found in whole spark project > > > Key: SPARK-22487 > URL: https://issues.apache.org/jira/browse/SPARK-22487 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kent Yao >Priority: Minor > > Actually there is no hive client for executions in spark now and there are no > usages of HIVE_EXECUTION_VERSION found in whole spark project. > HIVE_EXECUTION_VERSION is set by `spark.sql.hive.version`, which is still set > internally in some places or by users, this may confuse developers and users > with HIVE_METASTORE_VERSION(spark.sql.hive.metastore.version) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22487) No usages of HIVE_EXECUTION_VERSION found in whole spark project
[ https://issues.apache.org/jira/browse/SPARK-22487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22487: Assignee: Apache Spark > No usages of HIVE_EXECUTION_VERSION found in whole spark project > > > Key: SPARK-22487 > URL: https://issues.apache.org/jira/browse/SPARK-22487 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Minor > > Actually there is no hive client for executions in spark now and there are no > usages of HIVE_EXECUTION_VERSION found in whole spark project. > HIVE_EXECUTION_VERSION is set by `spark.sql.hive.version`, which is still set > internally in some places or by users, this may confuse developers and users > with HIVE_METASTORE_VERSION(spark.sql.hive.metastore.version) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22487) No usages of HIVE_EXECUTION_VERSION found in whole spark project
[ https://issues.apache.org/jira/browse/SPARK-22487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246959#comment-16246959 ] Apache Spark commented on SPARK-22487: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/19712 > No usages of HIVE_EXECUTION_VERSION found in whole spark project > > > Key: SPARK-22487 > URL: https://issues.apache.org/jira/browse/SPARK-22487 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kent Yao >Priority: Minor > > Actually there is no hive client for executions in spark now and there are no > usages of HIVE_EXECUTION_VERSION found in whole spark project. > HIVE_EXECUTION_VERSION is set by `spark.sql.hive.version`, which is still set > internally in some places or by users, this may confuse developers and users > with HIVE_METASTORE_VERSION(spark.sql.hive.metastore.version) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22487) No usages of HIVE_EXECUTION_VERSION found in whole spark project
Kent Yao created SPARK-22487: Summary: No usages of HIVE_EXECUTION_VERSION found in whole spark project Key: SPARK-22487 URL: https://issues.apache.org/jira/browse/SPARK-22487 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kent Yao Priority: Minor Actually there is no hive client for executions in spark now and there are no usages of HIVE_EXECUTION_VERSION found in whole spark project. HIVE_EXECUTION_VERSION is set by `spark.sql.hive.version`, which is still set internally in some places or by users, this may confuse developers and users with HIVE_METASTORE_VERSION(spark.sql.hive.metastore.version) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22287) SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
[ https://issues.apache.org/jira/browse/SPARK-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22287. Resolution: Fixed Assignee: paul mackles Fix Version/s: 2.3.0 2.2.1 > SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher > - > > Key: SPARK-22287 > URL: https://issues.apache.org/jira/browse/SPARK-22287 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.1, 2.2.0, 2.3.0 >Reporter: paul mackles >Assignee: paul mackles >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > > There does not appear to be a way to control the heap size used by > MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not > honored for that particular daemon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage` in MiMa
[ https://issues.apache.org/jira/browse/SPARK-22485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22485. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19710 [https://github.com/apache/spark/pull/19710] > Use `exclude[Problem]` instead `excludePackage` in MiMa > --- > > Key: SPARK-22485 > URL: https://issues.apache.org/jira/browse/SPARK-22485 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 2.3.0 > > > `excludePackage` is deprecated like the > [following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33]. > {code} > @deprecated("Replace with > ProblemFilters.exclude[Problem](\"my.package.*\")", "0.1.15") > def excludePackage(packageName: String): ProblemFilter = { > exclude[Problem](packageName + ".*") > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-22403: Assignee: Wing Yew Poon > StructuredKafkaWordCount example fails in YARN cluster mode > --- > > Key: SPARK-22403 > URL: https://issues.apache.org/jira/browse/SPARK-22403 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Wing Yew Poon >Assignee: Wing Yew Poon > Fix For: 2.2.1, 2.3.0 > > > When I run the StructuredKafkaWordCount example in YARN client mode, it runs > fine. However, when I run it in YARN cluster mode, the application errors > during initialization, and dies after the default number of YARN application > attempts. In the AM log, I see > {noformat} > 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value > AS STRING) > 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream > metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to > /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata > org.apache.hadoop.security.AccessControlException: Permission denied: > user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257) > ... > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960) > at > org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114) > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282) > at > org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79) > at > org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala) > {noformat} > Looking at StreamingQueryManager#createQuery, we have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198 > {code} > val checkpointLocation = userSpecifiedCheckpointLocation.map { ... > ... > }.orElse { > ... >
[jira] [Resolved] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-22403. -- Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 > StructuredKafkaWordCount example fails in YARN cluster mode > --- > > Key: SPARK-22403 > URL: https://issues.apache.org/jira/browse/SPARK-22403 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Wing Yew Poon > Fix For: 2.2.1, 2.3.0 > > > When I run the StructuredKafkaWordCount example in YARN client mode, it runs > fine. However, when I run it in YARN cluster mode, the application errors > during initialization, and dies after the default number of YARN application > attempts. In the AM log, I see > {noformat} > 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value > AS STRING) > 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream > metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to > /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata > org.apache.hadoop.security.AccessControlException: Permission denied: > user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257) > ... > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960) > at > org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114) > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282) > at > org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79) > at > org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala) > {noformat} > Looking at StreamingQueryManager#createQuery, we have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198 > {code} > val checkpointLocation = userSpecifiedCheckpointLocation.map { ... > ... > }.orElse { > ...
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246811#comment-16246811 ] Marcelo Vanzin commented on SPARK-18838: I actually did a backport to our internal branches, here's the one for 2.1 (which IIRC is very similar to the 2.2 one): https://github.com/cloudera/spark/commit/a6b96c45afa227b48723ed872b692ceed9f9171d > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Marcelo Vanzin > Fix For: 2.3.0 > > Attachments: SparkListernerComputeTime.xlsx, perfResults.pdf > > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22483) Exposing java.nio bufferedPool memory metrics to metrics system
[ https://issues.apache.org/jira/browse/SPARK-22483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22483. Resolution: Fixed Assignee: Srinivasa Reddy Vundela Fix Version/s: 2.3.0 > Exposing java.nio bufferedPool memory metrics to metrics system > --- > > Key: SPARK-22483 > URL: https://issues.apache.org/jira/browse/SPARK-22483 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Srinivasa Reddy Vundela >Assignee: Srinivasa Reddy Vundela > Fix For: 2.3.0 > > > Spark currently exposes on-heap and off-heap memory of JVM to metric system. > Currently there is no way to know how much direct/mapped memory allocated for > java.nio buffered pools. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246804#comment-16246804 ] Anthony Truchet commented on SPARK-18838: - Agreed but at the same time this is "almost" a bug when you consider use of Spark at really large scale... I tried a naive cherry pick and obviously there are a huge lot of non trivial conflicts. Do you know how to get an overview of the change (and of what happened in the mean time since 2.2) to guide this conflict resolution? Or maybe you already have some insight about the best way to go for it? > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Marcelo Vanzin > Fix For: 2.3.0 > > Attachments: SparkListernerComputeTime.xlsx, perfResults.pdf > > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246783#comment-16246783 ] Marcelo Vanzin commented on SPARK-18838: It's kind of an intrusive change to backport to a maintenance release, but since it may help a lot of people it might be ok. > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Marcelo Vanzin > Fix For: 2.3.0 > > Attachments: SparkListernerComputeTime.xlsx, perfResults.pdf > > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20647) Make the Storage page use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-20647. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19679 [https://github.com/apache/spark/pull/19679] > Make the Storage page use new app state store > - > > Key: SPARK-20647 > URL: https://issues.apache.org/jira/browse/SPARK-20647 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > Fix For: 2.3.0 > > > See spec in parent issue (SPARK-18085) for more details. > This task tracks making the Storage page use the new app state store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20647) Make the Storage page use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-20647: Assignee: Marcelo Vanzin > Make the Storage page use new app state store > - > > Key: SPARK-20647 > URL: https://issues.apache.org/jira/browse/SPARK-20647 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.3.0 > > > See spec in parent issue (SPARK-18085) for more details. > This task tracks making the Storage page use the new app state store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246624#comment-16246624 ] Apache Spark commented on SPARK-22471: -- User 'tashoyan' has created a pull request for this issue: https://github.com/apache/spark/pull/19711 > SQLListener consumes much memory causing OutOfMemoryError > - > > Key: SPARK-22471 > URL: https://issues.apache.org/jira/browse/SPARK-22471 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.2.0 > Environment: Spark 2.2.0, Linux >Reporter: Arseniy Tashoyan > Labels: memory-leak, sql > Attachments: SQLListener_retained_size.png, > SQLListener_stageIdToStageMetrics_retained_size.png > > Original Estimate: 72h > Remaining Estimate: 72h > > _SQLListener_ may grow very large when Spark runs complex multi-stage > requests. The listener tracks metrics for all stages in > __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup > this hash map regularly, but this is not enough. Precisely, the method > _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not > have metrics for very old data; this method runs on each execution completion. > However, if an execution has many stages, _SQLListener_ keeps adding new > entries to __stageIdToStageMetrics_ without calling > _trimExecutionsIfNecessary_. The hash map may grow to enormous size. > Strictly speaking, it is not a memory leak, because finally > _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program > has high odds to crash with OutOfMemoryError (and it does). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22465) Cogroup of two disproportionate RDDs could lead into 2G limit BUG
[ https://issues.apache.org/jira/browse/SPARK-22465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246595#comment-16246595 ] Thomas Graves commented on SPARK-22465: --- Its not strictly the 2G limit. He did hit that but he hit it because of the default behavior of cogroup. I think this jira was filed to look at that to make the behavior better. So I think the last couple sentences in the description refer to that. > Cogroup of two disproportionate RDDs could lead into 2G limit BUG > - > > Key: SPARK-22465 > URL: https://issues.apache.org/jira/browse/SPARK-22465 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, > 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, > 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0 >Reporter: Amit Kumar >Priority: Critical > > While running my spark pipeline, it failed with the following exception > {noformat} > 2017-11-03 04:49:09,776 [Executor task launch worker for task 58670] ERROR > org.apache.spark.executor.Executor - Exception in task 630.0 in stage 28.0 > (TID 58670) > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869) > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103) > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:469) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > After debugging I found that the issue lies with how spark handles cogroup of > two RDDs. > Here is the relevant code from apache spark > {noformat} > /** >* For each key k in `this` or `other`, return a resulting RDD that > contains a tuple with the >* list of values for that key in `this` as well as `other`. >*/ > def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = > self.withScope { > cogroup(other, defaultPartitioner(self, other)) > } > /** >* Choose a partitioner to use for a cogroup-like operation between a > number of RDDs. >* >* If any of the RDDs already has a partitioner, choose that one. >* >* Otherwise, we use a default HashPartitioner. For the number of > partitions, if >* spark.default.parallelism is set, then we'll use the value from > SparkContext >* defaultParallelism, otherwise we'll use the max number of upstream > partitions. >* >* Unless spark.default.parallelism is set, the number of partitions will > be the >* same as the number of partitions in the largest upstream RDD, as this > should >* be least likely to cause out-of-memory errors. >* >* We use two method parameters (rdd, others) to enforce callers passing at > least 1 RDD. >*/ > def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { > val rdds = (Seq(rdd) ++ others) > val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > > 0)) > if (hasPartitioner.nonEmpty) { > hasPartitioner.maxBy(_.partitions.length).partitioner.get > } else { > if (rdd.context.conf.contains("spark.default.parallelism")) { > new HashPartitioner(rdd.context.defaultParallelism) > } else { > new HashPartitioner(rdds.map(_.partitions.length).max) > } > } > } > {noformat} > Given this suppose we have two pair RDDs. > RDD1 : A small RDD which fewer data and partitions > RDD2: A huge RDD which has loads of data and partitions > Now in the code if we were to have a cogroup > {noformat} > val RDD3 = RDD1.cogroup(RDD2) > {noformat} > there is a case where this could lead to the SPARK-6235 Bug which is If RDD1 > has a partitioner when it is being called into a cogroup. This is because the > cogroups partitions are then decided by the partitioner and could lead to th
[jira] [Resolved] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-21994. Resolution: Not A Problem > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246463#comment-16246463 ] Srinivasa Reddy Vundela commented on SPARK-21994: - This issue is related to Cloudera spark and got fixed recently. We can close this jira. > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports
[ https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16605: -- Affects Version/s: 2.1.2 2.2.0 > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > --- > > Key: SPARK-16605 > URL: https://issues.apache.org/jira/browse/SPARK-16605 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.2, 2.2.0 >Reporter: marymwu > Fix For: 2.2.1, 2.3.0 > > Attachments: screenshot-1.png > > > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > Steps: > 1. Use hive to create a table "tbtxt" stored as txt and load data into it. > 2. Use hive to create a table "tborc" stored as orc and insert the data from > table "tbtxt" . Example, "create table tborc stored as orc as select * from > tbtxt" > 3. Use spark2.0 to "select * from tborc;".-->error > occurs,java.lang.IllegalArgumentException: Field "nid" does not exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports
[ https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16605: -- Fix Version/s: 2.3.0 > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > --- > > Key: SPARK-16605 > URL: https://issues.apache.org/jira/browse/SPARK-16605 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: marymwu > Fix For: 2.2.1, 2.3.0 > > Attachments: screenshot-1.png > > > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > Steps: > 1. Use hive to create a table "tbtxt" stored as txt and load data into it. > 2. Use hive to create a table "tborc" stored as orc and insert the data from > table "tbtxt" . Example, "create table tborc stored as orc as select * from > tbtxt" > 3. Use spark2.0 to "select * from tborc;".-->error > occurs,java.lang.IllegalArgumentException: Field "nid" does not exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports
[ https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246434#comment-16246434 ] Dongjoon Hyun commented on SPARK-16605: --- This duplicates SPARK-14387. SPARK-14387 is resolved in Spark 2.2.1. > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > --- > > Key: SPARK-16605 > URL: https://issues.apache.org/jira/browse/SPARK-16605 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: marymwu > Fix For: 2.2.1, 2.3.0 > > Attachments: screenshot-1.png > > > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > Steps: > 1. Use hive to create a table "tbtxt" stored as txt and load data into it. > 2. Use hive to create a table "tborc" stored as orc and insert the data from > table "tbtxt" . Example, "create table tborc stored as orc as select * from > tbtxt" > 3. Use spark2.0 to "select * from tborc;".-->error > occurs,java.lang.IllegalArgumentException: Field "nid" does not exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports
[ https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16605: -- Fix Version/s: 2.2.1 > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > --- > > Key: SPARK-16605 > URL: https://issues.apache.org/jira/browse/SPARK-16605 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: marymwu > Fix For: 2.2.1, 2.3.0 > > Attachments: screenshot-1.png > > > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > Steps: > 1. Use hive to create a table "tbtxt" stored as txt and load data into it. > 2. Use hive to create a table "tborc" stored as orc and insert the data from > table "tbtxt" . Example, "create table tborc stored as orc as select * from > tbtxt" > 3. Use spark2.0 to "select * from tborc;".-->error > occurs,java.lang.IllegalArgumentException: Field "nid" does not exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials
[ https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246328#comment-16246328 ] Andrew Ash commented on SPARK-22479: Completely agree that credentials shouldn't be in the toString since query plans are logged in many places. This looks like it brings SaveIntoDataSourceCommand more in-line with JdbcRelation, which also currently redacts credentials from its toString to avoid them being written to logs. > SaveIntoDataSourceCommand logs jdbc credentials > --- > > Key: SPARK-22479 > URL: https://issues.apache.org/jira/browse/SPARK-22479 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Onur Satici > > JDBC credentials are not redacted in plans including a > 'SaveIntoDataSourceCommand'. > Steps to reproduce: > {code} > spark-shell --packages org.postgresql:postgresql:42.1.1 > {code} > {code} > import org.apache.spark.sql.execution.QueryExecution > import org.apache.spark.sql.util.QueryExecutionListener > val listener = new QueryExecutionListener { > override def onFailure(funcName: String, qe: QueryExecution, exception: > Exception): Unit = {} > override def onSuccess(funcName: String, qe: QueryExecution, duration: > Long): Unit = { > System.out.println(qe.toString()) > } > } > spark.listenerManager.register(listener) > spark.range(100).write.format("jdbc").option("url", > "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", > "org.postgresql.Driver").option("dbtable", "test").save() > {code} > The above will yield the following plan: > {code} > == Parsed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Analyzed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Optimized Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Physical Plan == > ExecutedCommand >+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists > +- Range (0, 100, step=1, splits=Some(8)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22486) Support synchronous offset commits for Kafka
Jeremy Beard created SPARK-22486: Summary: Support synchronous offset commits for Kafka Key: SPARK-22486 URL: https://issues.apache.org/jira/browse/SPARK-22486 Project: Spark Issue Type: Improvement Components: DStreams Affects Versions: 2.2.0 Reporter: Jeremy Beard CanCommitOffsets provides asynchronous offset commits (via Consumer#commitAsync), and it would be useful if it also provided synchronous offset commits (via Consumer#commitSync) for when the desired behavior is to block until it is complete. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage` in MiMa
[ https://issues.apache.org/jira/browse/SPARK-22485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22485: Assignee: Apache Spark > Use `exclude[Problem]` instead `excludePackage` in MiMa > --- > > Key: SPARK-22485 > URL: https://issues.apache.org/jira/browse/SPARK-22485 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > `excludePackage` is deprecated like the > [following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33]. > {code} > @deprecated("Replace with > ProblemFilters.exclude[Problem](\"my.package.*\")", "0.1.15") > def excludePackage(packageName: String): ProblemFilter = { > exclude[Problem](packageName + ".*") > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage`
Dongjoon Hyun created SPARK-22485: - Summary: Use `exclude[Problem]` instead `excludePackage` Key: SPARK-22485 URL: https://issues.apache.org/jira/browse/SPARK-22485 Project: Spark Issue Type: Task Components: Build Affects Versions: 2.3.0 Reporter: Dongjoon Hyun Priority: Minor `excludePackage` is deprecated like the [following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33]. {code} @deprecated("Replace with ProblemFilters.exclude[Problem](\"my.package.*\")", "0.1.15") def excludePackage(packageName: String): ProblemFilter = { exclude[Problem](packageName + ".*") } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage` in MiMa
[ https://issues.apache.org/jira/browse/SPARK-22485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22485: Assignee: (was: Apache Spark) > Use `exclude[Problem]` instead `excludePackage` in MiMa > --- > > Key: SPARK-22485 > URL: https://issues.apache.org/jira/browse/SPARK-22485 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `excludePackage` is deprecated like the > [following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33]. > {code} > @deprecated("Replace with > ProblemFilters.exclude[Problem](\"my.package.*\")", "0.1.15") > def excludePackage(packageName: String): ProblemFilter = { > exclude[Problem](packageName + ".*") > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage` in MiMa
[ https://issues.apache.org/jira/browse/SPARK-22485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246265#comment-16246265 ] Apache Spark commented on SPARK-22485: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19710 > Use `exclude[Problem]` instead `excludePackage` in MiMa > --- > > Key: SPARK-22485 > URL: https://issues.apache.org/jira/browse/SPARK-22485 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `excludePackage` is deprecated like the > [following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33]. > {code} > @deprecated("Replace with > ProblemFilters.exclude[Problem](\"my.package.*\")", "0.1.15") > def excludePackage(packageName: String): ProblemFilter = { > exclude[Problem](packageName + ".*") > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage` in MiMa
[ https://issues.apache.org/jira/browse/SPARK-22485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22485: -- Summary: Use `exclude[Problem]` instead `excludePackage` in MiMa (was: Use `exclude[Problem]` instead `excludePackage`) > Use `exclude[Problem]` instead `excludePackage` in MiMa > --- > > Key: SPARK-22485 > URL: https://issues.apache.org/jira/browse/SPARK-22485 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `excludePackage` is deprecated like the > [following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33]. > {code} > @deprecated("Replace with > ProblemFilters.exclude[Problem](\"my.package.*\")", "0.1.15") > def excludePackage(packageName: String): ProblemFilter = { > exclude[Problem](packageName + ".*") > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22484) PySpark DataFrame.write.csv(quote="") uses nullchar as quote
John Bollenbacher created SPARK-22484: - Summary: PySpark DataFrame.write.csv(quote="") uses nullchar as quote Key: SPARK-22484 URL: https://issues.apache.org/jira/browse/SPARK-22484 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.2.0 Reporter: John Bollenbacher Priority: Minor [Documentation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=save#pyspark.sql.DataFrame) of DataFrame.write.csv() says that setting the quote parameter to an empty string should turn off quoting. Instead, it uses the [null character](https://en.wikipedia.org/wiki/Null_character) as the quote. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22483) Exposing java.nio bufferedPool memory metrics to metrics system
[ https://issues.apache.org/jira/browse/SPARK-22483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22483: Assignee: (was: Apache Spark) > Exposing java.nio bufferedPool memory metrics to metrics system > --- > > Key: SPARK-22483 > URL: https://issues.apache.org/jira/browse/SPARK-22483 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Srinivasa Reddy Vundela > > Spark currently exposes on-heap and off-heap memory of JVM to metric system. > Currently there is no way to know how much direct/mapped memory allocated for > java.nio buffered pools. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22483) Exposing java.nio bufferedPool memory metrics to metrics system
[ https://issues.apache.org/jira/browse/SPARK-22483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246203#comment-16246203 ] Apache Spark commented on SPARK-22483: -- User 'vundela' has created a pull request for this issue: https://github.com/apache/spark/pull/19709 > Exposing java.nio bufferedPool memory metrics to metrics system > --- > > Key: SPARK-22483 > URL: https://issues.apache.org/jira/browse/SPARK-22483 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Srinivasa Reddy Vundela > > Spark currently exposes on-heap and off-heap memory of JVM to metric system. > Currently there is no way to know how much direct/mapped memory allocated for > java.nio buffered pools. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22483) Exposing java.nio bufferedPool memory metrics to metrics system
[ https://issues.apache.org/jira/browse/SPARK-22483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22483: Assignee: Apache Spark > Exposing java.nio bufferedPool memory metrics to metrics system > --- > > Key: SPARK-22483 > URL: https://issues.apache.org/jira/browse/SPARK-22483 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Srinivasa Reddy Vundela >Assignee: Apache Spark > > Spark currently exposes on-heap and off-heap memory of JVM to metric system. > Currently there is no way to know how much direct/mapped memory allocated for > java.nio buffered pools. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22483) Exposing java.nio bufferedPool memory metrics to metrics system
Srinivasa Reddy Vundela created SPARK-22483: --- Summary: Exposing java.nio bufferedPool memory metrics to metrics system Key: SPARK-22483 URL: https://issues.apache.org/jira/browse/SPARK-22483 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 2.2.0 Reporter: Srinivasa Reddy Vundela Spark currently exposes on-heap and off-heap memory of JVM to metric system. Currently there is no way to know how much direct/mapped memory allocated for java.nio buffered pools. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246176#comment-16246176 ] Ran Haim commented on SPARK-22481: -- It takes about 2 seconds to create the dataset...i need to refresh 30 tables, and that takes a minute now - where it used to take 2 seconds. In the old code, it does the same because iscashed actually uses the plan and not the dataset. > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Critical > > CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become > really slow. > The cause of the issue is that it is now *always* creates a dataset, and this > is redundant most of the time, we only need the dataset if the table is > cached. > before 2.1.1: > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > val logicalPlan = > sparkSession.sessionState.catalog.lookupRelation(tableIdent) > // Use lookupCachedData directly since RefreshTable also takes > databaseName. > val isCached = > sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty > if (isCached) { > // Create a data frame to represent the table. > // TODO: Use uncacheTable once it supports database name. > {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(df, > Some(tableIdent.table)) > } > } > after 2.1.1: >override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > {color:red} val table = sparkSession.table(tableIdent){color} > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = > true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22287) SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
[ https://issues.apache.org/jira/browse/SPARK-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-22287: - Target Version/s: 2.2.1 > SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher > - > > Key: SPARK-22287 > URL: https://issues.apache.org/jira/browse/SPARK-22287 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.1, 2.2.0, 2.3.0 >Reporter: paul mackles >Priority: Minor > > There does not appear to be a way to control the heap size used by > MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not > honored for that particular daemon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19606) Support constraints in spark-dispatcher
[ https://issues.apache.org/jira/browse/SPARK-19606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19606: - Target Version/s: 2.2.1 > Support constraints in spark-dispatcher > --- > > Key: SPARK-19606 > URL: https://issues.apache.org/jira/browse/SPARK-19606 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Philipp Hoffmann > > The `spark.mesos.constraints` configuration is ignored by the > spark-dispatcher. The constraints need to be passed in the Framework > information when registering with Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22482) Unreadable Parquet array columns
Costas Piliotis created SPARK-22482: --- Summary: Unreadable Parquet array columns Key: SPARK-22482 URL: https://issues.apache.org/jira/browse/SPARK-22482 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Environment: Spark 2.1.0 Parquet 1.8.1 Hive 1.2 Hive 2.1.0 presto 0.157 presto 0.180 Reporter: Costas Piliotis We have seen an issue with writing out parquet data from spark. int and bool arrays seem to be throwing exceptions when trying to read the parquet files from hive and presto. I've logged a ticket here: PARQUET-1157 with the parquet project but I'm not sure if it's an issue within their project or an issue with spark itself. Spark is reading parquet-avro data which is output by a mapreduce job and writing it out to parquet. The inbound parquet format has the column defined as: {code} optional group playerpositions_ai (LIST) { repeated int32 array; } {code} Spark is redefining this data as this: {code} optional group playerpositions_ai (LIST) { repeated group list { optional int32 element; } } {code} and with legacy format: {code} optional group playerpositions_ai (LIST) { repeated group bag { optional int32 array; } } {code} The parquet data was tested in Hive 1.2, Hive 2.1, Presto 0.157, Presto 0.180, and Spark 2.1, as well as Amazon Athena (which is some form of presto implementation). I believe that the above schema is valid for writing out parquet. The spark command writing it out is simple: {code} data.repartition(((data.count() / 1000) + 1).toInt).write.format("parquet") .mode("append") .partitionBy(partitionColumns: _*) .save(path) {code} We initially wrote this out with legacy format turned off but later turned on legacy format and have seen this error occur the same way with legacy off and on. Spark's stack trace from reading this is: {code} java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream. at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82) at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64) at org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readInteger(DictionaryValuesReader.java:112) at org.apache.parquet.column.impl.ColumnReaderImpl$2$3.read(ColumnReaderImpl.java:243) at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464) at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370) at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Also do note that our data is stored on S3 if that matters. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246070#comment-16246070 ] Wenchen Fan commented on SPARK-22481: - `SparkSession.table` is pretty cheap, I think the slowness is due to we now uncache all plans that refer to the given plan. This is for correctness, so I don't think it's a regression. > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Critical > > CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become > really slow. > The cause of the issue is that it is now *always* creates a dataset, and this > is redundant most of the time, we only need the dataset if the table is > cached. > before 2.1.1: > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > val logicalPlan = > sparkSession.sessionState.catalog.lookupRelation(tableIdent) > // Use lookupCachedData directly since RefreshTable also takes > databaseName. > val isCached = > sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty > if (isCached) { > // Create a data frame to represent the table. > // TODO: Use uncacheTable once it supports database name. > {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(df, > Some(tableIdent.table)) > } > } > after 2.1.1: >override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > {color:red} val table = sparkSession.table(tableIdent){color} > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = > true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246041#comment-16246041 ] Kazuaki Ishizaki commented on SPARK-22481: -- This change was introduce by [this PR|https://github.com/apache/spark/pull/17097/files#diff-463cb1b0f60d87ada075a820f18e1104R443]. It seems to be related to the first refactoring in the description. [~cloud_fan] Is there any reason to always call {{sparkSession.table()}}? > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Critical > > CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become > really slow. > The cause of the issue is that it is now *always* creates a dataset, and this > is redundant most of the time, we only need the dataset if the table is > cached. > before 2.1.1: > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > val logicalPlan = > sparkSession.sessionState.catalog.lookupRelation(tableIdent) > // Use lookupCachedData directly since RefreshTable also takes > databaseName. > val isCached = > sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty > if (isCached) { > // Create a data frame to represent the table. > // TODO: Use uncacheTable once it supports database name. > {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(df, > Some(tableIdent.table)) > } > } > after 2.1.1: >override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > {color:red} val table = sparkSession.table(tableIdent){color} > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = > true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-22481: - Description: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* creates a dataset, and this is redundant most of the time, we only need the dataset if the table is cached. before 2.1.1: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after 2.1.1: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. {color:red} val table = sparkSession.table(tableIdent){color} if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } was: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* creates a dataset, and this is redundant most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. {color:red} val table = sparkSession.table(tableIdent){color} if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug >
[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-22481: - Description: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* creates a dataset, and this is redundant most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. {color:red} val table = sparkSession.table(tableIdent){color} if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } was: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* create a dataset, and this is redundent most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. {color:red} val table = sparkSession.table(tableIdent){color} if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Typ
[jira] [Created] (SPARK-22481) CatalogImpl.refreshTable is slow
Ran Haim created SPARK-22481: Summary: CatalogImpl.refreshTable is slow Key: SPARK-22481 URL: https://issues.apache.org/jira/browse/SPARK-22481 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.1.2, 2.1.1 Reporter: Ran Haim Priority: Critical CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* create a dataset, and this is redundent most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. * val df = Dataset.ofRows(sparkSession, logicalPlan)* // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. * val table = sparkSession.table(tableIdent)* if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-22481: - Description: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* create a dataset, and this is redundent most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. {color:red} val table = sparkSession.table(tableIdent){color} if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } was: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* create a dataset, and this is redundent most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. * val df = Dataset.ofRows(sparkSession, logicalPlan)* // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. * val table = sparkSession.table(tableIdent)* if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQ
[jira] [Commented] (SPARK-22474) cannot read a parquet file containing a Seq[Map[MyCaseClass, String]]
[ https://issues.apache.org/jira/browse/SPARK-22474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245915#comment-16245915 ] Kazuaki Ishizaki commented on SPARK-22474: -- This check was introduced by [this PR|https://github.com/apache/spark/commit/8ab50765cd793169091d983b50d87a391f6ac1f4]. While I did not run it with Spark 2.0.2, Spark 2.0.2 seem to include this PR, too. > cannot read a parquet file containing a Seq[Map[MyCaseClass, String]] > - > > Key: SPARK-22474 > URL: https://issues.apache.org/jira/browse/SPARK-22474 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.2, 2.2.0 >Reporter: Mikael Valot > > The following code run in spark-shell throws an exception. It is working fine > in Spark 2.0.2 > {code:java} > case class MyId(v: String) > case class MyClass(infos: Seq[Map[MyId, String]]) > val seq = Seq(MyClass(Seq(Map(MyId("1234") -> "blah" > seq.toDS().write.parquet("/tmp/myclass") > spark.read.parquet("/tmp/myclass").as[MyClass].collect() > Caused by: org.apache.spark.sql.AnalysisException: Map key type is expected > to be a primitive type, but found: required group key { > optional binary v (UTF8); > }; > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:581) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:201) > at scala.Option.fold(Option.scala:158) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:87) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:84) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetSchemaConverter$$convert(ParquetSchemaConverter.scala:84) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201) > at scala.Option.fold(Option.scala:158) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetArrayConverter.(ParquetRowConverter.scala:483) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:298) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:183) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:180) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.datasou
[jira] [Commented] (SPARK-22474) cannot read a parquet file containing a Seq[Map[MyCaseClass, String]]
[ https://issues.apache.org/jira/browse/SPARK-22474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245875#comment-16245875 ] Mikael Valot commented on SPARK-22474: -- I can read the file if I comment out line 581 in org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter {code:java} def checkConversionRequirement(f: => Boolean, message: String): Unit = { if (!f) { // throw new AnalysisException(message) } } {code} > cannot read a parquet file containing a Seq[Map[MyCaseClass, String]] > - > > Key: SPARK-22474 > URL: https://issues.apache.org/jira/browse/SPARK-22474 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.2, 2.2.0 >Reporter: Mikael Valot > > The following code run in spark-shell throws an exception. It is working fine > in Spark 2.0.2 > {code:java} > case class MyId(v: String) > case class MyClass(infos: Seq[Map[MyId, String]]) > val seq = Seq(MyClass(Seq(Map(MyId("1234") -> "blah" > seq.toDS().write.parquet("/tmp/myclass") > spark.read.parquet("/tmp/myclass").as[MyClass].collect() > Caused by: org.apache.spark.sql.AnalysisException: Map key type is expected > to be a primitive type, but found: required group key { > optional binary v (UTF8); > }; > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:581) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:201) > at scala.Option.fold(Option.scala:158) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:87) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:84) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetSchemaConverter$$convert(ParquetSchemaConverter.scala:84) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201) > at scala.Option.fold(Option.scala:158) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetArrayConverter.(ParquetRowConverter.scala:483) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:298) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:183) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:180) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable
[jira] [Assigned] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials
[ https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22479: Assignee: (was: Apache Spark) > SaveIntoDataSourceCommand logs jdbc credentials > --- > > Key: SPARK-22479 > URL: https://issues.apache.org/jira/browse/SPARK-22479 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Onur Satici > > JDBC credentials are not redacted in plans including a > 'SaveIntoDataSourceCommand'. > Steps to reproduce: > {code} > spark-shell --packages org.postgresql:postgresql:42.1.1 > {code} > {code} > import org.apache.spark.sql.execution.QueryExecution > import org.apache.spark.sql.util.QueryExecutionListener > val listener = new QueryExecutionListener { > override def onFailure(funcName: String, qe: QueryExecution, exception: > Exception): Unit = {} > override def onSuccess(funcName: String, qe: QueryExecution, duration: > Long): Unit = { > System.out.println(qe.toString()) > } > } > spark.listenerManager.register(listener) > spark.range(100).write.format("jdbc").option("url", > "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", > "org.postgresql.Driver").option("dbtable", "test").save() > {code} > The above will yield the following plan: > {code} > == Parsed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Analyzed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Optimized Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Physical Plan == > ExecutedCommand >+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists > +- Range (0, 100, step=1, splits=Some(8)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials
[ https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245766#comment-16245766 ] Apache Spark commented on SPARK-22479: -- User 'onursatici' has created a pull request for this issue: https://github.com/apache/spark/pull/19708 > SaveIntoDataSourceCommand logs jdbc credentials > --- > > Key: SPARK-22479 > URL: https://issues.apache.org/jira/browse/SPARK-22479 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Onur Satici > > JDBC credentials are not redacted in plans including a > 'SaveIntoDataSourceCommand'. > Steps to reproduce: > {code} > spark-shell --packages org.postgresql:postgresql:42.1.1 > {code} > {code} > import org.apache.spark.sql.execution.QueryExecution > import org.apache.spark.sql.util.QueryExecutionListener > val listener = new QueryExecutionListener { > override def onFailure(funcName: String, qe: QueryExecution, exception: > Exception): Unit = {} > override def onSuccess(funcName: String, qe: QueryExecution, duration: > Long): Unit = { > System.out.println(qe.toString()) > } > } > spark.listenerManager.register(listener) > spark.range(100).write.format("jdbc").option("url", > "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", > "org.postgresql.Driver").option("dbtable", "test").save() > {code} > The above will yield the following plan: > {code} > == Parsed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Analyzed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Optimized Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Physical Plan == > ExecutedCommand >+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists > +- Range (0, 100, step=1, splits=Some(8)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials
[ https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22479: Assignee: Apache Spark > SaveIntoDataSourceCommand logs jdbc credentials > --- > > Key: SPARK-22479 > URL: https://issues.apache.org/jira/browse/SPARK-22479 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Onur Satici >Assignee: Apache Spark > > JDBC credentials are not redacted in plans including a > 'SaveIntoDataSourceCommand'. > Steps to reproduce: > {code} > spark-shell --packages org.postgresql:postgresql:42.1.1 > {code} > {code} > import org.apache.spark.sql.execution.QueryExecution > import org.apache.spark.sql.util.QueryExecutionListener > val listener = new QueryExecutionListener { > override def onFailure(funcName: String, qe: QueryExecution, exception: > Exception): Unit = {} > override def onSuccess(funcName: String, qe: QueryExecution, duration: > Long): Unit = { > System.out.println(qe.toString()) > } > } > spark.listenerManager.register(listener) > spark.range(100).write.format("jdbc").option("url", > "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", > "org.postgresql.Driver").option("dbtable", "test").save() > {code} > The above will yield the following plan: > {code} > == Parsed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Analyzed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Optimized Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Physical Plan == > ExecutedCommand >+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists > +- Range (0, 100, step=1, splits=Some(8)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22478) Spark - Truncate date by Day / Hour
[ https://issues.apache.org/jira/browse/SPARK-22478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245748#comment-16245748 ] Marco Gaido commented on SPARK-22478: - The reason why {{TRUNC}} works like this is for compatibility with Hive, where the same function works the same way. You can achieve your goal in other ways, like using {{date_format}} to get a string with the information you need or you can use the {{HOUR}} function, or you can even create your UDF to achieve this. > Spark - Truncate date by Day / Hour > > > Key: SPARK-22478 > URL: https://issues.apache.org/jira/browse/SPARK-22478 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: david hodeffi >Priority: Trivial > > Currently it is possible to truncate date/timestamp only by month or year, > There is a missing functionality to truncate by hour/day. > A usecase that requires this functionality is: Aggregating financial > transaction of a party in order to create daily/hour summary. > Those summaries are required for machine learning algorithms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once
[ https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20542. Resolution: Fixed Fix Version/s: 2.3.0 > Add an API into Bucketizer that can bin a lot of columns all at once > > > Key: SPARK-20542 > URL: https://issues.apache.org/jira/browse/SPARK-20542 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > Current ML's Bucketizer can only bin a column of continuous features. If a > dataset has thousands of of continuous columns needed to bin, we will result > in thousands of ML stages. It is very inefficient regarding query planning > and execution. > We should have a type of bucketizer that can bin a lot of columns all at > once. It would need to accept an list of arrays of split points to correspond > to the columns to bin, but it might make things more efficient by replacing > thousands of stages with just one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22480) Dynamic Watermarking
[ https://issues.apache.org/jira/browse/SPARK-22480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jochen Niebuhr updated SPARK-22480: --- Description: When you're using the watermark feature, you're forced to provide an absolute duration to identify late events. For the case we're using structured streaming for, this is not completely working. In our case, late events will be possible on the next business day. So I'd like to use 24 hours watermark for Sunday-Thursday, 72 hours for Friday, 48 for Saturday. What I would suggest is being able to use a function or expression to {{withWatermark}} so people can implement this or similar behaviors. If this sounds like a good idea, I can probably supply a pull request. (was: When you're using the watermark feature, you're forced to provide an absolute duration to identify late events. For the case we're using structured streaming for, this is not completely working. In our case, late events will be possible on the next business day. So I'd like to use 24 hours watermark for Sunday-Thursday, 72 hours for Friday, 48 for Saturday. What I would suggest is being able to use a function or expression to `withWatermark` so people can implement this or similar behaviors. If this sounds like a good idea, I can probably supply a pull request.) > Dynamic Watermarking > > > Key: SPARK-22480 > URL: https://issues.apache.org/jira/browse/SPARK-22480 > Project: Spark > Issue Type: Wish > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jochen Niebuhr > > When you're using the watermark feature, you're forced to provide an absolute > duration to identify late events. For the case we're using structured > streaming for, this is not completely working. In our case, late events will > be possible on the next business day. So I'd like to use 24 hours watermark > for Sunday-Thursday, 72 hours for Friday, 48 for Saturday. What I would > suggest is being able to use a function or expression to {{withWatermark}} so > people can implement this or similar behaviors. If this sounds like a good > idea, I can probably supply a pull request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22480) Dynamic Watermarking
[ https://issues.apache.org/jira/browse/SPARK-22480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jochen Niebuhr updated SPARK-22480: --- Description: When you're using the watermark feature, you're forced to provide an absolute duration to identify late events. For the case we're using structured streaming for, this is not completely working. In our case, late events will be possible on the next business day. So I'd like to use 24 hours watermark for Sunday-Thursday, 72 hours for Friday, 48 for Saturday. What I would suggest is being able to use a function or expression to `withWatermark` so people can implement this or similar behaviors. If this sounds like a good idea, I can probably supply a pull request. (was: When you're using the watermark feature, you're forced to provide an absolute duration to identify late events. For the case we're using structured streaming for, this is not completely working. In our case, late events will be possible on the next business day. So I'd like to use 24 hours watermark for Sunday-Thursday, 72 hours for Friday, 48 for Saturday. If this sounds like a good idea, I can probably supply a pull request.) > Dynamic Watermarking > > > Key: SPARK-22480 > URL: https://issues.apache.org/jira/browse/SPARK-22480 > Project: Spark > Issue Type: Wish > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jochen Niebuhr > > When you're using the watermark feature, you're forced to provide an absolute > duration to identify late events. For the case we're using structured > streaming for, this is not completely working. In our case, late events will > be possible on the next business day. So I'd like to use 24 hours watermark > for Sunday-Thursday, 72 hours for Friday, 48 for Saturday. What I would > suggest is being able to use a function or expression to `withWatermark` so > people can implement this or similar behaviors. If this sounds like a good > idea, I can probably supply a pull request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22480) Dynamic Watermarking
Jochen Niebuhr created SPARK-22480: -- Summary: Dynamic Watermarking Key: SPARK-22480 URL: https://issues.apache.org/jira/browse/SPARK-22480 Project: Spark Issue Type: Wish Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Jochen Niebuhr When you're using the watermark feature, you're forced to provide an absolute duration to identify late events. For the case we're using structured streaming for, this is not completely working. In our case, late events will be possible on the next business day. So I'd like to use 24 hours watermark for Sunday-Thursday, 72 hours for Friday, 48 for Saturday. If this sounds like a good idea, I can probably supply a pull request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials
[ https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245677#comment-16245677 ] Onur Satici commented on SPARK-22479: - will submit a pr > SaveIntoDataSourceCommand logs jdbc credentials > --- > > Key: SPARK-22479 > URL: https://issues.apache.org/jira/browse/SPARK-22479 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Onur Satici > > JDBC credentials are not redacted in plans including a > 'SaveIntoDataSourceCommand'. > Steps to reproduce: > {code} > spark-shell --packages org.postgresql:postgresql:42.1.1 > {code} > {code} > import org.apache.spark.sql.execution.QueryExecution > import org.apache.spark.sql.util.QueryExecutionListener > val listener = new QueryExecutionListener { > override def onFailure(funcName: String, qe: QueryExecution, exception: > Exception): Unit = {} > override def onSuccess(funcName: String, qe: QueryExecution, duration: > Long): Unit = { > System.out.println(qe.toString()) > } > } > spark.listenerManager.register(listener) > spark.range(100).write.format("jdbc").option("url", > "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", > "org.postgresql.Driver").option("dbtable", "test").save() > {code} > The above will yield the following plan: > {code} > == Parsed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Analyzed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Optimized Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Physical Plan == > ExecutedCommand >+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists > +- Range (0, 100, step=1, splits=Some(8)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22472: Assignee: (was: Apache Spark) > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245676#comment-16245676 ] Apache Spark commented on SPARK-22472: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19707 > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22472: Assignee: Apache Spark > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik >Assignee: Apache Spark > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials
Onur Satici created SPARK-22479: --- Summary: SaveIntoDataSourceCommand logs jdbc credentials Key: SPARK-22479 URL: https://issues.apache.org/jira/browse/SPARK-22479 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Onur Satici JDBC credentials are not redacted in plans including a 'SaveIntoDataSourceCommand'. Steps to reproduce: {code} spark-shell --packages org.postgresql:postgresql:42.1.1 {code} {code} import org.apache.spark.sql.execution.QueryExecution import org.apache.spark.sql.util.QueryExecutionListener val listener = new QueryExecutionListener { override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = {} override def onSuccess(funcName: String, qe: QueryExecution, duration: Long): Unit = { System.out.println(qe.toString()) } } spark.listenerManager.register(listener) spark.range(100).write.format("jdbc").option("url", "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", "org.postgresql.Driver").option("dbtable", "test").save() {code} The above will yield the following plan: {code} == Parsed Logical Plan == SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) == Analyzed Logical Plan == SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) == Optimized Logical Plan == SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) == Physical Plan == ExecutedCommand +- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), ErrorIfExists +- Range (0, 100, step=1, splits=Some(8)) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245556#comment-16245556 ] Li Yuanjian commented on SPARK-20928: - Our team discuss on the design sketch in detail, we have some ideas and questions take down below. 1. Will the Window Operation support in the Continuous Processing Mode? Even if we only consider narrow dependencies currently like the design sketch described, the exactly-once assurance may not be accomplished based on current implementation of window and watermark. 2. Should the EpochIDs aligned in the scenario of not map-only? {quote} The design can also work with blocking operators, although it’d require the blocking operators to ensure epoch markers from all the partitions have been received by the operator before moving forward to commit. {quote} is the `blocking operators` means 'operator need shuffle'? We think that only the operator has ordering relation(like window\mapState\sortByKey) need the EpochIDs aligned, others(like groupBy) doesn't. 3. Also the scenario of many to one(like shuffle and window), should we use a new EpochID in shuffle read stage and window slide out trigger, or use the original EpochIDs batch? > SPIP: Continuous Processing Mode for Structured Streaming > - > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust > Labels: SPIP > Attachments: Continuous Processing in Structured Streaming Design > Sketch.pdf > > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22478) Spark - Truncate date by Day / Hour
david hodeffi created SPARK-22478: - Summary: Spark - Truncate date by Day / Hour Key: SPARK-22478 URL: https://issues.apache.org/jira/browse/SPARK-22478 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.2.0 Reporter: david hodeffi Priority: Trivial Currently it is possible to truncate date/timestamp only by month or year, There is a missing functionality to truncate by hour/day. A usecase that requires this functionality is: Aggregating financial transaction of a party in order to create daily/hour summary. Those summaries are required for machine learning algorithms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22477) windowing in spark
Deepak Nayak created SPARK-22477: Summary: windowing in spark Key: SPARK-22477 URL: https://issues.apache.org/jira/browse/SPARK-22477 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.2.0 Reporter: Deepak Nayak I ahve performed map operation over a set of attrtibutes and have stored it in a variable. Next i want to perform windowing over a single attribute and not on the variable which contains all the attributes. I have tried doing that but i am not getting all the attribute values after windowing it gives windowed output only for a particular attribute. Kindly help anyone if possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent
[ https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22405. - Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 2.3.0 > Enrich the event information and add new event of ExternalCatalogEvent > -- > > Key: SPARK-22405 > URL: https://issues.apache.org/jira/browse/SPARK-22405 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.3.0 > > > We're building a data lineage tool in which we need to monitor the metadata > changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides > several useful events like "CreateDatabaseEvent" for custom SparkListener to > use. But the information provided by such event is not rich enough, for > example {{CreateTablePreEvent}} only provides "database" name and "table" > name, not all the table metadata, which is hard for user to get all the table > related useful information. > So here propose to and new {{ExternalCatalogEvent}} and enrich the current > existing events for all the catalog related updates. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters
[ https://issues.apache.org/jira/browse/SPARK-22442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22442: --- Assignee: Liang-Chi Hsieh > Schema generated by Product Encoder doesn't match case class field name when > using non-standard characters > -- > > Key: SPARK-22442 > URL: https://issues.apache.org/jira/browse/SPARK-22442 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Mikel San Vicente >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > Product encoder encodes special characters wrongly when field name contains > certain nonstandard characters. > For example for: > {quote} > case class MyType(`field.1`: String, `field 2`: String) > {quote} > we will get the following schema > {quote} > root > |-- field$u002E1: string (nullable = true) > |-- field$u00202: string (nullable = true) > {quote} > As a consequence of this issue a DataFrame with the correct schema can't be > converted to a Dataset using .as[MyType] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters
[ https://issues.apache.org/jira/browse/SPARK-22442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22442. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19664 [https://github.com/apache/spark/pull/19664] > Schema generated by Product Encoder doesn't match case class field name when > using non-standard characters > -- > > Key: SPARK-22442 > URL: https://issues.apache.org/jira/browse/SPARK-22442 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Mikel San Vicente > Fix For: 2.3.0 > > > Product encoder encodes special characters wrongly when field name contains > certain nonstandard characters. > For example for: > {quote} > case class MyType(`field.1`: String, `field 2`: String) > {quote} > we will get the following schema > {quote} > root > |-- field$u002E1: string (nullable = true) > |-- field$u00202: string (nullable = true) > {quote} > As a consequence of this issue a DataFrame with the correct schema can't be > converted to a Dataset using .as[MyType] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however, I can't seem to reduce it to a sample. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745){noformat} Another error is: {noformat} if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 385, in getNumPartitions AttributeError: 'NoneType' object has no attribute 'size' {noformat} This is happening at multiple points in my code. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however, I can't seem to reduce it to a sample. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty Fi
[jira] [Resolved] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive
[ https://issues.apache.org/jira/browse/SPARK-22463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22463. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19663 [https://github.com/apache/spark/pull/19663] > Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to > distributed archive > -- > > Key: SPARK-22463 > URL: https://issues.apache.org/jira/browse/SPARK-22463 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.2, 2.2.0 >Reporter: Kent Yao > Fix For: 2.3.0 > > > When I ran self contained sql apps, such as > {code:java} > import org.apache.spark.sql.SparkSession > object ShowHiveTables { > def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder() > .appName("Show Hive Tables") > .enableHiveSupport() > .getOrCreate() > spark.sql("show tables").show() > spark.stop() > } > } > {code} > with **yarn cluster** mode and `hive-site.xml` correctly within > `$SPARK_HOME/conf`,they failed to connect the right hive metestore for not > seeing hive-site.xml in AM/Driver's classpath. > Although submitting them with `--files/--jars local/path/to/hive-site.xml` or > puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well > in cluster mode as client mode, according to the official doc, see @ > http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables > > Configuration of Hive is done by placing your hive-site.xml, core-site.xml > > (for security configuration), and hdfs-site.xml (for HDFS configuration) > > file in conf/. > We may respect these configuration files too or modify the doc for > hive-tables in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive
[ https://issues.apache.org/jira/browse/SPARK-22463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22463: --- Assignee: Kent Yao > Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to > distributed archive > -- > > Key: SPARK-22463 > URL: https://issues.apache.org/jira/browse/SPARK-22463 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.2, 2.2.0 >Reporter: Kent Yao >Assignee: Kent Yao > Fix For: 2.3.0 > > > When I ran self contained sql apps, such as > {code:java} > import org.apache.spark.sql.SparkSession > object ShowHiveTables { > def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder() > .appName("Show Hive Tables") > .enableHiveSupport() > .getOrCreate() > spark.sql("show tables").show() > spark.stop() > } > } > {code} > with **yarn cluster** mode and `hive-site.xml` correctly within > `$SPARK_HOME/conf`,they failed to connect the right hive metestore for not > seeing hive-site.xml in AM/Driver's classpath. > Although submitting them with `--files/--jars local/path/to/hive-site.xml` or > puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well > in cluster mode as client mode, according to the official doc, see @ > http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables > > Configuration of Hive is done by placing your hive-site.xml, core-site.xml > > (for security configuration), and hdfs-site.xml (for HDFS configuration) > > file in conf/. > We may respect these configuration files too or modify the doc for > hive-tables in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245344#comment-16245344 ] Liang-Chi Hsieh commented on SPARK-22460: - FYI., there is already a reported issue in spark-avro for this: https://github.com/databricks/spark-avro/issues/229. > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org