[jira] [Created] (SPARK-22490) PySpark doc has misleading string for SparkSession.builder

2017-11-09 Thread Xiao Li (JIRA)
Xiao Li created SPARK-22490:
---

 Summary: PySpark doc has misleading string for SparkSession.builder
 Key: SPARK-22490
 URL: https://issues.apache.org/jira/browse/SPARK-22490
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Xiao Li
Priority: Minor


We need to fix the following line in our PySpark doc 
http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html

{noformat}
 SparkSession.builder = ¶
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided

2017-11-09 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247113#comment-16247113
 ] 

Dongjoon Hyun commented on SPARK-22042:
---

Also, cc [~felixcheung].

> ReorderJoinPredicates can break when child's partitioning is not decided
> 
>
> Key: SPARK-22042
> URL: https://issues.apache.org/jira/browse/SPARK-22042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tejas Patil
>
> When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its 
> children, the children may not be properly constructed as the child-subtree 
> has to still go through other planner rules.
> In this particular case, the child is `SortMergeJoinExec`. Since the required 
> `Exchange` operators are not in place (because `EnsureRequirements` runs 
> _after_ `ReorderJoinPredicates`), the join's children would not have 
> partitioning defined. This breaks while creation the `PartitioningCollection` 
> here : 
> https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69
> Small repro:
> {noformat}
> context.sql("SET spark.sql.autoBroadcastJoinThreshold=0")
> val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", 
> "k")
> df.write.format("parquet").saveAsTable("table1")
> df.write.format("parquet").saveAsTable("table2")
> df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table")
> sql("""
>   SELECT *
>   FROM (
> SELECT a.i, a.j, a.k
> FROM bucketed_table a
> JOIN table1 b
> ON a.i = b.i
>   ) c
>   JOIN table2
>   ON c.i = table2.i
> """).explain
> {noformat}
> This fails with :
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: 
> PartitioningCollection requires all of its partitionings have the same 
> numPartitions.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69)
>   at 
> org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
>   at org.apache.spark.sql.Dataset.explain(Dataset.scala:464)
>   at org.apache.spark.sql.Dataset.explain(Dataset.scala

[jira] [Commented] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided

2017-11-09 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247112#comment-16247112
 ] 

Dongjoon Hyun commented on SPARK-22042:
---

Since this raises a runtime exception to the query, I raised this to Major. It 
would be great if 2.2.1 has this.
How do you think about this issue, [~smilegator] and [~cloud_fan]?

> ReorderJoinPredicates can break when child's partitioning is not decided
> 
>
> Key: SPARK-22042
> URL: https://issues.apache.org/jira/browse/SPARK-22042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tejas Patil
>
> When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its 
> children, the children may not be properly constructed as the child-subtree 
> has to still go through other planner rules.
> In this particular case, the child is `SortMergeJoinExec`. Since the required 
> `Exchange` operators are not in place (because `EnsureRequirements` runs 
> _after_ `ReorderJoinPredicates`), the join's children would not have 
> partitioning defined. This breaks while creation the `PartitioningCollection` 
> here : 
> https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69
> Small repro:
> {noformat}
> context.sql("SET spark.sql.autoBroadcastJoinThreshold=0")
> val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", 
> "k")
> df.write.format("parquet").saveAsTable("table1")
> df.write.format("parquet").saveAsTable("table2")
> df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table")
> sql("""
>   SELECT *
>   FROM (
> SELECT a.i, a.j, a.k
> FROM bucketed_table a
> JOIN table1 b
> ON a.i = b.i
>   ) c
>   JOIN table2
>   ON c.i = table2.i
> """).explain
> {noformat}
> This fails with :
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: 
> PartitioningCollection requires all of its partitionings have the same 
> numPartitions.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69)
>   at 
> org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExe

[jira] [Updated] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided

2017-11-09 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22042:
--
Priority: Major  (was: Minor)

> ReorderJoinPredicates can break when child's partitioning is not decided
> 
>
> Key: SPARK-22042
> URL: https://issues.apache.org/jira/browse/SPARK-22042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tejas Patil
>
> When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its 
> children, the children may not be properly constructed as the child-subtree 
> has to still go through other planner rules.
> In this particular case, the child is `SortMergeJoinExec`. Since the required 
> `Exchange` operators are not in place (because `EnsureRequirements` runs 
> _after_ `ReorderJoinPredicates`), the join's children would not have 
> partitioning defined. This breaks while creation the `PartitioningCollection` 
> here : 
> https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69
> Small repro:
> {noformat}
> context.sql("SET spark.sql.autoBroadcastJoinThreshold=0")
> val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", 
> "k")
> df.write.format("parquet").saveAsTable("table1")
> df.write.format("parquet").saveAsTable("table2")
> df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table")
> sql("""
>   SELECT *
>   FROM (
> SELECT a.i, a.j, a.k
> FROM bucketed_table a
> JOIN table1 b
> ON a.i = b.i
>   ) c
>   JOIN table2
>   ON c.i = table2.i
> """).explain
> {noformat}
> This fails with :
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: 
> PartitioningCollection requires all of its partitionings have the same 
> numPartitions.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69)
>   at 
> org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
>   at org.apache.spark.sql.Dataset.explain(Dataset.scala:464)
>   at org.apache.spark.sql.Dataset.explain(Dataset.scala:477)
>   ... 60 elided
> {noformat}



--
This m

[jira] [Assigned] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22397:


Assignee: Apache Spark

> Add multiple column support to QuantileDiscretizer
> --
>
> Key: SPARK-22397
> URL: https://issues.apache.org/jira/browse/SPARK-22397
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>
> Once SPARK-20542 adds multi column support to {{Bucketizer}}, we  can add 
> multi column support to the {{QuantileDiscretizer}} too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22397:


Assignee: (was: Apache Spark)

> Add multiple column support to QuantileDiscretizer
> --
>
> Key: SPARK-22397
> URL: https://issues.apache.org/jira/browse/SPARK-22397
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>
> Once SPARK-20542 adds multi column support to {{Bucketizer}}, we  can add 
> multi column support to the {{QuantileDiscretizer}} too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247106#comment-16247106
 ] 

Apache Spark commented on SPARK-22397:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/19715

> Add multiple column support to QuantileDiscretizer
> --
>
> Key: SPARK-22397
> URL: https://issues.apache.org/jira/browse/SPARK-22397
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>
> Once SPARK-20542 adds multi column support to {{Bucketizer}}, we  can add 
> multi column support to the {{QuantileDiscretizer}} too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13612) Multiplication of BigDecimal columns not working as expected

2017-11-09 Thread Varadharajan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varadharajan resolved SPARK-13612.
--
Resolution: Workaround

> Multiplication of BigDecimal columns not working as expected
> 
>
> Key: SPARK-13612
> URL: https://issues.apache.org/jira/browse/SPARK-13612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Varadharajan
>
> Please consider the below snippet:
> {code}
> case class AM(id: Int, a: BigDecimal)
> case class AX(id: Int, b: BigDecimal)
> val x = sc.parallelize(List(AM(1, 10))).toDF
> val y = sc.parallelize(List(AX(1, 10))).toDF
> x.join(y, x("id") === y("id")).withColumn("z", x("a") * y("b")).show
> {code}
> output:
> {code}
> | id|   a| id|   b|   z|
> |  1|10.00...|  1|10.00...|null|
> {code}
> Here the multiplication of the columns ("z") return null instead of 100.
> As of now we are using the below workaround, but definitely looks like a 
> serious issue.
> {code}
> x.join(y, x("id") === y("id")).withColumn("z", x("a") / (expr("1") / 
> y("b"))).show
> {code}
> {code}
> | id|   a| id|   b|   z|
> |  1|10.00...|  1|10.00...|100.0...|
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22489) Shouldn't change broadcast join buildSide if user clearly specified

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22489:


Assignee: (was: Apache Spark)

> Shouldn't change broadcast join buildSide if user clearly specified
> ---
>
> Key: SPARK-22489
> URL: https://issues.apache.org/jira/browse/SPARK-22489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>
> How to reproduce:
> {code:java}
> import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
> spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value").createTempView("table1")
> spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value").createTempView("table2")
> val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON 
> t1.key = t2.key").queryExecution.executedPlan
> println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide)
> {code}
> The result is {{BuildRight}}, but should be {{BuildLeft}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22489) Shouldn't change broadcast join buildSide if user clearly specified

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22489:


Assignee: Apache Spark

> Shouldn't change broadcast join buildSide if user clearly specified
> ---
>
> Key: SPARK-22489
> URL: https://issues.apache.org/jira/browse/SPARK-22489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>
> How to reproduce:
> {code:java}
> import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
> spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value").createTempView("table1")
> spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value").createTempView("table2")
> val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON 
> t1.key = t2.key").queryExecution.executedPlan
> println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide)
> {code}
> The result is {{BuildRight}}, but should be {{BuildLeft}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22489) Shouldn't change broadcast join buildSide if user clearly specified

2017-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247059#comment-16247059
 ] 

Apache Spark commented on SPARK-22489:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/19714

> Shouldn't change broadcast join buildSide if user clearly specified
> ---
>
> Key: SPARK-22489
> URL: https://issues.apache.org/jira/browse/SPARK-22489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>
> How to reproduce:
> {code:java}
> import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
> spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value").createTempView("table1")
> spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value").createTempView("table2")
> val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON 
> t1.key = t2.key").queryExecution.executedPlan
> println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide)
> {code}
> The result is {{BuildRight}}, but should be {{BuildLeft}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22472.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>Assignee: Wenchen Fan
>  Labels: release-notes
> Fix For: 2.2.1, 2.3.0
>
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22489) Shouldn't change broadcast join buildSide if user clearly specified

2017-11-09 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-22489:

Description: 
How to reproduce:

{code:java}
import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec

spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value").createTempView("table1")
spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value").createTempView("table2")

val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON 
t1.key = t2.key").queryExecution.executedPlan

println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide)
{code}

The result is {{BuildRight}}, but should be {{BuildLeft}}.

  was:
How to reproduce:

{code:java}
import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec

spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value").createTempView("table1")
spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value").createTempView("table2")

val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON 
t1.key = t2.key").queryExecution.executedPlan

println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide)
{code}

The result is {{BuildRight}}, but should be {{BuildLeft}}


> Shouldn't change broadcast join buildSide if user clearly specified
> ---
>
> Key: SPARK-22489
> URL: https://issues.apache.org/jira/browse/SPARK-22489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>
> How to reproduce:
> {code:java}
> import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
> spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value").createTempView("table1")
> spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value").createTempView("table2")
> val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON 
> t1.key = t2.key").queryExecution.executedPlan
> println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide)
> {code}
> The result is {{BuildRight}}, but should be {{BuildLeft}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22489) Shouldn't change broadcast join buildSide if user clearly specified

2017-11-09 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-22489:
---

 Summary: Shouldn't change broadcast join buildSide if user clearly 
specified
 Key: SPARK-22489
 URL: https://issues.apache.org/jira/browse/SPARK-22489
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Yuming Wang


How to reproduce:

{code:java}
import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec

spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value").createTempView("table1")
spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value").createTempView("table2")

val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON 
t1.key = t2.key").queryExecution.executedPlan

println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide)
{code}

The result is {{BuildRight}}, but should be {{BuildLeft}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22472:

Labels: release-notes  (was: )

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>Assignee: Wenchen Fan
>  Labels: release-notes
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-22472:
---

Assignee: Wenchen Fan
Target Version/s: 2.2.1

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>Assignee: Wenchen Fan
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22472:

Target Version/s: 2.2.1, 2.3.0  (was: 2.2.1)

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>Assignee: Wenchen Fan
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22488) The view resolution in the SparkSession internal table() API

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22488:


Assignee: Xiao Li  (was: Apache Spark)

> The view resolution in the SparkSession internal table() API 
> -
>
> Key: SPARK-22488
> URL: https://issues.apache.org/jira/browse/SPARK-22488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> The current internal `table()` API of `SparkSession` bypasses the Analyzer 
> and directly calls `sessionState.catalog.lookupRelation` API. This skips the 
> view resolution logics in our Analyzer rule `ResolveRelations`. This internal 
> API is widely used by various DDL commands or the other internal APIs.
> Users might get the strange error caused by view resolution when the default 
> database is different.
> ```
> Table or view not found: t1; line 1 pos 14
> org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 
> pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22488) The view resolution in the SparkSession internal table() API

2017-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246993#comment-16246993
 ] 

Apache Spark commented on SPARK-22488:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19713

> The view resolution in the SparkSession internal table() API 
> -
>
> Key: SPARK-22488
> URL: https://issues.apache.org/jira/browse/SPARK-22488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> The current internal `table()` API of `SparkSession` bypasses the Analyzer 
> and directly calls `sessionState.catalog.lookupRelation` API. This skips the 
> view resolution logics in our Analyzer rule `ResolveRelations`. This internal 
> API is widely used by various DDL commands or the other internal APIs.
> Users might get the strange error caused by view resolution when the default 
> database is different.
> ```
> Table or view not found: t1; line 1 pos 14
> org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 
> pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22488) The view resolution in the SparkSession internal table() API

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22488:


Assignee: Apache Spark  (was: Xiao Li)

> The view resolution in the SparkSession internal table() API 
> -
>
> Key: SPARK-22488
> URL: https://issues.apache.org/jira/browse/SPARK-22488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> The current internal `table()` API of `SparkSession` bypasses the Analyzer 
> and directly calls `sessionState.catalog.lookupRelation` API. This skips the 
> view resolution logics in our Analyzer rule `ResolveRelations`. This internal 
> API is widely used by various DDL commands or the other internal APIs.
> Users might get the strange error caused by view resolution when the default 
> database is different.
> ```
> Table or view not found: t1; line 1 pos 14
> org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 
> pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22488) The view resolution in the SparkSession internal table() API

2017-11-09 Thread Xiao Li (JIRA)
Xiao Li created SPARK-22488:
---

 Summary: The view resolution in the SparkSession internal table() 
API 
 Key: SPARK-22488
 URL: https://issues.apache.org/jira/browse/SPARK-22488
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.2
Reporter: Xiao Li
Assignee: Xiao Li


The current internal `table()` API of `SparkSession` bypasses the Analyzer and 
directly calls `sessionState.catalog.lookupRelation` API. This skips the view 
resolution logics in our Analyzer rule `ResolveRelations`. This internal API is 
widely used by various DDL commands or the other internal APIs.

Users might get the strange error caused by view resolution when the default 
database is different.
```
Table or view not found: t1; line 1 pos 14
org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 
14
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22477) windowing in spark

2017-11-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22477.
--
Resolution: Invalid

Let's ask questions to https://spark.apache.org/community.html, not a JIRA.

> windowing in spark
> --
>
> Key: SPARK-22477
> URL: https://issues.apache.org/jira/browse/SPARK-22477
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Deepak Nayak
>
> I ahve performed map operation over a set of attrtibutes and have stored it 
> in a variable. Next i want to perform windowing over a single attribute and 
> not on the variable which contains all the attributes.
> I have tried doing that but i am not getting all the attribute values after 
> windowing it gives windowed output only for a particular attribute.
> Kindly help anyone if possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22447) SAS reading functionality

2017-11-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22447.
--
Resolution: Won't Fix

I think we wouldn't port the third party unless there is a strong reason. To me 
leaving it as a thridparty sounds making more sense.

Let me leave it resolved until any committer sees the value on porting it. 

> SAS reading functionality
> -
>
> Key: SPARK-22447
> URL: https://issues.apache.org/jira/browse/SPARK-22447
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Abdeali Kothari
>Priority: Trivial
>
> It would be useful to have a SAS file reading functionality in Spark.
> This file specification is well documented at 
> http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/CUJ/1992/9210/ross/ross.htm
> There is a 3rd party tool for this: https://github.com/saurfang/spark-sas7bdat
> But it is not maintained and has issues with large files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize

2017-11-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22426.
--
Resolution: Duplicate

> Spark AM launching containers on node where External spark shuffle service 
> failed to initialize
> ---
>
> Key: SPARK-22426
> URL: https://issues.apache.org/jira/browse/SPARK-22426
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>
> When Spark External Shuffle Service on a NodeManager fails, the remote 
> executors will fail while fetching the data from the executors launched on 
> this Node. Spark ApplicationMaster should not launch containers on this Node 
> or not use external shuffle service.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-11-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22308.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Assignee: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
> Fix For: 2.3.0
>
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22487) No usages of HIVE_EXECUTION_VERSION found in whole spark project

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22487:


Assignee: (was: Apache Spark)

> No usages of HIVE_EXECUTION_VERSION found in whole spark project
> 
>
> Key: SPARK-22487
> URL: https://issues.apache.org/jira/browse/SPARK-22487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> Actually there is no hive client for executions in spark now and there are no 
> usages of HIVE_EXECUTION_VERSION found in whole spark project.  
> HIVE_EXECUTION_VERSION is set by `spark.sql.hive.version`, which is still set 
> internally in some places or by users,  this may confuse developers and users 
> with HIVE_METASTORE_VERSION(spark.sql.hive.metastore.version)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22487) No usages of HIVE_EXECUTION_VERSION found in whole spark project

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22487:


Assignee: Apache Spark

> No usages of HIVE_EXECUTION_VERSION found in whole spark project
> 
>
> Key: SPARK-22487
> URL: https://issues.apache.org/jira/browse/SPARK-22487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>
> Actually there is no hive client for executions in spark now and there are no 
> usages of HIVE_EXECUTION_VERSION found in whole spark project.  
> HIVE_EXECUTION_VERSION is set by `spark.sql.hive.version`, which is still set 
> internally in some places or by users,  this may confuse developers and users 
> with HIVE_METASTORE_VERSION(spark.sql.hive.metastore.version)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22487) No usages of HIVE_EXECUTION_VERSION found in whole spark project

2017-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246959#comment-16246959
 ] 

Apache Spark commented on SPARK-22487:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/19712

> No usages of HIVE_EXECUTION_VERSION found in whole spark project
> 
>
> Key: SPARK-22487
> URL: https://issues.apache.org/jira/browse/SPARK-22487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> Actually there is no hive client for executions in spark now and there are no 
> usages of HIVE_EXECUTION_VERSION found in whole spark project.  
> HIVE_EXECUTION_VERSION is set by `spark.sql.hive.version`, which is still set 
> internally in some places or by users,  this may confuse developers and users 
> with HIVE_METASTORE_VERSION(spark.sql.hive.metastore.version)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22487) No usages of HIVE_EXECUTION_VERSION found in whole spark project

2017-11-09 Thread Kent Yao (JIRA)
Kent Yao created SPARK-22487:


 Summary: No usages of HIVE_EXECUTION_VERSION found in whole spark 
project
 Key: SPARK-22487
 URL: https://issues.apache.org/jira/browse/SPARK-22487
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kent Yao
Priority: Minor


Actually there is no hive client for executions in spark now and there are no 
usages of HIVE_EXECUTION_VERSION found in whole spark project.  
HIVE_EXECUTION_VERSION is set by `spark.sql.hive.version`, which is still set 
internally in some places or by users,  this may confuse developers and users 
with HIVE_METASTORE_VERSION(spark.sql.hive.metastore.version)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22287) SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher

2017-11-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22287.

   Resolution: Fixed
 Assignee: paul mackles
Fix Version/s: 2.3.0
   2.2.1

> SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
> -
>
> Key: SPARK-22287
> URL: https://issues.apache.org/jira/browse/SPARK-22287
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.1, 2.2.0, 2.3.0
>Reporter: paul mackles
>Assignee: paul mackles
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> There does not appear to be a way to control the heap size used by 
> MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not 
> honored for that particular daemon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage` in MiMa

2017-11-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22485.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19710
[https://github.com/apache/spark/pull/19710]

> Use `exclude[Problem]` instead `excludePackage` in MiMa
> ---
>
> Key: SPARK-22485
> URL: https://issues.apache.org/jira/browse/SPARK-22485
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.0
>
>
> `excludePackage` is deprecated like the 
> [following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33].
> {code}
>   @deprecated("Replace with 
> ProblemFilters.exclude[Problem](\"my.package.*\")", "0.1.15")
>   def excludePackage(packageName: String): ProblemFilter = {
> exclude[Problem](packageName + ".*")
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-11-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-22403:


Assignee: Wing Yew Poon

> StructuredKafkaWordCount example fails in YARN cluster mode
> ---
>
> Key: SPARK-22403
> URL: https://issues.apache.org/jira/browse/SPARK-22403
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
> Fix For: 2.2.1, 2.3.0
>
>
> When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
> fine. However, when I run it in YARN cluster mode, the application errors 
> during initialization, and dies after the default number of YARN application 
> attempts. In the AM log, I see
> {noformat}
> 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value 
> AS STRING)
> 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream 
> metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
> /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
> ...
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
>   at 
> org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala)
> {noformat}
> Looking at StreamingQueryManager#createQuery, we have
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198
> {code}
> val checkpointLocation = userSpecifiedCheckpointLocation.map { ...
>   ...
> }.orElse {
>   ...
>

[jira] [Resolved] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-11-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22403.
--
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

> StructuredKafkaWordCount example fails in YARN cluster mode
> ---
>
> Key: SPARK-22403
> URL: https://issues.apache.org/jira/browse/SPARK-22403
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wing Yew Poon
> Fix For: 2.2.1, 2.3.0
>
>
> When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
> fine. However, when I run it in YARN cluster mode, the application errors 
> during initialization, and dies after the default number of YARN application 
> attempts. In the AM log, I see
> {noformat}
> 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value 
> AS STRING)
> 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream 
> metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
> /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
> ...
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
>   at 
> org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala)
> {noformat}
> Looking at StreamingQueryManager#createQuery, we have
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198
> {code}
> val checkpointLocation = userSpecifiedCheckpointLocation.map { ...
>   ...
> }.orElse {
>   ...

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-11-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246811#comment-16246811
 ] 

Marcelo Vanzin commented on SPARK-18838:


I actually did a backport to our internal branches, here's the one for 2.1 
(which IIRC is very similar to the 2.2 one):
https://github.com/cloudera/spark/commit/a6b96c45afa227b48723ed872b692ceed9f9171d

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
> Attachments: SparkListernerComputeTime.xlsx, perfResults.pdf
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22483) Exposing java.nio bufferedPool memory metrics to metrics system

2017-11-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22483.

   Resolution: Fixed
 Assignee: Srinivasa Reddy Vundela
Fix Version/s: 2.3.0

> Exposing java.nio bufferedPool memory metrics to metrics system
> ---
>
> Key: SPARK-22483
> URL: https://issues.apache.org/jira/browse/SPARK-22483
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Srinivasa Reddy Vundela
>Assignee: Srinivasa Reddy Vundela
> Fix For: 2.3.0
>
>
> Spark currently exposes on-heap and off-heap memory of JVM to metric system. 
> Currently there is no way to know how much direct/mapped memory allocated for 
> java.nio buffered pools. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-11-09 Thread Anthony Truchet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246804#comment-16246804
 ] 

Anthony Truchet commented on SPARK-18838:
-

Agreed but at the same time this is "almost" a bug when you consider use of 
Spark at really large scale...
I tried a naive cherry pick and obviously there are a huge lot of non trivial 
conflicts.
Do you know how to get an overview of the change (and of what happened in the 
mean time since 2.2) to guide this conflict resolution?
Or maybe you already have some insight about the best way to go for it?  

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
> Attachments: SparkListernerComputeTime.xlsx, perfResults.pdf
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-11-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246783#comment-16246783
 ] 

Marcelo Vanzin commented on SPARK-18838:


It's kind of an intrusive change to backport to a maintenance release, but 
since it may help a lot of people it might be ok.

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
> Attachments: SparkListernerComputeTime.xlsx, perfResults.pdf
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20647) Make the Storage page use new app state store

2017-11-09 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-20647.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19679
[https://github.com/apache/spark/pull/19679]

> Make the Storage page use new app state store
> -
>
> Key: SPARK-20647
> URL: https://issues.apache.org/jira/browse/SPARK-20647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making the Storage page use the new app state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20647) Make the Storage page use new app state store

2017-11-09 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-20647:


Assignee: Marcelo Vanzin

> Make the Storage page use new app state store
> -
>
> Key: SPARK-20647
> URL: https://issues.apache.org/jira/browse/SPARK-20647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making the Storage page use the new app state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246624#comment-16246624
 ] 

Apache Spark commented on SPARK-22471:
--

User 'tashoyan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19711

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22465) Cogroup of two disproportionate RDDs could lead into 2G limit BUG

2017-11-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246595#comment-16246595
 ] 

Thomas Graves commented on SPARK-22465:
---

Its not strictly the 2G limit.  He did hit that but he hit it because of the 
default behavior of cogroup.  I think this jira was filed to look at that to 
make the behavior better.  So I think the last couple sentences in the 
description refer to that.



> Cogroup of two disproportionate RDDs could lead into 2G limit BUG
> -
>
> Key: SPARK-22465
> URL: https://issues.apache.org/jira/browse/SPARK-22465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 
> 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Amit Kumar
>Priority: Critical
>
> While running my spark pipeline, it failed with the following exception
> {noformat}
> 2017-11-03 04:49:09,776 [Executor task launch worker for task 58670] ERROR 
> org.apache.spark.executor.Executor  - Exception in task 630.0 in stage 28.0 
> (TID 58670)
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:469)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After debugging I found that the issue lies with how spark handles cogroup of 
> two RDDs.
> Here is the relevant code from apache spark
> {noformat}
>  /**
>* For each key k in `this` or `other`, return a resulting RDD that 
> contains a tuple with the
>* list of values for that key in `this` as well as `other`.
>*/
>   def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = 
> self.withScope {
> cogroup(other, defaultPartitioner(self, other))
>   }
> /**
>* Choose a partitioner to use for a cogroup-like operation between a 
> number of RDDs.
>*
>* If any of the RDDs already has a partitioner, choose that one.
>*
>* Otherwise, we use a default HashPartitioner. For the number of 
> partitions, if
>* spark.default.parallelism is set, then we'll use the value from 
> SparkContext
>* defaultParallelism, otherwise we'll use the max number of upstream 
> partitions.
>*
>* Unless spark.default.parallelism is set, the number of partitions will 
> be the
>* same as the number of partitions in the largest upstream RDD, as this 
> should
>* be least likely to cause out-of-memory errors.
>*
>* We use two method parameters (rdd, others) to enforce callers passing at 
> least 1 RDD.
>*/
>   def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
> val rdds = (Seq(rdd) ++ others)
> val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 
> 0))
> if (hasPartitioner.nonEmpty) {
>   hasPartitioner.maxBy(_.partitions.length).partitioner.get
> } else {
>   if (rdd.context.conf.contains("spark.default.parallelism")) {
> new HashPartitioner(rdd.context.defaultParallelism)
>   } else {
> new HashPartitioner(rdds.map(_.partitions.length).max)
>   }
> }
>   }
> {noformat}
> Given this  suppose we have two  pair RDDs.
> RDD1 : A small RDD which fewer data and partitions
> RDD2: A huge RDD which has loads of data and partitions
> Now in the code if we were to have a cogroup
> {noformat}
> val RDD3 = RDD1.cogroup(RDD2)
> {noformat}
> there is a case where this could lead to the SPARK-6235 Bug which is If RDD1 
> has a partitioner when it is being called into a cogroup. This is because the 
> cogroups partitions are then decided by the partitioner and could lead to th

[jira] [Resolved] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-11-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21994.

Resolution: Not A Problem

> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path (causing the data to be saved as external 
> table), it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> A second workaround is to update the metadata of the managed table created by 
> Spark 2.2:
> {code}
> spark.sql("alter table test.spark22_test set SERDEPROPERTIES 
> ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')")
> spark.catalog.refreshTable("test.spark22_test")
> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-11-09 Thread Srinivasa Reddy Vundela (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246463#comment-16246463
 ] 

Srinivasa Reddy Vundela commented on SPARK-21994:
-

This issue is related to Cloudera spark and got fixed recently. We can close 
this jira.

> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path (causing the data to be saved as external 
> table), it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> A second workaround is to update the metadata of the managed table created by 
> Spark 2.2:
> {code}
> spark.sql("alter table test.spark22_test set SERDEPROPERTIES 
> ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')")
> spark.catalog.refreshTable("test.spark22_test")
> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports

2017-11-09 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16605:
--
Affects Version/s: 2.1.2
   2.2.0

> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> ---
>
> Key: SPARK-16605
> URL: https://issues.apache.org/jira/browse/SPARK-16605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.2, 2.2.0
>Reporter: marymwu
> Fix For: 2.2.1, 2.3.0
>
> Attachments: screenshot-1.png
>
>
> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> Steps:
> 1. Use hive to create a table "tbtxt" stored as txt and load data into it.
> 2. Use hive to create a table "tborc" stored as orc and insert the data from 
> table "tbtxt" . Example, "create table tborc stored as orc as select * from 
> tbtxt"
> 3. Use spark2.0 to "select * from tborc;".-->error 
> occurs,java.lang.IllegalArgumentException: Field "nid" does not exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports

2017-11-09 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16605:
--
Fix Version/s: 2.3.0

> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> ---
>
> Key: SPARK-16605
> URL: https://issues.apache.org/jira/browse/SPARK-16605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: marymwu
> Fix For: 2.2.1, 2.3.0
>
> Attachments: screenshot-1.png
>
>
> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> Steps:
> 1. Use hive to create a table "tbtxt" stored as txt and load data into it.
> 2. Use hive to create a table "tborc" stored as orc and insert the data from 
> table "tbtxt" . Example, "create table tborc stored as orc as select * from 
> tbtxt"
> 3. Use spark2.0 to "select * from tborc;".-->error 
> occurs,java.lang.IllegalArgumentException: Field "nid" does not exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports

2017-11-09 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246434#comment-16246434
 ] 

Dongjoon Hyun commented on SPARK-16605:
---

This duplicates SPARK-14387.
SPARK-14387 is resolved in Spark 2.2.1.

> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> ---
>
> Key: SPARK-16605
> URL: https://issues.apache.org/jira/browse/SPARK-16605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: marymwu
> Fix For: 2.2.1, 2.3.0
>
> Attachments: screenshot-1.png
>
>
> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> Steps:
> 1. Use hive to create a table "tbtxt" stored as txt and load data into it.
> 2. Use hive to create a table "tborc" stored as orc and insert the data from 
> table "tbtxt" . Example, "create table tborc stored as orc as select * from 
> tbtxt"
> 3. Use spark2.0 to "select * from tborc;".-->error 
> occurs,java.lang.IllegalArgumentException: Field "nid" does not exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports

2017-11-09 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16605:
--
Fix Version/s: 2.2.1

> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> ---
>
> Key: SPARK-16605
> URL: https://issues.apache.org/jira/browse/SPARK-16605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: marymwu
> Fix For: 2.2.1, 2.3.0
>
> Attachments: screenshot-1.png
>
>
> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> Steps:
> 1. Use hive to create a table "tbtxt" stored as txt and load data into it.
> 2. Use hive to create a table "tborc" stored as orc and insert the data from 
> table "tbtxt" . Example, "create table tborc stored as orc as select * from 
> tbtxt"
> 3. Use spark2.0 to "select * from tborc;".-->error 
> occurs,java.lang.IllegalArgumentException: Field "nid" does not exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials

2017-11-09 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246328#comment-16246328
 ] 

Andrew Ash commented on SPARK-22479:


Completely agree that credentials shouldn't be in the toString since query 
plans are logged in many places.  This looks like it brings 
SaveIntoDataSourceCommand more in-line with JdbcRelation, which also currently 
redacts credentials from its toString to avoid them being written to logs.

> SaveIntoDataSourceCommand logs jdbc credentials
> ---
>
> Key: SPARK-22479
> URL: https://issues.apache.org/jira/browse/SPARK-22479
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Onur Satici
>
> JDBC credentials are not redacted in plans including a 
> 'SaveIntoDataSourceCommand'.
> Steps to reproduce:
> {code}
> spark-shell --packages org.postgresql:postgresql:42.1.1
> {code}
> {code}
> import org.apache.spark.sql.execution.QueryExecution
> import org.apache.spark.sql.util.QueryExecutionListener
> val listener = new QueryExecutionListener {
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, duration: 
> Long): Unit = {
> System.out.println(qe.toString())
>   }
> }
> spark.listenerManager.register(listener)
> spark.range(100).write.format("jdbc").option("url", 
> "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", 
> "org.postgresql.Driver").option("dbtable", "test").save()
> {code}
> The above will yield the following plan:
> {code}
> == Parsed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Analyzed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Optimized Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Physical Plan ==
> ExecutedCommand
>+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>  +- Range (0, 100, step=1, splits=Some(8))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22486) Support synchronous offset commits for Kafka

2017-11-09 Thread Jeremy Beard (JIRA)
Jeremy Beard created SPARK-22486:


 Summary: Support synchronous offset commits for Kafka
 Key: SPARK-22486
 URL: https://issues.apache.org/jira/browse/SPARK-22486
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 2.2.0
Reporter: Jeremy Beard


CanCommitOffsets provides asynchronous offset commits (via 
Consumer#commitAsync), and it would be useful if it also provided synchronous 
offset commits (via Consumer#commitSync) for when the desired behavior is to 
block until it is complete.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage` in MiMa

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22485:


Assignee: Apache Spark

> Use `exclude[Problem]` instead `excludePackage` in MiMa
> ---
>
> Key: SPARK-22485
> URL: https://issues.apache.org/jira/browse/SPARK-22485
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> `excludePackage` is deprecated like the 
> [following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33].
> {code}
>   @deprecated("Replace with 
> ProblemFilters.exclude[Problem](\"my.package.*\")", "0.1.15")
>   def excludePackage(packageName: String): ProblemFilter = {
> exclude[Problem](packageName + ".*")
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage`

2017-11-09 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-22485:
-

 Summary: Use `exclude[Problem]` instead `excludePackage`
 Key: SPARK-22485
 URL: https://issues.apache.org/jira/browse/SPARK-22485
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun
Priority: Minor


`excludePackage` is deprecated like the 
[following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33].
{code}
  @deprecated("Replace with ProblemFilters.exclude[Problem](\"my.package.*\")", 
"0.1.15")
  def excludePackage(packageName: String): ProblemFilter = {
exclude[Problem](packageName + ".*")
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage` in MiMa

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22485:


Assignee: (was: Apache Spark)

> Use `exclude[Problem]` instead `excludePackage` in MiMa
> ---
>
> Key: SPARK-22485
> URL: https://issues.apache.org/jira/browse/SPARK-22485
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `excludePackage` is deprecated like the 
> [following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33].
> {code}
>   @deprecated("Replace with 
> ProblemFilters.exclude[Problem](\"my.package.*\")", "0.1.15")
>   def excludePackage(packageName: String): ProblemFilter = {
> exclude[Problem](packageName + ".*")
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage` in MiMa

2017-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246265#comment-16246265
 ] 

Apache Spark commented on SPARK-22485:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19710

> Use `exclude[Problem]` instead `excludePackage` in MiMa
> ---
>
> Key: SPARK-22485
> URL: https://issues.apache.org/jira/browse/SPARK-22485
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `excludePackage` is deprecated like the 
> [following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33].
> {code}
>   @deprecated("Replace with 
> ProblemFilters.exclude[Problem](\"my.package.*\")", "0.1.15")
>   def excludePackage(packageName: String): ProblemFilter = {
> exclude[Problem](packageName + ".*")
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22485) Use `exclude[Problem]` instead `excludePackage` in MiMa

2017-11-09 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22485:
--
Summary: Use `exclude[Problem]` instead `excludePackage` in MiMa  (was: Use 
`exclude[Problem]` instead `excludePackage`)

> Use `exclude[Problem]` instead `excludePackage` in MiMa
> ---
>
> Key: SPARK-22485
> URL: https://issues.apache.org/jira/browse/SPARK-22485
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `excludePackage` is deprecated like the 
> [following|https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33].
> {code}
>   @deprecated("Replace with 
> ProblemFilters.exclude[Problem](\"my.package.*\")", "0.1.15")
>   def excludePackage(packageName: String): ProblemFilter = {
> exclude[Problem](packageName + ".*")
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22484) PySpark DataFrame.write.csv(quote="") uses nullchar as quote

2017-11-09 Thread John Bollenbacher (JIRA)
John Bollenbacher created SPARK-22484:
-

 Summary: PySpark DataFrame.write.csv(quote="") uses nullchar as 
quote
 Key: SPARK-22484
 URL: https://issues.apache.org/jira/browse/SPARK-22484
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.2.0
Reporter: John Bollenbacher
Priority: Minor


[Documentation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=save#pyspark.sql.DataFrame)
 of DataFrame.write.csv() says that setting the quote parameter to an empty 
string should turn off quoting.  Instead, it uses the [null 
character](https://en.wikipedia.org/wiki/Null_character) as the quote.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22483) Exposing java.nio bufferedPool memory metrics to metrics system

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22483:


Assignee: (was: Apache Spark)

> Exposing java.nio bufferedPool memory metrics to metrics system
> ---
>
> Key: SPARK-22483
> URL: https://issues.apache.org/jira/browse/SPARK-22483
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Srinivasa Reddy Vundela
>
> Spark currently exposes on-heap and off-heap memory of JVM to metric system. 
> Currently there is no way to know how much direct/mapped memory allocated for 
> java.nio buffered pools. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22483) Exposing java.nio bufferedPool memory metrics to metrics system

2017-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246203#comment-16246203
 ] 

Apache Spark commented on SPARK-22483:
--

User 'vundela' has created a pull request for this issue:
https://github.com/apache/spark/pull/19709

> Exposing java.nio bufferedPool memory metrics to metrics system
> ---
>
> Key: SPARK-22483
> URL: https://issues.apache.org/jira/browse/SPARK-22483
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Srinivasa Reddy Vundela
>
> Spark currently exposes on-heap and off-heap memory of JVM to metric system. 
> Currently there is no way to know how much direct/mapped memory allocated for 
> java.nio buffered pools. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22483) Exposing java.nio bufferedPool memory metrics to metrics system

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22483:


Assignee: Apache Spark

> Exposing java.nio bufferedPool memory metrics to metrics system
> ---
>
> Key: SPARK-22483
> URL: https://issues.apache.org/jira/browse/SPARK-22483
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Srinivasa Reddy Vundela
>Assignee: Apache Spark
>
> Spark currently exposes on-heap and off-heap memory of JVM to metric system. 
> Currently there is no way to know how much direct/mapped memory allocated for 
> java.nio buffered pools. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22483) Exposing java.nio bufferedPool memory metrics to metrics system

2017-11-09 Thread Srinivasa Reddy Vundela (JIRA)
Srinivasa Reddy Vundela created SPARK-22483:
---

 Summary: Exposing java.nio bufferedPool memory metrics to metrics 
system
 Key: SPARK-22483
 URL: https://issues.apache.org/jira/browse/SPARK-22483
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Srinivasa Reddy Vundela


Spark currently exposes on-heap and off-heap memory of JVM to metric system. 
Currently there is no way to know how much direct/mapped memory allocated for 
java.nio buffered pools. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246176#comment-16246176
 ] 

Ran Haim commented on SPARK-22481:
--

It takes about 2 seconds to create the dataset...i need to refresh 30 tables, 
and that takes a minute now - where it used to take 2 seconds.
In the old code, it does the same because iscashed actually uses the plan and 
not the dataset.

> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Critical
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = 
> true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22287) SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher

2017-11-09 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22287:
-
Target Version/s: 2.2.1

> SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
> -
>
> Key: SPARK-22287
> URL: https://issues.apache.org/jira/browse/SPARK-22287
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.1, 2.2.0, 2.3.0
>Reporter: paul mackles
>Priority: Minor
>
> There does not appear to be a way to control the heap size used by 
> MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not 
> honored for that particular daemon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19606) Support constraints in spark-dispatcher

2017-11-09 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19606:
-
Target Version/s: 2.2.1

> Support constraints in spark-dispatcher
> ---
>
> Key: SPARK-19606
> URL: https://issues.apache.org/jira/browse/SPARK-19606
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Philipp Hoffmann
>
> The `spark.mesos.constraints` configuration is ignored by the 
> spark-dispatcher. The constraints need to be passed in the Framework 
> information when registering with Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22482) Unreadable Parquet array columns

2017-11-09 Thread Costas Piliotis (JIRA)
Costas Piliotis created SPARK-22482:
---

 Summary: Unreadable Parquet array columns
 Key: SPARK-22482
 URL: https://issues.apache.org/jira/browse/SPARK-22482
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
 Environment: Spark 2.1.0
Parquet 1.8.1
Hive 1.2
Hive 2.1.0
presto 0.157
presto 0.180
Reporter: Costas Piliotis


We have seen an issue with writing out parquet data from spark.   int and bool 
arrays seem to be throwing exceptions when trying to read the parquet files 
from hive and presto.

I've logged a ticket here:  PARQUET-1157 with the parquet project but I'm not 
sure if it's an issue within their project or an issue with spark itself.

Spark is reading parquet-avro data which is output by a mapreduce job and 
writing it out to parquet.   

The inbound parquet format has the column defined as:

{code}
  optional group playerpositions_ai (LIST) {
repeated int32 array;
  }
{code}

Spark is redefining this data as this:

{code}
  optional group playerpositions_ai (LIST) {
repeated group list {
  optional int32 element;
}
  }
{code}

and with legacy format:
{code}
  optional group playerpositions_ai (LIST) {
repeated group bag {
  optional int32 array;
}
  }
{code}

The parquet data was tested in Hive 1.2, Hive 2.1, Presto 0.157, Presto 0.180, 
and Spark 2.1, as well as Amazon Athena (which is some form of presto 
implementation).   

I believe that the above schema is valid for writing out parquet.  

The spark command writing it out is simple:
{code}
  data.repartition(((data.count() / 1000) + 
1).toInt).write.format("parquet")
.mode("append")
.partitionBy(partitionColumns: _*)
.save(path)
{code}

We initially wrote this out with legacy format turned off but later turned on 
legacy format and have seen this error occur the same way with legacy off and 
on.  

Spark's stack trace from reading this is:

{code}
java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
at 
org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
at 
org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
at 
org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readInteger(DictionaryValuesReader.java:112)
at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$3.read(ColumnReaderImpl.java:243)
at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370)
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

Also do note that our data is stored on S3 if that matters.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246070#comment-16246070
 ] 

Wenchen Fan commented on SPARK-22481:
-

`SparkSession.table` is pretty cheap, I think the slowness is due to we now 
uncache all plans that refer to the given plan. This is for correctness, so I 
don't think it's a regression.

> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Critical
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = 
> true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246041#comment-16246041
 ] 

Kazuaki Ishizaki commented on SPARK-22481:
--

This change was introduce by [this 
PR|https://github.com/apache/spark/pull/17097/files#diff-463cb1b0f60d87ada075a820f18e1104R443].
 It seems to be related to the first refactoring in the description.

[~cloud_fan] Is there any reason to always call {{sparkSession.table()}}?

> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Critical
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = 
> true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Ran Haim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-22481:
-
Description: 
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* creates a dataset, and this 
is redundant most of the time, we only need the dataset if the table is cached.

before 2.1.1:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
 {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after 2.1.1:
   override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
{color:red}   val table = sparkSession.table(tableIdent){color}
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }



  was:
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* creates a dataset, and this 
is redundant most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
 {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
{color:red}   val table = sparkSession.table(tableIdent){color}
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }




> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>   

[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Ran Haim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-22481:
-
Description: 
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* creates a dataset, and this 
is redundant most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
 {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
{color:red}   val table = sparkSession.table(tableIdent){color}
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }



  was:
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* create a dataset, and this is 
redundent most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
 {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
{color:red}   val table = sparkSession.table(tableIdent){color}
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }




> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Typ

[jira] [Created] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Ran Haim (JIRA)
Ran Haim created SPARK-22481:


 Summary: CatalogImpl.refreshTable is slow
 Key: SPARK-22481
 URL: https://issues.apache.org/jira/browse/SPARK-22481
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.2, 2.1.1
Reporter: Ran Haim
Priority: Critical


CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* create a dataset, and this is 
redundent most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
*  val df = Dataset.ofRows(sparkSession, logicalPlan)*
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
   * val table = sparkSession.table(tableIdent)*
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Ran Haim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-22481:
-
Description: 
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* create a dataset, and this is 
redundent most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
 {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
{color:red}   val table = sparkSession.table(tableIdent){color}
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }



  was:
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* create a dataset, and this is 
redundent most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
*  val df = Dataset.ofRows(sparkSession, logicalPlan)*
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
   * val table = sparkSession.table(tableIdent)*
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }




> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQ

[jira] [Commented] (SPARK-22474) cannot read a parquet file containing a Seq[Map[MyCaseClass, String]]

2017-11-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245915#comment-16245915
 ] 

Kazuaki Ishizaki commented on SPARK-22474:
--

This check was introduced by [this 
PR|https://github.com/apache/spark/commit/8ab50765cd793169091d983b50d87a391f6ac1f4].
While I did not run it with Spark 2.0.2, Spark 2.0.2 seem to include this PR, 
too.

> cannot read a parquet file containing a Seq[Map[MyCaseClass, String]]
> -
>
> Key: SPARK-22474
> URL: https://issues.apache.org/jira/browse/SPARK-22474
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Mikael Valot
>
> The following code run in spark-shell throws an exception. It is working fine 
> in Spark 2.0.2
> {code:java}
> case class MyId(v: String)
> case class MyClass(infos: Seq[Map[MyId, String]])
> val seq = Seq(MyClass(Seq(Map(MyId("1234") -> "blah"
> seq.toDS().write.parquet("/tmp/myclass")
> spark.read.parquet("/tmp/myclass").as[MyClass].collect()
> Caused by: org.apache.spark.sql.AnalysisException: Map key type is expected 
> to be a primitive type, but found: required group key {
>   optional binary v (UTF8);
> };
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:581)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:246)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:201)
>   at scala.Option.fold(Option.scala:158)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:87)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:84)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetSchemaConverter$$convert(ParquetSchemaConverter.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201)
>   at scala.Option.fold(Option.scala:158)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetArrayConverter.(ParquetRowConverter.scala:483)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:183)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasou

[jira] [Commented] (SPARK-22474) cannot read a parquet file containing a Seq[Map[MyCaseClass, String]]

2017-11-09 Thread Mikael Valot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245875#comment-16245875
 ] 

Mikael Valot commented on SPARK-22474:
--

I can read the file if I comment out line 581 in 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter
{code:java}
  def checkConversionRequirement(f: => Boolean, message: String): Unit = {
if (!f) {
//  throw new AnalysisException(message)
}
  }
{code}



> cannot read a parquet file containing a Seq[Map[MyCaseClass, String]]
> -
>
> Key: SPARK-22474
> URL: https://issues.apache.org/jira/browse/SPARK-22474
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Mikael Valot
>
> The following code run in spark-shell throws an exception. It is working fine 
> in Spark 2.0.2
> {code:java}
> case class MyId(v: String)
> case class MyClass(infos: Seq[Map[MyId, String]])
> val seq = Seq(MyClass(Seq(Map(MyId("1234") -> "blah"
> seq.toDS().write.parquet("/tmp/myclass")
> spark.read.parquet("/tmp/myclass").as[MyClass].collect()
> Caused by: org.apache.spark.sql.AnalysisException: Map key type is expected 
> to be a primitive type, but found: required group key {
>   optional binary v (UTF8);
> };
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:581)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:246)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:201)
>   at scala.Option.fold(Option.scala:158)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:87)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:84)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetSchemaConverter$$convert(ParquetSchemaConverter.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201)
>   at scala.Option.fold(Option.scala:158)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetArrayConverter.(ParquetRowConverter.scala:483)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:183)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable

[jira] [Assigned] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22479:


Assignee: (was: Apache Spark)

> SaveIntoDataSourceCommand logs jdbc credentials
> ---
>
> Key: SPARK-22479
> URL: https://issues.apache.org/jira/browse/SPARK-22479
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Onur Satici
>
> JDBC credentials are not redacted in plans including a 
> 'SaveIntoDataSourceCommand'.
> Steps to reproduce:
> {code}
> spark-shell --packages org.postgresql:postgresql:42.1.1
> {code}
> {code}
> import org.apache.spark.sql.execution.QueryExecution
> import org.apache.spark.sql.util.QueryExecutionListener
> val listener = new QueryExecutionListener {
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, duration: 
> Long): Unit = {
> System.out.println(qe.toString())
>   }
> }
> spark.listenerManager.register(listener)
> spark.range(100).write.format("jdbc").option("url", 
> "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", 
> "org.postgresql.Driver").option("dbtable", "test").save()
> {code}
> The above will yield the following plan:
> {code}
> == Parsed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Analyzed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Optimized Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Physical Plan ==
> ExecutedCommand
>+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>  +- Range (0, 100, step=1, splits=Some(8))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials

2017-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245766#comment-16245766
 ] 

Apache Spark commented on SPARK-22479:
--

User 'onursatici' has created a pull request for this issue:
https://github.com/apache/spark/pull/19708

> SaveIntoDataSourceCommand logs jdbc credentials
> ---
>
> Key: SPARK-22479
> URL: https://issues.apache.org/jira/browse/SPARK-22479
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Onur Satici
>
> JDBC credentials are not redacted in plans including a 
> 'SaveIntoDataSourceCommand'.
> Steps to reproduce:
> {code}
> spark-shell --packages org.postgresql:postgresql:42.1.1
> {code}
> {code}
> import org.apache.spark.sql.execution.QueryExecution
> import org.apache.spark.sql.util.QueryExecutionListener
> val listener = new QueryExecutionListener {
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, duration: 
> Long): Unit = {
> System.out.println(qe.toString())
>   }
> }
> spark.listenerManager.register(listener)
> spark.range(100).write.format("jdbc").option("url", 
> "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", 
> "org.postgresql.Driver").option("dbtable", "test").save()
> {code}
> The above will yield the following plan:
> {code}
> == Parsed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Analyzed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Optimized Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Physical Plan ==
> ExecutedCommand
>+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>  +- Range (0, 100, step=1, splits=Some(8))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22479:


Assignee: Apache Spark

> SaveIntoDataSourceCommand logs jdbc credentials
> ---
>
> Key: SPARK-22479
> URL: https://issues.apache.org/jira/browse/SPARK-22479
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Onur Satici
>Assignee: Apache Spark
>
> JDBC credentials are not redacted in plans including a 
> 'SaveIntoDataSourceCommand'.
> Steps to reproduce:
> {code}
> spark-shell --packages org.postgresql:postgresql:42.1.1
> {code}
> {code}
> import org.apache.spark.sql.execution.QueryExecution
> import org.apache.spark.sql.util.QueryExecutionListener
> val listener = new QueryExecutionListener {
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, duration: 
> Long): Unit = {
> System.out.println(qe.toString())
>   }
> }
> spark.listenerManager.register(listener)
> spark.range(100).write.format("jdbc").option("url", 
> "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", 
> "org.postgresql.Driver").option("dbtable", "test").save()
> {code}
> The above will yield the following plan:
> {code}
> == Parsed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Analyzed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Optimized Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Physical Plan ==
> ExecutedCommand
>+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>  +- Range (0, 100, step=1, splits=Some(8))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22478) Spark - Truncate date by Day / Hour

2017-11-09 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245748#comment-16245748
 ] 

Marco Gaido commented on SPARK-22478:
-

The reason why {{TRUNC}} works like this is for compatibility with Hive, where 
the same function works the same way. You can achieve your goal in other ways, 
like using {{date_format}} to get a string with the information you need or you 
can use the {{HOUR}} function, or you can even create your UDF to achieve this.

> Spark  - Truncate date by Day / Hour
> 
>
> Key: SPARK-22478
> URL: https://issues.apache.org/jira/browse/SPARK-22478
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: david hodeffi
>Priority: Trivial
>
> Currently it is possible to truncate date/timestamp only by month or year,
> There is a missing functionality to truncate by hour/day.
> A usecase that requires this functionality is: Aggregating financial 
> transaction of a party in order to create daily/hour summary. 
> Those summaries are required for machine learning algorithms.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once

2017-11-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20542.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Add an API into Bucketizer that can bin a lot of columns all at once
> 
>
> Key: SPARK-20542
> URL: https://issues.apache.org/jira/browse/SPARK-20542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> Current ML's Bucketizer can only bin a column of continuous features. If a 
> dataset has thousands of of continuous columns needed to bin, we will result 
> in thousands of ML stages. It is very inefficient regarding query planning 
> and execution.
> We should have a type of bucketizer that can bin a lot of columns all at 
> once. It would need to accept an list of arrays of split points to correspond 
> to the columns to bin, but it might make things more efficient by replacing 
> thousands of stages with just one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22480) Dynamic Watermarking

2017-11-09 Thread Jochen Niebuhr (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jochen Niebuhr updated SPARK-22480:
---
Description: When you're using the watermark feature, you're forced to 
provide an absolute duration to identify late events. For the case we're using 
structured streaming for, this is not completely working. In our case, late 
events will be possible on the next business day. So I'd like to use 24 hours 
watermark for Sunday-Thursday, 72 hours for Friday, 48 for Saturday. What I 
would suggest is being able to use a function or expression to 
{{withWatermark}} so people can implement this or similar behaviors. If this 
sounds like a good idea, I can probably supply a pull request.  (was: When 
you're using the watermark feature, you're forced to provide an absolute 
duration to identify late events. For the case we're using structured streaming 
for, this is not completely working. In our case, late events will be possible 
on the next business day. So I'd like to use 24 hours watermark for 
Sunday-Thursday, 72 hours for Friday, 48 for Saturday. What I would suggest is 
being able to use a function or expression to `withWatermark` so people can 
implement this or similar behaviors. If this sounds like a good idea, I can 
probably supply a pull request.)

> Dynamic Watermarking
> 
>
> Key: SPARK-22480
> URL: https://issues.apache.org/jira/browse/SPARK-22480
> Project: Spark
>  Issue Type: Wish
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jochen Niebuhr
>
> When you're using the watermark feature, you're forced to provide an absolute 
> duration to identify late events. For the case we're using structured 
> streaming for, this is not completely working. In our case, late events will 
> be possible on the next business day. So I'd like to use 24 hours watermark 
> for Sunday-Thursday, 72 hours for Friday, 48 for Saturday. What I would 
> suggest is being able to use a function or expression to {{withWatermark}} so 
> people can implement this or similar behaviors. If this sounds like a good 
> idea, I can probably supply a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22480) Dynamic Watermarking

2017-11-09 Thread Jochen Niebuhr (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jochen Niebuhr updated SPARK-22480:
---
Description: When you're using the watermark feature, you're forced to 
provide an absolute duration to identify late events. For the case we're using 
structured streaming for, this is not completely working. In our case, late 
events will be possible on the next business day. So I'd like to use 24 hours 
watermark for Sunday-Thursday, 72 hours for Friday, 48 for Saturday. What I 
would suggest is being able to use a function or expression to `withWatermark` 
so people can implement this or similar behaviors. If this sounds like a good 
idea, I can probably supply a pull request.  (was: When you're using the 
watermark feature, you're forced to provide an absolute duration to identify 
late events. For the case we're using structured streaming for, this is not 
completely working. In our case, late events will be possible on the next 
business day. So I'd like to use 24 hours watermark for Sunday-Thursday, 72 
hours for Friday, 48 for Saturday. If this sounds like a good idea, I can 
probably supply a pull request.)

> Dynamic Watermarking
> 
>
> Key: SPARK-22480
> URL: https://issues.apache.org/jira/browse/SPARK-22480
> Project: Spark
>  Issue Type: Wish
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jochen Niebuhr
>
> When you're using the watermark feature, you're forced to provide an absolute 
> duration to identify late events. For the case we're using structured 
> streaming for, this is not completely working. In our case, late events will 
> be possible on the next business day. So I'd like to use 24 hours watermark 
> for Sunday-Thursday, 72 hours for Friday, 48 for Saturday. What I would 
> suggest is being able to use a function or expression to `withWatermark` so 
> people can implement this or similar behaviors. If this sounds like a good 
> idea, I can probably supply a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22480) Dynamic Watermarking

2017-11-09 Thread Jochen Niebuhr (JIRA)
Jochen Niebuhr created SPARK-22480:
--

 Summary: Dynamic Watermarking
 Key: SPARK-22480
 URL: https://issues.apache.org/jira/browse/SPARK-22480
 Project: Spark
  Issue Type: Wish
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Jochen Niebuhr


When you're using the watermark feature, you're forced to provide an absolute 
duration to identify late events. For the case we're using structured streaming 
for, this is not completely working. In our case, late events will be possible 
on the next business day. So I'd like to use 24 hours watermark for 
Sunday-Thursday, 72 hours for Friday, 48 for Saturday. If this sounds like a 
good idea, I can probably supply a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials

2017-11-09 Thread Onur Satici (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245677#comment-16245677
 ] 

Onur Satici commented on SPARK-22479:
-

will submit a pr

> SaveIntoDataSourceCommand logs jdbc credentials
> ---
>
> Key: SPARK-22479
> URL: https://issues.apache.org/jira/browse/SPARK-22479
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Onur Satici
>
> JDBC credentials are not redacted in plans including a 
> 'SaveIntoDataSourceCommand'.
> Steps to reproduce:
> {code}
> spark-shell --packages org.postgresql:postgresql:42.1.1
> {code}
> {code}
> import org.apache.spark.sql.execution.QueryExecution
> import org.apache.spark.sql.util.QueryExecutionListener
> val listener = new QueryExecutionListener {
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, duration: 
> Long): Unit = {
> System.out.println(qe.toString())
>   }
> }
> spark.listenerManager.register(listener)
> spark.range(100).write.format("jdbc").option("url", 
> "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", 
> "org.postgresql.Driver").option("dbtable", "test").save()
> {code}
> The above will yield the following plan:
> {code}
> == Parsed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Analyzed Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Optimized Logical Plan ==
> SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>+- Range (0, 100, step=1, splits=Some(8))
> == Physical Plan ==
> ExecutedCommand
>+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
> org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
> ErrorIfExists
>  +- Range (0, 100, step=1, splits=Some(8))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22472:


Assignee: (was: Apache Spark)

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245676#comment-16245676
 ] 

Apache Spark commented on SPARK-22472:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19707

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22472:


Assignee: Apache Spark

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>Assignee: Apache Spark
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials

2017-11-09 Thread Onur Satici (JIRA)
Onur Satici created SPARK-22479:
---

 Summary: SaveIntoDataSourceCommand logs jdbc credentials
 Key: SPARK-22479
 URL: https://issues.apache.org/jira/browse/SPARK-22479
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Onur Satici


JDBC credentials are not redacted in plans including a 
'SaveIntoDataSourceCommand'.

Steps to reproduce:
{code}
spark-shell --packages org.postgresql:postgresql:42.1.1
{code}

{code}
import org.apache.spark.sql.execution.QueryExecution
import org.apache.spark.sql.util.QueryExecutionListener
val listener = new QueryExecutionListener {
  override def onFailure(funcName: String, qe: QueryExecution, exception: 
Exception): Unit = {}
  override def onSuccess(funcName: String, qe: QueryExecution, duration: Long): 
Unit = {
System.out.println(qe.toString())
  }
}
spark.listenerManager.register(listener)
spark.range(100).write.format("jdbc").option("url", 
"jdbc:postgresql:sparkdb").option("password", "pass").option("driver", 
"org.postgresql.Driver").option("dbtable", "test").save()
{code}
The above will yield the following plan:

{code}
== Parsed Logical Plan ==
SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
ErrorIfExists
   +- Range (0, 100, step=1, splits=Some(8))

== Analyzed Logical Plan ==
SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
ErrorIfExists
   +- Range (0, 100, step=1, splits=Some(8))

== Optimized Logical Plan ==
SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
ErrorIfExists
   +- Range (0, 100, step=1, splits=Some(8))

== Physical Plan ==
ExecutedCommand
   +- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> 
org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), 
ErrorIfExists
 +- Range (0, 100, step=1, splits=Some(8))
{code}








--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming

2017-11-09 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245556#comment-16245556
 ] 

Li Yuanjian commented on SPARK-20928:
-

Our team discuss on the design sketch in detail, we have some ideas and 
questions take down below.
1. Will the Window Operation support in the Continuous Processing Mode? 
Even if we only consider narrow dependencies currently  like the design sketch 
described, the exactly-once assurance may not be accomplished based on current 
implementation of window and watermark.
2. Should the EpochIDs aligned in the scenario of not map-only?
{quote}
The design can also work with blocking operators, although it’d require the 
blocking operators to ensure epoch markers from all the partitions have been 
received by the operator before moving forward to commit.
{quote}
is the `blocking operators` means 'operator need shuffle'? We think that only 
the operator has ordering relation(like window\mapState\sortByKey) need the 
EpochIDs aligned, others(like groupBy) doesn't.
3. Also the scenario of many to one(like shuffle and window), should we use a 
new EpochID in shuffle read stage and window slide out trigger, or use the 
original EpochIDs batch?

> SPIP: Continuous Processing Mode for Structured Streaming
> -
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>  Labels: SPIP
> Attachments: Continuous Processing in Structured Streaming Design 
> Sketch.pdf
>
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22478) Spark - Truncate date by Day / Hour

2017-11-09 Thread david hodeffi (JIRA)
david hodeffi created SPARK-22478:
-

 Summary: Spark  - Truncate date by Day / Hour
 Key: SPARK-22478
 URL: https://issues.apache.org/jira/browse/SPARK-22478
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: david hodeffi
Priority: Trivial


Currently it is possible to truncate date/timestamp only by month or year,
There is a missing functionality to truncate by hour/day.
A usecase that requires this functionality is: Aggregating financial 
transaction of a party in order to create daily/hour summary. 
Those summaries are required for machine learning algorithms.  





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22477) windowing in spark

2017-11-09 Thread Deepak Nayak (JIRA)
Deepak Nayak created SPARK-22477:


 Summary: windowing in spark
 Key: SPARK-22477
 URL: https://issues.apache.org/jira/browse/SPARK-22477
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Deepak Nayak


I ahve performed map operation over a set of attrtibutes and have stored it in 
a variable. Next i want to perform windowing over a single attribute and not on 
the variable which contains all the attributes.

I have tried doing that but i am not getting all the attribute values after 
windowing it gives windowed output only for a particular attribute.

Kindly help anyone if possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent

2017-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22405.
-
   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.3.0

> Enrich the event information and add new event of ExternalCatalogEvent
> --
>
> Key: SPARK-22405
> URL: https://issues.apache.org/jira/browse/SPARK-22405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.3.0
>
>
> We're building a data lineage tool in which we need to monitor the metadata 
> changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
> several useful events like "CreateDatabaseEvent" for custom SparkListener to 
> use. But the information provided by such event is not rich enough, for 
> example {{CreateTablePreEvent}} only provides "database" name and "table" 
> name, not all the table metadata, which is hard for user to get all the table 
> related useful information.
> So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
> existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters

2017-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22442:
---

Assignee: Liang-Chi Hsieh

> Schema generated by Product Encoder doesn't match case class field name when 
> using non-standard characters
> --
>
> Key: SPARK-22442
> URL: https://issues.apache.org/jira/browse/SPARK-22442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Mikel San Vicente
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> Product encoder encodes special characters wrongly when field name contains 
> certain nonstandard characters.
> For example for:
> {quote}
> case class MyType(`field.1`: String, `field 2`: String)
> {quote}
> we will get the following schema
> {quote}
> root
>  |-- field$u002E1: string (nullable = true)
>  |-- field$u00202: string (nullable = true)
> {quote}
> As a consequence of this issue a DataFrame with the correct schema can't be 
> converted to a Dataset using .as[MyType]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22442) Schema generated by Product Encoder doesn't match case class field name when using non-standard characters

2017-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22442.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19664
[https://github.com/apache/spark/pull/19664]

> Schema generated by Product Encoder doesn't match case class field name when 
> using non-standard characters
> --
>
> Key: SPARK-22442
> URL: https://issues.apache.org/jira/browse/SPARK-22442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Mikel San Vicente
> Fix For: 2.3.0
>
>
> Product encoder encodes special characters wrongly when field name contains 
> certain nonstandard characters.
> For example for:
> {quote}
> case class MyType(`field.1`: String, `field 2`: String)
> {quote}
> we will get the following schema
> {quote}
> root
>  |-- field$u002E1: string (nullable = true)
>  |-- field$u00202: string (nullable = true)
> {quote}
> As a consequence of this issue a DataFrame with the correct schema can't be 
> converted to a Dataset using .as[MyType]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-09 Thread James Porritt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however, I can't seem to reduce it to a 
sample. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745){noformat}

Another error is:
{noformat}
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
385, in getNumPartitions
AttributeError: 'NoneType' object has no attribute 'size'
{noformat}

This is happening at multiple points in my code.





  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however, I can't seem to reduce it to a 
sample. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  Fi

[jira] [Resolved] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive

2017-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22463.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19663
[https://github.com/apache/spark/pull/19663]

> Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to 
> distributed archive
> --
>
> Key: SPARK-22463
> URL: https://issues.apache.org/jira/browse/SPARK-22463
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Kent Yao
> Fix For: 2.3.0
>
>
> When I ran self contained sql apps, such as
> {code:java}
> import org.apache.spark.sql.SparkSession
> object ShowHiveTables {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .appName("Show Hive Tables")
>   .enableHiveSupport()
>   .getOrCreate()
> spark.sql("show tables").show()
> spark.stop()
>   }
> }
> {code}
> with **yarn cluster** mode and `hive-site.xml` correctly within 
> `$SPARK_HOME/conf`,they failed to connect the right hive metestore for not 
> seeing hive-site.xml in AM/Driver's classpath.
> Although submitting them with `--files/--jars local/path/to/hive-site.xml` or 
> puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well 
> in cluster mode as client mode, according to the official doc, see @ 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
> > Configuration of Hive is done by placing your hive-site.xml, core-site.xml 
> > (for security configuration), and hdfs-site.xml (for HDFS configuration) 
> > file in conf/.
> We may respect these configuration files too or modify the doc for 
> hive-tables in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive

2017-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22463:
---

Assignee: Kent Yao

> Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to 
> distributed archive
> --
>
> Key: SPARK-22463
> URL: https://issues.apache.org/jira/browse/SPARK-22463
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
> Fix For: 2.3.0
>
>
> When I ran self contained sql apps, such as
> {code:java}
> import org.apache.spark.sql.SparkSession
> object ShowHiveTables {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .appName("Show Hive Tables")
>   .enableHiveSupport()
>   .getOrCreate()
> spark.sql("show tables").show()
> spark.stop()
>   }
> }
> {code}
> with **yarn cluster** mode and `hive-site.xml` correctly within 
> `$SPARK_HOME/conf`,they failed to connect the right hive metestore for not 
> seeing hive-site.xml in AM/Driver's classpath.
> Although submitting them with `--files/--jars local/path/to/hive-site.xml` or 
> puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well 
> in cluster mode as client mode, according to the official doc, see @ 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
> > Configuration of Hive is done by placing your hive-site.xml, core-site.xml 
> > (for security configuration), and hdfs-site.xml (for HDFS configuration) 
> > file in conf/.
> We may respect these configuration files too or modify the doc for 
> hive-tables in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-09 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245344#comment-16245344
 ] 

Liang-Chi Hsieh commented on SPARK-22460:
-

FYI., there is already a reported issue in spark-avro for this: 
https://github.com/databricks/spark-avro/issues/229.


> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org