[jira] [Assigned] (SPARK-31447) DATE_PART functions produces incorrect result

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31447:


Assignee: (was: Apache Spark)

> DATE_PART functions produces incorrect result
> -
>
> Key: SPARK-31447
> URL: https://issues.apache.org/jira/browse/SPARK-31447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sathyaprakash Govindasamy
>Priority: Minor
>
> Spark does not extract correct date part from calendar interval. Below is one 
> example for extracting day from calendar interval
> {code:java}
> spark.sql("SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) 
> - cast('2020-01-01 00:00:00' as timestamp)))").show{code}
> ++
> |date_part('DAY', subtracttimestamps(CAST('2020-01-15 00:00:00' AS 
> TIMESTAMP), CAST('2020-01-01 00:00:00' AS TIMESTAMP)))|
> ++
> | 0|
> ++
> Actual output 0 days
> Correct output 14 days
> This is because SubtractTimestamps expression calculates difference and 
> populates only microseconds field. months and days field are set to zero
> {code:java}
> new CalendarInterval(months=0, days=0, microseconds=end.asInstanceOf[Long] - 
> start.asInstanceOf[Long]){code}
> https://github.com/apache/spark/blob/2c5d489679ba3814973680d65853877664bcd931/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2211
> But ExtractIntervalDays expression retrieves days information from days field 
> in CalendarInterval and returns zero.
> {code:java}
> def getDays(interval: CalendarInterval): Int = {
>  interval.days
>  }{code}
> https://github.com/apache/spark/blob/2c5d489679ba3814973680d65853877664bcd931/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala#L73



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31447) DATE_PART functions produces incorrect result

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31447:


Assignee: Apache Spark

> DATE_PART functions produces incorrect result
> -
>
> Key: SPARK-31447
> URL: https://issues.apache.org/jira/browse/SPARK-31447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sathyaprakash Govindasamy
>Assignee: Apache Spark
>Priority: Minor
>
> Spark does not extract correct date part from calendar interval. Below is one 
> example for extracting day from calendar interval
> {code:java}
> spark.sql("SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) 
> - cast('2020-01-01 00:00:00' as timestamp)))").show{code}
> ++
> |date_part('DAY', subtracttimestamps(CAST('2020-01-15 00:00:00' AS 
> TIMESTAMP), CAST('2020-01-01 00:00:00' AS TIMESTAMP)))|
> ++
> | 0|
> ++
> Actual output 0 days
> Correct output 14 days
> This is because SubtractTimestamps expression calculates difference and 
> populates only microseconds field. months and days field are set to zero
> {code:java}
> new CalendarInterval(months=0, days=0, microseconds=end.asInstanceOf[Long] - 
> start.asInstanceOf[Long]){code}
> https://github.com/apache/spark/blob/2c5d489679ba3814973680d65853877664bcd931/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2211
> But ExtractIntervalDays expression retrieves days information from days field 
> in CalendarInterval and returns zero.
> {code:java}
> def getDays(interval: CalendarInterval): Int = {
>  interval.days
>  }{code}
> https://github.com/apache/spark/blob/2c5d489679ba3814973680d65853877664bcd931/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala#L73



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31372) Display expression schema for double checkout alias

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097140#comment-17097140
 ] 

Apache Spark commented on SPARK-31372:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/28427

> Display expression schema for double checkout alias
> ---
>
> Key: SPARK-31372
> URL: https://issues.apache.org/jira/browse/SPARK-31372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> Although SPARK-30184 Implement a helper method for aliasing functions, 
> developers always forget to using this improvement.
> We need to add more powerful guarantees so that aliases outputed by built-in 
> functions are correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31619) Rename config name "spark.dynamicAllocation.shuffleTimeout" to "spark.dynamicAllocation.shuffleTracking.timeout"

2020-04-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31619.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28426
[https://github.com/apache/spark/pull/28426]

> Rename config name "spark.dynamicAllocation.shuffleTimeout" to 
> "spark.dynamicAllocation.shuffleTracking.timeout"
> 
>
> Key: SPARK-31619
> URL: https://issues.apache.org/jira/browse/SPARK-31619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.0.0
>
>
> The "spark.dynamicAllocation.shuffleTimeout" configuration only takes effect 
> if "spark.dynamicAllocation.shuffleTracking.enabled" is true,  so we should 
> re-namespace that configuration so that it's nested under the 
> "shuffleTracking" one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29292) Fix internal usages of mutable collection for Seq in 2.13

2020-04-30 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097132#comment-17097132
 ] 

Sean R. Owen commented on SPARK-29292:
--

Yep it is going to be a quite large internal change to even compile, and will 
change runtime signature of most of the Scala API. That much will still be 
mostly source-compatible for programs, and binary compatibility has always been 
not guaranteed across Scala versions. I think we will largely have to 'take the 
hit' to perf where some of these changes introduce a copy, as the change will 
otherwise be very hard to manage across 2.12 vs 2.13. Maybe we have to rewrite 
some key internal code to avoid a copy where it matters. I take back the 
comment above; ArrayBuffer.toSeq almost certainly can't return an immutable 
wrapper as it can't let underlying changes be reflected in the 'immutable' 
.toSeq.

> Fix internal usages of mutable collection for Seq in 2.13
> -
>
> Key: SPARK-29292
> URL: https://issues.apache.org/jira/browse/SPARK-29292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Kind of related to https://issues.apache.org/jira/browse/SPARK-27681, but a 
> simpler subset. 
> In 2.13, a mutable collection can't be returned as a 
> {{scala.collection.Seq}}. It's easy enough to call .toSeq on these as that 
> still works on 2.12.
> {code}
> [ERROR] [Error] 
> /Users/seanowen/Documents/spark_2.13/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:467:
>  type mismatch;
>  found   : Seq[String] (in scala.collection) 
>  required: Seq[String] (in scala.collection.immutable) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23172) Expand the ReorderJoin rule to handle Project nodes

2020-04-30 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro reopened SPARK-23172:
--

> Expand the ReorderJoin rule to handle Project nodes
> ---
>
> Key: SPARK-23172
> URL: https://issues.apache.org/jira/browse/SPARK-23172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>  Labels: bulk-closed
>
> The current `ReorderJoin` optimizer rule cannot flatten a pattern `Join -> 
> Project -> Join` because `ExtractFiltersAndInnerJoins`
> doesn't handle `Project` nodes. So, the current master cannot reorder joins 
> in a query below;
> {code}
> val df1 = spark.range(100).selectExpr("id % 10 AS k0", s"id % 10 AS k1", s"id 
> % 10 AS k2", "id AS v1")
> val df2 = spark.range(10).selectExpr("id AS k0", "id AS v2")
> val df3 = spark.range(10).selectExpr("id AS k1", "id AS v3")
> val df4 = spark.range(10).selectExpr("id AS k2", "id AS v4")
> df1.join(df2, "k0").join(df3, "k1").join(df4, "k2").explain(true)
> == Analyzed Logical Plan ==
> k2: bigint, k1: bigint, k0: bigint, v1: bigint, v2: bigint, v3: bigint, v4: 
> bigint
> Project [k2#5L, k1#4L, k0#3L, v1#6L, v2#16L, v3#24L, v4#32L]
> +- Join Inner, (k2#5L = k2#31L)
>:- Project [k1#4L, k0#3L, k2#5L, v1#6L, v2#16L, v3#24L]
>:  +- Join Inner, (k1#4L = k1#23L)
>: :- Project [k0#3L, k1#4L, k2#5L, v1#6L, v2#16L]
>: :  +- Join Inner, (k0#3L = k0#15L)
>: : :- Project [(id#0L % cast(10 as bigint)) AS k0#3L, (id#0L % 
> cast(10 as bigint)) AS k1#4L, (id#0L % cast(10 as bigint)) AS k2#5L, id#0
> L AS v1#6L]
>: : :  +- Range (0, 100, step=1, splits=Some(4))
>: : +- Project [id#12L AS k0#15L, id#12L AS v2#16L]
>: :+- Range (0, 10, step=1, splits=Some(4))
>: +- Project [id#20L AS k1#23L, id#20L AS v3#24L]
>:+- Range (0, 10, step=1, splits=Some(4))
>+- Project [id#28L AS k2#31L, id#28L AS v4#32L]
>   +- Range (0, 10, step=1, splits=Some(4))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23172) Expand the ReorderJoin rule to handle Project nodes

2020-04-30 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-23172.
--
Resolution: Won't Fix

> Expand the ReorderJoin rule to handle Project nodes
> ---
>
> Key: SPARK-23172
> URL: https://issues.apache.org/jira/browse/SPARK-23172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>  Labels: bulk-closed
>
> The current `ReorderJoin` optimizer rule cannot flatten a pattern `Join -> 
> Project -> Join` because `ExtractFiltersAndInnerJoins`
> doesn't handle `Project` nodes. So, the current master cannot reorder joins 
> in a query below;
> {code}
> val df1 = spark.range(100).selectExpr("id % 10 AS k0", s"id % 10 AS k1", s"id 
> % 10 AS k2", "id AS v1")
> val df2 = spark.range(10).selectExpr("id AS k0", "id AS v2")
> val df3 = spark.range(10).selectExpr("id AS k1", "id AS v3")
> val df4 = spark.range(10).selectExpr("id AS k2", "id AS v4")
> df1.join(df2, "k0").join(df3, "k1").join(df4, "k2").explain(true)
> == Analyzed Logical Plan ==
> k2: bigint, k1: bigint, k0: bigint, v1: bigint, v2: bigint, v3: bigint, v4: 
> bigint
> Project [k2#5L, k1#4L, k0#3L, v1#6L, v2#16L, v3#24L, v4#32L]
> +- Join Inner, (k2#5L = k2#31L)
>:- Project [k1#4L, k0#3L, k2#5L, v1#6L, v2#16L, v3#24L]
>:  +- Join Inner, (k1#4L = k1#23L)
>: :- Project [k0#3L, k1#4L, k2#5L, v1#6L, v2#16L]
>: :  +- Join Inner, (k0#3L = k0#15L)
>: : :- Project [(id#0L % cast(10 as bigint)) AS k0#3L, (id#0L % 
> cast(10 as bigint)) AS k1#4L, (id#0L % cast(10 as bigint)) AS k2#5L, id#0
> L AS v1#6L]
>: : :  +- Range (0, 100, step=1, splits=Some(4))
>: : +- Project [id#12L AS k0#15L, id#12L AS v2#16L]
>: :+- Range (0, 10, step=1, splits=Some(4))
>: +- Project [id#20L AS k1#23L, id#20L AS v3#24L]
>:+- Range (0, 10, step=1, splits=Some(4))
>+- Project [id#28L AS k2#31L, id#28L AS v4#32L]
>   +- Range (0, 10, step=1, splits=Some(4))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20629) Copy shuffle data when nodes are being shut down

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20629:


Assignee: Apache Spark

> Copy shuffle data when nodes are being shut down
> 
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>
> We decided not to do this for YARN, but for Kubernetes and similar systems 
> nodes may be shut down entirely without the ability to keep an 
> AuxiliaryService around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20629) Copy shuffle data when nodes are being shut down

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20629:


Assignee: (was: Apache Spark)

> Copy shuffle data when nodes are being shut down
> 
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> We decided not to do this for YARN, but for Kubernetes and similar systems 
> nodes may be shut down entirely without the ability to keep an 
> AuxiliaryService around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20629) Copy shuffle data when nodes are being shut down

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097103#comment-17097103
 ] 

Apache Spark commented on SPARK-20629:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28331

> Copy shuffle data when nodes are being shut down
> 
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> We decided not to do this for YARN, but for Kubernetes and similar systems 
> nodes may be shut down entirely without the ability to keep an 
> AuxiliaryService around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly

2020-04-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31549.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28395
[https://github.com/apache/spark/pull/28395]

> Pyspark SparkContext.cancelJobGroup do not work correctly
> -
>
> Key: SPARK-31549
> URL: https://issues.apache.org/jira/browse/SPARK-31549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Critical
> Fix For: 3.0.0
>
>
> Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue 
> existing for a long time. This is because of pyspark thread didn't pinned to 
> jvm thread when invoking java side methods, which leads to all pyspark API 
> which related to java local thread variables do not work correctly. 
> (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` 
> and so on.)
> This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode 
> added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two 
> issue:
> * It is disabled by default. We need to set additional environment variable 
> to enable it.
> * There's memory leak issue which haven't been addressed.
> Now there's a series of project like hyperopt-spark, spark-joblib which rely 
> on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it 
> is critical to address this issue and we hope it work under default pyspark 
> mode. An optional approach is implementing methods like 
> `rdd.setGroupAndCollect`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly

2020-04-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31549:


Assignee: Weichen Xu

> Pyspark SparkContext.cancelJobGroup do not work correctly
> -
>
> Key: SPARK-31549
> URL: https://issues.apache.org/jira/browse/SPARK-31549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Critical
>
> Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue 
> existing for a long time. This is because of pyspark thread didn't pinned to 
> jvm thread when invoking java side methods, which leads to all pyspark API 
> which related to java local thread variables do not work correctly. 
> (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` 
> and so on.)
> This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode 
> added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two 
> issue:
> * It is disabled by default. We need to set additional environment variable 
> to enable it.
> * There's memory leak issue which haven't been addressed.
> Now there's a series of project like hyperopt-spark, spark-joblib which rely 
> on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it 
> is critical to address this issue and we hope it work under default pyspark 
> mode. An optional approach is implementing methods like 
> `rdd.setGroupAndCollect`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097073#comment-17097073
 ] 

Dongjoon Hyun commented on SPARK-31500:
---

I checked the following code snippet from the PR unit test with Apache Spark 
2.0.2 ~ 2.4.5 and update the Affected Version.
{code}
val bytesTest1 = "test1".getBytes
val bytesTest2 = "test2".getBytes
val df = Seq(bytesTest1, bytesTest1, bytesTest2).toDF("a")
val ret = df.select(collect_set($"a")).collect().map(r => 
r.getAs[Seq[_]](0)).head
{code}

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>  Labels: correctness
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31500:
--
Labels: correctness  (was: )

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>  Labels: correctness
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31500:
--
Affects Version/s: 2.1.3

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31500:
--
Affects Version/s: 2.0.2

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31500:
--
Affects Version/s: 2.2.3

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31500:
--
Affects Version/s: 2.3.4

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31500:
--
Affects Version/s: (was: 2.4.4)

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31616) Add partition event listener in ExternalCatalogWithListener

2020-04-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31616.
---
Resolution: Duplicate

> Add partition event listener in ExternalCatalogWithListener
> ---
>
> Key: SPARK-31616
> URL: https://issues.apache.org/jira/browse/SPARK-31616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wan Kun
>Priority: Minor
>
> There are many partitioned table in our data warehouse.
> We can get and analyze these  partition change events  by customizing the  
> external catalog listener.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31619) Rename config name "spark.dynamicAllocation.shuffleTimeout" to "spark.dynamicAllocation.shuffleTracking.timeout"

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31619:


Assignee: Apache Spark  (was: Xingbo Jiang)

> Rename config name "spark.dynamicAllocation.shuffleTimeout" to 
> "spark.dynamicAllocation.shuffleTracking.timeout"
> 
>
> Key: SPARK-31619
> URL: https://issues.apache.org/jira/browse/SPARK-31619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Apache Spark
>Priority: Major
>
> The "spark.dynamicAllocation.shuffleTimeout" configuration only takes effect 
> if "spark.dynamicAllocation.shuffleTracking.enabled" is true,  so we should 
> re-namespace that configuration so that it's nested under the 
> "shuffleTracking" one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31619) Rename config name "spark.dynamicAllocation.shuffleTimeout" to "spark.dynamicAllocation.shuffleTracking.timeout"

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097046#comment-17097046
 ] 

Apache Spark commented on SPARK-31619:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/28426

> Rename config name "spark.dynamicAllocation.shuffleTimeout" to 
> "spark.dynamicAllocation.shuffleTracking.timeout"
> 
>
> Key: SPARK-31619
> URL: https://issues.apache.org/jira/browse/SPARK-31619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
>
> The "spark.dynamicAllocation.shuffleTimeout" configuration only takes effect 
> if "spark.dynamicAllocation.shuffleTracking.enabled" is true,  so we should 
> re-namespace that configuration so that it's nested under the 
> "shuffleTracking" one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31619) Rename config name "spark.dynamicAllocation.shuffleTimeout" to "spark.dynamicAllocation.shuffleTracking.timeout"

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31619:


Assignee: Xingbo Jiang  (was: Apache Spark)

> Rename config name "spark.dynamicAllocation.shuffleTimeout" to 
> "spark.dynamicAllocation.shuffleTracking.timeout"
> 
>
> Key: SPARK-31619
> URL: https://issues.apache.org/jira/browse/SPARK-31619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
>
> The "spark.dynamicAllocation.shuffleTimeout" configuration only takes effect 
> if "spark.dynamicAllocation.shuffleTracking.enabled" is true,  so we should 
> re-namespace that configuration so that it's nested under the 
> "shuffleTracking" one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31619) Rename config name "spark.dynamicAllocation.shuffleTimeout" to "spark.dynamicAllocation.shuffleTracking.timeout"

2020-04-30 Thread Xingbo Jiang (Jira)
Xingbo Jiang created SPARK-31619:


 Summary: Rename config name 
"spark.dynamicAllocation.shuffleTimeout" to 
"spark.dynamicAllocation.shuffleTracking.timeout"
 Key: SPARK-31619
 URL: https://issues.apache.org/jira/browse/SPARK-31619
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xingbo Jiang
Assignee: Xingbo Jiang


The "spark.dynamicAllocation.shuffleTimeout" configuration only takes effect if 
"spark.dynamicAllocation.shuffleTracking.enabled" is true,  so we should 
re-namespace that configuration so that it's nested under the "shuffleTracking" 
one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31612) SQL Reference clean up

2020-04-30 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097013#comment-17097013
 ] 

Huaxin Gao commented on SPARK-31612:


Check the following things in SQL Reference and fix if problems found:
1) syntax
2) typo
3) links

> SQL Reference clean up
> --
>
> Key: SPARK-31612
> URL: https://issues.apache.org/jira/browse/SPARK-31612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> SQL Reference clean up



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31612) SQL Reference clean up

2020-04-30 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31612.
--
Fix Version/s: 3.0.0
 Assignee: Huaxin Gao
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28417

> SQL Reference clean up
> --
>
> Key: SPARK-31612
> URL: https://issues.apache.org/jira/browse/SPARK-31612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> SQL Reference clean up



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection

2020-04-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096939#comment-17096939
 ] 

Dongjoon Hyun commented on SPARK-29048:
---

BTW, this was merged and reverted for `3.0.0` only. This JIRA is irrelevant to 
2.4.6 release~ So, we can ignore for 2.4.6 release.

> Query optimizer slow when using Column.isInCollection() with a large size 
> collection
> 
>
> Key: SPARK-29048
> URL: https://issues.apache.org/jira/browse/SPARK-29048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5, 2.4.6
>Reporter: Weichen Xu
>Priority: Major
>
> Query optimizer slow when using Column.isInCollection() with a large size 
> collection.
> The query optimizer takes a long time to do its thing and on the UI all I see 
> is "Running commands". This can take from 10s of minutes to 11 hours 
> depending on how many values there are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection

2020-04-30 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096935#comment-17096935
 ] 

Dongjoon Hyun commented on SPARK-29048:
---

This caused a correctness issue, SPARK-31553, which is linked to this JIRA.

> Query optimizer slow when using Column.isInCollection() with a large size 
> collection
> 
>
> Key: SPARK-29048
> URL: https://issues.apache.org/jira/browse/SPARK-29048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5, 2.4.6
>Reporter: Weichen Xu
>Priority: Major
>
> Query optimizer slow when using Column.isInCollection() with a large size 
> collection.
> The query optimizer takes a long time to do its thing and on the UI all I see 
> is "Running commands". This can take from 10s of minutes to 11 hours 
> depending on how many values there are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection

2020-04-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29048:
--
Target Version/s:   (was: 2.4.6)

> Query optimizer slow when using Column.isInCollection() with a large size 
> collection
> 
>
> Key: SPARK-29048
> URL: https://issues.apache.org/jira/browse/SPARK-29048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5, 2.4.6
>Reporter: Weichen Xu
>Priority: Major
>
> Query optimizer slow when using Column.isInCollection() with a large size 
> collection.
> The query optimizer takes a long time to do its thing and on the UI all I see 
> is "Running commands". This can take from 10s of minutes to 11 hours 
> depending on how many values there are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31480) Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096930#comment-17096930
 ] 

Apache Spark commented on SPARK-31480:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/28425

> Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
> ---
>
> Key: SPARK-31480
> URL: https://issues.apache.org/jira/browse/SPARK-31480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Below is the EXPLAIN OUTPUT when using the *DSV2* 
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- BatchScan[col.dots#39L] JsonScan DataFilters: [isnotnull(col.dots#39L), 
> (col.dots#39L = 500)], Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-7dad6f63-dc...,
>  PartitionFilters: [], ReadSchema: struct
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) BatchScan
> Output [1]: [col.dots#39L]
> Arguments: [col.dots#39L], 
> JsonScan(org.apache.spark.sql.test.TestSparkSession@45eab322,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@72065f16,StructType(StructField(col.dots,LongType,true)),StructType(StructField(col.dots,LongType,true)),StructType(),org.apache.spark.sql.util.CaseInsensitiveStringMap@8822c5e0,Vector(),List(isnotnull(col.dots#39L),
>  (col.dots#39L = 500)))
> {code}
> When using *DSV1*, the output is much cleaner than the output of DSV2, 
> especially for EXPLAIN FORMATTED.
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- FileScan json [col.dots#37L] Batched: false, DataFilters: 
> [isnotnull(col.dots#37L), (col.dots#37L = 500)], Format: JSON, Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-59...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(`col.dots`), 
> EqualTo(`col.dots`,500)], ReadSchema: struct 
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) Scan json 
> Output [1]: [col.dots#37L]
> Batched: false
> Location: InMemoryFileIndex 
> [file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-5971-4a96-bf10-0730873f6ce0]
> PushedFilters: [IsNotNull(`col.dots`), EqualTo(`col.dots`,500)]
> ReadSchema: struct{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31480) Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096928#comment-17096928
 ] 

Apache Spark commented on SPARK-31480:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/28425

> Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
> ---
>
> Key: SPARK-31480
> URL: https://issues.apache.org/jira/browse/SPARK-31480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Below is the EXPLAIN OUTPUT when using the *DSV2* 
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- BatchScan[col.dots#39L] JsonScan DataFilters: [isnotnull(col.dots#39L), 
> (col.dots#39L = 500)], Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-7dad6f63-dc...,
>  PartitionFilters: [], ReadSchema: struct
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) BatchScan
> Output [1]: [col.dots#39L]
> Arguments: [col.dots#39L], 
> JsonScan(org.apache.spark.sql.test.TestSparkSession@45eab322,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@72065f16,StructType(StructField(col.dots,LongType,true)),StructType(StructField(col.dots,LongType,true)),StructType(),org.apache.spark.sql.util.CaseInsensitiveStringMap@8822c5e0,Vector(),List(isnotnull(col.dots#39L),
>  (col.dots#39L = 500)))
> {code}
> When using *DSV1*, the output is much cleaner than the output of DSV2, 
> especially for EXPLAIN FORMATTED.
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- FileScan json [col.dots#37L] Batched: false, DataFilters: 
> [isnotnull(col.dots#37L), (col.dots#37L = 500)], Format: JSON, Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-59...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(`col.dots`), 
> EqualTo(`col.dots`,500)], ReadSchema: struct 
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) Scan json 
> Output [1]: [col.dots#37L]
> Batched: false
> Location: InMemoryFileIndex 
> [file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-5971-4a96-bf10-0730873f6ce0]
> PushedFilters: [IsNotNull(`col.dots`), EqualTo(`col.dots`,500)]
> ReadSchema: struct{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31480) Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31480:


Assignee: Apache Spark

> Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
> ---
>
> Key: SPARK-31480
> URL: https://issues.apache.org/jira/browse/SPARK-31480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Below is the EXPLAIN OUTPUT when using the *DSV2* 
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- BatchScan[col.dots#39L] JsonScan DataFilters: [isnotnull(col.dots#39L), 
> (col.dots#39L = 500)], Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-7dad6f63-dc...,
>  PartitionFilters: [], ReadSchema: struct
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) BatchScan
> Output [1]: [col.dots#39L]
> Arguments: [col.dots#39L], 
> JsonScan(org.apache.spark.sql.test.TestSparkSession@45eab322,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@72065f16,StructType(StructField(col.dots,LongType,true)),StructType(StructField(col.dots,LongType,true)),StructType(),org.apache.spark.sql.util.CaseInsensitiveStringMap@8822c5e0,Vector(),List(isnotnull(col.dots#39L),
>  (col.dots#39L = 500)))
> {code}
> When using *DSV1*, the output is much cleaner than the output of DSV2, 
> especially for EXPLAIN FORMATTED.
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- FileScan json [col.dots#37L] Batched: false, DataFilters: 
> [isnotnull(col.dots#37L), (col.dots#37L = 500)], Format: JSON, Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-59...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(`col.dots`), 
> EqualTo(`col.dots`,500)], ReadSchema: struct 
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) Scan json 
> Output [1]: [col.dots#37L]
> Batched: false
> Location: InMemoryFileIndex 
> [file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-5971-4a96-bf10-0730873f6ce0]
> PushedFilters: [IsNotNull(`col.dots`), EqualTo(`col.dots`,500)]
> ReadSchema: struct{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31480) Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31480:


Assignee: (was: Apache Spark)

> Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
> ---
>
> Key: SPARK-31480
> URL: https://issues.apache.org/jira/browse/SPARK-31480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Below is the EXPLAIN OUTPUT when using the *DSV2* 
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- BatchScan[col.dots#39L] JsonScan DataFilters: [isnotnull(col.dots#39L), 
> (col.dots#39L = 500)], Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-7dad6f63-dc...,
>  PartitionFilters: [], ReadSchema: struct
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) BatchScan
> Output [1]: [col.dots#39L]
> Arguments: [col.dots#39L], 
> JsonScan(org.apache.spark.sql.test.TestSparkSession@45eab322,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@72065f16,StructType(StructField(col.dots,LongType,true)),StructType(StructField(col.dots,LongType,true)),StructType(),org.apache.spark.sql.util.CaseInsensitiveStringMap@8822c5e0,Vector(),List(isnotnull(col.dots#39L),
>  (col.dots#39L = 500)))
> {code}
> When using *DSV1*, the output is much cleaner than the output of DSV2, 
> especially for EXPLAIN FORMATTED.
> *Output of EXPLAIN EXTENDED* 
> {code:java}
> +- FileScan json [col.dots#37L] Batched: false, DataFilters: 
> [isnotnull(col.dots#37L), (col.dots#37L = 500)], Format: JSON, Location: 
> InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-59...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(`col.dots`), 
> EqualTo(`col.dots`,500)], ReadSchema: struct 
> {code}
> *Output of EXPLAIN FORMATTED* 
> {code:java}
>  (1) Scan json 
> Output [1]: [col.dots#37L]
> Batched: false
> Location: InMemoryFileIndex 
> [file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-5971-4a96-bf10-0730873f6ce0]
> PushedFilters: [IsNotNull(`col.dots`), EqualTo(`col.dots`,500)]
> ReadSchema: struct{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29292) Fix internal usages of mutable collection for Seq in 2.13

2020-04-30 Thread Guillaume Martres (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096893#comment-17096893
 ] 

Guillaume Martres commented on SPARK-29292:
---

If it's immutable it's fine yeah, but it seems that spark internally uses a 
bunch of ArrayBuffer which do need to be copied to be made into a scala.Seq 
now. On top of that, users of Spark might also have to add a bunch of 
potentially-copying .toSeq to call Spark methods. For example I had some code 
that did `sparkContext.parallelize(rdd.take(1000))` which still compiles but 
with a deprecation warning because take returns an Array:

> warning: method copyArrayToImmutableIndexedSeq in class LowPriorityImplicits2 
> is deprecated (since 2.13.0): Implicit conversions from Array to 
> immutable.IndexedSeq are implemented by copying; Use the more efficient 
> non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call

So depending on how common this sort of things is, it might make sense to 
change SparkContext#parallelize  and others to take a scala.collection.Seq 
instead of a Seq.

> Fix internal usages of mutable collection for Seq in 2.13
> -
>
> Key: SPARK-29292
> URL: https://issues.apache.org/jira/browse/SPARK-29292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Kind of related to https://issues.apache.org/jira/browse/SPARK-27681, but a 
> simpler subset. 
> In 2.13, a mutable collection can't be returned as a 
> {{scala.collection.Seq}}. It's easy enough to call .toSeq on these as that 
> still works on 2.12.
> {code}
> [ERROR] [Error] 
> /Users/seanowen/Documents/spark_2.13/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:467:
>  type mismatch;
>  found   : Seq[String] (in scala.collection) 
>  required: Seq[String] (in scala.collection.immutable) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29292) Fix internal usages of mutable collection for Seq in 2.13

2020-04-30 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096879#comment-17096879
 ] 

Sean R. Owen commented on SPARK-29292:
--

Heh, congratulations, it is going to be a huge change. Why would it entail 
copies though? a mutable Seq .toSeq should just return itself as it's trivially 
immutable. But yes that's what I'm worried about, if the extra .toSeq and 
.toMap to make it cross-compile introduces some overhead.

> Fix internal usages of mutable collection for Seq in 2.13
> -
>
> Key: SPARK-29292
> URL: https://issues.apache.org/jira/browse/SPARK-29292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Kind of related to https://issues.apache.org/jira/browse/SPARK-27681, but a 
> simpler subset. 
> In 2.13, a mutable collection can't be returned as a 
> {{scala.collection.Seq}}. It's easy enough to call .toSeq on these as that 
> still works on 2.12.
> {code}
> [ERROR] [Error] 
> /Users/seanowen/Documents/spark_2.13/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:467:
>  type mismatch;
>  found   : Seq[String] (in scala.collection) 
>  required: Seq[String] (in scala.collection.immutable) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29292) Fix internal usages of mutable collection for Seq in 2.13

2020-04-30 Thread Guillaume Martres (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096876#comment-17096876
 ] 

Guillaume Martres commented on SPARK-29292:
---

Thanks [~srowen], I got spark to compile with Scala 2.13.2 based on this 
branch, cf 
https://issues.apache.org/jira/browse/SPARK-25075?focusedCommentId=17096870=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17096870.
 But I think that the use of .toSeq should be reconsidered since it means 
copying on Scala 2.13. Instead, I think that usages of scala.Seq should be 
replaced by scala.collection.Seq when it makes sense to do so.

> Fix internal usages of mutable collection for Seq in 2.13
> -
>
> Key: SPARK-29292
> URL: https://issues.apache.org/jira/browse/SPARK-29292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Kind of related to https://issues.apache.org/jira/browse/SPARK-27681, but a 
> simpler subset. 
> In 2.13, a mutable collection can't be returned as a 
> {{scala.collection.Seq}}. It's easy enough to call .toSeq on these as that 
> still works on 2.12.
> {code}
> [ERROR] [Error] 
> /Users/seanowen/Documents/spark_2.13/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:467:
>  type mismatch;
>  found   : Seq[String] (in scala.collection) 
>  required: Seq[String] (in scala.collection.immutable) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29292) Fix internal usages of mutable collection for Seq in 2.13

2020-04-30 Thread Guillaume Martres (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096876#comment-17096876
 ] 

Guillaume Martres edited comment on SPARK-29292 at 4/30/20, 6:47 PM:
-

Thanks [~srowen], I got spark to compile with Scala 2.13.2 based on this 
branch, cf 
https://issues.apache.org/jira/browse/SPARK-25075?focusedCommentId=17096870=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17096870.
 But I think that the use of .toSeq should be reconsidered since it means 
copying on Scala 2.13 for mutable collections. Instead, I think that usages of 
scala.Seq should be replaced by scala.collection.Seq when it makes sense to do 
so.


was (Author: smarter):
Thanks [~srowen], I got spark to compile with Scala 2.13.2 based on this 
branch, cf 
https://issues.apache.org/jira/browse/SPARK-25075?focusedCommentId=17096870=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17096870.
 But I think that the use of .toSeq should be reconsidered since it means 
copying on Scala 2.13. Instead, I think that usages of scala.Seq should be 
replaced by scala.collection.Seq when it makes sense to do so.

> Fix internal usages of mutable collection for Seq in 2.13
> -
>
> Key: SPARK-29292
> URL: https://issues.apache.org/jira/browse/SPARK-29292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Kind of related to https://issues.apache.org/jira/browse/SPARK-27681, but a 
> simpler subset. 
> In 2.13, a mutable collection can't be returned as a 
> {{scala.collection.Seq}}. It's easy enough to call .toSeq on these as that 
> still works on 2.12.
> {code}
> [ERROR] [Error] 
> /Users/seanowen/Documents/spark_2.13/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:467:
>  type mismatch;
>  found   : Seq[String] (in scala.collection) 
>  required: Seq[String] (in scala.collection.immutable) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2020-04-30 Thread Guillaume Martres (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096870#comment-17096870
 ] 

Guillaume Martres commented on SPARK-25075:
---

Based on [~srowen]'s branch from SPARK-29292, I was able to get spark-core and 
its dependencies to compile with Scala 2.13.2 after making a few changes. The 
result is at https://github.com/smarter/spark/tree/scala-2.13, I've published 
it at https://bintray.com/smarter/maven for my own needs but I don't intend to 
update it or work on it further.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31608) Add a hybrid KVStore to make UI loading faster

2020-04-30 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-31608:

Description: 
This is a follow-up for the work done by Hieu Huynh in 2019.

Add a new class HybridKVStore to make the history server faster when loading 
event files. When rebuilding the application state from event logs, 
HybridKVStore will first write data to an in-memory store and having a 
background thread that keeps pushing the change to levelDB.

I ran some tests on 3.0.1 on mac os:
||kvstore type / log size||100m||200m||500m||1g||2g||
|HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
leveldb|
|LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|
For example when loading a 1g file, HybridKVStore takes 23s to parse (that 
means, users only need to wait for 23s to see the UI), the background thread 
will still run 17s to copy data to leveldb. And after that, the in memory store 
can be closed, the entire store now moves to leveldb. So in general, it has 3x 
- 4x UI loading speed improvement.

  was:
This is a follow-up for the work done by Hieu Huynh in 2019.

Add a new class HybridKVStore to make the history server faster when loading 
event files. When writing to this kvstore, it will first write to an in-memory 
store and having a background thread that keeps pushing the change to levelDB.

I ran some tests on 3.0.1 on mac os:
||kvstore type / log size||100m||200m||500m||1g||2g||
|HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
leveldb|
|LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|

 


> Add a hybrid KVStore to make UI loading faster
> --
>
> Key: SPARK-31608
> URL: https://issues.apache.org/jira/browse/SPARK-31608
> Project: Spark
>  Issue Type: Story
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Priority: Major
>
> This is a follow-up for the work done by Hieu Huynh in 2019.
> Add a new class HybridKVStore to make the history server faster when loading 
> event files. When rebuilding the application state from event logs, 
> HybridKVStore will first write data to an in-memory store and having a 
> background thread that keeps pushing the change to levelDB.
> I ran some tests on 3.0.1 on mac os:
> ||kvstore type / log size||100m||200m||500m||1g||2g||
> |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
> leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
> leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
> leveldb|
> |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|
> For example when loading a 1g file, HybridKVStore takes 23s to parse (that 
> means, users only need to wait for 23s to see the UI), the background thread 
> will still run 17s to copy data to leveldb. And after that, the in memory 
> store can be closed, the entire store now moves to leveldb. So in general, it 
> has 3x - 4x UI loading speed improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30865) Refactor DateTimeUtils

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30865:


Assignee: (was: Apache Spark)

> Refactor DateTimeUtils
> --
>
> Key: SPARK-30865
> URL: https://issues.apache.org/jira/browse/SPARK-30865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> * Move TimeZoneUTC and TimeZoneGMT to DateTimeTestUtils
> * Remove TimeZoneGMT because it is equal to UTC
> * Use ZoneId.systemDefault() instead of defaultTimeZone().toZoneId
> * Alias SQLDate & SQLTimestamp to internal types of DateType and TimestampType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30865) Refactor DateTimeUtils

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30865:


Assignee: Apache Spark

> Refactor DateTimeUtils
> --
>
> Key: SPARK-30865
> URL: https://issues.apache.org/jira/browse/SPARK-30865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> * Move TimeZoneUTC and TimeZoneGMT to DateTimeTestUtils
> * Remove TimeZoneGMT because it is equal to UTC
> * Use ZoneId.systemDefault() instead of defaultTimeZone().toZoneId
> * Alias SQLDate & SQLTimestamp to internal types of DateType and TimestampType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31574) Schema evolution in spark while using the storage format as parquet

2020-04-30 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096805#comment-17096805
 ] 

Pablo Langa Blanco commented on SPARK-31574:


Spark have a functionality to read multiple parquet files with different 
*compatible* schemas and by default is disabled.

[https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging]

The problem in the example you propose is that int and string are incompatible 
data types so merge schema is not going to work

> Schema evolution in spark while using the storage format as parquet
> ---
>
> Key: SPARK-31574
> URL: https://issues.apache.org/jira/browse/SPARK-31574
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: sharad Gupta
>Priority: Major
>
> Hi Team,
>  
> Use case:
> Suppose there is a table T1 with column C1 with datatype as int in schema 
> version 1. In the first on boarding table T1. I wrote couple of parquet files 
> with this schema version 1 with underlying file format used parquet.
> Now in schema version 2 the C1 column datatype changed to string from int. 
> Now It will write data with schema version 2 in parquet.
> So some parquet files are written with schema version 1 and some written with 
> schema version 2.
> Problem statement :
> 1. We are not able to execute the below command from spark sql
> ```Alter table Table T1 change C1 C1 string```
> 2. So as a solution i goto hive and alter the table change datatype because 
> it supported in hive then try to read the data in spark. So it is giving me 
> error
> ```
> Caused by: java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
>   at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
>   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)```
>  
> 3. Suspecting that the underlying parquet file is written with integer type 
> and we are reading from a table whose column is changed to string type. So 
> that is why it is happening.
> How you can reproduce this:
> spark sql
> 1. Create a table from spark sql with one column with datatype as int with 
> stored as parquet.
> 2. Now put some data into table.
> 3. Now you can see the data if you select from table.
> Hive
> 1. change datatype from int to string by alter command
> 2. Now try to read data, You will be able to read the data here even after 
> changing the datatype.
> spark sql 
> 1. Try to read data from here now you will see the error.
> Now the question is how to solve schema evolution in spark while using the 
> storage format as parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31357) SPIP: Catalog API for view metadata

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31357:


Assignee: Apache Spark

> SPIP: Catalog API for view metadata
> ---
>
> Key: SPARK-31357
> URL: https://issues.apache.org/jira/browse/SPARK-31357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: John Zhuge
>Assignee: Apache Spark
>Priority: Major
>  Labels: SPIP
>
> SPARK-24252 added a catalog plugin system and `TableCatalog` API that 
> provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view 
> metadata.
> Details in [SPIP 
> document|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31517:


Assignee: Apache Spark

> SparkR::orderBy with multiple columns descending produces error
> ---
>
> Key: SPARK-31517
> URL: https://issues.apache.org/jira/browse/SPARK-31517
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.5
> Environment: Databricks Runtime 6.5
>Reporter: Ross Bowen
>Assignee: Apache Spark
>Priority: Major
>
> When specifying two columns within an `orderBy()` function, to attempt to get 
> an ordering by two columns in descending order, an error is returned.
> {code:java}
> library(magrittr) 
> library(SparkR) 
> cars <- cbind(model = rownames(mtcars), mtcars) 
> carsDF <- createDataFrame(cars) 
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> desc(column("mpg")), desc(column("disp") %>% 
>   head() {code}
> This returns an error:
> {code:java}
>  Error in ns[[i]] : subscript out of bounds{code}
> This seems to be related to the more general issue that the following code, 
> excluding the use of the `desc()` function also fails:
> {code:java}
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> column("mpg"), column("disp" %>% 
>   head(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096755#comment-17096755
 ] 

Apache Spark commented on SPARK-31517:
--

User 'MichaelChirico' has created a pull request for this issue:
https://github.com/apache/spark/pull/28386

> SparkR::orderBy with multiple columns descending produces error
> ---
>
> Key: SPARK-31517
> URL: https://issues.apache.org/jira/browse/SPARK-31517
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.5
> Environment: Databricks Runtime 6.5
>Reporter: Ross Bowen
>Priority: Major
>
> When specifying two columns within an `orderBy()` function, to attempt to get 
> an ordering by two columns in descending order, an error is returned.
> {code:java}
> library(magrittr) 
> library(SparkR) 
> cars <- cbind(model = rownames(mtcars), mtcars) 
> carsDF <- createDataFrame(cars) 
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> desc(column("mpg")), desc(column("disp") %>% 
>   head() {code}
> This returns an error:
> {code:java}
>  Error in ns[[i]] : subscript out of bounds{code}
> This seems to be related to the more general issue that the following code, 
> excluding the use of the `desc()` function also fails:
> {code:java}
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> column("mpg"), column("disp" %>% 
>   head(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31590) The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31590:


Assignee: (was: Apache Spark)

> The filter used by Metadata-only queries should not have Unevaluable
> 
>
> Key: SPARK-31590
> URL: https://issues.apache.org/jira/browse/SPARK-31590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Priority: Trivial
>
> When using SPARK-23877, some sql execution errors.
> code:
> {code:scala}
> sql("set spark.sql.optimizer.metadataOnly=true")
> sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
> PARTITIONED BY (d ,h)")
> sql("""
> |INSERT OVERWRITE TABLE test_tbl PARTITION(d,h)
> |SELECT 1,'2020-01-01','23'
> |UNION ALL
> |SELECT 2,'2020-01-02','01'
> |UNION ALL
> |SELECT 3,'2020-01-02','02'
> """.stripMargin)
> sql(
>   s"""
>  |SELECT d, MAX(h) AS h
>  |FROM test_tbl
>  |WHERE d= (
>  |  SELECT MAX(d) AS d
>  |  FROM test_tbl
>  |)
>  |GROUP BY d
> """.stripMargin).collect()
> {code}
> Exception:
> {code:java}
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#48 []
> ...
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180)
> {code}
> optimizedPlan:
> {code:java}
> Aggregate [d#245], [d#245, max(h#246) AS h#243]
> +- Project [d#245, h#246]
>+- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 []))
>   :  +- Aggregate [max(d#245) AS d#241]
>   : +- LocalRelation , [d#245]
>   +- Relation[a#244,d#245,h#246] parquet
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31517:


Assignee: (was: Apache Spark)

> SparkR::orderBy with multiple columns descending produces error
> ---
>
> Key: SPARK-31517
> URL: https://issues.apache.org/jira/browse/SPARK-31517
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.5
> Environment: Databricks Runtime 6.5
>Reporter: Ross Bowen
>Priority: Major
>
> When specifying two columns within an `orderBy()` function, to attempt to get 
> an ordering by two columns in descending order, an error is returned.
> {code:java}
> library(magrittr) 
> library(SparkR) 
> cars <- cbind(model = rownames(mtcars), mtcars) 
> carsDF <- createDataFrame(cars) 
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> desc(column("mpg")), desc(column("disp") %>% 
>   head() {code}
> This returns an error:
> {code:java}
>  Error in ns[[i]] : subscript out of bounds{code}
> This seems to be related to the more general issue that the following code, 
> excluding the use of the `desc()` function also fails:
> {code:java}
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> column("mpg"), column("disp" %>% 
>   head(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31593) Remove unnecessary streaming query progress update

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096745#comment-17096745
 ] 

Apache Spark commented on SPARK-31593:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/28391

> Remove unnecessary streaming query progress update
> --
>
> Key: SPARK-31593
> URL: https://issues.apache.org/jira/browse/SPARK-31593
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Genmao Yu
>Priority: Minor
>
> Structured Streaming progress reporter will always report an `empty` progress 
> when there is no new data. As design, we should provide progress updates 
> every 10s (default) when there is no new data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31571) don't use stop(paste to build R errors

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31571:


Assignee: Apache Spark

> don't use stop(paste to build R errors
> --
>
> Key: SPARK-31571
> URL: https://issues.apache.org/jira/browse/SPARK-31571
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Assignee: Apache Spark
>Priority: Minor
>
> I notice for example this:
> stop(paste0("Arrow optimization does not support 'dapplyCollect' yet. Please 
> disable ",
> "Arrow optimization or use 'collect' and 'dapply' APIs instead."))
> paste0 is totally unnecessary here -- stop itself uses ... (vararg) with 
> default ''-sep combination, i.e., the above is equivalent to:
> stop("Arrow optimization does not support 'dapplyCollect' yet. Please disable 
> ",
>   "Arrow optimization or use 'collect' and 'dapply' APIs instead.")
> More generally, for portability, this will make it more difficult for 
> user-contributed translations because the standard set of tools for doing 
> this (namely tools::update_pkg_po('.")) would fail to capture these messages 
> as being candidates for translation.
> In fact, it's completely preferable IMO to keep the entire stop("") message 
> as a single string -- I've found that breaking the string across multiple 
> lines makes translation across different languages with different grammars 
> quite difficult. Understand there are lint style constraints however so I 
> wouldn't press on that for now.
> If formatting is needed, I recommend using stop(gettextf(...)) instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31609) Add VarianceThresholdSelector to PySpark

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096744#comment-17096744
 ] 

Apache Spark commented on SPARK-31609:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28409

> Add VarianceThresholdSelector to PySpark
> 
>
> Key: SPARK-31609
> URL: https://issues.apache.org/jira/browse/SPARK-31609
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add VarianceThresholdSelector to PySpark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31603) AFT uses common functions in RDDLossFunction

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31603:


Assignee: Apache Spark

> AFT uses common functions in RDDLossFunction
> 
>
> Key: SPARK-31603
> URL: https://issues.apache.org/jira/browse/SPARK-31603
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> The logic in optimizing AFT is quite similar to other algorithms like LiR, 
> LoR,
> We should reuse the common functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31603) AFT uses common functions in RDDLossFunction

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096751#comment-17096751
 ] 

Apache Spark commented on SPARK-31603:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/28404

> AFT uses common functions in RDDLossFunction
> 
>
> Key: SPARK-31603
> URL: https://issues.apache.org/jira/browse/SPARK-31603
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> The logic in optimizing AFT is quite similar to other algorithms like LiR, 
> LoR,
> We should reuse the common functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31571) don't use stop(paste to build R errors

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096748#comment-17096748
 ] 

Apache Spark commented on SPARK-31571:
--

User 'MichaelChirico' has created a pull request for this issue:
https://github.com/apache/spark/pull/28365

> don't use stop(paste to build R errors
> --
>
> Key: SPARK-31571
> URL: https://issues.apache.org/jira/browse/SPARK-31571
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Priority: Minor
>
> I notice for example this:
> stop(paste0("Arrow optimization does not support 'dapplyCollect' yet. Please 
> disable ",
> "Arrow optimization or use 'collect' and 'dapply' APIs instead."))
> paste0 is totally unnecessary here -- stop itself uses ... (vararg) with 
> default ''-sep combination, i.e., the above is equivalent to:
> stop("Arrow optimization does not support 'dapplyCollect' yet. Please disable 
> ",
>   "Arrow optimization or use 'collect' and 'dapply' APIs instead.")
> More generally, for portability, this will make it more difficult for 
> user-contributed translations because the standard set of tools for doing 
> this (namely tools::update_pkg_po('.")) would fail to capture these messages 
> as being candidates for translation.
> In fact, it's completely preferable IMO to keep the entire stop("") message 
> as a single string -- I've found that breaking the string across multiple 
> lines makes translation across different languages with different grammars 
> quite difficult. Understand there are lint style constraints however so I 
> wouldn't press on that for now.
> If formatting is needed, I recommend using stop(gettextf(...)) instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31571) don't use stop(paste to build R errors

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31571:


Assignee: (was: Apache Spark)

> don't use stop(paste to build R errors
> --
>
> Key: SPARK-31571
> URL: https://issues.apache.org/jira/browse/SPARK-31571
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Priority: Minor
>
> I notice for example this:
> stop(paste0("Arrow optimization does not support 'dapplyCollect' yet. Please 
> disable ",
> "Arrow optimization or use 'collect' and 'dapply' APIs instead."))
> paste0 is totally unnecessary here -- stop itself uses ... (vararg) with 
> default ''-sep combination, i.e., the above is equivalent to:
> stop("Arrow optimization does not support 'dapplyCollect' yet. Please disable 
> ",
>   "Arrow optimization or use 'collect' and 'dapply' APIs instead.")
> More generally, for portability, this will make it more difficult for 
> user-contributed translations because the standard set of tools for doing 
> this (namely tools::update_pkg_po('.")) would fail to capture these messages 
> as being candidates for translation.
> In fact, it's completely preferable IMO to keep the entire stop("") message 
> as a single string -- I've found that breaking the string across multiple 
> lines makes translation across different languages with different grammars 
> quite difficult. Understand there are lint style constraints however so I 
> wouldn't press on that for now.
> If formatting is needed, I recommend using stop(gettextf(...)) instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31603) AFT uses common functions in RDDLossFunction

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31603:


Assignee: (was: Apache Spark)

> AFT uses common functions in RDDLossFunction
> 
>
> Key: SPARK-31603
> URL: https://issues.apache.org/jira/browse/SPARK-31603
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> The logic in optimizing AFT is quite similar to other algorithms like LiR, 
> LoR,
> We should reuse the common functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31609) Add VarianceThresholdSelector to PySpark

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31609:


Assignee: (was: Apache Spark)

> Add VarianceThresholdSelector to PySpark
> 
>
> Key: SPARK-31609
> URL: https://issues.apache.org/jira/browse/SPARK-31609
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add VarianceThresholdSelector to PySpark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31609) Add VarianceThresholdSelector to PySpark

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31609:


Assignee: Apache Spark

> Add VarianceThresholdSelector to PySpark
> 
>
> Key: SPARK-31609
> URL: https://issues.apache.org/jira/browse/SPARK-31609
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> Add VarianceThresholdSelector to PySpark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31570) gapply and dapply docs should be more aligned, possibly combined

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096750#comment-17096750
 ] 

Apache Spark commented on SPARK-31570:
--

User 'MichaelChirico' has created a pull request for this issue:
https://github.com/apache/spark/pull/28362

> gapply and dapply docs should be more aligned, possibly combined
> 
>
> Key: SPARK-31570
> URL: https://issues.apache.org/jira/browse/SPARK-31570
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Priority: Minor
>
> This is a follow-up to https://issues.apache.org/jira/browse/SPARK-31568
> There, we combined gapply and gapplyCollect to make it easier to sync 
> arguments between those Rds.
> dapply and dapplyCollect are also sufficiently similar that they could be on 
> the same Rd.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27188) FileStreamSink: provide a new option to have retention on output files

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096746#comment-17096746
 ] 

Apache Spark commented on SPARK-27188:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/28363

> FileStreamSink: provide a new option to have retention on output files
> --
>
> Key: SPARK-27188
> URL: https://issues.apache.org/jira/browse/SPARK-27188
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-24295 we indicated various end users are struggling with dealing 
> with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary 
> readers which leverage metadata log to determine which files are safely read 
> (to ensure 'exactly-once'), pruning metadata log is not trivial to implement.
> While we may be able to deal with checking deleted output files in 
> FileStreamSink and get rid of them when compacting metadata, that operation 
> would take additional overhead for running query. (I'll try to address this 
> via another issue though.)
> We can still get time-to-live (TTL) of output files from end users, and 
> filter out files in metadata so that metadata is not growing linearly. Also 
> filtered out files will be no longer seen in reader queries which leverage 
> File(Stream)Source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31593) Remove unnecessary streaming query progress update

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31593:


Assignee: Apache Spark

> Remove unnecessary streaming query progress update
> --
>
> Key: SPARK-31593
> URL: https://issues.apache.org/jira/browse/SPARK-31593
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Genmao Yu
>Assignee: Apache Spark
>Priority: Minor
>
> Structured Streaming progress reporter will always report an `empty` progress 
> when there is no new data. As design, we should provide progress updates 
> every 10s (default) when there is no new data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31566) Add SQL Rest API Documentation

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31566:


Assignee: (was: Apache Spark)

> Add SQL Rest API Documentation
> --
>
> Key: SPARK-31566
> URL: https://issues.apache.org/jira/browse/SPARK-31566
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> SQL Rest API exposes query execution metrics as Public API. Its documentation 
> will be useful for end-users. 
> {code:java}
> /applications/[app-id]/sql
> 1- A list of all queries for a given application.
> 2- ?details=[true|false (default)] lists metric details in addition to 
> queries details.
> 3- ?offset=[offset]=[len] lists queries in the given range.{code}
> {code:java}
> /applications/[app-id]/sql/[execution-id]
> 1- Details for the given query.
> 2- ?details=[true|false (default)] lists metric details in addition to given 
> query details.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31590) The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096752#comment-17096752
 ] 

Apache Spark commented on SPARK-31590:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/28383

> The filter used by Metadata-only queries should not have Unevaluable
> 
>
> Key: SPARK-31590
> URL: https://issues.apache.org/jira/browse/SPARK-31590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Priority: Trivial
>
> When using SPARK-23877, some sql execution errors.
> code:
> {code:scala}
> sql("set spark.sql.optimizer.metadataOnly=true")
> sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
> PARTITIONED BY (d ,h)")
> sql("""
> |INSERT OVERWRITE TABLE test_tbl PARTITION(d,h)
> |SELECT 1,'2020-01-01','23'
> |UNION ALL
> |SELECT 2,'2020-01-02','01'
> |UNION ALL
> |SELECT 3,'2020-01-02','02'
> """.stripMargin)
> sql(
>   s"""
>  |SELECT d, MAX(h) AS h
>  |FROM test_tbl
>  |WHERE d= (
>  |  SELECT MAX(d) AS d
>  |  FROM test_tbl
>  |)
>  |GROUP BY d
> """.stripMargin).collect()
> {code}
> Exception:
> {code:java}
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#48 []
> ...
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180)
> {code}
> optimizedPlan:
> {code:java}
> Aggregate [d#245], [d#245, max(h#246) AS h#243]
> +- Project [d#245, h#246]
>+- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 []))
>   :  +- Aggregate [max(d#245) AS d#241]
>   : +- LocalRelation , [d#245]
>   +- Relation[a#244,d#245,h#246] parquet
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31570) gapply and dapply docs should be more aligned, possibly combined

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31570:


Assignee: Apache Spark

> gapply and dapply docs should be more aligned, possibly combined
> 
>
> Key: SPARK-31570
> URL: https://issues.apache.org/jira/browse/SPARK-31570
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Assignee: Apache Spark
>Priority: Minor
>
> This is a follow-up to https://issues.apache.org/jira/browse/SPARK-31568
> There, we combined gapply and gapplyCollect to make it easier to sync 
> arguments between those Rds.
> dapply and dapplyCollect are also sufficiently similar that they could be on 
> the same Rd.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31566) Add SQL Rest API Documentation

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31566:


Assignee: Apache Spark

> Add SQL Rest API Documentation
> --
>
> Key: SPARK-31566
> URL: https://issues.apache.org/jira/browse/SPARK-31566
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Assignee: Apache Spark
>Priority: Major
>
> SQL Rest API exposes query execution metrics as Public API. Its documentation 
> will be useful for end-users. 
> {code:java}
> /applications/[app-id]/sql
> 1- A list of all queries for a given application.
> 2- ?details=[true|false (default)] lists metric details in addition to 
> queries details.
> 3- ?offset=[offset]=[len] lists queries in the given range.{code}
> {code:java}
> /applications/[app-id]/sql/[execution-id]
> 1- Details for the given query.
> 2- ?details=[true|false (default)] lists metric details in addition to given 
> query details.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31566) Add SQL Rest API Documentation

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096747#comment-17096747
 ] 

Apache Spark commented on SPARK-31566:
--

User 'erenavsarogullari' has created a pull request for this issue:
https://github.com/apache/spark/pull/28354

> Add SQL Rest API Documentation
> --
>
> Key: SPARK-31566
> URL: https://issues.apache.org/jira/browse/SPARK-31566
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> SQL Rest API exposes query execution metrics as Public API. Its documentation 
> will be useful for end-users. 
> {code:java}
> /applications/[app-id]/sql
> 1- A list of all queries for a given application.
> 2- ?details=[true|false (default)] lists metric details in addition to 
> queries details.
> 3- ?offset=[offset]=[len] lists queries in the given range.{code}
> {code:java}
> /applications/[app-id]/sql/[execution-id]
> 1- Details for the given query.
> 2- ?details=[true|false (default)] lists metric details in addition to given 
> query details.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26199) Long expressions cause mutate to fail

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096749#comment-17096749
 ] 

Apache Spark commented on SPARK-26199:
--

User 'MichaelChirico' has created a pull request for this issue:
https://github.com/apache/spark/pull/28386

> Long expressions cause mutate to fail
> -
>
> Key: SPARK-26199
> URL: https://issues.apache.org/jira/browse/SPARK-26199
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: João Rafael
>Priority: Minor
>
> Calling {{mutate(df, field = expr)}} fails when expr is very long.
> Example:
> {code:R}
> df <- mutate(df, field = ifelse(
> lit(TRUE),
> lit("A"),
> ifelse(
> lit(T),
> lit("BB"),
> lit("C")
> )
> ))
> {code}
> Stack trace:
> {code:R}
> FATAL subscript out of bounds
>   at .handleSimpleError(function (obj) 
> {
> level = sapply(class(obj), sw
>   at FUN(X[[i]], ...)
>   at lapply(seq_along(args), function(i) {
> if (ns[[i]] != "") {
> at lapply(seq_along(args), function(i) {
> if (ns[[i]] != "") {
> at mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T), lit("BBB
>   at #78: mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T
> {code}
> The root cause is in: 
> [DataFrame.R#LL2182|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2182]
> When the expression is long {{deparse}} returns multiple lines, causing 
> {{args}} to have more elements than {{ns}}. The solution could be to set 
> {{nlines = 1}} or to collapse the lines together.
> A simple work around exists, by first placing the expression in a variable 
> and using it instead:
> {code:R}
> tmp <- ifelse(
> lit(TRUE),
> lit("A"),
> ifelse(
> lit(T),
> lit("BB"),
> lit("C")
> )
> )
> df <- mutate(df, field = tmp)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31570) gapply and dapply docs should be more aligned, possibly combined

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31570:


Assignee: (was: Apache Spark)

> gapply and dapply docs should be more aligned, possibly combined
> 
>
> Key: SPARK-31570
> URL: https://issues.apache.org/jira/browse/SPARK-31570
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Priority: Minor
>
> This is a follow-up to https://issues.apache.org/jira/browse/SPARK-31568
> There, we combined gapply and gapplyCollect to make it easier to sync 
> arguments between those Rds.
> dapply and dapplyCollect are also sufficiently similar that they could be on 
> the same Rd.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31590) The filter used by Metadata-only queries should not have Unevaluable

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31590:


Assignee: Apache Spark

> The filter used by Metadata-only queries should not have Unevaluable
> 
>
> Key: SPARK-31590
> URL: https://issues.apache.org/jira/browse/SPARK-31590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Assignee: Apache Spark
>Priority: Trivial
>
> When using SPARK-23877, some sql execution errors.
> code:
> {code:scala}
> sql("set spark.sql.optimizer.metadataOnly=true")
> sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
> PARTITIONED BY (d ,h)")
> sql("""
> |INSERT OVERWRITE TABLE test_tbl PARTITION(d,h)
> |SELECT 1,'2020-01-01','23'
> |UNION ALL
> |SELECT 2,'2020-01-02','01'
> |UNION ALL
> |SELECT 3,'2020-01-02','02'
> """.stripMargin)
> sql(
>   s"""
>  |SELECT d, MAX(h) AS h
>  |FROM test_tbl
>  |WHERE d= (
>  |  SELECT MAX(d) AS d
>  |  FROM test_tbl
>  |)
>  |GROUP BY d
> """.stripMargin).collect()
> {code}
> Exception:
> {code:java}
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#48 []
> ...
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180)
> {code}
> optimizedPlan:
> {code:java}
> Aggregate [d#245], [d#245, max(h#246) AS h#243]
> +- Project [d#245, h#246]
>+- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 []))
>   :  +- Aggregate [max(d#245) AS d#241]
>   : +- LocalRelation , [d#245]
>   +- Relation[a#244,d#245,h#246] parquet
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31593) Remove unnecessary streaming query progress update

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31593:


Assignee: (was: Apache Spark)

> Remove unnecessary streaming query progress update
> --
>
> Key: SPARK-31593
> URL: https://issues.apache.org/jira/browse/SPARK-31593
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Genmao Yu
>Priority: Minor
>
> Structured Streaming progress reporter will always report an `empty` progress 
> when there is no new data. As design, we should provide progress updates 
> every 10s (default) when there is no new data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31592) bufferPoolsBySize in HeapMemoryAllocator should be thread safe

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31592:


Assignee: Apache Spark

> bufferPoolsBySize in HeapMemoryAllocator should be thread safe
> --
>
> Key: SPARK-31592
> URL: https://issues.apache.org/jira/browse/SPARK-31592
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Yunbo Fan
>Assignee: Apache Spark
>Priority: Major
>
> Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose 
> value type is LinkedList.
> LinkedList is not thread safe and may hit the error below
> {code:java}
> java.util.NoSuchElementExceptionException
> at java.util.LinkedList.removeFirst(LinkedList.java:270) 
> at java.util.LinkedList.remove(LinkedList.java:685)
> at 
> org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31575) Synchronise global JVM security configuration modification

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31575:


Assignee: (was: Apache Spark)

> Synchronise global JVM security configuration modification
> --
>
> Key: SPARK-31575
> URL: https://issues.apache.org/jira/browse/SPARK-31575
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31365) Enable nested predicate pushdown per data sources

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31365:


Assignee: (was: Apache Spark)

> Enable nested predicate pushdown per data sources
> -
>
> Key: SPARK-31365
> URL: https://issues.apache.org/jira/browse/SPARK-31365
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Blocker
>
> Currently, nested predicate pushdown is on or off for all data sources. We 
> should create configuration for each supported data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31575) Synchronise global JVM security configuration modification

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096742#comment-17096742
 ] 

Apache Spark commented on SPARK-31575:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/28368

> Synchronise global JVM security configuration modification
> --
>
> Key: SPARK-31575
> URL: https://issues.apache.org/jira/browse/SPARK-31575
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31365) Enable nested predicate pushdown per data sources

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096740#comment-17096740
 ] 

Apache Spark commented on SPARK-31365:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/28366

> Enable nested predicate pushdown per data sources
> -
>
> Key: SPARK-31365
> URL: https://issues.apache.org/jira/browse/SPARK-31365
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Blocker
>
> Currently, nested predicate pushdown is on or off for all data sources. We 
> should create configuration for each supported data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31592) bufferPoolsBySize in HeapMemoryAllocator should be thread safe

2020-04-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096743#comment-17096743
 ] 

Apache Spark commented on SPARK-31592:
--

User 'fanyunbojerry' has created a pull request for this issue:
https://github.com/apache/spark/pull/28389

> bufferPoolsBySize in HeapMemoryAllocator should be thread safe
> --
>
> Key: SPARK-31592
> URL: https://issues.apache.org/jira/browse/SPARK-31592
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Yunbo Fan
>Priority: Major
>
> Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose 
> value type is LinkedList.
> LinkedList is not thread safe and may hit the error below
> {code:java}
> java.util.NoSuchElementExceptionException
> at java.util.LinkedList.removeFirst(LinkedList.java:270) 
> at java.util.LinkedList.remove(LinkedList.java:685)
> at 
> org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31592) bufferPoolsBySize in HeapMemoryAllocator should be thread safe

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31592:


Assignee: (was: Apache Spark)

> bufferPoolsBySize in HeapMemoryAllocator should be thread safe
> --
>
> Key: SPARK-31592
> URL: https://issues.apache.org/jira/browse/SPARK-31592
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Yunbo Fan
>Priority: Major
>
> Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose 
> value type is LinkedList.
> LinkedList is not thread safe and may hit the error below
> {code:java}
> java.util.NoSuchElementExceptionException
> at java.util.LinkedList.removeFirst(LinkedList.java:270) 
> at java.util.LinkedList.remove(LinkedList.java:685)
> at 
> org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31365) Enable nested predicate pushdown per data sources

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31365:


Assignee: Apache Spark

> Enable nested predicate pushdown per data sources
> -
>
> Key: SPARK-31365
> URL: https://issues.apache.org/jira/browse/SPARK-31365
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Blocker
>
> Currently, nested predicate pushdown is on or off for all data sources. We 
> should create configuration for each supported data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31575) Synchronise global JVM security configuration modification

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31575:


Assignee: Apache Spark

> Synchronise global JVM security configuration modification
> --
>
> Key: SPARK-31575
> URL: https://issues.apache.org/jira/browse/SPARK-31575
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31268:


Assignee: Apache Spark

> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval
> ---
>
> Key: SPARK-31268
> URL: https://issues.apache.org/jira/browse/SPARK-31268
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
> Attachments: screenshot-1.png
>
>
> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31334:


Assignee: Apache Spark

> Use agg column in Having clause behave different with column type 
> --
>
> Key: SPARK-31334
> URL: https://issues.apache.org/jira/browse/SPARK-31334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> ```
> test("") {
> Seq(
>   (1, 3),
>   (2, 3),
>   (3, 6),
>   (4, 7),
>   (5, 9),
>   (6, 9)
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> [info] -  *** FAILED *** (508 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
> missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, 
> sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. 
> Attribute(s) with the same name appear in the operation: a. Please check if 
> the right attribute(s) are used.;;
> [info] Project [b#181, a#184]
> [info] +- Filter (sum(a#184)#188 > cast(3 as double))
> [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
> sum(a#184) AS sum(a#184)#188]
> [info]   +- SubqueryAlias `testdata`
> [info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
> [info] +- LocalRelation [_1#177, _2#178]
> ```
> ```
> test("") {
> Seq(
>   ("1", "3"),
>   ("2", "3"),
>   ("3", "6"),
>   ("4", "7"),
>   ("5", "9"),
>   ("6", "9")
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> == Physical Plan ==
> *(2) Project [b#181, a#184L]
> +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
> as bigint))#197L > 3))
>+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
>   +- Exchange hashpartitioning(b#181, 5)
>  +- *(1) HashAggregate(keys=[b#181], 
> functions=[partial_sum(cast(a#180 as bigint))])
> +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
>+- LocalTableScan [_1#177, _2#178]
> ```{code}
> Spend A lot of time I can't find witch analyzer make this different,
> When column type is double, it failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31154) Expose basic write metrics for InsertIntoDataSourceCommand

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31154:


Assignee: Apache Spark

> Expose basic write metrics for InsertIntoDataSourceCommand
> --
>
> Key: SPARK-31154
> URL: https://issues.apache.org/jira/browse/SPARK-31154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> Spark provides interface `InsertableRelation` and the 
> `InsertIntoDataSourceCommand` to delegate the inserting processing to a data 
> source. Unlike `DataWritingCommand`, the metrics in 
> InsertIntoDataSourceCommand is empty and has no chance to update. So we 
> cannot get "number of written files" or "number of output rows" from its 
> metrics.
> For example, if a table is a Spark parquet table. We can get the writing 
> metrics by:
> {code}
> val df = sql("INSERT INTO TABLE test_table SELECT 1, 'a'")
> val numFiles = df.queryExecution.sparkPlan.metrics("numFiles").value
> {code}
> But if it is a Delta table, we cannot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31335) Add try function support

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31335:


Assignee: (was: Apache Spark)

> Add try function support
> 
>
> Key: SPARK-31335
> URL: https://issues.apache.org/jira/browse/SPARK-31335
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> Evaluate an expression and handle certain types of execution errors by 
> returning NULL.
> In cases where it is preferable that queries produce NULL instead of failing 
> when corrupt or invalid data is encountered, the TRY function may be useful 
> especially when ANSI mode is on and the users need null-tolerant on certain 
> columns or outputs.
> AnalysisExceptions will not handle by this, typically errors handled by TRY 
> function are: 
>   * Division by zero,
>   * Invalid casting,
>   * Numeric value out of range,
>   * e.t.c
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31154) Expose basic write metrics for InsertIntoDataSourceCommand

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31154:


Assignee: (was: Apache Spark)

> Expose basic write metrics for InsertIntoDataSourceCommand
> --
>
> Key: SPARK-31154
> URL: https://issues.apache.org/jira/browse/SPARK-31154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Spark provides interface `InsertableRelation` and the 
> `InsertIntoDataSourceCommand` to delegate the inserting processing to a data 
> source. Unlike `DataWritingCommand`, the metrics in 
> InsertIntoDataSourceCommand is empty and has no chance to update. So we 
> cannot get "number of written files" or "number of output rows" from its 
> metrics.
> For example, if a table is a Spark parquet table. We can get the writing 
> metrics by:
> {code}
> val df = sql("INSERT INTO TABLE test_table SELECT 1, 'a'")
> val numFiles = df.queryExecution.sparkPlan.metrics("numFiles").value
> {code}
> But if it is a Delta table, we cannot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31268:


Assignee: (was: Apache Spark)

> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval
> ---
>
> Key: SPARK-31268
> URL: https://issues.apache.org/jira/browse/SPARK-31268
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31363) Improve DataSourceRegister interface

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31363:


Assignee: (was: Apache Spark)

> Improve DataSourceRegister interface
> 
>
> Key: SPARK-31363
> URL: https://issues.apache.org/jira/browse/SPARK-31363
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Andrew Malone Melo
>Priority: Minor
>
> As the DSv2 API evolves, some breaking changes are occasionally made to the 
> API. It's possible to split a plugin into a "common" part and multiple 
> version-specific parts and this works good to have a single artifact for 
> users. The one part that can't be currently worked around is the 
> DataSourceRegister trait. This is an issue because users cargo-cult 
> configuration values, and choosing the wrong plugin version gives a 
> particularly baroque error message that bubbles up through ServiceLoader.
> Currently, the class implementing DataSourceRegister must also be the class 
> implementing the "toplevel" DataSourceV2 interface (and mixins), and these 
> various interfaces occasionally change as the API evolves. As a practical 
> matter, this means that there's no opportunity to decide at runtime which 
> class to pass along to Spark. Attempting to add multiple DataSourceV2 
> implementations to services/META-INF causes an exception when the 
> ServiceLoader tries to load the DataSourceRegister who implements the 
> "different" DataSourceV2.
> I would like to propose a new DataSourceRegister interface which adds a level 
> of indirection between the what ServiceLoader and DataSourceV2 loads. E.g. 
> (strawman)
> {{interface DataSourceRegisterV2 {}}
> {{  public String shortName();}}
> {{  public Class getImplementation();}}
> {{ }}}
> Then org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource 
> would have its search algorithm extended to look for DataSourceRegisterV2 
> objects, and if one is located for the given shortName, return the class 
> object from getImplementation(). At this point, the plugin could decide based 
> on the current runtime environment which class to prevent to Spark. There 
> wouldn't be any changes to plugins who don't implement this API.
> If this is an acceptable idea, I can put together a PR for further comment.
> Thanks
>  Andrew



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31226) SizeBasedCoalesce logic error

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31226:


Assignee: (was: Apache Spark)

> SizeBasedCoalesce logic error
> -
>
> Key: SPARK-31226
> URL: https://issues.apache.org/jira/browse/SPARK-31226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Minor
>
> In spark UT, 
> SizeBasedCoalecse's logic is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31106) Support is_json function

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31106:


Assignee: Apache Spark

> Support is_json function
> 
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Assignee: Apache Spark
>Priority: Major
>
> This function will allow users to verify whether the given string is valid 
> JSON or not. It returns `true` for valid JSON and `false` for invalid JSON. 
> `NULL` is returned for `NULL` input.
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB
>  * Amazon Redshift
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31220) repartition obeys spark.sql.adaptive.coalescePartitions.initialPartitionNum when spark.sql.adaptive.enabled

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31220:


Assignee: (was: Apache Spark)

> repartition obeys spark.sql.adaptive.coalescePartitions.initialPartitionNum 
> when spark.sql.adaptive.enabled
> ---
>
> Key: SPARK-31220
> URL: https://issues.apache.org/jira/browse/SPARK-31220
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("CREATE TABLE spark_31220(id int)")
> spark.sql("set 
> spark.sql.adaptive.coalescePartitions.initialPartitionNum=1000")
> spark.sql("set spark.sql.adaptive.enabled=true")
> {code}
> {noformat}
> scala> spark.sql("SELECT id from spark_31220 GROUP BY id").explain
> == Physical Plan ==
> AdaptiveSparkPlan(isFinalPlan=false)
> +- HashAggregate(keys=[id#5], functions=[])
>+- Exchange hashpartitioning(id#5, 1000), true, [id=#171]
>   +- HashAggregate(keys=[id#5], functions=[])
>  +- FileScan parquet default.spark_31220[id#5] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/opensource/apache-spark/spark-warehouse/spark_31220],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> scala> spark.sql("SELECT id from spark_31220 DISTRIBUTE BY id").explain
> == Physical Plan ==
> AdaptiveSparkPlan(isFinalPlan=false)
> +- Exchange hashpartitioning(id#5, 200), false, [id=#179]
>+- FileScan parquet default.spark_31220[id#5] Batched: true, DataFilters: 
> [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/opensource/apache-spark/spark-warehouse/spark_31220],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31106) Support is_json function

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31106:


Assignee: (was: Apache Spark)

> Support is_json function
> 
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will allow users to verify whether the given string is valid 
> JSON or not. It returns `true` for valid JSON and `false` for invalid JSON. 
> `NULL` is returned for `NULL` input.
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB
>  * Amazon Redshift
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31214) Upgrade Janino to 3.1.2

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31214:


Assignee: Apache Spark

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31107) Extend FairScheduler to support pool level resource isolation

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31107:


Assignee: Apache Spark

> Extend FairScheduler to support pool level resource isolation
> -
>
> Key: SPARK-31107
> URL: https://issues.apache.org/jira/browse/SPARK-31107
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: liupengcheng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, spark only provided two types of scheduler: FIFO & FAIR, but in 
> sql high-concurrency scenarios, a few of drawbacks are exposed.
> FIFO: it can easily causing congestion when large sql query occupies all the 
> resources
> FAIR: the taskSets of one pool may occupies all the resource due to there are 
> no hard limit on the maximum usage for each pool.  this case may be 
> frequently met under high workloads.
> So we propose to add a maxShare argument for FairScheduler to control the 
> maximum running tasks for each pool.
> One thing that needs our attention is that we should handle it well to make 
> the `ExecutorAllocationManager` can release resources:
>  e.g. Suppose we got 100 executors, if the tasks are scheduled on all 
> executors with max concurrency 50, there are cases that the executors may not 
> idle, and can not be released.
> One idea is to bind those executors to each pool, then we only schedule tasks 
> on executors of the pool which it belongs to.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31346) Add new configuration to make sure temporary directory cleaned

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31346:


Assignee: Apache Spark

> Add new configuration to make sure temporary directory cleaned
> --
>
> Key: SPARK-31346
> URL: https://issues.apache.org/jira/browse/SPARK-31346
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Assignee: Apache Spark
>Priority: Major
>
> In InsertIntoHiveTable and InsertIntoHiveDirCommand, we use 
> deleteExternalTmpPath to clean temporary directories after Job committed and 
> cancel deleteOnExit if succeeded. But sometimes (e.g., when speculative task 
> is enabled), temporary directories may be left uncleaned. This is happened if 
> there are still some tasks running after we called deleteExternalTmpPath. 
> Thus it maybe necessary to keep deleteOnExit, even if temporary directory has 
> already deleted, to make sure the temporary directories cleaned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31206) AQE will use the same SubqueryExec even if subqueryReuseEnabled=false

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31206:


Assignee: (was: Apache Spark)

> AQE will use the same SubqueryExec even if subqueryReuseEnabled=false
> -
>
> Key: SPARK-31206
> URL: https://issues.apache.org/jira/browse/SPARK-31206
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> In `InsertAdaptiveSparkPlan.buildSubqueryMap`, AQE will skip to compile the 
> subquery with the same exprId. As a result, in PlanAdaptiveSubqueries, it 
> will use the same SubqueryExec for the SubqueryExpression with the same 
> exprId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31357) SPIP: Catalog API for view metadata

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31357:


Assignee: (was: Apache Spark)

> SPIP: Catalog API for view metadata
> ---
>
> Key: SPARK-31357
> URL: https://issues.apache.org/jira/browse/SPARK-31357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: John Zhuge
>Priority: Major
>  Labels: SPIP
>
> SPARK-24252 added a catalog plugin system and `TableCatalog` API that 
> provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view 
> metadata.
> Details in [SPIP 
> document|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31264) Repartition by dynamic partition columns before insert table

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31264:


Assignee: (was: Apache Spark)

> Repartition by dynamic partition columns before insert table
> 
>
> Key: SPARK-31264
> URL: https://issues.apache.org/jira/browse/SPARK-31264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31366) Document aggregation in the sql reference doc

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31366:


Assignee: Apache Spark

> Document aggregation in the sql reference doc
> -
>
> Key: SPARK-31366
> URL: https://issues.apache.org/jira/browse/SPARK-31366
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Sunitha Kambhampati
>Assignee: Apache Spark
>Priority: Major
>
> Fill in the doc for sql-ref-syntax-qry-aggregation.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30707) Lead/Lag window function throws AnalysisException without ORDER BY clause

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30707:


Assignee: (was: Apache Spark)

> Lead/Lag window function throws AnalysisException without ORDER BY clause
> -
>
> Key: SPARK-30707
> URL: https://issues.apache.org/jira/browse/SPARK-30707
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
>  Lead/Lag window function throws AnalysisException without ORDER BY clause:
> {code:java}
> SELECT lead(ten, four + 1) OVER (PARTITION BY four), ten, four
> FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s
> org.apache.spark.sql.AnalysisException
> Window function lead(ten#x, (four#x + 1), null) requires window to be 
> ordered, please add ORDER BY clause. For example SELECT lead(ten#x, (four#x + 
> 1), null)(value_expr) OVER (PARTITION BY window_partition ORDER BY 
> window_ordering) from table;
> {code}
>  
> Maybe we need fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31346) Add new configuration to make sure temporary directory cleaned

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31346:


Assignee: (was: Apache Spark)

> Add new configuration to make sure temporary directory cleaned
> --
>
> Key: SPARK-31346
> URL: https://issues.apache.org/jira/browse/SPARK-31346
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Priority: Major
>
> In InsertIntoHiveTable and InsertIntoHiveDirCommand, we use 
> deleteExternalTmpPath to clean temporary directories after Job committed and 
> cancel deleteOnExit if succeeded. But sometimes (e.g., when speculative task 
> is enabled), temporary directories may be left uncleaned. This is happened if 
> there are still some tasks running after we called deleteExternalTmpPath. 
> Thus it maybe necessary to keep deleteOnExit, even if temporary directory has 
> already deleted, to make sure the temporary directories cleaned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31180) Implement PowerTransform

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31180:


Assignee: (was: Apache Spark)

> Implement PowerTransform
> 
>
> Key: SPARK-31180
> URL: https://issues.apache.org/jira/browse/SPARK-31180
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {color:#5a6e5a}Power transforms are a family of parametric, monotonic 
> transformations
> {color}{color:#5a6e5a}that are applied to make data more Gaussian-like. This 
> is useful for
> {color}{color:#5a6e5a}modeling issues related to heteroscedasticity 
> (non-constant variance),
> {color}{color:#5a6e5a}or other situations where normality is desired.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31264) Repartition by dynamic partition columns before insert table

2020-04-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31264:


Assignee: Apache Spark

> Repartition by dynamic partition columns before insert table
> 
>
> Key: SPARK-31264
> URL: https://issues.apache.org/jira/browse/SPARK-31264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >