date:20200219



 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30886:
--
Issue Type: Task  (was: Bug)

> Warn two-parameter TRIM/LTRIM/RTRIM functions
> -
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30891) Arrange version info of history

jiaan.geng created SPARK-30891:
--

 Summary: Arrange version info of history
 Key: SPARK-30891
 URL: https://issues.apache.org/jira/browse/SPARK-30891
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: jiaan.geng


core/src/main/scala/org/apache/spark/internal/config/History.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30890) Arrange version info of history

jiaan.geng created SPARK-30890:
--

 Summary: Arrange version info of history
 Key: SPARK-30890
 URL: https://issues.apache.org/jira/browse/SPARK-30890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: jiaan.geng


core/src/main/scala/org/apache/spark/internal/config/History.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30886) Warn two-parameter TRIM/LTRIM/RTRIM functions



[ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040729#comment-17040729
 ] 

Dongjoon Hyun commented on SPARK-30886:
---

Thanks to SPARK-28126, we can give a directional warning for the dangerous 
cases.

> Warn two-parameter TRIM/LTRIM/RTRIM functions
> -
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30889) Arrange version info of worker

jiaan.geng created SPARK-30889:
--

 Summary: Arrange version info of worker
 Key: SPARK-30889
 URL: https://issues.apache.org/jira/browse/SPARK-30889
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng


core/src/main/scala/org/apache/spark/internal/config/Worker.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30888) Arrange version info of network



 [ 
https://issues.apache.org/jira/browse/SPARK-30888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30888:
---
Description: 
spark/core/src/main/scala/org/apache/spark/internal/config/Network.scala

> Arrange version info of network
> ---
>
> Key: SPARK-30888
> URL: https://issues.apache.org/jira/browse/SPARK-30888
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> spark/core/src/main/scala/org/apache/spark/internal/config/Network.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30886) Warn two-parameter TRIM/LTRIM/RTRIM functions



 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30886:
--
Affects Version/s: (was: 2.4.5)
   (was: 2.3.4)

> Warn two-parameter TRIM/LTRIM/RTRIM functions
> -
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30888) Arrange version info of network

jiaan.geng created SPARK-30888:
--

 Summary: Arrange version info of network
 Key: SPARK-30888
 URL: https://issues.apache.org/jira/browse/SPARK-30888
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30887) Arrange version info of deploy

jiaan.geng created SPARK-30887:
--

 Summary: Arrange version info of deploy
 Key: SPARK-30887
 URL: https://issues.apache.org/jira/browse/SPARK-30887
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0, 3.1.0
Reporter: jiaan.geng


[*spark*|https://github.com/apache/spark]/[core|https://github.com/apache/spark/tree/master/core]/[src|https://github.com/apache/spark/tree/master/core/src]/[main|https://github.com/apache/spark/tree/master/core/src/main]/[scala|https://github.com/apache/spark/tree/master/core/src/main/scala]/[org|https://github.com/apache/spark/tree/master/core/src/main/scala/org]/[apache|https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache]/[spark|https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark]/[internal|https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/internal]/[config|https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/internal/config]/*Deploy.scala*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30886) Warn two-parameter TRIM/LTRIM/RTRIM functions

Dongjoon Hyun created SPARK-30886:
-

 Summary: Warn two-parameter TRIM/LTRIM/RTRIM functions
 Key: SPARK-30886
 URL: https://issues.apache.org/jira/browse/SPARK-30886
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 2.3.4, 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30886) Warn two-parameter TRIM/LTRIM/RTRIM functions



 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30886:
--
Description: Apache Spark community decided to keep the existing esoteric 
two-parameter use cases with a proper warning. This JIRA aims to show warning.

> Warn two-parameter TRIM/LTRIM/RTRIM functions
> -
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30885) V1 table name should be fully qualified if catalog name is provided

2020-02-19 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-30885:
--
Summary: V1 table name should be fully qualified if catalog name is 
provided  (was: V1 table name should be fully qualified)

> V1 table name should be fully qualified if catalog name is provided
> ---
>
> Key: SPARK-30885
> URL: https://issues.apache.org/jira/browse/SPARK-30885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> For the following example,
> {code:java}
> sql("CREATE TABLE t USING json AS SELECT 1 AS i")
> sql("SELECT * FROM spark_catalog.t")
> {code}
> `spark_catalog.t` is expanded to `spark_catalog.default.t` assuming that the 
> current namespace is `default`. However, this is not consistent with V2 
> behavior where namespace should be provided if the catalog name is also 
> provided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30885) V1 table name should be fully qualified

2020-02-19 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-30885:
--
Description: 
For the following example,

{code:java}
sql("CREATE TABLE t USING json AS SELECT 1 AS i")
sql("SELECT * FROM spark_catalog.t")
{code}
`spark_catalog.t` is expanded to `spark_catalog.default.t` assuming that the 
current namespace is `default`. However, this is not consistent with V2 
behavior where namespace should be provided if the catalog name is also 
provided.


  was:
For the following example,

{code:java}
sql("CREATE TABLE t USING json AS SELECT 1 AS i")
sql("SELECT * FROM spark_catalog.t")
{code}

works, and `spark_catalog.t` is expanded to `spark_catalog.default.t` assuming 
the current namespace is set to `default`. However, this is not consistent with 
V2 behavior where namespace should be provided if the catalog name is also 
provided.



> V1 table name should be fully qualified
> ---
>
> Key: SPARK-30885
> URL: https://issues.apache.org/jira/browse/SPARK-30885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> For the following example,
> {code:java}
> sql("CREATE TABLE t USING json AS SELECT 1 AS i")
> sql("SELECT * FROM spark_catalog.t")
> {code}
> `spark_catalog.t` is expanded to `spark_catalog.default.t` assuming that the 
> current namespace is `default`. However, this is not consistent with V2 
> behavior where namespace should be provided if the catalog name is also 
> provided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30885) V1 table name should be fully qualified

2020-02-19 Thread Terry Kim (Jira)

Terry Kim created SPARK-30885:
-

 Summary: V1 table name should be fully qualified
 Key: SPARK-30885
 URL: https://issues.apache.org/jira/browse/SPARK-30885
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Terry Kim


For the following example,

{code:java}
sql("CREATE TABLE t USING json AS SELECT 1 AS i")
sql("SELECT * FROM spark_catalog.t")
{code}

works, and `spark_catalog.t` is expanded to `spark_catalog.default.t` assuming 
the current namespace is set to `default`. However, this is not consistent with 
V2 behavior where namespace should be provided if the catalog name is also 
provided.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30884) Upgrade to Py4J 0.10.9



 [ 
https://issues.apache.org/jira/browse/SPARK-30884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30884:
-

Assignee: Dongjoon Hyun

> Upgrade to Py4J 0.10.9
> --
>
> Key: SPARK-30884
> URL: https://issues.apache.org/jira/browse/SPARK-30884
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to upgrade Py4J from 0.10.8.1 to 0.10.9.
> Py4J 0.10.9 is released with the following fixes.
> - https://www.py4j.org/changelog.html#py4j-0-10-9



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30884) Upgrade to Py4J 0.10.9



 [ 
https://issues.apache.org/jira/browse/SPARK-30884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30884:
--
Affects Version/s: (was: 2.4.5)
   3.0.0

> Upgrade to Py4J 0.10.9
> --
>
> Key: SPARK-30884
> URL: https://issues.apache.org/jira/browse/SPARK-30884
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to upgrade Py4J from 0.10.8.1 to 0.10.9.
> Py4J 0.10.9 is released with the following fixes.
> - https://www.py4j.org/changelog.html#py4j-0-10-9



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30884) Upgrade to Py4J 0.10.9



 [ 
https://issues.apache.org/jira/browse/SPARK-30884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30884:
--
Affects Version/s: (was: 3.1.0)
   2.4.5

> Upgrade to Py4J 0.10.9
> --
>
> Key: SPARK-30884
> URL: https://issues.apache.org/jira/browse/SPARK-30884
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to upgrade Py4J from 0.10.8.1 to 0.10.9.
> Py4J 0.10.9 is released with the following fixes.
> - https://www.py4j.org/changelog.html#py4j-0-10-9



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30884) Upgrade to Py4J 0.10.9

Dongjoon Hyun created SPARK-30884:
-

 Summary: Upgrade to Py4J 0.10.9
 Key: SPARK-30884
 URL: https://issues.apache.org/jira/browse/SPARK-30884
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun


This issue aims to upgrade Py4J from 0.10.8.1 to 0.10.9.
Py4J 0.10.9 is released with the following fixes.
- https://www.py4j.org/changelog.html#py4j-0-10-9



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30876) Optimizer cannot infer from inferred constraints with join



 [ 
https://issues.apache.org/jira/browse/SPARK-30876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30876:

Description: 
How to reproduce this issue:
{code:sql}
create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
create table t3(a int, b int, c int);
select count(*) from t1 join t2 join t3 on (t1.a = t2.b and t2.b = t3.c and 
t3.c = 1);
{code}
Spark 2.3+:
{noformat}
== Physical Plan ==
*(4) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, true, [id=#102]
   +- *(3) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(3) Project
 +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight
:- *(3) Project [b#10]
:  +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight
: :- *(3) Project [a#6]
: :  +- *(3) Filter isnotnull(a#6)
: : +- *(3) ColumnarToRow
: :+- FileScan parquet default.t1[a#6] Batched: true, 
DataFilters: [isnotnull(a#6)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct
: +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#87]
:+- *(1) Project [b#10]
:   +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1))
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t2[b#10] Batched: 
true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], ReadSchema: 
struct
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
int, true] as bigint))), [id=#96]
   +- *(2) Project [c#14]
  +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1))
 +- *(2) ColumnarToRow
+- FileScan parquet default.t3[c#14] Batched: true, 
DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], ReadSchema: 
struct

Time taken: 3.785 seconds, Fetched 1 row(s)
{noformat}


Spark 2.2.x:
{noformat}
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)])
  +- *Project
 +- *SortMergeJoin [b#19], [c#23], Inner
:- *Project [b#19]
:  +- *SortMergeJoin [a#15], [b#19], Inner
: :- *Sort [a#15 ASC NULLS FIRST], false, 0
: :  +- Exchange hashpartitioning(a#15, 200)
: : +- *Filter (isnotnull(a#15) && (a#15 = 1))
: :+- HiveTableScan [a#15], HiveTableRelation 
`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, 
b#16, c#17]
: +- *Sort [b#19 ASC NULLS FIRST], false, 0
:+- Exchange hashpartitioning(b#19, 200)
:   +- *Filter (isnotnull(b#19) && (b#19 = 1))
:  +- HiveTableScan [b#19], HiveTableRelation 
`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, 
b#19, c#20]
+- *Sort [c#23 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(c#23, 200)
  +- *Filter (isnotnull(c#23) && (c#23 = 1))
 +- HiveTableScan [c#23], HiveTableRelation `default`.`t3`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, b#22, c#23]
Time taken: 0.728 seconds, Fetched 1 row(s)
{noformat}

Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't.


  was:
How to reproduce this issue:
{code:sql}
create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
create table t3(a int, b int, c int);
select count(*) from t1 join t2 join t3 on (t1.a = t2.b and t2.b = t3.c and 
t3.c = 1)
{code}
Spark 2.3+:
{noformat}
== Physical Plan ==
*(4) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, true, [id=#102]
   +- *(3) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(3) Project
 +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight
:- *(3) Project [b#10]
:  +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight
: :- *(3) Project [a#6]
: :  +- *(3) Filter isnotnull(a#6)
: : +- *(3) ColumnarToRow
: :+- FileScan parquet default.t1[a#6] Batched: true, 
DataFilters: [isnotnull(a#6)], Format: Parquet, Location:

[jira] [Updated] (SPARK-30876) Optimizer cannot infer from inferred constraints with join



 [ 
https://issues.apache.org/jira/browse/SPARK-30876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30876:

Description: 
How to reproduce this issue:
{code:sql}
create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
create table t3(a int, b int, c int);
select count(*) from t1 join t2 join t3 on (t1.a = t2.b and t2.b = t3.c and 
t3.c = 1)
{code}
Spark 2.3+:
{noformat}
== Physical Plan ==
*(4) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, true, [id=#102]
   +- *(3) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(3) Project
 +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight
:- *(3) Project [b#10]
:  +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight
: :- *(3) Project [a#6]
: :  +- *(3) Filter isnotnull(a#6)
: : +- *(3) ColumnarToRow
: :+- FileScan parquet default.t1[a#6] Batched: true, 
DataFilters: [isnotnull(a#6)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct
: +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#87]
:+- *(1) Project [b#10]
:   +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1))
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t2[b#10] Batched: 
true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], ReadSchema: 
struct
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
int, true] as bigint))), [id=#96]
   +- *(2) Project [c#14]
  +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1))
 +- *(2) ColumnarToRow
+- FileScan parquet default.t3[c#14] Batched: true, 
DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], ReadSchema: 
struct

Time taken: 3.785 seconds, Fetched 1 row(s)
{noformat}


Spark 2.2.x:
{noformat}
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)])
  +- *Project
 +- *SortMergeJoin [b#19], [c#23], Inner
:- *Project [b#19]
:  +- *SortMergeJoin [a#15], [b#19], Inner
: :- *Sort [a#15 ASC NULLS FIRST], false, 0
: :  +- Exchange hashpartitioning(a#15, 200)
: : +- *Filter (isnotnull(a#15) && (a#15 = 1))
: :+- HiveTableScan [a#15], HiveTableRelation 
`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, 
b#16, c#17]
: +- *Sort [b#19 ASC NULLS FIRST], false, 0
:+- Exchange hashpartitioning(b#19, 200)
:   +- *Filter (isnotnull(b#19) && (b#19 = 1))
:  +- HiveTableScan [b#19], HiveTableRelation 
`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, 
b#19, c#20]
+- *Sort [c#23 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(c#23, 200)
  +- *Filter (isnotnull(c#23) && (c#23 = 1))
 +- HiveTableScan [c#23], HiveTableRelation `default`.`t3`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, b#22, c#23]
Time taken: 0.728 seconds, Fetched 1 row(s)
{noformat}

Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't.


  was:
How to reproduce this issue:
{code:sql}
create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
create table t3(a int, b int, c int);
{code}
Spark 2.3+:
{noformat}
== Physical Plan ==
*(4) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, true, [id=#102]
   +- *(3) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(3) Project
 +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight
:- *(3) Project [b#10]
:  +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight
: :- *(3) Project [a#6]
: :  +- *(3) Filter isnotnull(a#6)
: : +- *(3) ColumnarToRow
: :+- FileScan parquet default.t1[a#6] Batched: true, 
DataFilters: [isnotnull(a#6)], Format: Parquet, Location:

[jira] [Updated] (SPARK-30883) Tests that use setWritable,setReadable and setExecutable should be cancel when user is root

2020-02-19 Thread deshanxiao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deshanxiao updated SPARK-30883:
---
Environment: The java api *setWritable,setReadable and setExecutable* 
dosen't work well when the user is root. Maybe, we could cancel these tests or 
fast failure when the mvn test is starting.  (was: The java api 
*setWritable,setReadable and setExecutable* dosen't work when the user is root. 
Maybe, we could cancel these tests or fast failure when the mvn test is 
starting.)

> Tests that use setWritable,setReadable and setExecutable should be cancel 
> when user is root
> ---
>
> Key: SPARK-30883
> URL: https://issues.apache.org/jira/browse/SPARK-30883
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
> Environment: The java api *setWritable,setReadable and setExecutable* 
> dosen't work well when the user is root. Maybe, we could cancel these tests 
> or fast failure when the mvn test is starting.
>Reporter: deshanxiao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30883) Tests that use setWritable,setReadable and setExecutable should be cancel when user is root

2020-02-19 Thread deshanxiao (Jira)

deshanxiao created SPARK-30883:
--

 Summary: Tests that use setWritable,setReadable and setExecutable 
should be cancel when user is root
 Key: SPARK-30883
 URL: https://issues.apache.org/jira/browse/SPARK-30883
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.0.0
 Environment: The java api *setWritable,setReadable and setExecutable* 
dosen't work when the user is root. Maybe, we could cancel these tests or fast 
failure when the mvn test is starting.
Reporter: deshanxiao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30873) Handling Node Decommissioning for Yarn cluster manger in Spark



[ 
https://issues.apache.org/jira/browse/SPARK-30873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039884#comment-17039884
 ] 

Saurabh Chawla edited comment on SPARK-30873 at 2/20/20 3:43 AM:
-

We have raised the WIP PR for this.

cc [~holden] [~itskals][~amargoor]


was (Author: saurabhc100):
We have raised the WIP PR for this.

cc [~holdenkarau] [~itskals][~amargoor]

> Handling Node Decommissioning for Yarn cluster manger in Spark
> --
>
> Key: SPARK-30873
> URL: https://issues.apache.org/jira/browse/SPARK-30873
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Priority: Major
>
> In many public cloud environments, the node loss (in case of AWS 
> SpotLoss,Spot blocks and GCP preemptible VMs) is a planned and informed 
> activity. 
> The cloud provider intimates the cluster manager about the possible loss of 
> node ahead of time. Few examples is listed here:
> a) Spot loss in AWS(2 min before event)
> b) GCP Pre-emptible VM loss (30 second before event)
> c) AWS Spot block loss with info on termination time (generally few tens of 
> minutes before decommission as configured in Yarn)
> This JIRA tries to make spark leverage the knowledge of the node loss in 
> future, and tries to adjust the scheduling of tasks to minimise the impact on 
> the application. 
> It is well known that when a host is lost, the executors, its running tasks, 
> their caches and also Shuffle data is lost. This could result in wastage of 
> compute and other resources.
> The focus here is to build a framework for YARN, that can be extended for 
> other cluster managers to handle such scenario.
> The framework must handle one or more of the following:-
> 1) Prevent new tasks from starting on any executors on decommissioning Nodes. 
> 2) Decide to kill the running tasks so that they can be restarted elsewhere 
> (assuming they will not complete within the deadline) OR we can allow them to 
> continue hoping they will finish within deadline.
> 3) Clear the shuffle data entry from MapOutputTracker of decommission node 
> hostname to prevent the shuffle fetchfailed exception.The most significant 
> advantage of unregistering shuffle outputs when Spark schedules the first 
> re-attempt to compute the missing blocks, it notices all of the missing 
> blocks from decommissioned nodes and recovers in only one attempt. This 
> speeds up the recovery process significantly over the scheduled Spark 
> implementation, where stages might be rescheduled multiple times to recompute 
> missing shuffles from all nodes, and prevent jobs from being stuck for hours 
> failing and recomputing.
> 4) Prevent the stage to abort due to the fetchfailed exception in case of 
> decommissioning of node. In Spark there is number of consecutive stage 
> attempts allowed before a stage is aborted.This is controlled by the config 
> spark.stage.maxConsecutiveAttempts. Not accounting fetch fails due 
> decommissioning of nodes towards stage failure improves the reliability of 
> the system.
> Main components of change
> 1) Get the ClusterInfo update from the Resource Manager -> Application Master 
> -> Spark Driver.
> 2) DecommissionTracker, resides inside driver, tracks all the decommissioned 
> nodes and take necessary action and state transition.
> 3) Based on the decommission node list add hooks at code to achieve
>  a) No new task on executor
>  b) Remove shuffle data mapping info for the node to be decommissioned from 
> the mapOutputTracker
>  c) Do not count fetchFailure from decommissioned towards stage failure
> On the receiving info that node is to be decommissioned, the below action 
> needs to be performed by DecommissionTracker on driver:
>  * Add the entry of Nodes in DecommissionTracker with termination time and 
> node state as "DECOMMISSIONING".
>  * Stop assigning any new tasks on executors on the nodes which are candidate 
> for decommission. This makes sure slowly as the tasks finish the usage of 
> this node would die down.
>  * Kill all the executors for the decommissioning nodes after configurable 
> period of time, say "spark.graceful.decommission.executor.leasetimePct". This 
> killing ensures two things. Firstly, the task failure will be attributed in 
> job failure count. Second, avoid generation on more shuffle data on the node 
> that will eventually be lost. The node state is set to 
> "EXECUTOR_DECOMMISSIONED". 
>  * Mark Shuffle data on the node as unavailable after 
> "spark.qubole.graceful.decommission.shuffedata.leasetimePct" time. This will 
> ensure that recomputation of missing shuffle partition is done early, rather 
> than reducers failing with a time-consuming FetchFailure. The node state is 
> set to

[jira] [Resolved] (SPARK-30856) SQLContext retains reference to unusable instance after SparkContext restarted



 [ 
https://issues.apache.org/jira/browse/SPARK-30856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30856.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27610
[https://github.com/apache/spark/pull/27610]

> SQLContext retains reference to unusable instance after SparkContext restarted
> --
>
> Key: SPARK-30856
> URL: https://issues.apache.org/jira/browse/SPARK-30856
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.5
>Reporter: Alex Favaro
>Priority: Major
> Fix For: 3.1.0
>
>
> When the underlying SQLContext is instantiated for a SparkSession, the 
> instance is saved as a class attribute and returned from subsequent calls to 
> SQLContext.getOrCreate(). If the SparkContext is stopped and a new one 
> started, the SQLContext class attribute is never cleared so any code which 
> calls SQLContext.getOrCreate() will get a SQLContext with a reference to the 
> old, unusable SparkContext.
> A similar issue was identified and fixed for SparkSession in SPARK-19055, but 
> the fix did not change SQLContext as well. I ran into this because mllib 
> still 
> [uses|https://github.com/apache/spark/blob/master/python/pyspark/mllib/common.py#L105]
>  SQLContext.getOrCreate() under the hood.
> I've already written a fix for this, which I'll be sharing in a PR, that 
> clears the class attribute on SQLContext when the SparkSession is stopped. 
> Another option would be to deprecate SQLContext.getOrCreate() entirely since 
> the corresponding Scala 
> [method|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SQLContext.html#getOrCreate-org.apache.spark.SparkContext-]
>  is itself deprecated. That seems like a larger change for a relatively minor 
> issue, however.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30860) Different behavior between rolling and non-rolling event log

2020-02-19 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-30860:
-
Priority: Major  (was: Minor)

> Different behavior between rolling and non-rolling event log
> 
>
> Key: SPARK-30860
> URL: https://issues.apache.org/jira/browse/SPARK-30860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Adam Binford
>Priority: Major
>
> When creating a rolling event log, the application directory is created with 
> a call to FileSystem.mkdirs, with the file permission 770. The default 
> behavior of HDFS is to set the permission of a file created with 
> FileSystem.create or FileSystem.mkdirs to (P & ^umask), where P is the 
> permission in the API call and umask is a system value set by 
> fs.permissions.umask-mode and defaults to 0022. This means, with default 
> settings, any mkdirs call can have at most 755 permissions, which causes 
> rolling event log directories to be created with 750 permissions. This causes 
> the history server to be unable to prune old applications if they are not run 
> by the same user running the history server.
> This is not a problem for non-rolling logs, because it uses 
> SparkHadoopUtils.createFile for Hadoop 2 backward compatibility, and then 
> calls FileSystem.setPermission with 770 after the file has been created. 
> setPermission doesn't have the umask applied to it, so this works fine.
> Obviously this could be fixed by changing fs.permissions.umask-mode, but I'm 
> not sure the reason that's set in the first place or if this would hurt 
> anything else. The main issue is there is different behavior between rolling 
> and non-rolling event logs that might want to be updated in this repo to be 
> consistent across each.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28990) SparkSQL invalid call to toAttribute on unresolved object, tree: *



 [ 
https://issues.apache.org/jira/browse/SPARK-28990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28990.
--
Resolution: Cannot Reproduce

> SparkSQL invalid call to toAttribute on unresolved object, tree: *
> --
>
> Key: SPARK-28990
> URL: https://issues.apache.org/jira/browse/SPARK-28990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: fengchaoge
>Priority: Major
>
> SparkSQL create table as select from one table which may not exists throw 
> exceptions like:
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree:
> {code}
> This is not friendly, spark user may have no idea about what's wrong.
> Simple sql can reproduce it，like this:
> {code}
> spark-sql (default)> create table default.spark as select * from default.dual;
> {code}
> {code}
> 2019-09-05 16:27:24,127 INFO (main) [Logging.scala:logInfo(54)] - Parsing 
> command: create table default.spark as select * from default.dual
> 2019-09-05 16:27:24,772 ERROR (main) [Logging.scala:logError(91)] - Failed in 
> [create table default.spark as select * from default.dual]
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree: *
> at 
> org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:245)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:52)
> at 
> org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:160)
> at 
> org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:148)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
> at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:148)
> at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:147)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
> at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
> at 
>

[jira] [Resolved] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception



 [ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30332.
--
Resolution: Cannot Reproduce

I am resolving this. I can't reproduce and take a look for the cause. 

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, 
> AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, 
> T_41233.csv
>
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
>

[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception



[ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040560#comment-17040560
 ] 

Hyukjin Kwon commented on SPARK-30332:
--

Where is error message, and how do you create the tables?
There are many ways to narrow down the problem. You can remove the columns one 
by one, and see if it still reproduces the issue.
Simplifying the multiple CSV files. Unifying the file names for readability.

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, 
> AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, 
> T_41233.csv
>
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
>

[jira] [Commented] (SPARK-30868) Throw Exception if runHive(sql) failed

2020-02-19 Thread Jackey Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040557#comment-17040557
 ] 

Jackey Lee commented on SPARK-30868:


[~Ankitraj] No, I think we should just throw Exception once hive run statement 
failed. Otherwise, the user cannot detect that this statement has failed.

Any jira you have created for this?

> Throw Exception if runHive(sql) failed
> --
>
> Key: SPARK-30868
> URL: https://issues.apache.org/jira/browse/SPARK-30868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, HiveClientImpl.runHive will not throw an exception when it runs 
> incorrectly, which will cause it to fail to feedback error information 
> normally.
> Example
> {code:scala}
> spark.sql("add jar file:///tmp/test.jar").show()
> spark.sql("show databases").show()
> {code}
> /tmp/test.jar doesn't exist, thus add jar is failed. However this code will 
> run completely without causing application failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30837) spark-master-test-k8s is broken



[ 
https://issues.apache.org/jira/browse/SPARK-30837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040555#comment-17040555
 ] 

Hyukjin Kwon commented on SPARK-30837:
--

Thanks! I see green light :-). resolving it.

> spark-master-test-k8s is broken
> ---
>
> Key: SPARK-30837
> URL: https://issues.apache.org/jira/browse/SPARK-30837
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20K8s%20Builds/job/spark-master-test-k8s/471/console
> {code}
> + /home/jenkins/bin/session_lock_resource.py minikube
>   File "/home/jenkins/bin/session_lock_resource.py", line 140
> child_body_func = lambda(success_callback): _lock_and_wait(
> ^
> SyntaxError: invalid syntax
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30837) spark-master-test-k8s is broken



 [ 
https://issues.apache.org/jira/browse/SPARK-30837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30837.
--
Resolution: Fixed

> spark-master-test-k8s is broken
> ---
>
> Key: SPARK-30837
> URL: https://issues.apache.org/jira/browse/SPARK-30837
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20K8s%20Builds/job/spark-master-test-k8s/471/console
> {code}
> + /home/jenkins/bin/session_lock_resource.py minikube
>   File "/home/jenkins/bin/session_lock_resource.py", line 140
> child_body_func = lambda(success_callback): _lock_and_wait(
> ^
> SyntaxError: invalid syntax
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30876) Optimizer cannot infer from inferred constraints with join



 [ 
https://issues.apache.org/jira/browse/SPARK-30876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30876:

Summary: Optimizer cannot infer from inferred constraints with join  (was: 
Optimizer cannot infer more constraint)

> Optimizer cannot infer from inferred constraints with join
> --
>
> Key: SPARK-30876
> URL: https://issues.apache.org/jira/browse/SPARK-30876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table t1(a int, b int, c int);
> create table t2(a int, b int, c int);
> create table t3(a int, b int, c int);
> {code}
> Spark 2.3+:
> {noformat}
> == Physical Plan ==
> *(4) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition, true, [id=#102]
>+- *(3) HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *(3) Project
>  +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight
> :- *(3) Project [b#10]
> :  +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight
> : :- *(3) Project [a#6]
> : :  +- *(3) Filter isnotnull(a#6)
> : : +- *(3) ColumnarToRow
> : :+- FileScan parquet default.t1[a#6] Batched: true, 
> DataFilters: [isnotnull(a#6)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: 
> struct
> : +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), 
> [id=#87]
> :+- *(1) Project [b#10]
> :   +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1))
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t2[b#10] Batched: 
> true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], 
> ReadSchema: struct
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), 
> [id=#96]
>+- *(2) Project [c#14]
>   +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1))
>  +- *(2) ColumnarToRow
> +- FileScan parquet default.t3[c#14] Batched: true, 
> DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], 
> ReadSchema: struct
> Time taken: 3.785 seconds, Fetched 1 row(s)
> {noformat}
> Spark 2.2.x:
> {noformat}
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *Project
>  +- *SortMergeJoin [b#19], [c#23], Inner
> :- *Project [b#19]
> :  +- *SortMergeJoin [a#15], [b#19], Inner
> : :- *Sort [a#15 ASC NULLS FIRST], false, 0
> : :  +- Exchange hashpartitioning(a#15, 200)
> : : +- *Filter (isnotnull(a#15) && (a#15 = 1))
> : :+- HiveTableScan [a#15], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, 
> b#16, c#17]
> : +- *Sort [b#19 ASC NULLS FIRST], false, 0
> :+- Exchange hashpartitioning(b#19, 200)
> :   +- *Filter (isnotnull(b#19) && (b#19 = 1))
> :  +- HiveTableScan [b#19], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, 
> b#19, c#20]
> +- *Sort [c#23 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(c#23, 200)
>   +- *Filter (isnotnull(c#23) && (c#23 = 1))
>  +- HiveTableScan [c#23], HiveTableRelation 
> `default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, 
> b#22, c#23]
> Time taken: 0.728 seconds, Fetched 1 row(s)
> {noformat}
> Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30882) Inaccurate results with higher precision ApproximatePercentile results



 [ 
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandini malempati updated SPARK-30882:
--
Summary: Inaccurate results with higher precision ApproximatePercentile 
results   (was: Inaccurate results even with higher precision 
ApproximatePercentile results )

> Inaccurate results with higher precision ApproximatePercentile results 
> ---
>
> Key: SPARK-30882
> URL: https://issues.apache.org/jira/browse/SPARK-30882
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.4
>Reporter: Nandini malempati
>Priority: Major
>
> Results of ApproximatePercentile should have better accuracy with increased 
> precision as per the documentation provided here: 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
> But i'm seeing nondeterministic behavior . On a data set of size 450 with 
> Accuracy 100 is returning better results than 500 (P25 and P90 looks 
> fine but P75 is very off). And accuracy with 700 gives exact results. But 
> this behavior is not consistent. 
> {code:scala}
> // Some comments here
> package com.microsoft.teams.test.utils
> import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
> import org.scalatest.{Assertions, FunSpec}
> import org.apache.spark.sql.functions.{lit, _}
> import 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
> StructType}
> import scala.collection.mutable.ListBuffer
> class PercentilesTest extends FunSpec with Assertions {
> it("check percentiles with different precision") {
>   val schema = List(StructField("MetricName", StringType), 
> StructField("DataPoint", IntegerType))
>   val data = new ListBuffer[Row]
>   for(i <- 1 to 450) { data += Row("metric", i)}
>   import spark.implicits._
> val df = createDF(schema, data.toSeq)
> val accuracy1000 = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 
> ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
> val accuracy1M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100))
> val accuracy5M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500))
> val accuracy7M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700))
> val accuracy10M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000))
> accuracy1000.show(1, false)
> accuracy1M.show(1, false)
> accuracy5M.show(1, false)
> accuracy7M.show(1, false)
> accuracy10M.show(1, false)
>   }
>   def percentile_approx(col: Column, percentage: Column, accuracy: Column): 
> Column = {
> val expr = new ApproximatePercentile(
>   col.expr,  percentage.expr, accuracy.expr
> ).toAggregateExpression
> new Column(expr)
>   }
>   def percentile_approx(col: Column, percentage: Column, accuracy: Int): 
> Column = percentile_approx(
> col, percentage, lit(accuracy)
>   )
>   lazy val spark: SparkSession = {
> SparkSession
>   .builder()
>   .master("local")
>   .appName("spark tests")
>   .getOrCreate()
>   }
>   def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = {
> spark.createDataFrame(
>   spark.sparkContext.parallelize(data),
>   StructType(schema))
>   }
> }
> {code}
> Above is a test run to reproduce the error. Below are few runs with different 
> accuracies.  P25 and P90 looks fine. But P75 is the max of the column.
> +--+--+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 700)|
> +--+--+
> |metric    |[45, 1125000, 3375000, 405]                       |
> +--+--+
>   
> +--+--+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)|
> +--+--+
> |metric    |[45, 1125000, 450, 405]                       |
> +--+--+
>   
> +--+--+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 100)|
>

[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results



 [ 
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandini malempati updated SPARK-30882:
--
Description: 
Results of ApproximatePercentile should have better accuracy with increased 
precision as per the documentation provided here: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
But i'm seeing nondeterministic behavior . On a data set of size 450 with 
Accuracy 100 is returning better results than 500 (P25 and P90 looks 
fine but P75 is very off). And accuracy with 700 gives exact results. But 
this behavior is not consistent. 


{code:scala}
// Some comments here
package com.microsoft.teams.test.utils

import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
import org.scalatest.{Assertions, FunSpec}
import org.apache.spark.sql.functions.{lit, _}
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
StructType}

import scala.collection.mutable.ListBuffer


class PercentilesTest extends FunSpec with Assertions {

it("check percentiles with different precision") {
  val schema = List(StructField("MetricName", StringType), 
StructField("DataPoint", IntegerType))
  val data = new ListBuffer[Row]
  for(i <- 1 to 450) { data += Row("metric", i)}

  import spark.implicits._
val df = createDF(schema, data.toSeq)

val accuracy1000 = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))

val accuracy1M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 100))

val accuracy5M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 500))

val accuracy7M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 700))

val accuracy10M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 1000))

accuracy1000.show(1, false)

accuracy1M.show(1, false)

accuracy5M.show(1, false)

accuracy7M.show(1, false)

accuracy10M.show(1, false)

  }

  def percentile_approx(col: Column, percentage: Column, accuracy: Column): 
Column = {
val expr = new ApproximatePercentile(
  col.expr,  percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
  }

  def percentile_approx(col: Column, percentage: Column, accuracy: Int): Column 
= percentile_approx(
col, percentage, lit(accuracy)
  )

  lazy val spark: SparkSession = {
SparkSession
  .builder()
  .master("local")
  .appName("spark tests")
  .getOrCreate()
  }

  def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = {
spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema))
  }
}

{code}

Above is a test run to reproduce the error.  In this example with accuracy 
500 , P25 and P90 looks fine. But P75 is the max of the column.

+--+--+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)|
+--+--+
|metric    |[45, 1125000, 450, 405]                       |
+--+--+

This is breaking our reports as there is no proper definition of accuracy . we 
have data sets of size more than 2700. After studying the pattern found 
that inaccurate percentiles always have "max" of the column as value. P50 and 
P99 might be right in few cases but P75 can be wrong. 

Is there a way to define what the correct accuracy would be for a given dataset 
size ? 

  was:
Results of ApproximatePercentile should have better accuracy with increased 
precision as per the documentation provided here: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
But i'm seeing nondeterministic behavior . On a data set of size 450 with 
Accuracy 100 is returning better results than 500. And accuracy with 
700 gives exact results. But this behavior is not consistent. 


{code:scala}
// Some comments here
package com.microsoft.teams.test.utils

import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
import org.scalatest.{Assertions, FunSpec}
import org.apache.spark.sql.functions.{lit, _}
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
StructType}

import

[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results



 [ 
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandini malempati updated SPARK-30882:
--
Description: 
Results of ApproximatePercentile should have better accuracy with increased 
precision as per the documentation provided here: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
But i'm seeing nondeterministic behavior . On a data set of size 450 with 
Accuracy 100 is returning better results than 500 (P25 and P90 looks 
fine but P75 is very off). And accuracy with 700 gives exact results. But 
this behavior is not consistent. 


{code:scala}
// Some comments here
package com.microsoft.teams.test.utils

import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
import org.scalatest.{Assertions, FunSpec}
import org.apache.spark.sql.functions.{lit, _}
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
StructType}

import scala.collection.mutable.ListBuffer


class PercentilesTest extends FunSpec with Assertions {

it("check percentiles with different precision") {
  val schema = List(StructField("MetricName", StringType), 
StructField("DataPoint", IntegerType))
  val data = new ListBuffer[Row]
  for(i <- 1 to 450) { data += Row("metric", i)}

  import spark.implicits._
val df = createDF(schema, data.toSeq)

val accuracy1000 = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))

val accuracy1M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 100))

val accuracy5M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 500))

val accuracy7M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 700))

val accuracy10M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 1000))

accuracy1000.show(1, false)

accuracy1M.show(1, false)

accuracy5M.show(1, false)

accuracy7M.show(1, false)

accuracy10M.show(1, false)

  }

  def percentile_approx(col: Column, percentage: Column, accuracy: Column): 
Column = {
val expr = new ApproximatePercentile(
  col.expr,  percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
  }

  def percentile_approx(col: Column, percentage: Column, accuracy: Int): Column 
= percentile_approx(
col, percentage, lit(accuracy)
  )

  lazy val spark: SparkSession = {
SparkSession
  .builder()
  .master("local")
  .appName("spark tests")
  .getOrCreate()
  }

  def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = {
spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema))
  }
}

{code}

Above is a test run to reproduce the error. Below are few runs with different 
accuracies.  P25 and P90 looks fine. But P75 is the max of the column.
+--+--+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 700)|
+--+--+
|metric    |[45, 1125000, 3375000, 405]                       |
+--+--+
  
+--+--+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)|
+--+--+
|metric    |[45, 1125000, 450, 405]                       |
+--+--+
  
+--+--+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 100)|
+--+--+
|metric    |[45, 1124998, 3374996, 405]                       |
+--+--+
  
+--++
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 1)|
+--++
|metric    |[45, 1124848, 3374638, 405]                     |
+--++


This is breaking our reports as there is no proper definition of accuracy . we 
have data sets of size more than 2700. After studying the pattern found 
that inaccurate percentiles always have "max" of the column as

[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results



 [ 
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandini malempati updated SPARK-30882:
--
Description: 
Results of ApproximatePercentile should have better accuracy with increased 
precision as per the documentation provided here: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
But i'm seeing nondeterministic behavior . On a data set of size 450 with 
Accuracy 100 is returning better results than 500. And accuracy with 
700 gives exact results. But this behavior is not consistent. 


{code:scala}
// Some comments here
package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, 
DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} 
import org.apache.spark.sql.functions.\{lit, _} import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile 
import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, 
StructType}import scala.collection.mutable.ListBuffer class PercentilesTest 
extends FunSpec with Assertions \{ it("check percentiles with different 
precision") { val schema = List(StructField("MetricName", StringType), 
StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i <- 
1 to 450) { data += Row("metric", i)} import spark.implicits._ val df = 
createDF(schema, data.toSeq) val accuracy1000 = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val 
accuracy1M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 500)) val accuracy7M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 700)) val accuracy10M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) accuracy1M.show(1, 
false) accuracy5M.show(1, false) accuracy7M.show(1, false) accuracy10M.show(1, 
false) } def percentile_approx(col: Column, percentage: Column, accuracy: 
Column): Column = \{ val expr = new ApproximatePercentile( col.expr, 
percentage.expr, accuracy.expr ).toAggregateExpression new Column(expr) } def 
percentile_approx(col: Column, percentage: Column, accuracy: Int): Column = 
percentile_approx( col, percentage, lit(accuracy) ) lazy val spark: 
SparkSession = \{ SparkSession .builder() .master("local") .appName("spark 
tests") .getOrCreate() } def createDF(schema: List[StructField], data: 
Seq[Row]): DataFrame = \{ spark.createDataFrame( 
spark.sparkContext.parallelize(data), StructType(schema)) } }
{code}

Above is a test run to reproduce the error.  In this example with accuracy 
500 , P25 and P90 looks fine. But P75 is the max of the column.

+--+--+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)|
+--+--+
|metric    |[45, 1125000, 450, 405]                       |
+--+--+

This is breaking our reports as there is no proper definition of accuracy . we 
have data sets of size more than 2700. After studying the pattern found 
that inaccurate percentiles always have "max" of the column as value. P50 and 
P99 might be right in few cases but P75 can be wrong. 

Is there a way to define what the correct accuracy would be for a given dataset 
size ? 

  was:
Results of ApproximatePercentile should have better accuracy with increased 
precision as per the documentation provided here: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
But i'm seeing nondeterministic behavior . On a data set of size 450 with 
Accuracy 100 is returning better results than 500. And accuracy with 
700 gives exact results. But this behavior is not consistent. 
{code:java}
// code placeholder
{package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, 
DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} 
import org.apache.spark.sql.functions.\{lit, _} import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile 
import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, 
StructType}import scala.collection.mutable.ListBuffer class PercentilesTest 
extends FunSpec with Assertions \{ it("check percentiles with different 
precision") { val schema = List(StructField("MetricName", StringType),

[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results



 [ 
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandini malempati updated SPARK-30882:
--
Description: 
Results of ApproximatePercentile should have better accuracy with increased 
precision as per the documentation provided here: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
But i'm seeing nondeterministic behavior . On a data set of size 450 with 
Accuracy 100 is returning better results than 500. And accuracy with 
700 gives exact results. But this behavior is not consistent. 


{code:scala}
// Some comments here
package com.microsoft.teams.test.utils

import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
import org.scalatest.{Assertions, FunSpec}
import org.apache.spark.sql.functions.{lit, _}
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
StructType}

import scala.collection.mutable.ListBuffer


class PercentilesTest extends FunSpec with Assertions {

it("check percentiles with different precision") {
  val schema = List(StructField("MetricName", StringType), 
StructField("DataPoint", IntegerType))
  val data = new ListBuffer[Row]
  for(i <- 1 to 450) { data += Row("metric", i)}

  import spark.implicits._
val df = createDF(schema, data.toSeq)

val accuracy1000 = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))

val accuracy1M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 100))

val accuracy5M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 500))

val accuracy7M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 700))

val accuracy10M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 1000))

accuracy1000.show(1, false)

accuracy1M.show(1, false)

accuracy5M.show(1, false)

accuracy7M.show(1, false)

accuracy10M.show(1, false)

  }

  def percentile_approx(col: Column, percentage: Column, accuracy: Column): 
Column = {
val expr = new ApproximatePercentile(
  col.expr,  percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
  }

  def percentile_approx(col: Column, percentage: Column, accuracy: Int): Column 
= percentile_approx(
col, percentage, lit(accuracy)
  )

  lazy val spark: SparkSession = {
SparkSession
  .builder()
  .master("local")
  .appName("spark tests")
  .getOrCreate()
  }

  def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = {
spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema))
  }
}

{code}

Above is a test run to reproduce the error.  In this example with accuracy 
500 , P25 and P90 looks fine. But P75 is the max of the column.

+--+--+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)|
+--+--+
|metric    |[45, 1125000, 450, 405]                       |
+--+--+

This is breaking our reports as there is no proper definition of accuracy . we 
have data sets of size more than 2700. After studying the pattern found 
that inaccurate percentiles always have "max" of the column as value. P50 and 
P99 might be right in few cases but P75 can be wrong. 

Is there a way to define what the correct accuracy would be for a given dataset 
size ? 

  was:
Results of ApproximatePercentile should have better accuracy with increased 
precision as per the documentation provided here: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
But i'm seeing nondeterministic behavior . On a data set of size 450 with 
Accuracy 100 is returning better results than 500. And accuracy with 
700 gives exact results. But this behavior is not consistent. 


{code:scala}
// Some comments here
package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, 
DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} 
import org.apache.spark.sql.functions.\{lit, _} import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile 
import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, 
StructType}import scala.collection.mutable.ListBuffer class

[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results



 [ 
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandini malempati updated SPARK-30882:
--
Description: 
Results of ApproximatePercentile should have better accuracy with increased 
precision as per the documentation provided here: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
But i'm seeing nondeterministic behavior . On a data set of size 450 with 
Accuracy 100 is returning better results than 500. And accuracy with 
700 gives exact results. But this behavior is not consistent. 
{code:java}
// code placeholder
{package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, 
DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} 
import org.apache.spark.sql.functions.\{lit, _} import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile 
import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, 
StructType}import scala.collection.mutable.ListBuffer class PercentilesTest 
extends FunSpec with Assertions \{ it("check percentiles with different 
precision") { val schema = List(StructField("MetricName", StringType), 
StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i <- 
1 to 450) { data += Row("metric", i)} import spark.implicits._ val df = 
createDF(schema, data.toSeq) val accuracy1000 = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val 
accuracy1M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 500)) val accuracy7M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 700)) val accuracy10M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) accuracy1M.show(1, 
false) accuracy5M.show(1, false) accuracy7M.show(1, false) accuracy10M.show(1, 
false) } def percentile_approx(col: Column, percentage: Column, accuracy: 
Column): Column = \{ val expr = new ApproximatePercentile( col.expr, 
percentage.expr, accuracy.expr ).toAggregateExpression new Column(expr) } def 
percentile_approx(col: Column, percentage: Column, accuracy: Int): Column = 
percentile_approx( col, percentage, lit(accuracy) ) lazy val spark: 
SparkSession = \{ SparkSession .builder() .master("local") .appName("spark 
tests") .getOrCreate() } def createDF(schema: List[StructField], data: 
Seq[Row]): DataFrame = \{ spark.createDataFrame( 
spark.sparkContext.parallelize(data), StructType(schema)) } }}


Above is a test run to reproduce the error.  In this example with accuracy 
500 , P25 and P90 looks fine. But P75 is the max of the column.

+--+--+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)|
+--+--+
|metric    |[45, 1125000, 450, 405]                       |
+--+--+

This is breaking our reports as there is no proper definition of accuracy . we 
have data sets of size more than 2700. After studying the pattern found 
that inaccurate percentiles always have "max" of the column as value. P50 and 
P99 might be right in few cases but P75 can be wrong. 

Is there a way to define what the correct accuracy would be for a given dataset 
size ? 

  was:
Results of ApproximatePercentile should have better accuracy with increased 
precision as per the documentation provided here: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
But i'm seeing nondeterministic behavior . On a data set of size 450 with 
Accuracy 100 is returning better results than 500. And accuracy with 
700 gives exact results. But this behavior is not consistent. 
{code:java}
// code placeholder
{code}
package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, 
DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} 
import org.apache.spark.sql.functions.\{lit, _} import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile 
import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, 
StructType}import scala.collection.mutable.ListBuffer class PercentilesTest 
extends FunSpec with Assertions \{ it("check percentiles with different 
precision") { val schema = List(StructField("MetricName", StringType),

[jira] [Updated] (SPARK-30882) Inaccurate ApproximatePercentile results



 [ 
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandini malempati updated SPARK-30882:
--
Summary: Inaccurate ApproximatePercentile results   (was: 
ApproximatePercentile results )

> Inaccurate ApproximatePercentile results 
> -
>
> Key: SPARK-30882
> URL: https://issues.apache.org/jira/browse/SPARK-30882
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.4
>Reporter: Nandini malempati
>Priority: Major
>
> Results of ApproximatePercentile should have better accuracy with increased 
> precision as per the documentation provided here: 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
> But i'm seeing nondeterministic behavior . On a data set of size 450 with 
> Accuracy 100 is returning better results than 500. And accuracy with 
> 700 gives exact results. But this behavior is not consistent. 
> {code:java}
> // code placeholder
> {code}
> package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, 
> DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} 
> import org.apache.spark.sql.functions.\{lit, _} import 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile 
> import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, 
> StructType}import scala.collection.mutable.ListBuffer class PercentilesTest 
> extends FunSpec with Assertions \{ it("check percentiles with different 
> precision") { val schema = List(StructField("MetricName", StringType), 
> StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i 
> <- 1 to 450) { data += Row("metric", i)} import spark.implicits._ val df 
> = createDF(schema, data.toSeq) val accuracy1000 = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 
> ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500)) val accuracy7M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700)) val accuracy10M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) 
> accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false) 
> accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage: 
> Column, accuracy: Column): Column = \{ val expr = new ApproximatePercentile( 
> col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new 
> Column(expr) } def percentile_approx(col: Column, percentage: Column, 
> accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) ) 
> lazy val spark: SparkSession = \{ SparkSession .builder() .master("local") 
> .appName("spark tests") .getOrCreate() } def createDF(schema: 
> List[StructField], data: Seq[Row]): DataFrame = \{ spark.createDataFrame( 
> spark.sparkContext.parallelize(data), StructType(schema)) } }
> Above is a test run to reproduce the error.  In this example with accuracy 
> 500 , P25 and P90 looks fine. But P75 is the max of the column.
> +--+--+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)|
> +--+--+
> |metric    |[45, 1125000, 450, 405]                       |
> +--+--+
> This is breaking our reports as there is no proper definition of accuracy . 
> we have data sets of size more than 2700. After studying the pattern 
> found that inaccurate percentiles always have "max" of the column as value. 
> P50 and P99 might be right in few cases but P75 can be wrong. 
> Is there a way to define what the correct accuracy would be for a given 
> dataset size ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results



 [ 
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandini malempati updated SPARK-30882:
--
Summary: Inaccurate results even with higher precision 
ApproximatePercentile results   (was: Inaccurate ApproximatePercentile results )

> Inaccurate results even with higher precision ApproximatePercentile results 
> 
>
> Key: SPARK-30882
> URL: https://issues.apache.org/jira/browse/SPARK-30882
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.4
>Reporter: Nandini malempati
>Priority: Major
>
> Results of ApproximatePercentile should have better accuracy with increased 
> precision as per the documentation provided here: 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
> But i'm seeing nondeterministic behavior . On a data set of size 450 with 
> Accuracy 100 is returning better results than 500. And accuracy with 
> 700 gives exact results. But this behavior is not consistent. 
> {code:java}
> // code placeholder
> {code}
> package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, 
> DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} 
> import org.apache.spark.sql.functions.\{lit, _} import 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile 
> import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, 
> StructType}import scala.collection.mutable.ListBuffer class PercentilesTest 
> extends FunSpec with Assertions \{ it("check percentiles with different 
> precision") { val schema = List(StructField("MetricName", StringType), 
> StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i 
> <- 1 to 450) { data += Row("metric", i)} import spark.implicits._ val df 
> = createDF(schema, data.toSeq) val accuracy1000 = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 
> ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500)) val accuracy7M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700)) val accuracy10M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) 
> accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false) 
> accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage: 
> Column, accuracy: Column): Column = \{ val expr = new ApproximatePercentile( 
> col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new 
> Column(expr) } def percentile_approx(col: Column, percentage: Column, 
> accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) ) 
> lazy val spark: SparkSession = \{ SparkSession .builder() .master("local") 
> .appName("spark tests") .getOrCreate() } def createDF(schema: 
> List[StructField], data: Seq[Row]): DataFrame = \{ spark.createDataFrame( 
> spark.sparkContext.parallelize(data), StructType(schema)) } }
> Above is a test run to reproduce the error.  In this example with accuracy 
> 500 , P25 and P90 looks fine. But P75 is the max of the column.
> +--+--+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)|
> +--+--+
> |metric    |[45, 1125000, 450, 405]                       |
> +--+--+
> This is breaking our reports as there is no proper definition of accuracy . 
> we have data sets of size more than 2700. After studying the pattern 
> found that inaccurate percentiles always have "max" of the column as value. 
> P50 and P99 might be right in few cases but P75 can be wrong. 
> Is there a way to define what the correct accuracy would be for a given 
> dataset size ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30882) ApproximatePercentile results

Nandini malempati created SPARK-30882:
-

 Summary: ApproximatePercentile results 
 Key: SPARK-30882
 URL: https://issues.apache.org/jira/browse/SPARK-30882
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.4.4
Reporter: Nandini malempati


Results of ApproximatePercentile should have better accuracy with increased 
precision as per the documentation provided here: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
But i'm seeing nondeterministic behavior . On a data set of size 450 with 
Accuracy 100 is returning better results than 500. And accuracy with 
700 gives exact results. But this behavior is not consistent. 
{code:java}
// code placeholder
{code}
package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, 
DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} 
import org.apache.spark.sql.functions.\{lit, _} import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile 
import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, 
StructType}import scala.collection.mutable.ListBuffer class PercentilesTest 
extends FunSpec with Assertions \{ it("check percentiles with different 
precision") { val schema = List(StructField("MetricName", StringType), 
StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i <- 
1 to 450) { data += Row("metric", i)} import spark.implicits._ val df = 
createDF(schema, data.toSeq) val accuracy1000 = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val 
accuracy1M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 500)) val accuracy7M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 700)) val accuracy10M = 
df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 
0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) accuracy1M.show(1, 
false) accuracy5M.show(1, false) accuracy7M.show(1, false) accuracy10M.show(1, 
false) } def percentile_approx(col: Column, percentage: Column, accuracy: 
Column): Column = \{ val expr = new ApproximatePercentile( col.expr, 
percentage.expr, accuracy.expr ).toAggregateExpression new Column(expr) } def 
percentile_approx(col: Column, percentage: Column, accuracy: Int): Column = 
percentile_approx( col, percentage, lit(accuracy) ) lazy val spark: 
SparkSession = \{ SparkSession .builder() .master("local") .appName("spark 
tests") .getOrCreate() } def createDF(schema: List[StructField], data: 
Seq[Row]): DataFrame = \{ spark.createDataFrame( 
spark.sparkContext.parallelize(data), StructType(schema)) } }

Above is a test run to reproduce the error.  In this example with accuracy 
500 , P25 and P90 looks fine. But P75 is the max of the column.

+--+--+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)|
+--+--+
|metric    |[45, 1125000, 450, 405]                       |
+--+--+

This is breaking our reports as there is no proper definition of accuracy . we 
have data sets of size more than 2700. After studying the pattern found 
that inaccurate percentiles always have "max" of the column as value. P50 and 
P99 might be right in few cases but P75 can be wrong. 

Is there a way to define what the correct accuracy would be for a given dataset 
size ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30802) Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite

2020-02-19 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30802.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27555
[https://github.com/apache/spark/pull/27555]

> Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test 
> suite
> ---
>
> Key: SPARK-30802
> URL: https://issues.apache.org/jira/browse/SPARK-30802
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.1.0
>
>
> Similar to https://issues.apache.org/jira/browse/SPARK-29754, we can use 
> Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite. 
> Also, we should use common code to minimize code duplication. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30802) Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite

2020-02-19 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30802:
-
Priority: Minor  (was: Major)

> Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test 
> suite
> ---
>
> Key: SPARK-30802
> URL: https://issues.apache.org/jira/browse/SPARK-30802
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> Similar to https://issues.apache.org/jira/browse/SPARK-29754, we can use 
> Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite. 
> Also, we should use common code to minimize code duplication. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30802) Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite

2020-02-19 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30802:


Assignee: Huaxin Gao

> Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test 
> suite
> ---
>
> Key: SPARK-30802
> URL: https://issues.apache.org/jira/browse/SPARK-30802
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Similar to https://issues.apache.org/jira/browse/SPARK-29754, we can use 
> Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite. 
> Also, we should use common code to minimize code duplication. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30881) Revise the doc of spark.sql.sources.parallelPartitionDiscovery.threshold

2020-02-19 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-30881:
---
Priority: Minor  (was: Major)

> Revise the doc of spark.sql.sources.parallelPartitionDiscovery.threshold
> 
>
> Key: SPARK-30881
> URL: https://issues.apache.org/jira/browse/SPARK-30881
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> The doc of configuration 
> "spark.sql.sources.parallelPartitionDiscovery.threshold" is not accurate on 
> the part "This applies to Parquet, ORC, CSV, JSON and LibSVM data sources".
> We should revise it as effective on all the file-based data sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30881) Revise the doc of spark.sql.sources.parallelPartitionDiscovery.threshold

2020-02-19 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-30881:
--

 Summary: Revise the doc of 
spark.sql.sources.parallelPartitionDiscovery.threshold
 Key: SPARK-30881
 URL: https://issues.apache.org/jira/browse/SPARK-30881
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


The doc of configuration 
"spark.sql.sources.parallelPartitionDiscovery.threshold" is not accurate on the 
part "This applies to Parquet, ORC, CSV, JSON and LibSVM data sources".

We should revise it as effective on all the file-based data sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict

2020-02-19 Thread Artur Sukhenko (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040456#comment-17040456
 ] 

Artur Sukhenko commented on SPARK-21065:


[~zsxwing] Is `spark.streaming.concurrentJobs` still (2.2/2.3/2.4) risky?

> Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
> --
>
> Key: SPARK-21065
> URL: https://issues.apache.org/jira/browse/SPARK-21065
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Scheduler, Web UI
>Affects Versions: 2.1.0
>Reporter: Dan Dutrow
>Priority: Major
>
> My streaming application has 200+ output operations, many of them stateful 
> and several of them windowed. In an attempt to reduce the processing times, I 
> set "spark.streaming.concurrentJobs" to 2+. Initial results are very 
> positive, cutting our processing time from ~3 minutes to ~1 minute, but 
> eventually we encounter an exception as follows:
> (Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
> batch from 45 minutes before the exception is thrown.)
> 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
> org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
> StreamingJobProgressListener threw an exception
> java.util.NoSuchElementException: key not found 149697756 ms
> at scala.collection.MalLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
> at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
> ...
> The Spark code causing the exception is here:
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125
>   override def onOutputOperationCompleted(
>   outputOperationCompleted: StreamingListenerOutputOperationCompleted): 
> Unit = synchronized {
> // This method is called before onBatchCompleted
> {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color}
>   updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo)
> }
> It seems to me that it may be caused by that batch being removed earlier.
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102
>   override def onBatchCompleted(batchCompleted: 
> StreamingListenerBatchCompleted): Unit = {
> synchronized {
>   waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime)
>   
> {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color}
>   val batchUIData = BatchUIData(batchCompleted.batchInfo)
>   completedBatchUIData.enqueue(batchUIData)
>   if (completedBatchUIData.size > batchUIDataLimit) {
> val removedBatch = completedBatchUIData.dequeue()
> batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime)
>   }
>   totalCompletedBatches += 1L
>   totalProcessedRecords += batchUIData.numRecords
> }
> }
> What is the solution here? Should I make my spark streaming context remember 
> duration a lot longer? ssc.remember(batchDuration * rememberMultiple)
> Otherwise, it seems like there should be some kind of existence check on 
> runningBatchUIData before dereferencing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30879) Refine doc-building workflow



 [ 
https://issues.apache.org/jira/browse/SPARK-30879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30879:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Refine doc-building workflow
> 
>
> Key: SPARK-30879
> URL: https://issues.apache.org/jira/browse/SPARK-30879
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There are a few rough edges in the workflow for building docs that could be 
> refined:
>  * sudo pip installing stuff
>  * no pinned versions of any doc dependencies
>  * using some deprecated options
>  * race condition with jekyll serve



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30880) Delete Sphinx Makefile cruft



 [ 
https://issues.apache.org/jira/browse/SPARK-30880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30880:
--
Description: (was: There are a few rough edges in the workflow for 
building docs that could be refined:
 * sudo pip installing stuff
 * no pinned versions of any doc dependencies
 * using some deprecated options
 * race condition with jekyll serve)

> Delete Sphinx Makefile cruft
> 
>
> Key: SPARK-30880
> URL: https://issues.apache.org/jira/browse/SPARK-30880
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30880) Delete Sphinx Makefile cruft



 [ 
https://issues.apache.org/jira/browse/SPARK-30880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30880:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Delete Sphinx Makefile cruft
> 
>
> Key: SPARK-30880
> URL: https://issues.apache.org/jira/browse/SPARK-30880
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30880) Delete Sphinx Makefile cruft

Dongjoon Hyun created SPARK-30880:
-

 Summary: Delete Sphinx Makefile cruft
 Key: SPARK-30880
 URL: https://issues.apache.org/jira/browse/SPARK-30880
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


There are a few rough edges in the workflow for building docs that could be 
refined:
 * sudo pip installing stuff
 * no pinned versions of any doc dependencies
 * using some deprecated options
 * race condition with jekyll serve



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30731) Update deprecated Mkdocs option



 [ 
https://issues.apache.org/jira/browse/SPARK-30731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30731.
---
Fix Version/s: 2.4.6
 Assignee: Nicholas Chammas
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/27626

> Update deprecated Mkdocs option
> ---
>
> Key: SPARK-30731
> URL: https://issues.apache.org/jira/browse/SPARK-30731
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Trivial
> Fix For: 2.4.6
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30731) Update deprecated Mkdocs option



 [ 
https://issues.apache.org/jira/browse/SPARK-30731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30731:
--
Issue Type: Bug  (was: Improvement)

> Update deprecated Mkdocs option
> ---
>
> Key: SPARK-30731
> URL: https://issues.apache.org/jira/browse/SPARK-30731
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There are a few rough edges in the workflow for building docs that could be 
> refined:
>  * sudo pip installing stuff
>  * no pinned versions of any doc dependencies
>  * using some deprecated options
>  * race condition with jekyll serve



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30731) Update deprecated Mkdocs option



 [ 
https://issues.apache.org/jira/browse/SPARK-30731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30731:
--
Description: (was: There are a few rough edges in the workflow for 
building docs that could be refined:
 * sudo pip installing stuff
 * no pinned versions of any doc dependencies
 * using some deprecated options
 * race condition with jekyll serve)

> Update deprecated Mkdocs option
> ---
>
> Key: SPARK-30731
> URL: https://issues.apache.org/jira/browse/SPARK-30731
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30731) Update deprecated Mkdocs option



 [ 
https://issues.apache.org/jira/browse/SPARK-30731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30731:
--
Priority: Trivial  (was: Minor)

> Update deprecated Mkdocs option
> ---
>
> Key: SPARK-30731
> URL: https://issues.apache.org/jira/browse/SPARK-30731
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> There are a few rough edges in the workflow for building docs that could be 
> refined:
>  * sudo pip installing stuff
>  * no pinned versions of any doc dependencies
>  * using some deprecated options
>  * race condition with jekyll serve



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30879) Refine doc-building workflow

Dongjoon Hyun created SPARK-30879:
-

 Summary: Refine doc-building workflow
 Key: SPARK-30879
 URL: https://issues.apache.org/jira/browse/SPARK-30879
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


There are a few rough edges in the workflow for building docs that could be 
refined:
 * sudo pip installing stuff
 * no pinned versions of any doc dependencies
 * using some deprecated options
 * race condition with jekyll serve



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30731) Update deprecated Mkdocs option



 [ 
https://issues.apache.org/jira/browse/SPARK-30731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30731:
--
Summary: Update deprecated Mkdocs option  (was: Refine doc-building 
workflow)

> Update deprecated Mkdocs option
> ---
>
> Key: SPARK-30731
> URL: https://issues.apache.org/jira/browse/SPARK-30731
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There are a few rough edges in the workflow for building docs that could be 
> refined:
>  * sudo pip installing stuff
>  * no pinned versions of any doc dependencies
>  * using some deprecated options
>  * race condition with jekyll serve



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30811) CTE that refers to non-existent table with same name causes StackOverflowError



[ 
https://issues.apache.org/jira/browse/SPARK-30811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040299#comment-17040299
 ] 

Dongjoon Hyun commented on SPARK-30811:
---

The test cases are landed to `master/branch-3.0` to prevent future regression.
- https://github.com/apache/spark/pull/27635

> CTE that refers to non-existent table with same name causes StackOverflowError
> --
>
> Key: SPARK-30811
> URL: https://issues.apache.org/jira/browse/SPARK-30811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 2.4.6
>
>
> The following query causes a StackOverflowError:
> {noformat}
> WITH t AS (SELECT 1 FROM nonexist.t) SELECT * FROM t
> {noformat}
> This only happens when the CTE refers to a non-existent table with the same 
> name and a database qualifier. This is caused by a couple of things:
>  * {{CTESubstitution}} runs analysis on the CTE, but this does not throw an 
> exception because the table has a database qualifier. The reason is that we 
> don't fail is because we re-attempt to resolve the relation in a later rule.
>  * {{CTESubstitution}} replace logic does not check if the table it is 
> replacing has a database, it shouldn't replace the relation if it does. So 
> now we will happily replace {{nonexist.t}} with {{t}}.
>  * {{CTESubstitution}} transforms down, this means it will keep replacing 
> {{t}} with itself, creating an infinite recursion.
> This is not an issue for master/3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save

2020-02-19 Thread Burak Yavuz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-29908.
-
Resolution: Duplicate

> Support partitioning for DataSource V2 tables in DataFrameWriter.save
> -
>
> Key: SPARK-29908
> URL: https://issues.apache.org/jira/browse/SPARK-29908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Currently, any data source that that upgrades to DataSource V2 loses the 
> partition transform information when using DataFrameWriter.save. The main 
> reason is the lack of an API for "creating" a table with partitioning and 
> schema information for V2 tables without a catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30878) improve the CREATE TABLE document

Wenchen Fan created SPARK-30878:
---

 Summary: improve the CREATE TABLE document
 Key: SPARK-30878
 URL: https://issues.apache.org/jira/browse/SPARK-30878
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30877) Make Aggregate higher order function with Spark Aggregate Expressions

2020-02-19 Thread Jira

Herman van Hövell created SPARK-30877:
-

 Summary: Make Aggregate higher order function with Spark Aggregate 
Expressions
 Key: SPARK-30877
 URL: https://issues.apache.org/jira/browse/SPARK-30877
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Herman van Hövell


The aggregate higher order function should be able to work with spark's 
aggregate functions. For example: {{aggregate(xs, x -> avg(x - 1) + 1)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30876) Optimizer cannot infer more constraint

Yuming Wang created SPARK-30876:
---

 Summary: Optimizer cannot infer more constraint
 Key: SPARK-30876
 URL: https://issues.apache.org/jira/browse/SPARK-30876
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5, 2.3.4, 3.0.0
Reporter: Yuming Wang


How to reproduce this issue:
{code:sql}
create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
create table t3(a int, b int, c int);
{code}
Spark 2.3+:
{noformat}
== Physical Plan ==
*(4) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, true, [id=#102]
   +- *(3) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(3) Project
 +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight
:- *(3) Project [b#10]
:  +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight
: :- *(3) Project [a#6]
: :  +- *(3) Filter isnotnull(a#6)
: : +- *(3) ColumnarToRow
: :+- FileScan parquet default.t1[a#6] Batched: true, 
DataFilters: [isnotnull(a#6)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct
: +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#87]
:+- *(1) Project [b#10]
:   +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1))
:  +- *(1) ColumnarToRow
: +- FileScan parquet default.t2[b#10] Batched: 
true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], ReadSchema: 
struct
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
int, true] as bigint))), [id=#96]
   +- *(2) Project [c#14]
  +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1))
 +- *(2) ColumnarToRow
+- FileScan parquet default.t3[c#14] Batched: true, 
DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], ReadSchema: 
struct

Time taken: 3.785 seconds, Fetched 1 row(s)
{noformat}


Spark 2.2.x:
{noformat}
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)])
  +- *Project
 +- *SortMergeJoin [b#19], [c#23], Inner
:- *Project [b#19]
:  +- *SortMergeJoin [a#15], [b#19], Inner
: :- *Sort [a#15 ASC NULLS FIRST], false, 0
: :  +- Exchange hashpartitioning(a#15, 200)
: : +- *Filter (isnotnull(a#15) && (a#15 = 1))
: :+- HiveTableScan [a#15], HiveTableRelation 
`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, 
b#16, c#17]
: +- *Sort [b#19 ASC NULLS FIRST], false, 0
:+- Exchange hashpartitioning(b#19, 200)
:   +- *Filter (isnotnull(b#19) && (b#19 = 1))
:  +- HiveTableScan [b#19], HiveTableRelation 
`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, 
b#19, c#20]
+- *Sort [c#23 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(c#23, 200)
  +- *Filter (isnotnull(c#23) && (c#23 = 1))
 +- HiveTableScan [c#23], HiveTableRelation `default`.`t3`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, b#22, c#23]
Time taken: 0.728 seconds, Fetched 1 row(s)
{noformat}

Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query

2020-02-19 Thread Peter Toth (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040121#comment-17040121
 ] 

Peter Toth commented on SPARK-24497:


Thanks [~dmateusp] for your +1. That PR is mine and I will resolve the 
conflicts soon. Hopefully it will get some review after that.

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30875) Revisit the decision of writing parquet TIMESTAMP_MICROS by default



[ 
https://issues.apache.org/jira/browse/SPARK-30875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040042#comment-17040042
 ] 

Wenchen Fan commented on SPARK-30875:
-

cc [~maxgekk] [~hyukjin.kwon]

> Revisit the decision of writing parquet TIMESTAMP_MICROS by default
> ---
>
> Key: SPARK-30875
> URL: https://issues.apache.org/jira/browse/SPARK-30875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark 3.0, we write out timestamp values as parquet TIMESTAMP_MICROS by 
> default, instead of INT96. This is good in general as Spark can read all 
> kinds of parquet timestamps, but works better with TIMESTAMP_MICROS.
> However, this brings some troubles with hive compatibility. Spark can use 
> native parquet writer to write hive parquet tables, which may break hive 
> compatibility if Spark writes TIMESTAMP_MICROS.
> We can switch back to INT96 by default, or fix it:
> 1. when using native parquet writer to write hive parquet tables, write 
> timestamp as INT96.
> 2. when creating tables in `HiveExternalCatalog.createTable`, don't claim the 
> parquet table is hive compatible if it has timestamp columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30875) Revisit the decision of writing parquet TIMESTAMP_MICROS by default



 [ 
https://issues.apache.org/jira/browse/SPARK-30875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30875:

Description: 
In Spark 3.0, we write out timestamp values as parquet TIMESTAMP_MICROS by 
default, instead of INT96. This is good in general as Spark can read all kinds 
of parquet timestamps, but works better with TIMESTAMP_MICROS.

However, this brings some troubles with hive compatibility. Spark can use 
native parquet writer to write hive parquet tables, which may break hive 
compatibility if Spark writes TIMESTAMP_MICROS.

We can switch back to INT96 by default, or fix it:
1. when using native parquet writer to write hive parquet tables, write 
timestamp as INT96.
2. when creating tables in `HiveExternalCatalog.createTable`, don't claim the 
parquet table is hive compatible if it has timestamp columns.

  was:In Spark 3.0, we write out timestamp values as parquet 


> Revisit the decision of writing parquet TIMESTAMP_MICROS by default
> ---
>
> Key: SPARK-30875
> URL: https://issues.apache.org/jira/browse/SPARK-30875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark 3.0, we write out timestamp values as parquet TIMESTAMP_MICROS by 
> default, instead of INT96. This is good in general as Spark can read all 
> kinds of parquet timestamps, but works better with TIMESTAMP_MICROS.
> However, this brings some troubles with hive compatibility. Spark can use 
> native parquet writer to write hive parquet tables, which may break hive 
> compatibility if Spark writes TIMESTAMP_MICROS.
> We can switch back to INT96 by default, or fix it:
> 1. when using native parquet writer to write hive parquet tables, write 
> timestamp as INT96.
> 2. when creating tables in `HiveExternalCatalog.createTable`, don't claim the 
> parquet table is hive compatible if it has timestamp columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30875) Revisit the decision of writing parquet TIMESTAMP_MICROS by default



 [ 
https://issues.apache.org/jira/browse/SPARK-30875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30875:

Description: In Spark 3.0, we write out timestamp values as parquet 

> Revisit the decision of writing parquet TIMESTAMP_MICROS by default
> ---
>
> Key: SPARK-30875
> URL: https://issues.apache.org/jira/browse/SPARK-30875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark 3.0, we write out timestamp values as parquet 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30875) Revisit the decision of writing parquet TIMESTAMP_MICROS by default

Wenchen Fan created SPARK-30875:
---

 Summary: Revisit the decision of writing parquet TIMESTAMP_MICROS 
by default
 Key: SPARK-30875
 URL: https://issues.apache.org/jira/browse/SPARK-30875
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30874) Support Postgres Kerberos login in JDBC connector

2020-02-19 Thread Gabor Somogyi (Jira)

Gabor Somogyi created SPARK-30874:
-

 Summary: Support Postgres Kerberos login in JDBC connector
 Key: SPARK-30874
 URL: https://issues.apache.org/jira/browse/SPARK-30874
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.5
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-02-19 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039975#comment-17039975
 ] 

Gabor Somogyi commented on SPARK-12312:
---

After deep analysis it has turned out each database type needs custom 
implementation so creating subtasks to handle them.

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30763) Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract



 [ 
https://issues.apache.org/jira/browse/SPARK-30763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30763.
-
Fix Version/s: 2.4.6
   3.0.0
 Assignee: jiaan.geng
   Resolution: Fixed

> Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract
> -
>
> Key: SPARK-30763
> URL: https://issues.apache.org/jira/browse/SPARK-30763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0, 2.4.6
>
>
> The current implement of regexp_extract will throws a unprocessed exception 
> show below:
> SELECT regexp_extract('1a 2b 14m', '
> d+')
>  
> {code:java}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 22.0 failed 1 times, most recent failure: Lost task 1.0 in 
> stage 22.0 (TID 33, 192.168.1.6, executor driver): 
> java.lang.IndexOutOfBoundsException: No group 1
> [info] at java.util.regex.Matcher.group(Matcher.java:538)
> [info] at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> [info] at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> [info] at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804)
> [info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227)
> [info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227)
> [info] at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2156)
> [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> [info] at org.apache.spark.scheduler.Task.run(Task.scala:127)
> [info] at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
> [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> [info] at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
> [info] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info] at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> I think should treat this exception well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30786) Block replication is not retried on other BlockManagers when it fails on 1 of the peers



 [ 
https://issues.apache.org/jira/browse/SPARK-30786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30786.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27539
[https://github.com/apache/spark/pull/27539]

> Block replication is not retried on other BlockManagers when it fails on 1 of 
> the peers
> ---
>
> Key: SPARK-30786
> URL: https://issues.apache.org/jira/browse/SPARK-30786
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Prakhar Jain
>Assignee: Prakhar Jain
>Priority: Major
> Fix For: 3.1.0
>
>
> When we cache an RDD with replication > 1, Firstly the RDD block is cached 
> locally on one of the BlockManager and then it is replicated to 
> (replication-1) number of BlockManagers. While replicating a block, if 
> replication fails on one of the peers, it is supposed to retry the 
> replication on some other peer (based on 
> "spark.storage.maxReplicationFailures" config). But currently this doesn't 
> happen because of some issue.
> Logs of 1 of the executor which is trying to replicate:
> {noformat}
> 20/02/10 09:01:47 INFO Executor: Starting executor ID 1 on host 
> wn11-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net
> .
> .
> .
> 20/02/10 09:06:45 INFO Executor: Running task 244.0 in stage 3.0 (TID 550)
> 20/02/10 09:06:45 DEBUG BlockManager: Getting local block rdd_13_244
> 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 was not found
> 20/02/10 09:06:45 DEBUG BlockManager: Getting remote block rdd_13_244
> 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 not found
> 20/02/10 09:06:46 INFO MemoryStore: Block rdd_13_244 stored as values in 
> memory (estimated size 33.3 MB, free 44.2 MB)
> 20/02/10 09:06:46 DEBUG BlockManager: Told master about block rdd_13_244
> 20/02/10 09:06:46 DEBUG BlockManager: Put block rdd_13_244 locally took  947 
> ms
> 20/02/10 09:06:46 DEBUG BlockManager: Level for block rdd_13_244 is 
> StorageLevel(memory, deserialized, 3 replicas)
> 20/02/10 09:06:46 TRACE BlockManager: Trying to replicate rdd_13_244 of 
> 34908552 bytes to BlockManagerId(2, 
> wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None)
> 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes 
> to BlockManagerId(2, 
> wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) 
> in 205.849858 ms
> 20/02/10 09:06:47 TRACE BlockManager: Trying to replicate rdd_13_244 of 
> 34908552 bytes to BlockManagerId(5, 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None)
> 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes 
> to BlockManagerId(5, 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None) 
> in 180.501504 ms
> 20/02/10 09:06:47 DEBUG BlockManager: Replicating rdd_13_244 of 34908552 
> bytes to 2 peer(s) took 387.381168 ms
> 20/02/10 09:06:47 DEBUG BlockManager: block rdd_13_244 replicated to 
> BlockManagerId(5, 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None), 
> BlockManagerId(2, 
> wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None)
> 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 remotely took  423 
> ms
> 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 with 
> replication took  1371 ms
> 20/02/10 09:06:47 DEBUG BlockManager: Getting local block rdd_13_244
> 20/02/10 09:06:47 DEBUG BlockManager: Level for block rdd_13_244 is 
> StorageLevel(memory, deserialized, 3 replicas)
> 20/02/10 09:06:47 INFO Executor: Finished task 244.0 in stage 3.0 (TID 550). 
> 2253 bytes result sent to driver
> {noformat}
> Logs of other executor where the block is being replicated to:
> {noformat}
> 20/02/10 09:01:47 INFO Executor: Starting executor ID 5 on host 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net
> .
> .
> .
> 20/02/10 09:06:47 INFO MemoryStore: Will not store rdd_13_244
> 20/02/10 09:06:47 WARN MemoryStore: Not enough space to cache rdd_13_244 in 
> memory! (computed 4.2 MB so far)
> 20/02/10 09:06:47 INFO MemoryStore: Memory use = 4.9 GB (blocks) + 7.3 MB 
> (scratch space shared across 2 tasks(s)) = 4.9 GB. Storage limit = 4.9 GB.
> 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 locally took  12 ms
> 20/02/10 09:06:47 WARN BlockManager: Block rdd_13_244 could not be removed as 
> it was not found on disk or in memory
> 20/02/10 09:06:47 WARN BlockManager: Putting block rdd_13_244 failed
> 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 without 
> replication took  13 ms
> {noformat}
> Note here that the block replication

[jira] [Assigned] (SPARK-30786) Block replication is not retried on other BlockManagers when it fails on 1 of the peers



 [ 
https://issues.apache.org/jira/browse/SPARK-30786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30786:
---

Assignee: Prakhar Jain

> Block replication is not retried on other BlockManagers when it fails on 1 of 
> the peers
> ---
>
> Key: SPARK-30786
> URL: https://issues.apache.org/jira/browse/SPARK-30786
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Prakhar Jain
>Assignee: Prakhar Jain
>Priority: Major
>
> When we cache an RDD with replication > 1, Firstly the RDD block is cached 
> locally on one of the BlockManager and then it is replicated to 
> (replication-1) number of BlockManagers. While replicating a block, if 
> replication fails on one of the peers, it is supposed to retry the 
> replication on some other peer (based on 
> "spark.storage.maxReplicationFailures" config). But currently this doesn't 
> happen because of some issue.
> Logs of 1 of the executor which is trying to replicate:
> {noformat}
> 20/02/10 09:01:47 INFO Executor: Starting executor ID 1 on host 
> wn11-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net
> .
> .
> .
> 20/02/10 09:06:45 INFO Executor: Running task 244.0 in stage 3.0 (TID 550)
> 20/02/10 09:06:45 DEBUG BlockManager: Getting local block rdd_13_244
> 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 was not found
> 20/02/10 09:06:45 DEBUG BlockManager: Getting remote block rdd_13_244
> 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 not found
> 20/02/10 09:06:46 INFO MemoryStore: Block rdd_13_244 stored as values in 
> memory (estimated size 33.3 MB, free 44.2 MB)
> 20/02/10 09:06:46 DEBUG BlockManager: Told master about block rdd_13_244
> 20/02/10 09:06:46 DEBUG BlockManager: Put block rdd_13_244 locally took  947 
> ms
> 20/02/10 09:06:46 DEBUG BlockManager: Level for block rdd_13_244 is 
> StorageLevel(memory, deserialized, 3 replicas)
> 20/02/10 09:06:46 TRACE BlockManager: Trying to replicate rdd_13_244 of 
> 34908552 bytes to BlockManagerId(2, 
> wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None)
> 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes 
> to BlockManagerId(2, 
> wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) 
> in 205.849858 ms
> 20/02/10 09:06:47 TRACE BlockManager: Trying to replicate rdd_13_244 of 
> 34908552 bytes to BlockManagerId(5, 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None)
> 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes 
> to BlockManagerId(5, 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None) 
> in 180.501504 ms
> 20/02/10 09:06:47 DEBUG BlockManager: Replicating rdd_13_244 of 34908552 
> bytes to 2 peer(s) took 387.381168 ms
> 20/02/10 09:06:47 DEBUG BlockManager: block rdd_13_244 replicated to 
> BlockManagerId(5, 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None), 
> BlockManagerId(2, 
> wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None)
> 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 remotely took  423 
> ms
> 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 with 
> replication took  1371 ms
> 20/02/10 09:06:47 DEBUG BlockManager: Getting local block rdd_13_244
> 20/02/10 09:06:47 DEBUG BlockManager: Level for block rdd_13_244 is 
> StorageLevel(memory, deserialized, 3 replicas)
> 20/02/10 09:06:47 INFO Executor: Finished task 244.0 in stage 3.0 (TID 550). 
> 2253 bytes result sent to driver
> {noformat}
> Logs of other executor where the block is being replicated to:
> {noformat}
> 20/02/10 09:01:47 INFO Executor: Starting executor ID 5 on host 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net
> .
> .
> .
> 20/02/10 09:06:47 INFO MemoryStore: Will not store rdd_13_244
> 20/02/10 09:06:47 WARN MemoryStore: Not enough space to cache rdd_13_244 in 
> memory! (computed 4.2 MB so far)
> 20/02/10 09:06:47 INFO MemoryStore: Memory use = 4.9 GB (blocks) + 7.3 MB 
> (scratch space shared across 2 tasks(s)) = 4.9 GB. Storage limit = 4.9 GB.
> 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 locally took  12 ms
> 20/02/10 09:06:47 WARN BlockManager: Block rdd_13_244 could not be removed as 
> it was not found on disk or in memory
> 20/02/10 09:06:47 WARN BlockManager: Putting block rdd_13_244 failed
> 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 without 
> replication took  13 ms
> {noformat}
> Note here that the block replication failed in Executor-5 with log line "Not 
> enough space to cache rdd_13_244 in memory!". But Executor-1 shows that block 
> is

[jira] [Comment Edited] (SPARK-26346) Upgrade parquet to 1.11.1

2020-02-19 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-26346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039904#comment-17039904
 ] 

Ismaël Mejía edited comment on SPARK-26346 at 2/19/20 10:54 AM:


It seems this issue is blocked because Spark is having dependency issues 
because of its dependency on Hive. Hive depends on both Avro 1.7 (for Hive 1.x) 
and Avro 1.8 (for Hive 2.x). Parquet depends on Avro 1.9. So there are issues 
from API changes on Avro. I proposed a patch for Hive with the hope that it 
gets backported into Hive 2.x to unlock the Avro 1.9 upgrade on Spark (and 
transitively this issue) but so far I have not received any attention from the 
Hive community, for more info HIVE-21737 and [my multiples call for attention 
in the Hive 
ML|[https://lists.apache.org/thread.html/rc6c672ad4a5e255957d54d80ff83bf48eacece2828a86bc6cedd9c4c%40%3Cdev.hive.apache.org%3E].]
 If someone can ping somebody in the Hive community that can help this advance 
that would be great!


was (Author: iemejia):
It seems this issue is blocked because Spark is having dependency issues 
because of its dependency on Hive. Hive depends on both Avro 1.7 (for Hive 1.x) 
and Avro 1.8 (for Hive 2.x). Parquet depends on Avro 1.9. So there are issues 
from API changes on Avro. I proposed a patch for Hive with the hope that it 
gets backported into Hive 2.x but so far I have not received any attention from 
the Hive community, for more info HIVE-21737 and [my multiples call for 
attention in the Hive 
ML|[https://lists.apache.org/thread.html/rc6c672ad4a5e255957d54d80ff83bf48eacece2828a86bc6cedd9c4c%40%3Cdev.hive.apache.org%3E].]
 If someone can ping somebody in the Hive community that can help this advance 
that would be great!

> Upgrade parquet to 1.11.1
> -
>
> Key: SPARK-26346
> URL: https://issues.apache.org/jira/browse/SPARK-26346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26346) Upgrade parquet to 1.11.1

2020-02-19 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-26346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039904#comment-17039904
 ] 

Ismaël Mejía commented on SPARK-26346:
--

It seems this issue is blocked because Spark is having dependency issues 
because of its dependency on Hive. Hive depends on both Avro 1.7 (for Hive 1.x) 
and Avro 1.8 (for Hive 2.x). Parquet depends on Avro 1.9. So there are issues 
from API changes on Avro. I proposed a patch for Hive with the hope that it 
gets backported into Hive 2.x but so far I have not received any attention from 
the Hive community, for more info HIVE-21737 and [my multiples call for 
attention in the Hive 
ML|[https://lists.apache.org/thread.html/rc6c672ad4a5e255957d54d80ff83bf48eacece2828a86bc6cedd9c4c%40%3Cdev.hive.apache.org%3E].]
 If someone can ping somebody in the Hive community that can help this advance 
that would be great!

> Upgrade parquet to 1.11.1
> -
>
> Key: SPARK-26346
> URL: https://issues.apache.org/jira/browse/SPARK-26346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30873) Handling Node Decommissioning for Yarn cluster manger in Spark



[ 
https://issues.apache.org/jira/browse/SPARK-30873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039884#comment-17039884
 ] 

Saurabh Chawla commented on SPARK-30873:


We have raised the WIP PR for this.

cc [~holdenkarau] [~itskals][~amargoor]

> Handling Node Decommissioning for Yarn cluster manger in Spark
> --
>
> Key: SPARK-30873
> URL: https://issues.apache.org/jira/browse/SPARK-30873
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Priority: Major
>
> In many public cloud environments, the node loss (in case of AWS 
> SpotLoss,Spot blocks and GCP preemptible VMs) is a planned and informed 
> activity. 
> The cloud provider intimates the cluster manager about the possible loss of 
> node ahead of time. Few examples is listed here:
> a) Spot loss in AWS(2 min before event)
> b) GCP Pre-emptible VM loss (30 second before event)
> c) AWS Spot block loss with info on termination time (generally few tens of 
> minutes before decommission as configured in Yarn)
> This JIRA tries to make spark leverage the knowledge of the node loss in 
> future, and tries to adjust the scheduling of tasks to minimise the impact on 
> the application. 
> It is well known that when a host is lost, the executors, its running tasks, 
> their caches and also Shuffle data is lost. This could result in wastage of 
> compute and other resources.
> The focus here is to build a framework for YARN, that can be extended for 
> other cluster managers to handle such scenario.
> The framework must handle one or more of the following:-
> 1) Prevent new tasks from starting on any executors on decommissioning Nodes. 
> 2) Decide to kill the running tasks so that they can be restarted elsewhere 
> (assuming they will not complete within the deadline) OR we can allow them to 
> continue hoping they will finish within deadline.
> 3) Clear the shuffle data entry from MapOutputTracker of decommission node 
> hostname to prevent the shuffle fetchfailed exception.The most significant 
> advantage of unregistering shuffle outputs when Spark schedules the first 
> re-attempt to compute the missing blocks, it notices all of the missing 
> blocks from decommissioned nodes and recovers in only one attempt. This 
> speeds up the recovery process significantly over the scheduled Spark 
> implementation, where stages might be rescheduled multiple times to recompute 
> missing shuffles from all nodes, and prevent jobs from being stuck for hours 
> failing and recomputing.
> 4) Prevent the stage to abort due to the fetchfailed exception in case of 
> decommissioning of node. In Spark there is number of consecutive stage 
> attempts allowed before a stage is aborted.This is controlled by the config 
> spark.stage.maxConsecutiveAttempts. Not accounting fetch fails due 
> decommissioning of nodes towards stage failure improves the reliability of 
> the system.
> Main components of change
> 1) Get the ClusterInfo update from the Resource Manager -> Application Master 
> -> Spark Driver.
> 2) DecommissionTracker, resides inside driver, tracks all the decommissioned 
> nodes and take necessary action and state transition.
> 3) Based on the decommission node list add hooks at code to achieve
>  a) No new task on executor
>  b) Remove shuffle data mapping info for the node to be decommissioned from 
> the mapOutputTracker
>  c) Do not count fetchFailure from decommissioned towards stage failure
> On the receiving info that node is to be decommissioned, the below action 
> needs to be performed by DecommissionTracker on driver:
>  * Add the entry of Nodes in DecommissionTracker with termination time and 
> node state as "DECOMMISSIONING".
>  * Stop assigning any new tasks on executors on the nodes which are candidate 
> for decommission. This makes sure slowly as the tasks finish the usage of 
> this node would die down.
>  * Kill all the executors for the decommissioning nodes after configurable 
> period of time, say "spark.graceful.decommission.executor.leasetimePct". This 
> killing ensures two things. Firstly, the task failure will be attributed in 
> job failure count. Second, avoid generation on more shuffle data on the node 
> that will eventually be lost. The node state is set to 
> "EXECUTOR_DECOMMISSIONED". 
>  * Mark Shuffle data on the node as unavailable after 
> "spark.qubole.graceful.decommission.shuffedata.leasetimePct" time. This will 
> ensure that recomputation of missing shuffle partition is done early, rather 
> than reducers failing with a time-consuming FetchFailure. The node state is 
> set to "SHUFFLE_DECOMMISSIONED". 
>  * Mark Node as Terminated after the termination time. Now the state of the 
> node is "TERMINATED".
>  * Remove the node

[jira] [Updated] (SPARK-30873) Handling Node Decommissioning for Yarn cluster manger in Spark

[
https://issues.apache.org/jira/browse/SPARK-30873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Saurabh Chawla updated SPARK-30873:
---
Description:
In many public cloud environments, the node loss (in case of AWS SpotLoss,Spot
blocks and GCP preemptible VMs) is a planned and informed activity.
The cloud provider intimates the cluster manager about the possible loss of
node ahead of time. Few examples is listed here:
a) Spot loss in AWS(2 min before event)
b) GCP Pre-emptible VM loss (30 second before event)
c) AWS Spot block loss with info on termination time (generally few tens of
minutes before decommission as configured in Yarn)

This JIRA tries to make spark leverage the knowledge of the node loss in
future, and tries to adjust the scheduling of tasks to minimise the impact on
the application.
It is well known that when a host is lost, the executors, its running tasks,
their caches and also Shuffle data is lost. This could result in wastage of
compute and other resources.

The focus here is to build a framework for YARN, that can be extended for other
cluster managers to handle such scenario.

The framework must handle one or more of the following:-
1) Prevent new tasks from starting on any executors on decommissioning Nodes.
2) Decide to kill the running tasks so that they can be restarted elsewhere
(assuming they will not complete within the deadline) OR we can allow them to
continue hoping they will finish within deadline.
3) Clear the shuffle data entry from MapOutputTracker of decommission node
hostname to prevent the shuffle fetchfailed exception.The most significant
advantage of unregistering shuffle outputs when Spark schedules the first
re-attempt to compute the missing blocks, it notices all of the missing blocks
from decommissioned nodes and recovers in only one attempt. This speeds up the
recovery process significantly over the scheduled Spark implementation, where
stages might be rescheduled multiple times to recompute missing shuffles from
all nodes, and prevent jobs from being stuck for hours failing and recomputing.
4) Prevent the stage to abort due to the fetchfailed exception in case of
decommissioning of node. In Spark there is number of consecutive stage attempts
allowed before a stage is aborted.This is controlled by the config
spark.stage.maxConsecutiveAttempts. Not accounting fetch fails due
decommissioning of nodes towards stage failure improves the reliability of the
system.

Main components of change
1) Get the ClusterInfo update from the Resource Manager -> Application Master
-> Spark Driver.
2) DecommissionTracker, resides inside driver, tracks all the decommissioned
nodes and take necessary action and state transition.
3) Based on the decommission node list add hooks at code to achieve
a) No new task on executor
b) Remove shuffle data mapping info for the node to be decommissioned from the
mapOutputTracker
c) Do not count fetchFailure from decommissioned towards stage failure

On the receiving info that node is to be decommissioned, the below action needs
to be performed by DecommissionTracker on driver:
* Add the entry of Nodes in DecommissionTracker with termination time and node
state as "DECOMMISSIONING".
* Stop assigning any new tasks on executors on the nodes which are candidate
for decommission. This makes sure slowly as the tasks finish the usage of this
node would die down.
* Kill all the executors for the decommissioning nodes after configurable
period of time, say "spark.graceful.decommission.executor.leasetimePct". This
killing ensures two things. Firstly, the task failure will be attributed in job
failure count. Second, avoid generation on more shuffle data on the node that
will eventually be lost. The node state is set to "EXECUTOR_DECOMMISSIONED".
* Mark Shuffle data on the node as unavailable after
"spark.qubole.graceful.decommission.shuffedata.leasetimePct" time. This will
ensure that recomputation of missing shuffle partition is done early, rather
than reducers failing with a time-consuming FetchFailure. The node state is set
to "SHUFFLE_DECOMMISSIONED".
* Mark Node as Terminated after the termination time. Now the state of the
node is "TERMINATED".
* Remove the node entry from Decommission Tracker if the same host name is
reused.(This is not uncommon in many public cloud environments).

was:
In many public cloud environments, the node loss (in case of AWS SpotLoss,Spot
blocks and GCP preemptible VMs) is a planned and informed activity.
The cloud provider intimates the cluster manager about the possible loss of
node ahead of time. Few exmaples is listed here:
a) Spot loss in AWS(2 min before event)
b) GCP Pre-emptible VM loss (30 second before event)
c) AWS Spot block loss with info on termination time (generally few tens of
minutes before decommission as configured in Yarn)

This JIRA tries to make spark leverage the

[jira] [Created] (SPARK-30873) Handling Node Decommissioning for Yarn cluster manger in Spark

Saurabh Chawla created SPARK-30873:
--

 Summary: Handling Node Decommissioning for Yarn cluster manger in 
Spark
 Key: SPARK-30873
 URL: https://issues.apache.org/jira/browse/SPARK-30873
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 3.0.0
Reporter: Saurabh Chawla


In many public cloud environments, the node loss (in case of AWS SpotLoss,Spot 
blocks and GCP preemptible VMs) is a planned and informed activity. 
The cloud provider intimates the cluster manager about the possible loss of 
node ahead of time. Few exmaples is listed here:
a) Spot loss in AWS(2 min before event)
b) GCP Pre-emptible VM loss (30 second before event)
c) AWS Spot block loss with info on termination time (generally few tens of 
minutes before decommission as configured in Yarn)

This JIRA tries to make spark leverage the knowledge of the node loss in 
future, and tries to adjust the scheduling of tasks to minimise the impact on 
the application. 
It is well known that when a host is lost, the executors, its running tasks, 
their caches and also Shuffle data is lost. This could result in wastage of 
compute and other resources.

The focus here is to build a framework for YARN, that can be extended for other 
cluster managers to handle such scenario.

The framework must handle one or more of the following:-
1) Prevent new tasks from starting on any executors on decommissioning Nodes. 
2) Decide to kill the running tasks so that they can be restarted elsewhere 
(assuming they will not complete within the deadline) OR we can allow them to 
continue hoping they will finish within deadline.
3) Clear the shuffle data entry from MapOutputTracker of decommission node 
hostname to prevent the shuffle fetchfailed exception.The most significant 
advantage of unregistering shuffle outputs when Spark schedules the first 
re-attempt to compute the missing blocks, it notices all of the missing blocks 
from decommissioned nodes and recovers in only one attempt. This speeds up the 
recovery process significantly over the scheduled Spark implementation, where 
stages might be rescheduled multiple times to recompute missing shuffles from 
all nodes, and prevent jobs from being stuck for hours failing and recomputing.
4) Prevent the stage to abort due to the fetchfailed execption in case of 
decomissioning of node. In Spark there is number of consecutive stage attempts 
allowed before a stage is aborted.This is controlled by the config 
spark.stage.maxConsecutiveAttempts. Not accounting fetch fails due 
decommissioning of nodes towards stage failure improves the reliability of the 
system.

Main components of change
1) Get the ClusterInfo update from the Resource Manager -> Application Master 
-> Spark Driver.
2) DecommissionTracker, resides inside driver, tracks all the decommissioned 
nodes and take necessary action and state transition.
3) Based on the decommission node list add hooks at code to achieve
 a) No new task on executor
 b) Remove shuffle data mapping info for the node to be decommissioned from the 
mapOutputTracker
 c) Do not count fetchFailure from decommissioned towards stage failure

On the receiving info that node is to be decommissioned, the below action needs 
to be performed by DecommissionTracker on driver:
 * Add the entry of Nodes in DecommissionTracker with termination time and 
nodestate as "DECOMMISSIONING".
 * Stop assigning any new tasks on executors on the nodes which are candidate 
for decommission. This makes sure slowly as the tasks finish the usage of this 
node would die down.
 * Kill all the executors for the decommissioning nodes after configurable 
period of time, say "spark.graceful.decommission.executor.leasetimePct". This 
killing ensures two things. Firstly, the task failure will be attributed in job 
failure count. Second, avoid generation on more shuffle data on the node that 
will eventually be lost. The node state is set to "EXECUTOR_DECOMMISSIONED". 
 * Mark Shuffle data on the node as unavailable after 
"spark.qubole.graceful.decommission.shuffedata.leasetimePct" time. This will 
ensure that recomputation of missing shuffle partition is done early, rather 
than reducers failing with a time-consuming FetchFailure. The node state is set 
to "SHUFFLE_DECOMMISSIONED". 
 * Mark Node as Terminated after the termination time. Now the state of the 
node is "TERMINATED".
 * Remove the node entry from Decommission Tracker if the same host name is 
reused.(This is not uncommon in many public cloud environments).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26346) Upgrade parquet to 1.11.1

2020-02-19 Thread Gabor Szadovszky (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039871#comment-17039871
 ] 

Gabor Szadovszky commented on SPARK-26346:
--

[~h-vetinari],
Parquet 1.11.1 was initiated to support the Spark integration. I would not do 
the release until Spark can confirm that 1.11.1-SNAPSHOT works correctly.
I've added PARQUET-1796 to 1.11.1.

> Upgrade parquet to 1.11.1
> -
>
> Key: SPARK-26346
> URL: https://issues.apache.org/jira/browse/SPARK-26346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30872) Constraints inferred from inferred attributes



 [ 
https://issues.apache.org/jira/browse/SPARK-30872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30872:

Affects Version/s: (was: 3.0.0)
   3.1.0

> Constraints inferred from inferred attributes
> -
>
> Key: SPARK-30872
> URL: https://issues.apache.org/jira/browse/SPARK-30872
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> scala> spark.range(20).selectExpr("id as a", "id as b", "id as 
> c").write.saveAsTable("t1")
> scala> spark.sql("select count(*) from t1 where a = b and b = c and (c = 3 or 
> c = 13)").explain(false)
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition, true, [id=#76]
>+- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *(1) Project
>  +- *(1) Filter (((isnotnull(c#36L) AND ((b#35L = 3) OR (b#35L = 
> 13))) AND isnotnull(b#35L)) AND (a#34L = c#36L)) AND isnotnull(a#34L)) AND 
> (a#34L = b#35L)) AND (b#35L = c#36L)) AND ((c#36L = 3) OR (c#36L = 13)))
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[a#34L,b#35L,c#36L] Batched: 
> true, DataFilters: [isnotnull(c#36L), ((b#35L = 3) OR (b#35L = 13)), 
> isnotnull(b#35L), (a#34L = c#36L), isnotnull(a#..., Format: Parquet, 
> Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c), 
> Or(EqualTo(b,3),EqualTo(b,13)), IsNotNull(b), IsNotNull(a), 
> Or(EqualTo(c,3),EqualT..., ReadSchema: struct
> {code}
> We can infer more constraints: {{(a#34L = 3) OR (a#34L = 13)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30872) Constraints inferred from inferred attributes



 [ 
https://issues.apache.org/jira/browse/SPARK-30872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30872:

Description: 
{code:scala}
scala> spark.range(20).selectExpr("id as a", "id as b", "id as 
c").write.saveAsTable("t1")

scala> spark.sql("select count(*) from t1 where a = b and b = c and (c = 3 or c 
= 13)").explain(false)
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, true, [id=#76]
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(1) Project
 +- *(1) Filter (((isnotnull(c#36L) AND ((b#35L = 3) OR (b#35L = 
13))) AND isnotnull(b#35L)) AND (a#34L = c#36L)) AND isnotnull(a#34L)) AND 
(a#34L = b#35L)) AND (b#35L = c#36L)) AND ((c#36L = 3) OR (c#36L = 13)))
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[a#34L,b#35L,c#36L] Batched: true, 
DataFilters: [isnotnull(c#36L), ((b#35L = 3) OR (b#35L = 13)), 
isnotnull(b#35L), (a#34L = c#36L), isnotnull(a#..., Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(c), 
Or(EqualTo(b,3),EqualTo(b,13)), IsNotNull(b), IsNotNull(a), 
Or(EqualTo(c,3),EqualT..., ReadSchema: struct
{code}

We can infer more constraints: {{(a#34L = 3) OR (a#34L = 13)}}.


  was:

{code:scala}
scala> spark.range(20).selectExpr("id as a", "id as b", "id as 
c").write.saveAsTable("t1")

scala> spark.sql("select count(*) from t1 where a = b and b = c and (c = 3 or c 
= 13)").explain(false)
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, true, [id=#76]
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
  +- *(1) Project
 +- *(1) Filter (((isnotnull(c#36L) AND ((b#35L = 3) OR (b#35L = 
13))) AND isnotnull(b#35L)) AND (a#34L = c#36L)) AND isnotnull(a#34L)) AND 
(a#34L = b#35L)) AND (b#35L = c#36L)) AND ((c#36L = 3) OR (c#36L = 13)))
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[a#34L,b#35L,c#36L] Batched: true, 
DataFilters: [isnotnull(c#36L), ((b#35L = 3) OR (b#35L = 13)), 
isnotnull(b#35L), (a#34L = c#36L), isnotnull(a#..., Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(c), 
Or(EqualTo(b,3),EqualTo(b,13)), IsNotNull(b), IsNotNull(a), 
Or(EqualTo(c,3),EqualT..., ReadSchema: struct
{code}



> Constraints inferred from inferred attributes
> -
>
> Key: SPARK-30872
> URL: https://issues.apache.org/jira/browse/SPARK-30872
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> scala> spark.range(20).selectExpr("id as a", "id as b", "id as 
> c").write.saveAsTable("t1")
> scala> spark.sql("select count(*) from t1 where a = b and b = c and (c = 3 or 
> c = 13)").explain(false)
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition, true, [id=#76]
>+- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *(1) Project
>  +- *(1) Filter (((isnotnull(c#36L) AND ((b#35L = 3) OR (b#35L = 
> 13))) AND isnotnull(b#35L)) AND (a#34L = c#36L)) AND isnotnull(a#34L)) AND 
> (a#34L = b#35L)) AND (b#35L = c#36L)) AND ((c#36L = 3) OR (c#36L = 13)))
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[a#34L,b#35L,c#36L] Batched: 
> true, DataFilters: [isnotnull(c#36L), ((b#35L = 3) OR (b#35L = 13)), 
> isnotnull(b#35L), (a#34L = c#36L), isnotnull(a#..., Format: Parquet, 
> Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c), 
> Or(EqualTo(b,3),EqualTo(b,13)), IsNotNull(b), IsNotNull(a), 
> Or(EqualTo(c,3),EqualT..., ReadSchema: struct
> {code}
> We can infer more constraints: {{(a#34L = 3) OR (a#34L = 13)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30349) The result is wrong when joining tables with selecting the same columns



 [ 
https://issues.apache.org/jira/browse/SPARK-30349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30349.
--
Resolution: Cannot Reproduce

Resolving this due to no feedback from the author.

> The result is wrong when joining tables with selecting the same columns
> ---
>
> Key: SPARK-30349
> URL: https://issues.apache.org/jira/browse/SPARK-30349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: cen yuhai
>Priority: Blocker
>  Labels: correctness
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> {code:sql}
> // code placeholder
> with tmp as(
> select
> log_date,
> buvid,
> manga_id,
> sum(readtime) readtime
> from
> manga.dwd_app_readtime_xt_dt
> where
> log_date >= 20191220
> group by
> log_date,
> buvid,
> manga_id
> )
> select
> t.log_date,
> GET_JSON_OBJECT(t.extended_fields, '$.type'),
> count(distinct t.buvid),
> count(distinct t0.buvid),
> count(distinct t1.buvid),
> count(distinct t2.buvid),
> count(
> distinct case
> when t1.buvid = t0.buvid then t1.buvid
> end
> ),
> count(
> distinct case
> when t1.buvid = t0.buvid
> and t1.buvid = t2.buvid then t1.buvid
> end
> ),
> count(
> distinct case
> when t0.buvid = t2.buvid then t0.buvid
> end
> ),
> sum(readtime),
> avg(readtime),
> sum(
> case
> when t0.buvid = t3.buvid then readtime
> end
> ),
> avg(
> case
> when t0.buvid = t3.buvid then readtime
> end
> )
> from
> manga.manga_tfc_app_ubt_d t
> join manga.manga_tfc_app_ubt_d t1 on t.buvid = t1.buvid
> and t1.log_date >= 20191220
> and t1.event_id = 'manga.manga-detail.0.0.pv'
> and to_date(t.stime) = TO_DATE(t1.stime)
> and GET_JSON_OBJECT(t1.extended_fields, '$.manga_id') = 
> GET_JSON_OBJECT(t.extended_fields, '$.manga_id')
> left join manga.manga_buvid_minlog t0 on t.buvid = t0.buvid
> and t0.log_date = 20191223
> and t0.minlog >= '2019-12-20'
> and to_date(t.stime) = TO_DATE(t0.minlog)
> left join manga.dwb_tfc_app_launch_df t2 on t.buvid = t2.buvid
> and t2.log_date >= 20191220
> and DATE_ADD(to_date(t.stime), 1) = to_date(t2.stime)
> left join tmp t3 on t1.buvid = t3.buvid
> and t3.log_date >= 20191220
> and t3.manga_id = GET_JSON_OBJECT(t.extended_fields, '$.manga_id')
> where
> t.log_date >= 20191220
> and t.event_id = 'manga.homepage-recommend.detail.0.click'
> group by
> t.log_date,
> GET_JSON_OBJECT(t.extended_fields, '$.type')
> {code}
>  !screenshot-1.png! 
> The result of hive 2.3 is ok
>  !screenshot-2.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save



[ 
https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039784#comment-17039784
 ] 

Hyukjin Kwon commented on SPARK-29908:
--

[~brkyvz], the PRs were closed. Do we still need this? can you take an action 
to the JIRA too?

> Support partitioning for DataSource V2 tables in DataFrameWriter.save
> -
>
> Key: SPARK-29908
> URL: https://issues.apache.org/jira/browse/SPARK-29908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Currently, any data source that that upgrades to DataSource V2 loses the 
> partition transform information when using DataFrameWriter.save. The main 
> reason is the lack of an API for "creating" a table with partitioning and 
> schema information for V2 tables without a catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29699) Different answers in nested aggregates with window functions



 [ 
https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29699:
-
Priority: Critical  (was: Blocker)

> Different answers in nested aggregates with window functions
> 
>
> Key: SPARK-29699
> URL: https://issues.apache.org/jira/browse/SPARK-29699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Critical
>
> A nested aggregate below with a window function seems to have different 
> answers in the `rsum` column  between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# 
> postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
> postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
>  a | b | sum | rsum 
> ---+---+-+--
>  1 | 1 |   8 |8
>  1 | 2 |   2 |   10
>  1 |   |  10 |   20
>  2 | 2 |   2 |   22
>  2 |   |   2 |   24
>|   |  12 |   36
> (6 rows)
> {code}
> {code:java}
> scala> sql("""
>  | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
>  |   from gstest2 group by rollup (a,b) order by rsum, a, b
>  | """).show()
> +++--++   
>   
> |   a|   b|sum(c)|rsum|
> +++--++
> |null|null|12|  12|
> |   1|null|10|  22|
> |   1|   1| 8|  30|
> |   1|   2| 2|  32|
> |   2|null| 2|  34|
> |   2|   2| 2|  36|
> +++--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29699) Different answers in nested aggregates with window functions



 [ 
https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29699:
-
Target Version/s:   (was: 3.0.0)

> Different answers in nested aggregates with window functions
> 
>
> Key: SPARK-29699
> URL: https://issues.apache.org/jira/browse/SPARK-29699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Critical
>
> A nested aggregate below with a window function seems to have different 
> answers in the `rsum` column  between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# 
> postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
> postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
>  a | b | sum | rsum 
> ---+---+-+--
>  1 | 1 |   8 |8
>  1 | 2 |   2 |   10
>  1 |   |  10 |   20
>  2 | 2 |   2 |   22
>  2 |   |   2 |   24
>|   |  12 |   36
> (6 rows)
> {code}
> {code:java}
> scala> sql("""
>  | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
>  |   from gstest2 group by rollup (a,b) order by rsum, a, b
>  | """).show()
> +++--++   
>   
> |   a|   b|sum(c)|rsum|
> +++--++
> |null|null|12|  12|
> |   1|null|10|  22|
> |   1|   1| 8|  30|
> |   1|   2| 2|  32|
> |   2|null| 2|  34|
> |   2|   2| 2|  36|
> +++--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29699) Different answers in nested aggregates with window functions



 [ 
https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29699:
-
Labels:   (was: correctness)

> Different answers in nested aggregates with window functions
> 
>
> Key: SPARK-29699
> URL: https://issues.apache.org/jira/browse/SPARK-29699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Blocker
>
> A nested aggregate below with a window function seems to have different 
> answers in the `rsum` column  between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# 
> postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
> postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
>  a | b | sum | rsum 
> ---+---+-+--
>  1 | 1 |   8 |8
>  1 | 2 |   2 |   10
>  1 |   |  10 |   20
>  2 | 2 |   2 |   22
>  2 |   |   2 |   24
>|   |  12 |   36
> (6 rows)
> {code}
> {code:java}
> scala> sql("""
>  | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
>  |   from gstest2 group by rollup (a,b) order by rsum, a, b
>  | """).show()
> +++--++   
>   
> |   a|   b|sum(c)|rsum|
> +++--++
> |null|null|12|  12|
> |   1|null|10|  22|
> |   1|   1| 8|  30|
> |   1|   2| 2|  32|
> |   2|null| 2|  34|
> |   2|   2| 2|  36|
> +++--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions