[jira] [Updated] (SPARK-31809) Infer IsNotNull for non null intolerant child of null intolerant in join condition

2020-06-05 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31809:

Summary: Infer IsNotNull for non null intolerant child of null intolerant 
in join condition  (was: Infer IsNotNull for all children of NullIntolerant 
expressions)

> Infer IsNotNull for non null intolerant child of null intolerant in join 
> condition
> --
>
> Key: SPARK-31809
> URL: https://issues.apache.org/jira/browse/SPARK-31809
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: default.png, infer.png
>
>
> We should infer {{IsNotNull}} for all children of {{NullIntolerant}} 
> expressions. For example:
> {code:sql}
> CREATE TABLE t1(c1 string, c2 string);
> CREATE TABLE t2(c1 string, c2 string);
> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;
> {code}
> {noformat}
> == Physical Plan ==
> *(4) Project [c1#5, c2#6]
> +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner
>:- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33]
>: +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, 
> c2#6], Statistics(sizeInBytes=8.0 EiB)
>+- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(c1#7, 200), true, [id=#46]
>  +- *(2) Filter isnotnull(c1#7)
> +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], 
> Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query 
> performance:
> {noformat}
> == Physical Plan ==
> *(5) Project [c1#23, c2#24]
> +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner
>:- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, 
> [id=#95]
>: +- *(1) Filter isnotnull(coalesce(c1#23, c2#24))
>:+- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, 
> c2#24], Statistics(sizeInBytes=8.0 EiB)
>+- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(c1#25, 200), true, [id=#103]
>  +- *(3) Filter isnotnull(c1#25)
> +- Scan hive default.t2 [c1#25], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, 
> c2#26], Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> Real performance test case:
>  !default.png!  !infer.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31884) Support MongoDB Kerberos login in JDBC connector

2020-06-05 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126510#comment-17126510
 ] 

Gabor Somogyi commented on SPARK-31884:
---

Sorry, my question was not exact enough. The server side is clear that it's 
working. I mean the JDBC client side.
Can you point a JDBC client example which is using keytab file?


> Support MongoDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31884
> URL: https://issues.apache.org/jira/browse/SPARK-31884
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126513#comment-17126513
 ] 

Apache Spark commented on SPARK-31705:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28733

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem,
>orders
> WHERE  l_orderkey = o_orderkey
>AND ( ( l_suppkey > 3
>AND o_custkey > 13 )
>   OR ( l_suppkey > 1
>AND o_custkey > 11 ) )
>AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem
>JOIN orders
>  ON l_orderkey = o_orderkey
> AND ( ( l_suppkey > 3
> AND o_custkey > 13 )
>OR ( l_suppkey > 1
> AND o_custkey > 11 ) )
> AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*) 
> FROM   lineitem, 
>orders 
> WHERE  l_orderkey = o_orderkey 
>AND NOT ( ( l_suppkey > 3 
>AND ( l_suppkey > 2 
>   OR o_custkey > 13 ) ) 
>   OR ( l_suppkey > 1 
>AND o_custkey > 11 ) ) 
>AND l_partkey > 19;
> {code}
> {noformat}
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem,
> postgres-#orders
> postgres-# WHERE  l_orderkey = o_orderkey
> postgres-#AND ( ( l_suppkey > 3
> postgres(#AND o_custkey > 13 )
> postgres(#   OR ( l_suppkey > 1
> postgres(#AND o_custkey > 11 ) )
> postgres-#AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem
> postgres-#JOIN orders
> postgres-#  ON l_orderkey = o_orderkey
> postgres-# AND ( ( l_suppkey > 3
> postgres(# AND o_custkey > 13 )
> postgres(#OR ( l_suppkey > 1
> postgres(# AND o_custkey > 11 ) )
> postgres-# AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o

[jira] [Created] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

2020-06-05 Thread Brandon (Jira)
Brandon created SPARK-31911:
---

 Summary: Using S3A staging committer, pending uploads are 
committed more than once and listed incorrectly in _SUCCESS data
 Key: SPARK-31911
 URL: https://issues.apache.org/jira/browse/SPARK-31911
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.4
Reporter: Brandon


First of all thanks for the great work on the S3 committers. I was able set up 
the directory staging committer in my environment following docs at 
[https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
 and tested one of my Spark applications using it. The Spark version is 2.4.4 
with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
Spark in parallel.

I think I'm seeing a bug where the staging committer will complete pending 
uploads more than once. The main symptom is that the _SUCCESS data files under 
each table will contain overlapping file names that belong to separate tables. 
From my reading of the code, that's because the filenames in _SUCCESS reflect 
which multipart uploads were completed in the commit for that particular table.

An example:

Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition so 
writes one partition file.

When the two writes are done,
* /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
* /b/_SUCCESS contains the same two filenames.

Setting S3A logs to debug, I see the commitJob operation belonging to table a 
includes completing the uploads of /a/part- and /b/part-. Then again, 
commitJob for b includes the same completions.

I believe this may be caused from the way the pendingSet files are stored in 
the staging directory. They are stored under one directory named by the jobID, 
in the Hadoop code. However, for all write jobs executed by the Spark 
application, the jobID passed to Hadoop is the same - the application ID. Maybe 
the staging commit algorithm was built on the assumption that each instance of 
the algorithm would use a unique random jobID.

[~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
(thank you), I would be interested to know your thoughts on this. Also it's my 
first time opening a bug here, so let me know if there's anything else I can do 
to help report the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31912) Normalize all binary comparison expressions

2020-06-05 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31912:

Summary: Normalize all binary comparison expressions  (was: Normalize all 
binary comparison expression)

> Normalize all binary comparison expressions
> ---
>
> Key: SPARK-31912
> URL: https://issues.apache.org/jira/browse/SPARK-31912
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31912) Normalize all binary comparison expression

2020-06-05 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-31912:
---

 Summary: Normalize all binary comparison expression
 Key: SPARK-31912
 URL: https://issues.apache.org/jira/browse/SPARK-31912
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.1.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

2020-06-05 Thread Brandon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon updated SPARK-31911:

Description: 
First of all thanks for the great work on the S3 committers. I was able set up 
the directory staging committer in my environment following docs at 
[https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
 and tested one of my Spark applications using it. The Spark version is 2.4.4 
with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
Spark in parallel.

I think I'm seeing a bug where the staging committer will complete pending 
uploads more than once. The main symptom is that the _SUCCESS data files under 
each table will contain overlapping file names that belong to separate tables. 
From my reading of the code, that's because the filenames in _SUCCESS reflect 
which multipart uploads were completed in the commit for that particular table.

An example:

Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition so 
writes one partition file.

When the two writes are done,
 * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
 * /b/_SUCCESS contains the same two filenames.

Setting S3A logs to debug, I see the commitJob operation belonging to table a 
includes completing the uploads of /a/part- and /b/part-. Then again, 
commitJob for b includes the same completions. I haven't had a problem yet, but 
I wonder if having these extra requests would become an issue at higher scale, 
where dozens of commits with hundreds of files may be happening on the cluster.

I believe this may be caused from the way the pendingSet files are stored in 
the staging directory. They are stored under one directory named by the jobID, 
in the Hadoop code. However, for all write jobs executed by the Spark 
application, the jobID passed to Hadoop is the same - the application ID. Maybe 
the staging commit algorithm was built on the assumption that each instance of 
the algorithm would use a unique random jobID.

[~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
(thank you), I would be interested to know your thoughts on this. Also it's my 
first time opening a bug here, so let me know if there's anything else I can do 
to help report the issue.

  was:
First of all thanks for the great work on the S3 committers. I was able set up 
the directory staging committer in my environment following docs at 
[https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
 and tested one of my Spark applications using it. The Spark version is 2.4.4 
with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
Spark in parallel.

I think I'm seeing a bug where the staging committer will complete pending 
uploads more than once. The main symptom is that the _SUCCESS data files under 
each table will contain overlapping file names that belong to separate tables. 
From my reading of the code, that's because the filenames in _SUCCESS reflect 
which multipart uploads were completed in the commit for that particular table.

An example:

Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition so 
writes one partition file.

When the two writes are done,
* /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
* /b/_SUCCESS contains the same two filenames.

Setting S3A logs to debug, I see the commitJob operation belonging to table a 
includes completing the uploads of /a/part- and /b/part-. Then again, 
commitJob for b includes the same completions.

I believe this may be caused from the way the pendingSet files are stored in 
the staging directory. They are stored under one directory named by the jobID, 
in the Hadoop code. However, for all write jobs executed by the Spark 
application, the jobID passed to Hadoop is the same - the application ID. Maybe 
the staging commit algorithm was built on the assumption that each instance of 
the algorithm would use a unique random jobID.

[~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
(thank you), I would be interested to know your thoughts on this. Also it's my 
first time opening a bug here, so let me know if there's anything else I can do 
to help report the issue.


> Using S3A staging committer, pending uploads are committed more than once and 
> listed incorrectly in _SUCCESS data
> -
>
> Key: SPA

[jira] [Updated] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

2020-06-05 Thread Brandon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon updated SPARK-31911:

Description: 
First of all thanks for the great work on the S3 committers. I was able set up 
the directory staging committer in my environment following docs at 
[https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
 and tested one of my Spark applications using it. The Spark version is 2.4.4 
with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
Spark in parallel.

I think I'm seeing a bug where the staging committer will complete pending 
uploads more than once. The main symptom is that the _SUCCESS data files under 
each table will contain overlapping file names that belong to separate tables. 
From my reading of the code, that's because the filenames in _SUCCESS reflect 
which multipart uploads were completed in the commit for that particular table.

An example:

Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition so 
writes one partition file.

When the two writes are done,
 * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
 * /b/_SUCCESS contains the same two filenames.

Setting S3A logs to debug, I see the commitJob operation belonging to table a 
includes completing the uploads of /a/part- and /b/part-. Then again, 
commitJob for table b includes the same completions. I haven't had a problem 
yet, but I wonder if having these extra requests would become an issue at 
higher scale, where dozens of commits with hundreds of files may be happening 
in the application.

I believe this may be caused from the way the pendingSet files are stored in 
the staging directory. They are stored under one directory named by the jobID, 
in the Hadoop code. However, for all write jobs executed by the Spark 
application, the jobID passed to Hadoop is the same - the application ID. Maybe 
the staging commit algorithm was built on the assumption that each instance of 
the algorithm would use a unique random jobID.

[~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
(thank you), I would be interested to know your thoughts on this. Also it's my 
first time opening a bug here, so let me know if there's anything else I can do 
to help report the issue.

  was:
First of all thanks for the great work on the S3 committers. I was able set up 
the directory staging committer in my environment following docs at 
[https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
 and tested one of my Spark applications using it. The Spark version is 2.4.4 
with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
Spark in parallel.

I think I'm seeing a bug where the staging committer will complete pending 
uploads more than once. The main symptom is that the _SUCCESS data files under 
each table will contain overlapping file names that belong to separate tables. 
From my reading of the code, that's because the filenames in _SUCCESS reflect 
which multipart uploads were completed in the commit for that particular table.

An example:

Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition so 
writes one partition file.

When the two writes are done,
 * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
 * /b/_SUCCESS contains the same two filenames.

Setting S3A logs to debug, I see the commitJob operation belonging to table a 
includes completing the uploads of /a/part- and /b/part-. Then again, 
commitJob for b includes the same completions. I haven't had a problem yet, but 
I wonder if having these extra requests would become an issue at higher scale, 
where dozens of commits with hundreds of files may be happening on the cluster.

I believe this may be caused from the way the pendingSet files are stored in 
the staging directory. They are stored under one directory named by the jobID, 
in the Hadoop code. However, for all write jobs executed by the Spark 
application, the jobID passed to Hadoop is the same - the application ID. Maybe 
the staging commit algorithm was built on the assumption that each instance of 
the algorithm would use a unique random jobID.

[~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
(thank you), I would be interested to know your thoughts on this. Also it's my 
first time opening a bug here, so let me know if there's anything else I can do 
to help report the issue.


> Using S3A staging committer, pending uploads are commi

[jira] [Updated] (SPARK-31912) Normalize all binary comparison expressions

2020-06-05 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31912:

Description: 
This test will fail:
{code:scala}
  test("SPARK-31912 Normalize all binary comparison expressions") {
val original = testRelation
  .where('a === 'b && Literal(13) >= 'b).as("x")
val optimized = testRelation
  .where(IsNotNull('a) && IsNotNull('b) && 'a === 'b && 'b <= 13 && 'a <= 
13).as("x")
comparePlans(Optimize.execute(original.analyze), optimized.analyze)
  }
{code}


> Normalize all binary comparison expressions
> ---
>
> Key: SPARK-31912
> URL: https://issues.apache.org/jira/browse/SPARK-31912
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> This test will fail:
> {code:scala}
>   test("SPARK-31912 Normalize all binary comparison expressions") {
> val original = testRelation
>   .where('a === 'b && Literal(13) >= 'b).as("x")
> val optimized = testRelation
>   .where(IsNotNull('a) && IsNotNull('b) && 'a === 'b && 'b <= 13 && 'a <= 
> 13).as("x")
> comparePlans(Optimize.execute(original.analyze), optimized.analyze)
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31879) First day of week changed for non-MONDAY_START Lacales

2020-06-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31879:
---

Assignee: Wenchen Fan

> First day of week changed for non-MONDAY_START Lacales
> --
>
> Key: SPARK-31879
> URL: https://issues.apache.org/jira/browse/SPARK-31879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Wenchen Fan
>Priority: Blocker
>
> h1. cases
> {code:sql}
> spark-sql> select to_timestamp('2020-1-1', '-w-u');
> 2019-12-29 00:00:00
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select to_timestamp('2020-1-1', '-w-u');
> 2019-12-30 00:00:00
> {code}
> h1. reasons
> These week-based fields need Locale to express their semantics, the first day 
> of the week varies from country to country.
> From the Java doc of WeekFields
> {code:java}
> /**
>  * Gets the first day-of-week.
>  * 
>  * The first day-of-week varies by culture.
>  * For example, the US uses Sunday, while France and the ISO-8601 
> standard use Monday.
>  * This method returns the first day using the standard {@code DayOfWeek} 
> enum.
>  *
>  * @return the first day-of-week, not null
>  */
> public DayOfWeek getFirstDayOfWeek() {
> return firstDayOfWeek;
> }
> {code}
> But for the SimpleDateFormat, the day-of-week is not localized
> ```
> u Day number of week (1 = Monday, ..., 7 = Sunday)Number  1
> ```
> Currently, the default locale we use is the US, so the result moved a day 
> backward.
> For other countries, please refer to [First Day of the Week in Different 
> Countries|http://chartsbin.com/view/41671]
> h1. solution options
> 1. Use new Locale("en", "GB") as default locale.
> 2. For JDK10 and onwards, we can set locale Unicode extension 'fw'  to 'mon', 
> but not work for lower JDKs
> 3. Forbid 'u', give user proper exceptions, and enable and document 'e/c'. 
> Currently, the 'u' is internally substituted by 'e', but they are not 
> equivalent.
> 1 and 2 can solve this with default locale but not for the functions with 
> custom locale supported.
> cc [~cloud_fan] [~dongjoon] [~maropu]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

2020-06-05 Thread Brandon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon updated SPARK-31911:

Description: 
First of all thanks for the great work on the S3 committers. I was able set up 
the directory staging committer in my environment following docs at 
[https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
 and tested one of my Spark applications using it. The Spark version is 2.4.4 
with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
Spark in parallel.

I think I'm seeing a bug where the staging committer will complete pending 
uploads more than once. The main symptom is that the _SUCCESS data files under 
each table will contain overlapping file names that belong to separate tables. 
From my reading of the code, that's because the filenames in _SUCCESS reflect 
which multipart uploads were completed in the commit for that particular table.

An example:

Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition so 
writes one partition file.

When the two writes are done,
 * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
 * /b/_SUCCESS contains the same two filenames.

Setting S3A logs to debug, I see the commitJob operation belonging to table a 
includes completing the uploads of /a/part- and /b/part-. Then again, 
commitJob for table b includes the same completions. I haven't had a problem 
yet, but I wonder if having these extra requests would become an issue at 
higher scale, where dozens of commits with hundreds of files may be happening 
concurrently in the application.

I believe this may be caused from the way the pendingSet files are stored in 
the staging directory. They are stored under one directory named by the jobID, 
in the Hadoop code. However, for all write jobs executed by the Spark 
application, the jobID passed to Hadoop is the same - the application ID. Maybe 
the staging commit algorithm was built on the assumption that each instance of 
the algorithm would use a unique random jobID.

[~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
(thank you), I would be interested to know your thoughts on this. Also it's my 
first time opening a bug here, so let me know if there's anything else I can do 
to help report the issue.

  was:
First of all thanks for the great work on the S3 committers. I was able set up 
the directory staging committer in my environment following docs at 
[https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
 and tested one of my Spark applications using it. The Spark version is 2.4.4 
with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
Spark in parallel.

I think I'm seeing a bug where the staging committer will complete pending 
uploads more than once. The main symptom is that the _SUCCESS data files under 
each table will contain overlapping file names that belong to separate tables. 
From my reading of the code, that's because the filenames in _SUCCESS reflect 
which multipart uploads were completed in the commit for that particular table.

An example:

Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition so 
writes one partition file.

When the two writes are done,
 * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
 * /b/_SUCCESS contains the same two filenames.

Setting S3A logs to debug, I see the commitJob operation belonging to table a 
includes completing the uploads of /a/part- and /b/part-. Then again, 
commitJob for table b includes the same completions. I haven't had a problem 
yet, but I wonder if having these extra requests would become an issue at 
higher scale, where dozens of commits with hundreds of files may be happening 
in the application.

I believe this may be caused from the way the pendingSet files are stored in 
the staging directory. They are stored under one directory named by the jobID, 
in the Hadoop code. However, for all write jobs executed by the Spark 
application, the jobID passed to Hadoop is the same - the application ID. Maybe 
the staging commit algorithm was built on the assumption that each instance of 
the algorithm would use a unique random jobID.

[~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
(thank you), I would be interested to know your thoughts on this. Also it's my 
first time opening a bug here, so let me know if there's anything else I can do 
to help report the issue.


> Using S3A staging committer, p

[jira] [Updated] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

2020-06-05 Thread Brandon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon updated SPARK-31911:

Description: 
First of all thanks for the great work on the S3 committers. I was able set up 
the directory staging committer in my environment following docs at 
[https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
 and tested one of my Spark applications using it. The Spark version is 2.4.4 
with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
Spark in parallel.

I think I'm seeing a bug where the staging committer will complete pending 
uploads more than once. The main symptom how I discovered this is that the 
_SUCCESS data files under each table will contain overlapping file names that 
belong to separate tables. From my reading of the code, that's because the 
filenames in _SUCCESS reflect which multipart uploads were completed in the 
commit for that particular table.

An example:

Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition so 
writes one partition file.

When the two writes are done,
 * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
 * /b/_SUCCESS contains the same two filenames.

Setting S3A logs to debug, I see the commitJob operation belonging to table a 
includes completing the uploads of /a/part- and /b/part-. Then again, 
commitJob for table b includes the same completions. I haven't had a problem 
yet, but I wonder if having these extra requests would become an issue at 
higher scale, where dozens of commits with hundreds of files may be happening 
concurrently in the application.

I believe this may be caused from the way the pendingSet files are stored in 
the staging directory. They are stored under one directory named by the jobID, 
in the Hadoop code. However, for all write jobs executed by the Spark 
application, the jobID passed to Hadoop is the same - the application ID. Maybe 
the staging commit algorithm was built on the assumption that each instance of 
the algorithm would use a unique random jobID.

[~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
(thank you), I would be interested to know your thoughts on this. Also it's my 
first time opening a bug here, so let me know if there's anything else I can do 
to help report the issue.

  was:
First of all thanks for the great work on the S3 committers. I was able set up 
the directory staging committer in my environment following docs at 
[https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
 and tested one of my Spark applications using it. The Spark version is 2.4.4 
with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
Spark in parallel.

I think I'm seeing a bug where the staging committer will complete pending 
uploads more than once. The main symptom is that the _SUCCESS data files under 
each table will contain overlapping file names that belong to separate tables. 
From my reading of the code, that's because the filenames in _SUCCESS reflect 
which multipart uploads were completed in the commit for that particular table.

An example:

Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition so 
writes one partition file.

When the two writes are done,
 * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
 * /b/_SUCCESS contains the same two filenames.

Setting S3A logs to debug, I see the commitJob operation belonging to table a 
includes completing the uploads of /a/part- and /b/part-. Then again, 
commitJob for table b includes the same completions. I haven't had a problem 
yet, but I wonder if having these extra requests would become an issue at 
higher scale, where dozens of commits with hundreds of files may be happening 
concurrently in the application.

I believe this may be caused from the way the pendingSet files are stored in 
the staging directory. They are stored under one directory named by the jobID, 
in the Hadoop code. However, for all write jobs executed by the Spark 
application, the jobID passed to Hadoop is the same - the application ID. Maybe 
the staging commit algorithm was built on the assumption that each instance of 
the algorithm would use a unique random jobID.

[~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
(thank you), I would be interested to know your thoughts on this. Also it's my 
first time opening a bug here, so let me know if there's anything else I can do 
to help report the issue

[jira] [Assigned] (SPARK-31912) Normalize all binary comparison expressions

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31912:


Assignee: Apache Spark

> Normalize all binary comparison expressions
> ---
>
> Key: SPARK-31912
> URL: https://issues.apache.org/jira/browse/SPARK-31912
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> This test will fail:
> {code:scala}
>   test("SPARK-31912 Normalize all binary comparison expressions") {
> val original = testRelation
>   .where('a === 'b && Literal(13) >= 'b).as("x")
> val optimized = testRelation
>   .where(IsNotNull('a) && IsNotNull('b) && 'a === 'b && 'b <= 13 && 'a <= 
> 13).as("x")
> comparePlans(Optimize.execute(original.analyze), optimized.analyze)
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31912) Normalize all binary comparison expressions

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126553#comment-17126553
 ] 

Apache Spark commented on SPARK-31912:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/28734

> Normalize all binary comparison expressions
> ---
>
> Key: SPARK-31912
> URL: https://issues.apache.org/jira/browse/SPARK-31912
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> This test will fail:
> {code:scala}
>   test("SPARK-31912 Normalize all binary comparison expressions") {
> val original = testRelation
>   .where('a === 'b && Literal(13) >= 'b).as("x")
> val optimized = testRelation
>   .where(IsNotNull('a) && IsNotNull('b) && 'a === 'b && 'b <= 13 && 'a <= 
> 13).as("x")
> comparePlans(Optimize.execute(original.analyze), optimized.analyze)
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31912) Normalize all binary comparison expressions

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31912:


Assignee: (was: Apache Spark)

> Normalize all binary comparison expressions
> ---
>
> Key: SPARK-31912
> URL: https://issues.apache.org/jira/browse/SPARK-31912
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> This test will fail:
> {code:scala}
>   test("SPARK-31912 Normalize all binary comparison expressions") {
> val original = testRelation
>   .where('a === 'b && Literal(13) >= 'b).as("x")
> val optimized = testRelation
>   .where(IsNotNull('a) && IsNotNull('b) && 'a === 'b && 'b <= 13 && 'a <= 
> 13).as("x")
> comparePlans(Optimize.execute(original.analyze), optimized.analyze)
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

2020-06-05 Thread Brandon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126556#comment-17126556
 ] 

Brandon commented on SPARK-31911:
-

Interesting, it looks like the staging committer supports a configuration 
parameter `spark.sql.sources.writeJobUUID` that takes precedence over the 
`spark.app.id` for determining the name of the pendingSet directory. This is 
very interesting because `spark.sql.sources.writeJobUUID` is not present 
anywhere in the Spark codebase. Should this be set to a random UUID for each 
write job in Spark?

> Using S3A staging committer, pending uploads are committed more than once and 
> listed incorrectly in _SUCCESS data
> -
>
> Key: SPARK-31911
> URL: https://issues.apache.org/jira/browse/SPARK-31911
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Brandon
>Priority: Major
>
> First of all thanks for the great work on the S3 committers. I was able set 
> up the directory staging committer in my environment following docs at 
> [https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
>  and tested one of my Spark applications using it. The Spark version is 2.4.4 
> with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
> multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
> Spark in parallel.
> I think I'm seeing a bug where the staging committer will complete pending 
> uploads more than once. The main symptom how I discovered this is that the 
> _SUCCESS data files under each table will contain overlapping file names that 
> belong to separate tables. From my reading of the code, that's because the 
> filenames in _SUCCESS reflect which multipart uploads were completed in the 
> commit for that particular table.
> An example:
> Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
> DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition 
> so writes one partition file.
> When the two writes are done,
>  * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
>  * /b/_SUCCESS contains the same two filenames.
> Setting S3A logs to debug, I see the commitJob operation belonging to table a 
> includes completing the uploads of /a/part- and /b/part-. Then again, 
> commitJob for table b includes the same completions. I haven't had a problem 
> yet, but I wonder if having these extra requests would become an issue at 
> higher scale, where dozens of commits with hundreds of files may be happening 
> concurrently in the application.
> I believe this may be caused from the way the pendingSet files are stored in 
> the staging directory. They are stored under one directory named by the 
> jobID, in the Hadoop code. However, for all write jobs executed by the Spark 
> application, the jobID passed to Hadoop is the same - the application ID. 
> Maybe the staging commit algorithm was built on the assumption that each 
> instance of the algorithm would use a unique random jobID.
> [~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
> (thank you), I would be interested to know your thoughts on this. Also it's 
> my first time opening a bug here, so let me know if there's anything else I 
> can do to help report the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

2020-06-05 Thread Brandon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126556#comment-17126556
 ] 

Brandon edited comment on SPARK-31911 at 6/5/20, 8:53 AM:
--

Interesting, it looks like the staging committer supports a configuration 
parameter `spark.sql.sources.writeJobUUID` that takes precedence over the 
`spark.app.id` for determining the name of the pendingSet directory. This is 
very interesting because `spark.sql.sources.writeJobUUID` is not present 
anywhere in the Spark codebase. Should this be set to a random UUID for each 
write job in Spark?

https://github.com/apache/hadoop/blob/a6df05bf5e24d04852a35b096c44e79f843f4776/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/staging/StagingCommitter.java#L186-L208


was (Author: brandonvin):
Interesting, it looks like the staging committer supports a configuration 
parameter `spark.sql.sources.writeJobUUID` that takes precedence over the 
`spark.app.id` for determining the name of the pendingSet directory. This is 
very interesting because `spark.sql.sources.writeJobUUID` is not present 
anywhere in the Spark codebase. Should this be set to a random UUID for each 
write job in Spark?

> Using S3A staging committer, pending uploads are committed more than once and 
> listed incorrectly in _SUCCESS data
> -
>
> Key: SPARK-31911
> URL: https://issues.apache.org/jira/browse/SPARK-31911
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Brandon
>Priority: Major
>
> First of all thanks for the great work on the S3 committers. I was able set 
> up the directory staging committer in my environment following docs at 
> [https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
>  and tested one of my Spark applications using it. The Spark version is 2.4.4 
> with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
> multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
> Spark in parallel.
> I think I'm seeing a bug where the staging committer will complete pending 
> uploads more than once. The main symptom how I discovered this is that the 
> _SUCCESS data files under each table will contain overlapping file names that 
> belong to separate tables. From my reading of the code, that's because the 
> filenames in _SUCCESS reflect which multipart uploads were completed in the 
> commit for that particular table.
> An example:
> Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
> DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition 
> so writes one partition file.
> When the two writes are done,
>  * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
>  * /b/_SUCCESS contains the same two filenames.
> Setting S3A logs to debug, I see the commitJob operation belonging to table a 
> includes completing the uploads of /a/part- and /b/part-. Then again, 
> commitJob for table b includes the same completions. I haven't had a problem 
> yet, but I wonder if having these extra requests would become an issue at 
> higher scale, where dozens of commits with hundreds of files may be happening 
> concurrently in the application.
> I believe this may be caused from the way the pendingSet files are stored in 
> the staging directory. They are stored under one directory named by the 
> jobID, in the Hadoop code. However, for all write jobs executed by the Spark 
> application, the jobID passed to Hadoop is the same - the application ID. 
> Maybe the staging commit algorithm was built on the assumption that each 
> instance of the algorithm would use a unique random jobID.
> [~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
> (thank you), I would be interested to know your thoughts on this. Also it's 
> my first time opening a bug here, so let me know if there's anything else I 
> can do to help report the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31859) Thriftserver with spark.sql.datetime.java8API.enabled=true

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126565#comment-17126565
 ] 

Apache Spark commented on SPARK-31859:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/28735

> Thriftserver with spark.sql.datetime.java8API.enabled=true
> --
>
> Key: SPARK-31859
> URL: https://issues.apache.org/jira/browse/SPARK-31859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
>   test("spark.sql.datetime.java8API.enabled=true") {
> withJdbcStatement() { st =>
>   st.execute("set spark.sql.datetime.java8API.enabled=true")
>   val rs = st.executeQuery("select timestamp '2020-05-28 00:00:00'")
>   rs.next()
>   // scalastyle:off
>   println(rs.getObject(1))
> }
>   }
> {code}
> fails with 
> {code}
> HiveThriftBinaryServerSuite:
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
> at java.sql.Timestamp.valueOf(Timestamp.java:204)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:444)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:424)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.getObject(HiveBaseResultSet.java:464
> {code}
> It seems it might be needed in HiveResult.toHiveString?
> cc [~maxgekk]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31859) Thriftserver with spark.sql.datetime.java8API.enabled=true

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126566#comment-17126566
 ] 

Apache Spark commented on SPARK-31859:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/28735

> Thriftserver with spark.sql.datetime.java8API.enabled=true
> --
>
> Key: SPARK-31859
> URL: https://issues.apache.org/jira/browse/SPARK-31859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
>   test("spark.sql.datetime.java8API.enabled=true") {
> withJdbcStatement() { st =>
>   st.execute("set spark.sql.datetime.java8API.enabled=true")
>   val rs = st.executeQuery("select timestamp '2020-05-28 00:00:00'")
>   rs.next()
>   // scalastyle:off
>   println(rs.getObject(1))
> }
>   }
> {code}
> fails with 
> {code}
> HiveThriftBinaryServerSuite:
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
> at java.sql.Timestamp.valueOf(Timestamp.java:204)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:444)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:424)
> at 
> org.apache.hive.jdbc.HiveBaseResultSet.getObject(HiveBaseResultSet.java:464
> {code}
> It seems it might be needed in HiveResult.toHiveString?
> cc [~maxgekk]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31861) Thriftserver collecting timestamp not using spark.sql.session.timeZone

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126569#comment-17126569
 ] 

Apache Spark commented on SPARK-31861:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/28735

> Thriftserver collecting timestamp not using spark.sql.session.timeZone
> --
>
> Key: SPARK-31861
> URL: https://issues.apache.org/jira/browse/SPARK-31861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> If JDBC client is in TimeZone PST, and sets spark.sql.session.timeZone to 
> PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM 
> timezone of the Spark cluster is e.g. CET, then
> - the timestamp literal in the query is interpreted as 12:00:00 PST, i.e. 
> 21:00:00 CET
> - but currently when it's returned, the timestamps are collected from the 
> query with a collect() in 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L299,
>  and then in the end Timestamps are turned into strings using a t.toString() 
> in 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/ColumnValue.java#L138
>  This will use the Spark cluster TimeZone. That results in "21:00:00" 
> returned to the JDBC application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31861) Thriftserver collecting timestamp not using spark.sql.session.timeZone

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126568#comment-17126568
 ] 

Apache Spark commented on SPARK-31861:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/28735

> Thriftserver collecting timestamp not using spark.sql.session.timeZone
> --
>
> Key: SPARK-31861
> URL: https://issues.apache.org/jira/browse/SPARK-31861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> If JDBC client is in TimeZone PST, and sets spark.sql.session.timeZone to 
> PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM 
> timezone of the Spark cluster is e.g. CET, then
> - the timestamp literal in the query is interpreted as 12:00:00 PST, i.e. 
> 21:00:00 CET
> - but currently when it's returned, the timestamps are collected from the 
> query with a collect() in 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L299,
>  and then in the end Timestamps are turned into strings using a t.toString() 
> in 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/ColumnValue.java#L138
>  This will use the Spark cluster TimeZone. That results in "21:00:00" 
> returned to the JDBC application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31861) Thriftserver collecting timestamp not using spark.sql.session.timeZone

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126567#comment-17126567
 ] 

Apache Spark commented on SPARK-31861:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/28735

> Thriftserver collecting timestamp not using spark.sql.session.timeZone
> --
>
> Key: SPARK-31861
> URL: https://issues.apache.org/jira/browse/SPARK-31861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> If JDBC client is in TimeZone PST, and sets spark.sql.session.timeZone to 
> PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM 
> timezone of the Spark cluster is e.g. CET, then
> - the timestamp literal in the query is interpreted as 12:00:00 PST, i.e. 
> 21:00:00 CET
> - but currently when it's returned, the timestamps are collected from the 
> query with a collect() in 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L299,
>  and then in the end Timestamps are turned into strings using a t.toString() 
> in 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/ColumnValue.java#L138
>  This will use the Spark cluster TimeZone. That results in "21:00:00" 
> returned to the JDBC application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31879) First day of week changed for non-MONDAY_START Lacales

2020-06-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-31879:
-
  Assignee: Kent Yao  (was: Wenchen Fan)

> First day of week changed for non-MONDAY_START Lacales
> --
>
> Key: SPARK-31879
> URL: https://issues.apache.org/jira/browse/SPARK-31879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Blocker
>
> h1. cases
> {code:sql}
> spark-sql> select to_timestamp('2020-1-1', '-w-u');
> 2019-12-29 00:00:00
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select to_timestamp('2020-1-1', '-w-u');
> 2019-12-30 00:00:00
> {code}
> h1. reasons
> These week-based fields need Locale to express their semantics, the first day 
> of the week varies from country to country.
> From the Java doc of WeekFields
> {code:java}
> /**
>  * Gets the first day-of-week.
>  * 
>  * The first day-of-week varies by culture.
>  * For example, the US uses Sunday, while France and the ISO-8601 
> standard use Monday.
>  * This method returns the first day using the standard {@code DayOfWeek} 
> enum.
>  *
>  * @return the first day-of-week, not null
>  */
> public DayOfWeek getFirstDayOfWeek() {
> return firstDayOfWeek;
> }
> {code}
> But for the SimpleDateFormat, the day-of-week is not localized
> ```
> u Day number of week (1 = Monday, ..., 7 = Sunday)Number  1
> ```
> Currently, the default locale we use is the US, so the result moved a day 
> backward.
> For other countries, please refer to [First Day of the Week in Different 
> Countries|http://chartsbin.com/view/41671]
> h1. solution options
> 1. Use new Locale("en", "GB") as default locale.
> 2. For JDK10 and onwards, we can set locale Unicode extension 'fw'  to 'mon', 
> but not work for lower JDKs
> 3. Forbid 'u', give user proper exceptions, and enable and document 'e/c'. 
> Currently, the 'u' is internally substituted by 'e', but they are not 
> equivalent.
> 1 and 2 can solve this with default locale but not for the functions with 
> custom locale supported.
> cc [~cloud_fan] [~dongjoon] [~maropu]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31879) First day of week changed for non-MONDAY_START Lacales

2020-06-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31879.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> First day of week changed for non-MONDAY_START Lacales
> --
>
> Key: SPARK-31879
> URL: https://issues.apache.org/jira/browse/SPARK-31879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Blocker
> Fix For: 3.0.0
>
>
> h1. cases
> {code:sql}
> spark-sql> select to_timestamp('2020-1-1', '-w-u');
> 2019-12-29 00:00:00
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select to_timestamp('2020-1-1', '-w-u');
> 2019-12-30 00:00:00
> {code}
> h1. reasons
> These week-based fields need Locale to express their semantics, the first day 
> of the week varies from country to country.
> From the Java doc of WeekFields
> {code:java}
> /**
>  * Gets the first day-of-week.
>  * 
>  * The first day-of-week varies by culture.
>  * For example, the US uses Sunday, while France and the ISO-8601 
> standard use Monday.
>  * This method returns the first day using the standard {@code DayOfWeek} 
> enum.
>  *
>  * @return the first day-of-week, not null
>  */
> public DayOfWeek getFirstDayOfWeek() {
> return firstDayOfWeek;
> }
> {code}
> But for the SimpleDateFormat, the day-of-week is not localized
> ```
> u Day number of week (1 = Monday, ..., 7 = Sunday)Number  1
> ```
> Currently, the default locale we use is the US, so the result moved a day 
> backward.
> For other countries, please refer to [First Day of the Week in Different 
> Countries|http://chartsbin.com/view/41671]
> h1. solution options
> 1. Use new Locale("en", "GB") as default locale.
> 2. For JDK10 and onwards, we can set locale Unicode extension 'fw'  to 'mon', 
> but not work for lower JDKs
> 3. Forbid 'u', give user proper exceptions, and enable and document 'e/c'. 
> Currently, the 'u' is internally substituted by 'e', but they are not 
> equivalent.
> 1 and 2 can solve this with default locale but not for the functions with 
> custom locale supported.
> cc [~cloud_fan] [~dongjoon] [~maropu]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

2020-06-05 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126606#comment-17126606
 ] 

Steve Loughran commented on SPARK-31911:


thanks for this. Created HADOOP-17066

I'd like to see those debug logs; if you can, could you attach to that JIRA, or 
if not, email me as stevel at apache.org

At a guess > 1 job is 
- either using the same staging dir, so uploading twice
- or the directory under user.home we use for propagating those .pendingset 
files are the same

either way, serious bug. Will look at ASAP

> Using S3A staging committer, pending uploads are committed more than once and 
> listed incorrectly in _SUCCESS data
> -
>
> Key: SPARK-31911
> URL: https://issues.apache.org/jira/browse/SPARK-31911
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Brandon
>Priority: Major
>
> First of all thanks for the great work on the S3 committers. I was able set 
> up the directory staging committer in my environment following docs at 
> [https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
>  and tested one of my Spark applications using it. The Spark version is 2.4.4 
> with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
> multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
> Spark in parallel.
> I think I'm seeing a bug where the staging committer will complete pending 
> uploads more than once. The main symptom how I discovered this is that the 
> _SUCCESS data files under each table will contain overlapping file names that 
> belong to separate tables. From my reading of the code, that's because the 
> filenames in _SUCCESS reflect which multipart uploads were completed in the 
> commit for that particular table.
> An example:
> Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
> DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition 
> so writes one partition file.
> When the two writes are done,
>  * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
>  * /b/_SUCCESS contains the same two filenames.
> Setting S3A logs to debug, I see the commitJob operation belonging to table a 
> includes completing the uploads of /a/part- and /b/part-. Then again, 
> commitJob for table b includes the same completions. I haven't had a problem 
> yet, but I wonder if having these extra requests would become an issue at 
> higher scale, where dozens of commits with hundreds of files may be happening 
> concurrently in the application.
> I believe this may be caused from the way the pendingSet files are stored in 
> the staging directory. They are stored under one directory named by the 
> jobID, in the Hadoop code. However, for all write jobs executed by the Spark 
> application, the jobID passed to Hadoop is the same - the application ID. 
> Maybe the staging commit algorithm was built on the assumption that each 
> instance of the algorithm would use a unique random jobID.
> [~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
> (thank you), I would be interested to know your thoughts on this. Also it's 
> my first time opening a bug here, so let me know if there's anything else I 
> can do to help report the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

2020-06-05 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126608#comment-17126608
 ] 

Steve Loughran commented on SPARK-31911:


BTW, given we complete the multipart uploads only once, I don't think we could 
actually write the files twice.

But: they could be committed by the wrong job, or aborted by the wrong job. And 
I'm surprised the second job commit didn't actually fail

> Using S3A staging committer, pending uploads are committed more than once and 
> listed incorrectly in _SUCCESS data
> -
>
> Key: SPARK-31911
> URL: https://issues.apache.org/jira/browse/SPARK-31911
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Brandon
>Priority: Major
>
> First of all thanks for the great work on the S3 committers. I was able set 
> up the directory staging committer in my environment following docs at 
> [https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
>  and tested one of my Spark applications using it. The Spark version is 2.4.4 
> with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
> multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
> Spark in parallel.
> I think I'm seeing a bug where the staging committer will complete pending 
> uploads more than once. The main symptom how I discovered this is that the 
> _SUCCESS data files under each table will contain overlapping file names that 
> belong to separate tables. From my reading of the code, that's because the 
> filenames in _SUCCESS reflect which multipart uploads were completed in the 
> commit for that particular table.
> An example:
> Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
> DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition 
> so writes one partition file.
> When the two writes are done,
>  * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
>  * /b/_SUCCESS contains the same two filenames.
> Setting S3A logs to debug, I see the commitJob operation belonging to table a 
> includes completing the uploads of /a/part- and /b/part-. Then again, 
> commitJob for table b includes the same completions. I haven't had a problem 
> yet, but I wonder if having these extra requests would become an issue at 
> higher scale, where dozens of commits with hundreds of files may be happening 
> concurrently in the application.
> I believe this may be caused from the way the pendingSet files are stored in 
> the staging directory. They are stored under one directory named by the 
> jobID, in the Hadoop code. However, for all write jobs executed by the Spark 
> application, the jobID passed to Hadoop is the same - the application ID. 
> Maybe the staging commit algorithm was built on the assumption that each 
> instance of the algorithm would use a unique random jobID.
> [~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
> (thank you), I would be interested to know your thoughts on this. Also it's 
> my first time opening a bug here, so let me know if there's anything else I 
> can do to help report the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

2020-06-05 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126609#comment-17126609
 ] 

Steve Loughran commented on SPARK-31911:


FYI [~mackrorysd]

> Using S3A staging committer, pending uploads are committed more than once and 
> listed incorrectly in _SUCCESS data
> -
>
> Key: SPARK-31911
> URL: https://issues.apache.org/jira/browse/SPARK-31911
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Brandon
>Priority: Major
>
> First of all thanks for the great work on the S3 committers. I was able set 
> up the directory staging committer in my environment following docs at 
> [https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
>  and tested one of my Spark applications using it. The Spark version is 2.4.4 
> with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
> multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
> Spark in parallel.
> I think I'm seeing a bug where the staging committer will complete pending 
> uploads more than once. The main symptom how I discovered this is that the 
> _SUCCESS data files under each table will contain overlapping file names that 
> belong to separate tables. From my reading of the code, that's because the 
> filenames in _SUCCESS reflect which multipart uploads were completed in the 
> commit for that particular table.
> An example:
> Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
> DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition 
> so writes one partition file.
> When the two writes are done,
>  * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
>  * /b/_SUCCESS contains the same two filenames.
> Setting S3A logs to debug, I see the commitJob operation belonging to table a 
> includes completing the uploads of /a/part- and /b/part-. Then again, 
> commitJob for table b includes the same completions. I haven't had a problem 
> yet, but I wonder if having these extra requests would become an issue at 
> higher scale, where dozens of commits with hundreds of files may be happening 
> concurrently in the application.
> I believe this may be caused from the way the pendingSet files are stored in 
> the staging directory. They are stored under one directory named by the 
> jobID, in the Hadoop code. However, for all write jobs executed by the Spark 
> application, the jobID passed to Hadoop is the same - the application ID. 
> Maybe the staging commit algorithm was built on the assumption that each 
> instance of the algorithm would use a unique random jobID.
> [~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
> (thank you), I would be interested to know your thoughts on this. Also it's 
> my first time opening a bug here, so let me know if there's anything else I 
> can do to help report the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31817) Pass-through of Kerberos credentials from Spark SQL to a jdbc source

2020-06-05 Thread Luis Lozano Coira (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126622#comment-17126622
 ] 

Luis Lozano Coira commented on SPARK-31817:
---

I am not aware of the implementation details of this Microsoft gateway. Maybe 
this documentation will help:

https://docs.microsoft.com/en-us/power-bi/connect-data/service-gateway-sso-overview
https://docs.microsoft.com/en-us/power-bi/connect-data/service-gateway-sso-kerberos

> Pass-through of Kerberos credentials from Spark SQL to a jdbc source
> 
>
> Key: SPARK-31817
> URL: https://issues.apache.org/jira/browse/SPARK-31817
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Luis Lozano Coira
>Priority: Major
>
> I am connecting to Spark SQL through the Thrift JDBC/ODBC server using 
> kerberos. From Spark SQL I have connected to a JDBC source using basic 
> authentication but I am interested in doing a pass-through of kerberos 
> credentials to this JDBC source. 
> Would it be possible to do something like that? If not possible, could you 
> consider adding this functionality?
> Anyway I would like to start testing this pass-through and try to develop an 
> approach by myself. How could this functionality be added? Could you give me 
> any indication to start this development?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31913) StackOverflowError in FileScanRDD

2020-06-05 Thread Genmao Yu (Jira)
Genmao Yu created SPARK-31913:
-

 Summary: StackOverflowError in FileScanRDD
 Key: SPARK-31913
 URL: https://issues.apache.org/jira/browse/SPARK-31913
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: Genmao Yu


Reading from FileScanRDD may failed with a StackOverflowError in my environment:
- There are a mass of empty files in table partition。
- Set `spark.sql.files.maxPartitionBytes`  with a large value: 1024MB

A quick workaround is set `spark.sql.files.maxPartitionBytes` with a small 
value, like default 128MB.

A better way is resolve the recursive calls in FileScanRDD.

{code}
java.lang.StackOverflowError
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.getSubject(Subject.java:297)
at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:648)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:38)
at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:640)
at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:148)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:143)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:326)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31867) Fix silent data change for datetime formatting

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126652#comment-17126652
 ] 

Apache Spark commented on SPARK-31867:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28736

> Fix silent data change for datetime formatting 
> ---
>
> Key: SPARK-31867
> URL: https://issues.apache.org/jira/browse/SPARK-31867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Blocker
> Fix For: 3.0.0
>
>
> {code:java}
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> NULL
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> 0001970-01-01
> spark-sql>
> {code}
> For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when 
> using the `NumberPrinterParser` to format it
> {code:java}
> switch (signStyle) {
>   case EXCEEDS_PAD:
> if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) {
>   buf.append(decimalStyle.getPositiveSign());
> }
> break;
>
>
> {code}
> the `minWidth` == `len(y..y)`
> the `EXCEED_POINTS` is 
> {code:java}
> /**
>  * Array of 10 to the power of n.
>  */
> static final long[] EXCEED_POINTS = new long[] {
> 0L,
> 10L,
> 100L,
> 1000L,
> 1L,
> 10L,
> 100L,
> 1000L,
> 1L,
> 10L,
> 100L,
> };
> {code}
> So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` 
> will be raised.
>  And at the caller side, for `from_unixtime`, the exception will be 
> suppressed and silent data change occurs. for `date_format`, the 
> `ArrayIndexOutOfBoundsException` will continue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31867) Fix silent data change for datetime formatting

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126654#comment-17126654
 ] 

Apache Spark commented on SPARK-31867:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28736

> Fix silent data change for datetime formatting 
> ---
>
> Key: SPARK-31867
> URL: https://issues.apache.org/jira/browse/SPARK-31867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Blocker
> Fix For: 3.0.0
>
>
> {code:java}
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> NULL
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> 0001970-01-01
> spark-sql>
> {code}
> For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when 
> using the `NumberPrinterParser` to format it
> {code:java}
> switch (signStyle) {
>   case EXCEEDS_PAD:
> if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) {
>   buf.append(decimalStyle.getPositiveSign());
> }
> break;
>
>
> {code}
> the `minWidth` == `len(y..y)`
> the `EXCEED_POINTS` is 
> {code:java}
> /**
>  * Array of 10 to the power of n.
>  */
> static final long[] EXCEED_POINTS = new long[] {
> 0L,
> 10L,
> 100L,
> 1000L,
> 1L,
> 10L,
> 100L,
> 1000L,
> 1L,
> 10L,
> 100L,
> };
> {code}
> So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` 
> will be raised.
>  And at the caller side, for `from_unixtime`, the exception will be 
> suppressed and silent data change occurs. for `date_format`, the 
> `ArrayIndexOutOfBoundsException` will continue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31913) StackOverflowError in FileScanRDD

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31913:


Assignee: Apache Spark

> StackOverflowError in FileScanRDD
> -
>
> Key: SPARK-31913
> URL: https://issues.apache.org/jira/browse/SPARK-31913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Genmao Yu
>Assignee: Apache Spark
>Priority: Minor
>
> Reading from FileScanRDD may failed with a StackOverflowError in my 
> environment:
> - There are a mass of empty files in table partition。
> - Set `spark.sql.files.maxPartitionBytes`  with a large value: 1024MB
> A quick workaround is set `spark.sql.files.maxPartitionBytes` with a small 
> value, like default 128MB.
> A better way is resolve the recursive calls in FileScanRDD.
> {code}
> java.lang.StackOverflowError
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.getSubject(Subject.java:297)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:648)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:38)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:640)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:148)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:143)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:326)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31913) StackOverflowError in FileScanRDD

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31913:


Assignee: (was: Apache Spark)

> StackOverflowError in FileScanRDD
> -
>
> Key: SPARK-31913
> URL: https://issues.apache.org/jira/browse/SPARK-31913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Genmao Yu
>Priority: Minor
>
> Reading from FileScanRDD may failed with a StackOverflowError in my 
> environment:
> - There are a mass of empty files in table partition。
> - Set `spark.sql.files.maxPartitionBytes`  with a large value: 1024MB
> A quick workaround is set `spark.sql.files.maxPartitionBytes` with a small 
> value, like default 128MB.
> A better way is resolve the recursive calls in FileScanRDD.
> {code}
> java.lang.StackOverflowError
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.getSubject(Subject.java:297)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:648)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:38)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:640)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:148)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:143)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:326)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31913) StackOverflowError in FileScanRDD

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126664#comment-17126664
 ] 

Apache Spark commented on SPARK-31913:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/28737

> StackOverflowError in FileScanRDD
> -
>
> Key: SPARK-31913
> URL: https://issues.apache.org/jira/browse/SPARK-31913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Genmao Yu
>Priority: Minor
>
> Reading from FileScanRDD may failed with a StackOverflowError in my 
> environment:
> - There are a mass of empty files in table partition。
> - Set `spark.sql.files.maxPartitionBytes`  with a large value: 1024MB
> A quick workaround is set `spark.sql.files.maxPartitionBytes` with a small 
> value, like default 128MB.
> A better way is resolve the recursive calls in FileScanRDD.
> {code}
> java.lang.StackOverflowError
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.getSubject(Subject.java:297)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:648)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:38)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:640)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:148)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:143)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:326)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31683) Make Prometheus output consistent with DropWizard 4.1 result

2020-06-05 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126671#comment-17126671
 ] 

Jorge Machado commented on SPARK-31683:
---

[~dongjoon] this only exporters the metrics from the driver right?

> Make Prometheus output consistent with DropWizard 4.1 result
> 
>
> Key: SPARK-31683
> URL: https://issues.apache.org/jira/browse/SPARK-31683
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-29032 adds Prometheus support.
> After that, SPARK-29674 upgraded DropWizard for JDK9+ support and causes 
> difference in output labels and number of keys.
>  
> This issue aims to fix this inconsistency in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31914) Apply SharedThriftServer to all ThriftServer related tests

2020-06-05 Thread Kent Yao (Jira)
Kent Yao created SPARK-31914:


 Summary: Apply SharedThriftServer to all ThriftServer related tests
 Key: SPARK-31914
 URL: https://issues.apache.org/jira/browse/SPARK-31914
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.1.0
Reporter: Kent Yao


To add 
{code:java}
HiveThriftBinaryServerSuite
HiveThriftCleanUpScratchDirSuite
HiveThriftHttpServerSuite
JdbcConnectionUriSuite
SingleSessionSuite
SparkMetadataOperationSuite
SparkThriftServerProtocolVersionsSuite
UISeleniumSuite
{code}

exist ones

{code:java}
ThriftServerQueryTestSuite
ThriftServerWithSparkContextSuite
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31914) Apply SharedThriftServer to all ThriftServer related tests

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31914:


Assignee: Apache Spark

> Apply SharedThriftServer to all ThriftServer related tests
> --
>
> Key: SPARK-31914
> URL: https://issues.apache.org/jira/browse/SPARK-31914
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> To add 
> {code:java}
> HiveThriftBinaryServerSuite
> HiveThriftCleanUpScratchDirSuite
> HiveThriftHttpServerSuite
> JdbcConnectionUriSuite
> SingleSessionSuite
> SparkMetadataOperationSuite
> SparkThriftServerProtocolVersionsSuite
> UISeleniumSuite
> {code}
> exist ones
> {code:java}
> ThriftServerQueryTestSuite
> ThriftServerWithSparkContextSuite
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31914) Apply SharedThriftServer to all ThriftServer related tests

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31914:


Assignee: (was: Apache Spark)

> Apply SharedThriftServer to all ThriftServer related tests
> --
>
> Key: SPARK-31914
> URL: https://issues.apache.org/jira/browse/SPARK-31914
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> To add 
> {code:java}
> HiveThriftBinaryServerSuite
> HiveThriftCleanUpScratchDirSuite
> HiveThriftHttpServerSuite
> JdbcConnectionUriSuite
> SingleSessionSuite
> SparkMetadataOperationSuite
> SparkThriftServerProtocolVersionsSuite
> UISeleniumSuite
> {code}
> exist ones
> {code:java}
> ThriftServerQueryTestSuite
> ThriftServerWithSparkContextSuite
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31914) Apply SharedThriftServer to all ThriftServer related tests

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126743#comment-17126743
 ] 

Apache Spark commented on SPARK-31914:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28738

> Apply SharedThriftServer to all ThriftServer related tests
> --
>
> Key: SPARK-31914
> URL: https://issues.apache.org/jira/browse/SPARK-31914
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> To add 
> {code:java}
> HiveThriftBinaryServerSuite
> HiveThriftCleanUpScratchDirSuite
> HiveThriftHttpServerSuite
> JdbcConnectionUriSuite
> SingleSessionSuite
> SparkMetadataOperationSuite
> SparkThriftServerProtocolVersionsSuite
> UISeleniumSuite
> {code}
> exist ones
> {code:java}
> ThriftServerQueryTestSuite
> ThriftServerWithSparkContextSuite
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31914) Apply SharedThriftServer to all ThriftServer related tests

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126745#comment-17126745
 ] 

Apache Spark commented on SPARK-31914:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28738

> Apply SharedThriftServer to all ThriftServer related tests
> --
>
> Key: SPARK-31914
> URL: https://issues.apache.org/jira/browse/SPARK-31914
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> To add 
> {code:java}
> HiveThriftBinaryServerSuite
> HiveThriftCleanUpScratchDirSuite
> HiveThriftHttpServerSuite
> JdbcConnectionUriSuite
> SingleSessionSuite
> SparkMetadataOperationSuite
> SparkThriftServerProtocolVersionsSuite
> UISeleniumSuite
> {code}
> exist ones
> {code:java}
> ThriftServerQueryTestSuite
> ThriftServerWithSparkContextSuite
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31910) Enable Java 8 time API in Thrift server

2020-06-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31910.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28729
[https://github.com/apache/spark/pull/28729]

> Enable Java 8 time API in Thrift server
> ---
>
> Key: SPARK-31910
> URL: https://issues.apache.org/jira/browse/SPARK-31910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> Switch to Java 8 time API by turning on the SQL config 
> spark.sql.datetime.java8API.enabled to address the following issues:
> # Date and timestamp string literals are parsed by using Java 8 time API and 
> Spark's session time zone. Before the changes, date/timestamp values were 
> collected as legacy types `java.sql.Date`/`java.sql.Timestamp`, and the value 
> of such types didn't respect the config `spark.sql.session.timeZone`. To have 
> consistent view, users had to keep JVM time zone and Spark's session time 
> zone in sync.
> # After the changes, formatting of date values doesn't depend on JVM time 
> zone.
> # While returning dates/timestamps of Java 8 type, we can avoid 
> dates/timestamps rebasing from Proleptic Gregorian calendar to the hybrid 
> calendar (Julian + Gregorian), and the issues related to calendar switching.
> # Properly handle negative years (BCE).
> # Consistent conversion of date/timestamp strings to/from internal Catalyst 
> types in both direction to and from Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31910) Enable Java 8 time API in Thrift server

2020-06-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31910:
---

Assignee: Maxim Gekk  (was: Apache Spark)

> Enable Java 8 time API in Thrift server
> ---
>
> Key: SPARK-31910
> URL: https://issues.apache.org/jira/browse/SPARK-31910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> Switch to Java 8 time API by turning on the SQL config 
> spark.sql.datetime.java8API.enabled to address the following issues:
> # Date and timestamp string literals are parsed by using Java 8 time API and 
> Spark's session time zone. Before the changes, date/timestamp values were 
> collected as legacy types `java.sql.Date`/`java.sql.Timestamp`, and the value 
> of such types didn't respect the config `spark.sql.session.timeZone`. To have 
> consistent view, users had to keep JVM time zone and Spark's session time 
> zone in sync.
> # After the changes, formatting of date values doesn't depend on JVM time 
> zone.
> # While returning dates/timestamps of Java 8 type, we can avoid 
> dates/timestamps rebasing from Proleptic Gregorian calendar to the hybrid 
> calendar (Julian + Gregorian), and the issues related to calendar switching.
> # Properly handle negative years (BCE).
> # Consistent conversion of date/timestamp strings to/from internal Catalyst 
> types in both direction to and from Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31911) Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data

2020-06-05 Thread Brandon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126936#comment-17126936
 ] 

Brandon commented on SPARK-31911:
-

Thanks for looking [~ste...@apache.org], I attached logs to the Hadoop ticket.

> Using S3A staging committer, pending uploads are committed more than once and 
> listed incorrectly in _SUCCESS data
> -
>
> Key: SPARK-31911
> URL: https://issues.apache.org/jira/browse/SPARK-31911
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Brandon
>Priority: Major
>
> First of all thanks for the great work on the S3 committers. I was able set 
> up the directory staging committer in my environment following docs at 
> [https://github.com/apache/spark/blob/master/docs/cloud-integration.md#committing-work-into-cloud-storage-safely-and-fast]
>  and tested one of my Spark applications using it. The Spark version is 2.4.4 
> with Hadoop 3.2.1 and the cloud committer bindings. The application writes 
> multiple DataFrames to ORC/Parquet in S3, submitting them as write jobs to 
> Spark in parallel.
> I think I'm seeing a bug where the staging committer will complete pending 
> uploads more than once. The main symptom how I discovered this is that the 
> _SUCCESS data files under each table will contain overlapping file names that 
> belong to separate tables. From my reading of the code, that's because the 
> filenames in _SUCCESS reflect which multipart uploads were completed in the 
> commit for that particular table.
> An example:
> Concurrently, fire off DataFrame.write.orc("s3a://bucket/a") and 
> DataFrame.write.orc("s3a://bucket/b"). Suppose each table has one partition 
> so writes one partition file.
> When the two writes are done,
>  * /a/_SUCCESS contains two filenames: /a/part- and /b/part-.
>  * /b/_SUCCESS contains the same two filenames.
> Setting S3A logs to debug, I see the commitJob operation belonging to table a 
> includes completing the uploads of /a/part- and /b/part-. Then again, 
> commitJob for table b includes the same completions. I haven't had a problem 
> yet, but I wonder if having these extra requests would become an issue at 
> higher scale, where dozens of commits with hundreds of files may be happening 
> concurrently in the application.
> I believe this may be caused from the way the pendingSet files are stored in 
> the staging directory. They are stored under one directory named by the 
> jobID, in the Hadoop code. However, for all write jobs executed by the Spark 
> application, the jobID passed to Hadoop is the same - the application ID. 
> Maybe the staging commit algorithm was built on the assumption that each 
> instance of the algorithm would use a unique random jobID.
> [~ste...@apache.org] , [~rdblue] Having seen your names on most of this work 
> (thank you), I would be interested to know your thoughts on this. Also it's 
> my first time opening a bug here, so let me know if there's anything else I 
> can do to help report the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31915) Remove projection that adds grouping keys in grouped and cogrouped pandas UDFs

2020-06-05 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-31915:


 Summary: Remove projection that adds grouping keys in grouped and 
cogrouped pandas UDFs
 Key: SPARK-31915
 URL: https://issues.apache.org/jira/browse/SPARK-31915
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Currently, grouped and cogrouped pandas UDFs in Spark unnecessarily projects 
the grouping keys. This results in case-sensitivity resolution failure when the 
project contains columns such as "Column" and "column" as they are considered 
different but ambiguous columns. 

It results as below:

{code}
from pyspark.sql.functions import *

df = spark.createDataFrame([[1, 1]], ["column", "Score"])

@pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
return pdf.assign(Score=0.5)

df.groupby('COLUMN').apply(my_pandas_udf).show()
{code}

{code}
pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could be: 
COLUMN, COLUMN.;
{code}

{code}
pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input 
columns: [COLUMN, COLUMN, value, value];;
'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, 
column#13L, value#14L), [column#22L, value#23L]
:- Project [COLUMN#9L, column#9L, value#10L]
:  +- LogicalRDD [column#9L, value#10L], false
+- Project [COLUMN#13L, column#13L, value#14L]
   +- LogicalRDD [column#13L, value#14L], false
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31915) Remove projection that adds grouping keys in grouped and cogrouped pandas UDFs

2020-06-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31915:
-
Description: 
Currently, grouped and cogrouped pandas UDFs in Spark unnecessarily projects 
the grouping keys. This results in case-sensitivity resolution failure when the 
project contains columns such as "Column" and "column" as they are considered 
different but ambiguous columns. 

It results as below:

{code}
from pyspark.sql.functions import *

df = spark.createDataFrame([[1, 1]], ["column", "Score"])

@pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
return pdf.assign(Score=0.5)

df.groupby('COLUMN').apply(my_pandas_udf).show()
{code}

{code}
pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could be: 
COLUMN, COLUMN.;
{code}

{code}
df1 = spark.createDataFrame([(1, 1)], ("column", "value"))
df2 = spark.createDataFrame([(1, 1)], ("column", "value"))

df1.groupby("COLUMN").cogroup(
df2.groupby("COLUMN")
).applyInPandas(lambda r, l: r + l, df1.schema).show()
{code}

{code}
pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input 
columns: [COLUMN, COLUMN, value, value];;
'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, 
column#13L, value#14L), [column#22L, value#23L]
:- Project [COLUMN#9L, column#9L, value#10L]
:  +- LogicalRDD [column#9L, value#10L], false
+- Project [COLUMN#13L, column#13L, value#14L]
   +- LogicalRDD [column#13L, value#14L], false
{code}

  was:
Currently, grouped and cogrouped pandas UDFs in Spark unnecessarily projects 
the grouping keys. This results in case-sensitivity resolution failure when the 
project contains columns such as "Column" and "column" as they are considered 
different but ambiguous columns. 

It results as below:

{code}
from pyspark.sql.functions import *

df = spark.createDataFrame([[1, 1]], ["column", "Score"])

@pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
return pdf.assign(Score=0.5)

df.groupby('COLUMN').apply(my_pandas_udf).show()
{code}

{code}
pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could be: 
COLUMN, COLUMN.;
{code}

{code}
pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input 
columns: [COLUMN, COLUMN, value, value];;
'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, 
column#13L, value#14L), [column#22L, value#23L]
:- Project [COLUMN#9L, column#9L, value#10L]
:  +- LogicalRDD [column#9L, value#10L], false
+- Project [COLUMN#13L, column#13L, value#14L]
   +- LogicalRDD [column#13L, value#14L], false
{code}


> Remove projection that adds grouping keys in grouped and cogrouped pandas UDFs
> --
>
> Key: SPARK-31915
> URL: https://issues.apache.org/jira/browse/SPARK-31915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, grouped and cogrouped pandas UDFs in Spark unnecessarily projects 
> the grouping keys. This results in case-sensitivity resolution failure when 
> the project contains columns such as "Column" and "column" as they are 
> considered different but ambiguous columns. 
> It results as below:
> {code}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([[1, 1]], ["column", "Score"])
> @pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
> def my_pandas_udf(pdf):
> return pdf.assign(Score=0.5)
> df.groupby('COLUMN').apply(my_pandas_udf).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could 
> be: COLUMN, COLUMN.;
> {code}
> {code}
> df1 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df2 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df1.groupby("COLUMN").cogroup(
> df2.groupby("COLUMN")
> ).applyInPandas(lambda r, l: r + l, df1.schema).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input 
> columns: [COLUMN, COLUMN, value, value];;
> 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, 
> column#13L, value#14L), [column#22L, value#23L]
> :- Project [COLUMN#9L, column#9L, value#10L]
> :  +- LogicalRDD [column#9L, value#10L], false
> +- Project [COLUMN#13L, column#13L, value#14L]
>+- LogicalRDD [column#13L, value#14L], false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31916) StringConcat can overflow `length`, leads to StringIndexOutOfBoundsException

2020-06-05 Thread Jeffrey Stokes (Jira)
Jeffrey Stokes created SPARK-31916:
--

 Summary: StringConcat can overflow `length`, leads to 
StringIndexOutOfBoundsException
 Key: SPARK-31916
 URL: https://issues.apache.org/jira/browse/SPARK-31916
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Jeffrey Stokes


We have query plans that through multiple transformations can grow extremely 
long in length. These would eventually throw OutOfMemory exceptions 
(https://issues.apache.org/jira/browse/SPARK-26103 &; related 
https://issues.apache.org/jira/browse/SPARK-25380).

 

We backported the changes from [https://github.com/apache/spark/pull/23169] 
into our distribution of Spark, based on 2.4.4, and attempted to use the added 
`spark.sql.maxPlanStringLength`. While this works in some cases, large query 
plans can still lead to issues stemming from `StringConcat` in 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala.

 

The following unit test exhibits the issue, which continues to fail in the 
master branch of spark:

 
{code:scala}
  test("StringConcat doesn't overflow on many inputs") {
val concat = new StringConcat(maxLength = 100)
0.to(Integer.MAX_VALUE).foreach { _ =>  
  concat.append("hello world")
 }
assert(concat.toString.length === 100)  
} 
{code}
 

Looking at the append method here: 
[https://github.com/apache/spark/blob/fc6af9d900ec6f6a1cbe8f987857a69e6ef600d1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala#L118-L128]

 

It seems like regardless of whether the string to be append is added fully to 
the internal buffer, added as a substring to reach `maxLength`, or not added at 
all the internal `length` field is incremented by the length of `s`. Eventually 
this will overflow an int and cause L123 to substring with a negative index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31903) toPandas with Arrow enabled doesn't show metrics in Query UI.

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127028#comment-17127028
 ] 

Apache Spark commented on SPARK-31903:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/28740

> toPandas with Arrow enabled doesn't show metrics in Query UI.
> -
>
> Key: SPARK-31903
> URL: https://issues.apache.org/jira/browse/SPARK-31903
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, R
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2020-06-03 at 4.47.07 PM.png, Screen Shot 
> 2020-06-03 at 4.47.27 PM.png
>
>
> When calling {{toPandas}}, usually Query UI shows each plan node's metric and 
> corresponding Stage ID and Task ID:
> {code:java}
> >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 
> >>> 'y', 'z'])
> >>> df.toPandas()
>x   yz
> 0  1  10  abc
> 1  2  20  def
> {code}
> !Screen Shot 2020-06-03 at 4.47.07 PM.png!
> but if Arrow execution is enabled, it shows only plan nodes and the duration 
> is not correct:
> {code:java}
> >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
> >>> df.toPandas()
>x   yz
> 0  1  10  abc
> 1  2  20  def{code}
>  
> !Screen Shot 2020-06-03 at 4.47.27 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31903) toPandas with Arrow enabled doesn't show metrics in Query UI.

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127027#comment-17127027
 ] 

Apache Spark commented on SPARK-31903:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/28740

> toPandas with Arrow enabled doesn't show metrics in Query UI.
> -
>
> Key: SPARK-31903
> URL: https://issues.apache.org/jira/browse/SPARK-31903
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, R
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2020-06-03 at 4.47.07 PM.png, Screen Shot 
> 2020-06-03 at 4.47.27 PM.png
>
>
> When calling {{toPandas}}, usually Query UI shows each plan node's metric and 
> corresponding Stage ID and Task ID:
> {code:java}
> >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 
> >>> 'y', 'z'])
> >>> df.toPandas()
>x   yz
> 0  1  10  abc
> 1  2  20  def
> {code}
> !Screen Shot 2020-06-03 at 4.47.07 PM.png!
> but if Arrow execution is enabled, it shows only plan nodes and the duration 
> is not correct:
> {code:java}
> >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
> >>> df.toPandas()
>x   yz
> 0  1  10  abc
> 1  2  20  def{code}
>  
> !Screen Shot 2020-06-03 at 4.47.27 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29750) Avoid dependency from joda-time

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29750:


Assignee: Apache Spark

> Avoid dependency from joda-time
> ---
>
> Key: SPARK-29750
> URL: https://issues.apache.org/jira/browse/SPARK-29750
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> * Remove direct dependency from joda-time
> * If it is used somewhere in Spark, use Java 8 time API instead of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29524) Unordered interval units

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29524:


Assignee: (was: Apache Spark)

> Unordered interval units
> 
>
> Key: SPARK-29524
> URL: https://issues.apache.org/jira/browse/SPARK-29524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, Spark requires particular order of interval units in casting from 
> strings - `YEAR` .. `MICROSECOND`. PostgreSQL allows any order:
> {code}
> maxim=# select interval '1 second 2 hours';
>  interval
> --
>  02:00:01
> (1 row)
> {code}
> but Spark fails on while parsing:
> {code}
> spark-sql> select interval '1 second 2 hours';
> NULL
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29524) Unordered interval units

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29524:


Assignee: Apache Spark

> Unordered interval units
> 
>
> Key: SPARK-29524
> URL: https://issues.apache.org/jira/browse/SPARK-29524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Currently, Spark requires particular order of interval units in casting from 
> strings - `YEAR` .. `MICROSECOND`. PostgreSQL allows any order:
> {code}
> maxim=# select interval '1 second 2 hours';
>  interval
> --
>  02:00:01
> (1 row)
> {code}
> but Spark fails on while parsing:
> {code}
> spark-sql> select interval '1 second 2 hours';
> NULL
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29750) Avoid dependency from joda-time

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29750:


Assignee: (was: Apache Spark)

> Avoid dependency from joda-time
> ---
>
> Key: SPARK-29750
> URL: https://issues.apache.org/jira/browse/SPARK-29750
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> * Remove direct dependency from joda-time
> * If it is used somewhere in Spark, use Java 8 time API instead of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29712) fromDayTimeString() does not take into account the left bound

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29712:


Assignee: Apache Spark

> fromDayTimeString() does not take into account the left bound
> -
>
> Key: SPARK-29712
> URL: https://issues.apache.org/jira/browse/SPARK-29712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, fromDayTimeString() takes into account the right bound but not the 
> left one. For example:
> {code}
> spark-sql> SELECT interval '1 2:03:04' hour to minute;
> interval 1 days 2 hours 3 minutes
> {code}
> The result should be *interval 2 hours 3 minutes*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29440) Support java.time.Duration as an external type of CalendarIntervalType

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29440:


Assignee: (was: Apache Spark)

> Support java.time.Duration as an external type of CalendarIntervalType
> --
>
> Key: SPARK-29440
> URL: https://issues.apache.org/jira/browse/SPARK-29440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, Spark SQL doesn't have any external type for Catalyst's 
> CalendarIntervalType. Internal CalendarInterval is partially exposed but it 
> cannot be used in UDF for example. This ticket aims to provide 
> `java.time.Duration` as one of external types of Spark `INTERVAL`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29440) Support java.time.Duration as an external type of CalendarIntervalType

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29440:


Assignee: Apache Spark

> Support java.time.Duration as an external type of CalendarIntervalType
> --
>
> Key: SPARK-29440
> URL: https://issues.apache.org/jira/browse/SPARK-29440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Currently, Spark SQL doesn't have any external type for Catalyst's 
> CalendarIntervalType. Internal CalendarInterval is partially exposed but it 
> cannot be used in UDF for example. This ticket aims to provide 
> `java.time.Duration` as one of external types of Spark `INTERVAL`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29712) fromDayTimeString() does not take into account the left bound

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29712:


Assignee: (was: Apache Spark)

> fromDayTimeString() does not take into account the left bound
> -
>
> Key: SPARK-29712
> URL: https://issues.apache.org/jira/browse/SPARK-29712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, fromDayTimeString() takes into account the right bound but not the 
> left one. For example:
> {code}
> spark-sql> SELECT interval '1 2:03:04' hour to minute;
> interval 1 days 2 hours 3 minutes
> {code}
> The result should be *interval 2 hours 3 minutes*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29920) Parsing failure on interval '20 15' day to hour

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127039#comment-17127039
 ] 

Apache Spark commented on SPARK-29920:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/26473

> Parsing failure on interval '20 15' day to hour
> ---
>
> Key: SPARK-29920
> URL: https://issues.apache.org/jira/browse/SPARK-29920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> spark-sql> select interval '20 15' day to hour;
> Error in query:
> requirement failed: Interval string must match day-time format of 'd 
> h:m:s.n': 20 15(line 1, pos 16)
> == SQL ==
> select interval '20 15' day to hour
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29173) Benchmark JDK 11 performance with FilterPushdownBenchmark

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127040#comment-17127040
 ] 

Apache Spark commented on SPARK-29173:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/27078

> Benchmark JDK 11 performance with FilterPushdownBenchmark
> -
>
> Key: SPARK-29173
> URL: https://issues.apache.org/jira/browse/SPARK-29173
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: jdk11.txt, jdk8.txt
>
>
> I was comparing the performance of JDK 8 and JDK 11 using 
> {{FilterPushdownBenchmark}}:
> {code:sh}
> bin/spark-submit --master local[1] --conf 
> "spark.driver.extraJavaOptions=-XX:+UseG1GC --driver-memory 20G --class 
> org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark  
> jars/spark-sql_2.12-3.0.0-SNAPSHOT-tests.jar
> {code}
> It seems JDK 11 is slower than JDK 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30501) Remove SQL config spark.sql.parquet.int64AsTimestampMillis

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127043#comment-17127043
 ] 

Apache Spark commented on SPARK-30501:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/27169

> Remove SQL config spark.sql.parquet.int64AsTimestampMillis
> --
>
> Key: SPARK-30501
> URL: https://issues.apache.org/jira/browse/SPARK-30501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> The SQL config has been deprecated since Spark 2.3, and should be removed in 
> Spark 3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29920) Parsing failure on interval '20 15' day to hour

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127042#comment-17127042
 ] 

Apache Spark commented on SPARK-29920:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/26473

> Parsing failure on interval '20 15' day to hour
> ---
>
> Key: SPARK-29920
> URL: https://issues.apache.org/jira/browse/SPARK-29920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> spark-sql> select interval '20 15' day to hour;
> Error in query:
> requirement failed: Interval string must match day-time format of 'd 
> h:m:s.n': 20 15(line 1, pos 16)
> == SQL ==
> select interval '20 15' day to hour
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30309) Mark `Filter` as a `sealed` class

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127044#comment-17127044
 ] 

Apache Spark commented on SPARK-30309:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/26950

> Mark `Filter` as a `sealed` class
> -
>
> Key: SPARK-30309
> URL: https://issues.apache.org/jira/browse/SPARK-30309
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Add the `sealed` keyword to the `Filter` class at the 
> `org.apache.spark.sql.sources` package. So, the compiler should output a 
> warning if handling of a filter is missed in a datasource:
> {code}
> Warning:(154, 65) match may not be exhaustive.
> It would fail on the following inputs: AlwaysFalse(), AlwaysTrue()
> def translate(filter: sources.Filter): Option[Expression] = filter match {
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28624) make_date is inconsistent when reading from table

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28624:


Assignee: Apache Spark

> make_date is inconsistent when reading from table
> -
>
> Key: SPARK-28624
> URL: https://issues.apache.org/jira/browse/SPARK-28624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screen Shot 2019-08-05 at 18.19.39.png, collect 
> make_date.png
>
>
> {code:sql}
> spark-sql> create table test_make_date as select make_date(-44, 3, 15) as d;
> spark-sql> select d, make_date(-44, 3, 15) from test_make_date;
> 0045-03-15-0044-03-15
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28624) make_date is inconsistent when reading from table

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28624:


Assignee: (was: Apache Spark)

> make_date is inconsistent when reading from table
> -
>
> Key: SPARK-28624
> URL: https://issues.apache.org/jira/browse/SPARK-28624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Screen Shot 2019-08-05 at 18.19.39.png, collect 
> make_date.png
>
>
> {code:sql}
> spark-sql> create table test_make_date as select make_date(-44, 3, 15) as d;
> spark-sql> select d, make_date(-44, 3, 15) from test_make_date;
> 0045-03-15-0044-03-15
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30416) Log a warning for deprecated SQL config in `set()` and `unset()`

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127046#comment-17127046
 ] 

Apache Spark commented on SPARK-30416:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/27645

> Log a warning for deprecated SQL config in `set()` and `unset()`
> 
>
> Key: SPARK-30416
> URL: https://issues.apache.org/jira/browse/SPARK-30416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> - Gather deprecated SQL configs and add extra info - when a config was 
> deprecated and why
> - Output warning about deprecated SQL config in set() and unset()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31471) Add a script to run multiple benchmarks

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31471:


Assignee: Apache Spark

> Add a script to run multiple benchmarks
> ---
>
> Key: SPARK-31471
> URL: https://issues.apache.org/jira/browse/SPARK-31471
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Add a python script to run multiple benchmarks. The script can be taken from 
> [https://github.com/apache/spark/pull/27078]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31023) Support foldable schemas by `from_json`

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127047#comment-17127047
 ] 

Apache Spark commented on SPARK-31023:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/27804

> Support foldable schemas by `from_json`
> ---
>
> Key: SPARK-31023
> URL: https://issues.apache.org/jira/browse/SPARK-31023
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently, Spark accepts only literals or schema_of_json w/ literal input as 
> the schema parameter of from_json. And it fails on any foldable expressions, 
> for instance:
> {code:sql}
> spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id 
> INT, dpt_org_city STRING', 'dpt_org_', ''));
> Error in query: Schema should be specified in DDL format as a string literal 
> or output of the schema_of_json function instead of replace('dpt_org_id INT, 
> dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
> {code}
> There are no reasons to restrict users by literals. The ticket aims to 
> support any foldable schemas by from_json().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31471) Add a script to run multiple benchmarks

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31471:


Assignee: (was: Apache Spark)

> Add a script to run multiple benchmarks
> ---
>
> Key: SPARK-31471
> URL: https://issues.apache.org/jira/browse/SPARK-31471
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Add a python script to run multiple benchmarks. The script can be taken from 
> [https://github.com/apache/spark/pull/27078]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31455) Fix rebasing of not-existed dates/timestamps

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127048#comment-17127048
 ] 

Apache Spark commented on SPARK-31455:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28225

> Fix rebasing of not-existed dates/timestamps
> 
>
> Key: SPARK-31455
> URL: https://issues.apache.org/jira/browse/SPARK-31455
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31489) Failure on pushing down filters with java.time.LocalDate values in ORC

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127050#comment-17127050
 ] 

Apache Spark commented on SPARK-31489:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28272

> Failure on pushing down filters with java.time.LocalDate values in ORC
> --
>
> Key: SPARK-31489
> URL: https://issues.apache.org/jira/browse/SPARK-31489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> When spark.sql.datetime.java8API.enabled is set to true, filters pushed down 
> with java.time.LocalDate values to ORC datasource fails with the exception:
> {code}
> Wrong value class java.time.LocalDate for DATE.EQUALS leaf
> java.lang.IllegalArgumentException: Wrong value class java.time.LocalDate for 
> DATE.EQUALS leaf
>   at 
> org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
>   at 
> org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.(SearchArgumentImpl.java:75)
>   at 
> org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$BuilderImpl.equals(SearchArgumentImpl.java:352)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:229)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31025) Support foldable input by `schema_of_csv`

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127049#comment-17127049
 ] 

Apache Spark commented on SPARK-31025:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/27804

> Support foldable input by `schema_of_csv` 
> --
>
> Key: SPARK-31025
> URL: https://issues.apache.org/jira/browse/SPARK-31025
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently, the `schema_of_csv()` function allows only string literal as the 
> input. The ticket aims to support any foldable string expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31553) Wrong result of isInCollection for large collections

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127052#comment-17127052
 ] 

Apache Spark commented on SPARK-31553:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28388

> Wrong result of isInCollection for large collections
> 
>
> Key: SPARK-31553
> URL: https://issues.apache.org/jira/browse/SPARK-31553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> If the size of a collection passed to isInCollection is bigger than 
> spark.sql.optimizer.inSetConversionThreshold, the method can return wrong 
> results for some inputs. For example:
> {code:scala}
> val set = (0 to 20).map(_.toString).toSet
> val data = Seq("1").toDF("x")
> println(set.contains("1"))
> data.select($"x".isInCollection(set).as("isInCollection")).show()
> {code}
> {code}
> true
> +--+
> |isInCollection|
> +--+
> | false|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31563) Failure of InSet.sql for UTF8String collection

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127051#comment-17127051
 ] 

Apache Spark commented on SPARK-31563:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28399

> Failure of InSet.sql for UTF8String collection
> --
>
> Key: SPARK-31563
> URL: https://issues.apache.org/jira/browse/SPARK-31563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> The InSet expression works on collections of internal Catalyst's types. We 
> can see this in the optimization when In is replaced by InSet, and In's 
> collection is evaluated to internal Catalyst's values: 
> [https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L253-L254]
> {code:scala}
> if (newList.length > SQLConf.get.optimizerInSetConversionThreshold) {
>   val hSet = newList.map(e => e.eval(EmptyRow))
>   InSet(v, HashSet() ++ hSet)
> }
> {code}
> The code existed before the optimization 
> https://github.com/apache/spark/pull/25754 that made another wrong assumption 
> about collection types.
> If InSet accepts only internal Catalyst's types, the following code shouldn't 
> fail:
> {code:scala}
> InSet(Literal("a"), Set("a", "b").map(UTF8String.fromString)).sql
> {code}
> but it fails with the exception:
> {code}
> Unsupported literal type class org.apache.spark.unsafe.types.UTF8String a
> java.lang.RuntimeException: Unsupported literal type class 
> org.apache.spark.unsafe.types.UTF8String a
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.InSet.$anonfun$sql$2(predicates.scala:522)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31557) Legacy parser incorrectly interprets pre-Gregorian dates

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127055#comment-17127055
 ] 

Apache Spark commented on SPARK-31557:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28398

> Legacy parser incorrectly interprets pre-Gregorian dates
> 
>
> Key: SPARK-31557
> URL: https://issues.apache.org/jira/browse/SPARK-31557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.0.0
>
>
> With CSV:
> {noformat}
> scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", 
> "1800-01-01").map(x => s"$x,$x")
> seq: Seq[String] = List(0002-01-01,0002-01-01, 1000-01-01,1000-01-01, 
> 1500-01-01,1500-01-01, 1800-01-01,1800-01-01)
> scala> val ds = seq.toDF("value").as[String]
> ds: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> spark.read.schema("expected STRING, actual DATE").csv(ds).show
> +--+--+
> |  expected|actual|
> +--+--+
> |0002-01-01|0001-12-30|
> |1000-01-01|1000-01-06|
> |1500-01-01|1500-01-10|
> |1800-01-01|1800-01-01|
> +--+--+
> scala> 
> {noformat}
> Similarly, with JSON:
> {noformat}
> scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", 
> "1800-01-01").map { x =>
>   s"""{"expected": "$x", "actual": "$x"}"""
> }
>  |  | seq: Seq[String] = List({"expected": "0002-01-01", "actual": 
> "0002-01-01"}, {"expected": "1000-01-01", "actual": "1000-01-01"}, 
> {"expected": "1500-01-01", "actual": "1500-01-01"}, {"expected": 
> "1800-01-01", "actual": "1800-01-01"})
> scala> 
> scala> val ds = seq.toDF("value").as[String]
> ds: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> spark.read.schema("expected STRING, actual DATE").json(ds).show
> +--+--+
> |  expected|actual|
> +--+--+
> |0002-01-01|0001-12-30|
> |1000-01-01|1000-01-06|
> |1500-01-01|1500-01-10|
> |1800-01-01|1800-01-01|
> +--+--+
> scala> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31044) Support foldable input by `schema_of_json`

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127053#comment-17127053
 ] 

Apache Spark commented on SPARK-31044:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/27804

> Support foldable input by `schema_of_json`
> --
>
> Key: SPARK-31044
> URL: https://issues.apache.org/jira/browse/SPARK-31044
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently, the `schema_of_json()` function allows only string literal as the 
> input. The ticket aims to support any foldable string expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31488) Support `java.time.LocalDate` in Parquet filter pushdown

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127056#comment-17127056
 ] 

Apache Spark commented on SPARK-31488:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28272

> Support `java.time.LocalDate` in Parquet filter pushdown
> 
>
> Key: SPARK-31488
> URL: https://issues.apache.org/jira/browse/SPARK-31488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, ParquetFilters supports only java.sql.Date values of DateType, and 
> explicitly casts Any to java.sql.Date, see
> https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176
> So, any filters refer to date values are not pushed down to Parquet when 
> spark.sql.datetime.java8API.enabled is true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31553) Wrong result of isInCollection for large collections

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127054#comment-17127054
 ] 

Apache Spark commented on SPARK-31553:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28328

> Wrong result of isInCollection for large collections
> 
>
> Key: SPARK-31553
> URL: https://issues.apache.org/jira/browse/SPARK-31553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> If the size of a collection passed to isInCollection is bigger than 
> spark.sql.optimizer.inSetConversionThreshold, the method can return wrong 
> results for some inputs. For example:
> {code:scala}
> val set = (0 to 20).map(_.toString).toSet
> val data = Seq("1").toDF("x")
> println(set.contains("1"))
> data.select($"x".isInCollection(set).as("isInCollection")).show()
> {code}
> {code}
> true
> +--+
> |isInCollection|
> +--+
> | false|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31563) Failure of InSet.sql for UTF8String collection

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127058#comment-17127058
 ] 

Apache Spark commented on SPARK-31563:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28343

> Failure of InSet.sql for UTF8String collection
> --
>
> Key: SPARK-31563
> URL: https://issues.apache.org/jira/browse/SPARK-31563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> The InSet expression works on collections of internal Catalyst's types. We 
> can see this in the optimization when In is replaced by InSet, and In's 
> collection is evaluated to internal Catalyst's values: 
> [https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L253-L254]
> {code:scala}
> if (newList.length > SQLConf.get.optimizerInSetConversionThreshold) {
>   val hSet = newList.map(e => e.eval(EmptyRow))
>   InSet(v, HashSet() ++ hSet)
> }
> {code}
> The code existed before the optimization 
> https://github.com/apache/spark/pull/25754 that made another wrong assumption 
> about collection types.
> If InSet accepts only internal Catalyst's types, the following code shouldn't 
> fail:
> {code:scala}
> InSet(Literal("a"), Set("a", "b").map(UTF8String.fromString)).sql
> {code}
> but it fails with the exception:
> {code}
> Unsupported literal type class org.apache.spark.unsafe.types.UTF8String a
> java.lang.RuntimeException: Unsupported literal type class 
> org.apache.spark.unsafe.types.UTF8String a
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.InSet.$anonfun$sql$2(predicates.scala:522)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection

2020-06-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127059#comment-17127059
 ] 

Apache Spark commented on SPARK-29048:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28388

> Query optimizer slow when using Column.isInCollection() with a large size 
> collection
> 
>
> Key: SPARK-29048
> URL: https://issues.apache.org/jira/browse/SPARK-29048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5, 2.4.6
>Reporter: Weichen Xu
>Priority: Major
>
> Query optimizer slow when using Column.isInCollection() with a large size 
> collection.
> The query optimizer takes a long time to do its thing and on the UI all I see 
> is "Running commands". This can take from 10s of minutes to 11 hours 
> depending on how many values there are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29048:


Assignee: (was: Apache Spark)

> Query optimizer slow when using Column.isInCollection() with a large size 
> collection
> 
>
> Key: SPARK-29048
> URL: https://issues.apache.org/jira/browse/SPARK-29048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5, 2.4.6
>Reporter: Weichen Xu
>Priority: Major
>
> Query optimizer slow when using Column.isInCollection() with a large size 
> collection.
> The query optimizer takes a long time to do its thing and on the UI all I see 
> is "Running commands". This can take from 10s of minutes to 11 hours 
> depending on how many values there are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection

2020-06-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29048:


Assignee: Apache Spark

> Query optimizer slow when using Column.isInCollection() with a large size 
> collection
> 
>
> Key: SPARK-29048
> URL: https://issues.apache.org/jira/browse/SPARK-29048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5, 2.4.6
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
>
> Query optimizer slow when using Column.isInCollection() with a large size 
> collection.
> The query optimizer takes a long time to do its thing and on the UI all I see 
> is "Running commands". This can take from 10s of minutes to 11 hours 
> depending on how many values there are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31917) Spark planner is extremely slow when struct transformation is used

2020-06-05 Thread Attila Kelemen (Jira)
Attila Kelemen created SPARK-31917:
--

 Summary: Spark planner is extremely slow when struct 
transformation is used
 Key: SPARK-31917
 URL: https://issues.apache.org/jira/browse/SPARK-31917
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 3.0.0
 Environment: JDK 1.8.0_181 (64 bit)
Reporter: Attila Kelemen


When doing multiple struct transformations one after another, then the Spark 
planner becomes extremely slow, greatly increasing with the depth of the 
transformation.

See the following for replicating the issue: 
[SparkTest.java|https://gist.github.com/kelemen/1bafe46e898326252cfda224b98a0e07].

The following runs out of memory with 2 GB heap in 
_RelationalGroupedDataset.agg_, but _agg_ completes with 4 GB heap. However, in 
this case the _show_ call must be commented, because it fails even with 4 GB 
heap (I have not tried how much it needs given that _agg_ already takes a lot 
of time).

When applying all the _mapFields_ calls, then _agg_ took me 20 mins.

When commenting the 1st _mapFields_ call, then _agg_ took me 50 sec.

When commenting the 2nd and 3rd _mapFields_ call, then _agg_ took me 5 seconds.

After various tries, it seems that the time required is mainly dependent on how 
many transformations are applied. Even the aggregation can be left out, and 
replaced with another simple transformation.

Aside from the performance, the plan generated by Spark will be poor as well, 
creating plans like (simplified notation) _sum(struct(F.A1, F.A2, F.A3).A3)_. 
And the previous needless struct wrapping is sometimes nested even deeper in 
the optimized plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org