date:20210716

[jira] [Assigned] (SPARK-35276) Write checksum files for shuffle

2021-07-16 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-35276:
---

Assignee: wuyi

> Write checksum files for shuffle 
> -
>
> Key: SPARK-35276
> URL: https://issues.apache.org/jira/browse/SPARK-35276
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35276) Write checksum files for shuffle

2021-07-16 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-35276.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32401
[https://github.com/apache/spark/pull/32401]

> Write checksum files for shuffle 
> -
>
> Key: SPARK-35276
> URL: https://issues.apache.org/jira/browse/SPARK-35276
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36192) Better error messages when comparing against list

2021-07-16 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-36192:


 Summary: Better error messages when comparing against list 
 Key: SPARK-36192
 URL: https://issues.apache.org/jira/browse/SPARK-36192
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
 Environment: We shall throw TypeError messages rather than Spark 
exceptions.
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36191) Support ORDER BY and LIMIT to be on the correlation path

2021-07-16 Thread Allison Wang (Jira)

Allison Wang created SPARK-36191:


 Summary: Support ORDER BY and LIMIT to be on the correlation path
 Key: SPARK-36191
 URL: https://issues.apache.org/jira/browse/SPARK-36191
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Allison Wang


A correlation path is defined as the sub-tree of all the operators that are on 
the path from the operator hosting the correlated expressions up to the 
operator producing the correlated values. 

We want to support ORDER BY (Sort) and LIMT operators to be on the correlation 
path to achieve better feature parity with Postgres. Here is an example query 
in `postgreSQL/join.sql`:

{code:SQL}
select * from
  text_tbl t1
  left join int8_tbl i8
  on i8.q2 = 123,
  lateral (select i8.q1, t2.f1 from text_tbl t2 limit 1) as ss
where t1.f1 = ss.f1;
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

2021-07-16 Thread Tony Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Zhang updated SPARK-36187:
---
Description: 
Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet 
formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code to set that config. 
If I understand it correctly, without setting SQLConf.OUTPUT_COMMITTER_CLASS 
properly (like for Orc format), SQLHadoopMapReduceCommitProtocol will still use 
the original staging directory, which may void the fix by the PR, in which case 
the commit collision may still happen, thus the fix is now only effective for 
Parquet, but not for non-Parquet files.

Could someone confirm if it is a potential problem, or not? Thanks!

[~duripeng] [~dagrawal3409]

  was:
Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for 
parquet formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code. Does it mean that 
this new staging directory will not take effect for non-Parquet formats? Could 
that be a potential problem? or am I missing something here?

Thanks!

[~duripeng] [~dagrawal3409]


> Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet 
> formats
> ---
>
> Key: SPARK-36187
> URL: https://issues.apache.org/jira/browse/SPARK-36187
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tony Zhang
>Priority: Minor
>
> Hi, my question here is specifically about [PR 
> #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
> SPARK-29302.
> To my understanding, the PR is to introduce a different staging directory at 
> job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, 
> the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is 
> not null: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
>  and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet 
> formats: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].
> However I didn't find similar behavior in Orc related code to set that 
> config. If I understand it correctly, without setting 
> SQLConf.OUTPUT_COMMITTER_CLASS properly (like for Orc format), 
> SQLHadoopMapReduceCommitProtocol will still use the original staging 
> directory, which may void the fix by the PR, in which case the commit 
> collision may still happen, thus the fix is now only effective for Parquet, 
> but not for non-Parquet files.
> Could someone confirm if it is a potential problem, or not? Thanks!
> [~duripeng] [~dagrawal3409]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36189) Improve bool, string, numeric DataTypeOps tests by avoiding joins

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36189:


Assignee: Apache Spark

> Improve bool, string, numeric DataTypeOps tests by avoiding joins
> -
>
> Key: SPARK-36189
> URL: https://issues.apache.org/jira/browse/SPARK-36189
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Improve bool, string, numeric DataTypeOps tests by avoiding joins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36189) Improve bool, string, numeric DataTypeOps tests by avoiding joins

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36189:


Assignee: (was: Apache Spark)

> Improve bool, string, numeric DataTypeOps tests by avoiding joins
> -
>
> Key: SPARK-36189
> URL: https://issues.apache.org/jira/browse/SPARK-36189
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Improve bool, string, numeric DataTypeOps tests by avoiding joins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36188) Add categories setter to CategoricalAccessor and CategoricalIndex.

2021-07-16 Thread Takuya Ueshin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382378#comment-17382378
 ] 

Takuya Ueshin commented on SPARK-36188:
---

I'm working on this.

> Add categories setter to CategoricalAccessor and CategoricalIndex.
> --
>
> Key: SPARK-36188
> URL: https://issues.apache.org/jira/browse/SPARK-36188
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36189) Improve bool, string, numeric DataTypeOps tests by avoiding joins

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382379#comment-17382379
 ] 

Apache Spark commented on SPARK-36189:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/33402

> Improve bool, string, numeric DataTypeOps tests by avoiding joins
> -
>
> Key: SPARK-36189
> URL: https://issues.apache.org/jira/browse/SPARK-36189
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Improve bool, string, numeric DataTypeOps tests by avoiding joins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36190) Improve the rest of DataTypeOps tests by avoiding joins

2021-07-16 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-36190:


 Summary: Improve the rest of DataTypeOps tests by avoiding joins
 Key: SPARK-36190
 URL: https://issues.apache.org/jira/browse/SPARK-36190
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


bool, string, numeric DataTypeOps tests have been improved by avoiding joins.

Improve the rest of DataTypeOps tests in the same way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36189) Improve bool, string, numeric DataTypeOps tests by avoiding joins

2021-07-16 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-36189:


 Summary: Improve bool, string, numeric DataTypeOps tests by 
avoiding joins
 Key: SPARK-36189
 URL: https://issues.apache.org/jira/browse/SPARK-36189
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


Improve bool, string, numeric DataTypeOps tests by avoiding joins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36188) Add categories setter to CategoricalAccessor and CategoricalIndex.

2021-07-16 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-36188:
-

 Summary: Add categories setter to CategoricalAccessor and 
CategoricalIndex.
 Key: SPARK-36188
 URL: https://issues.apache.org/jira/browse/SPARK-36188
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

2021-07-16 Thread Tony Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Zhang updated SPARK-36187:
---
   Shepherd: Wenchen Fan
Description: 
Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for 
parquet formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code. Does it mean that 
this new staging directory will not take effect for non-Parquet formats? Could 
that be a potential problem? or am I missing something here?

Thanks!

[~duripeng] [~dagrawal3409]

  was:
Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for 
parquet formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code. Does it mean that 
this new staging directory will not take effect for non-Parquet formats? Could 
that be a potential problem? or am I missing something here?

Thanks!


> Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet 
> formats
> ---
>
> Key: SPARK-36187
> URL: https://issues.apache.org/jira/browse/SPARK-36187
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tony Zhang
>Priority: Minor
>
> Hi, my question here is specifically about [PR 
> #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
> SPARK-29302.
> To my understanding, the PR is to introduce a different staging directory at 
> job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, 
> the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is 
> not null: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
>  however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for 
> parquet formats: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].
> However I didn't find similar behavior in Orc related code. Does it mean that 
> this new staging directory will not take effect for non-Parquet formats? 
> Could that be a potential problem? or am I missing something here?
> Thanks!
> [~duripeng] [~dagrawal3409]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

2021-07-16 Thread Tony Zhang (Jira)

Tony Zhang created SPARK-36187:
--

 Summary: Commit collision avoidance in dynamicPartitionOverwrite 
for non-Parquet formats
 Key: SPARK-36187
 URL: https://issues.apache.org/jira/browse/SPARK-36187
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 3.1.2
Reporter: Tony Zhang


Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for 
parquet formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code. Does it mean that 
this new staging directory will not take effect for non-Parquet formats? Could 
that be a potential problem? or am I missing something here?

Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35785) Cleanup support for RocksDB instance

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382361#comment-17382361
 ] 

Apache Spark commented on SPARK-35785:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33401

> Cleanup support for RocksDB instance
> 
>
> Key: SPARK-35785
> URL: https://issues.apache.org/jira/browse/SPARK-35785
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.2.0, 3.3.0
>
>
> Add the functionality of cleaning up files of old versions for the RocksDB 
> instance and RocksDBFileManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36186) Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36186:


Assignee: (was: Apache Spark)

> Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.
> 
>
> Key: SPARK-36186
> URL: https://issues.apache.org/jira/browse/SPARK-36186
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36186) Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36186:


Assignee: Apache Spark

> Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.
> 
>
> Key: SPARK-36186
> URL: https://issues.apache.org/jira/browse/SPARK-36186
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36186) Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382352#comment-17382352
 ] 

Apache Spark commented on SPARK-36186:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33400

> Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.
> 
>
> Key: SPARK-36186
> URL: https://issues.apache.org/jira/browse/SPARK-36186
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36186) Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.

2021-07-16 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-36186:
-

 Summary: Add as_ordered/as_unordered to CategoricalAccessor and 
CategoricalIndex.
 Key: SPARK-36186
 URL: https://issues.apache.org/jira/browse/SPARK-36186
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36185) Implement functions in CategoricalAccessor/CategoricalIndex

2021-07-16 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-36185:
-

 Summary: Implement functions in 
CategoricalAccessor/CategoricalIndex
 Key: SPARK-36185
 URL: https://issues.apache.org/jira/browse/SPARK-36185
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin


There are functions we haven't implemented in {{CategoricalAccessor}} and 
{{CategoricalIndex}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36099) Group exception messages in core/util

2021-07-16 Thread Allison Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382330#comment-17382330
 ] 

Allison Wang commented on SPARK-36099:
--

[~Shockang] Yes of course!

> Group exception messages in core/util
> -
>
> Key: SPARK-36099
> URL: https://issues.apache.org/jira/browse/SPARK-36099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> 'core/src/main/scala/org/apache/spark/util'
> || Filename ||   Count ||
> | AccumulatorV2.scala  |   4 |
> | ClosureCleaner.scala |   1 |
> | DependencyUtils.scala|   1 |
> | KeyLock.scala|   1 |
> | ListenerBus.scala|   1 |
> | NextIterator.scala   |   1 |
> | SerializableBuffer.scala |   2 |
> | ThreadUtils.scala|   4 |
> | Utils.scala  |  16 |
> 'core/src/main/scala/org/apache/spark/util/collection'
> || Filename  ||   Count ||
> | AppendOnlyMap.scala   |   1 |
> | CompactBuffer.scala   |   1 |
> | ImmutableBitSet.scala |   6 |
> | MedianHeap.scala  |   1 |
> | OpenHashSet.scala |   2 |
> 'core/src/main/scala/org/apache/spark/util/io'
> || Filename||   Count ||
> | ChunkedByteBuffer.scala |   1 |
> 'core/src/main/scala/org/apache/spark/util/logging'
> || Filename   ||   Count ||
> | DriverLogger.scala |   1 |
> 'core/src/main/scala/org/apache/spark/util/random'
> || Filename||   Count ||
> | RandomSampler.scala |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning

2021-07-16 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-36128.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33348
[https://github.com/apache/spark/pull/33348]

> CatalogFileIndex.filterPartitions should respect 
> spark.sql.hive.metastorePartitionPruning
> -
>
> Key: SPARK-36128
> URL: https://issues.apache.org/jira/browse/SPARK-36128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only 
> used in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. 
> The latter calls {{CatalogFileIndex.filterPartitions}} which calls 
> {{listPartitionsByFilter}} regardless of whether the above config is set or 
> not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning

2021-07-16 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-36128:
---

Assignee: Chao Sun

> CatalogFileIndex.filterPartitions should respect 
> spark.sql.hive.metastorePartitionPruning
> -
>
> Key: SPARK-36128
> URL: https://issues.apache.org/jira/browse/SPARK-36128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only 
> used in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. 
> The latter calls {{CatalogFileIndex.filterPartitions}} which calls 
> {{listPartitionsByFilter}} regardless of whether the above config is set or 
> not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36167) Revisit more InternalField managements.

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382241#comment-17382241
 ] 

Apache Spark commented on SPARK-36167:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33398

> Revisit more InternalField managements.
> ---
>
> Key: SPARK-36167
> URL: https://issues.apache.org/jira/browse/SPARK-36167
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> There are other places we can manage {{InternalField}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36183) Push down limit 1 through Aggregate

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382174#comment-17382174
 ] 

Apache Spark commented on SPARK-36183:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/33397

> Push down limit 1 through Aggregate
> ---
>
> Key: SPARK-36183
> URL: https://issues.apache.org/jira/browse/SPARK-36183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36183) Push down limit 1 through Aggregate

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36183:


Assignee: Apache Spark

> Push down limit 1 through Aggregate
> ---
>
> Key: SPARK-36183
> URL: https://issues.apache.org/jira/browse/SPARK-36183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36183) Push down limit 1 through Aggregate

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36183:


Assignee: (was: Apache Spark)

> Push down limit 1 through Aggregate
> ---
>
> Key: SPARK-36183
> URL: https://issues.apache.org/jira/browse/SPARK-36183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36183) Push down limit 1 through Aggregate

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382173#comment-17382173
 ] 

Apache Spark commented on SPARK-36183:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/33397

> Push down limit 1 through Aggregate
> ---
>
> Key: SPARK-36183
> URL: https://issues.apache.org/jira/browse/SPARK-36183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36184) Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36184:


Assignee: (was: Apache Spark)

> Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that 
> adds extra shuffles
> -
>
> Key: SPARK-36184
> URL: https://issues.apache.org/jira/browse/SPARK-36184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36184) Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36184:


Assignee: Apache Spark

> Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that 
> adds extra shuffles
> -
>
> Key: SPARK-36184
> URL: https://issues.apache.org/jira/browse/SPARK-36184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36184) Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382171#comment-17382171
 ] 

Apache Spark commented on SPARK-36184:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33396

> Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that 
> adds extra shuffles
> -
>
> Key: SPARK-36184
> URL: https://issues.apache.org/jira/browse/SPARK-36184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36184) Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles

2021-07-16 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-36184:
---

 Summary: Use ValidateRequirements instead of EnsureRequirements to 
skip AQE rules that adds extra shuffles
 Key: SPARK-36184
 URL: https://issues.apache.org/jira/browse/SPARK-36184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36183) Push down limit 1 through Aggregate

2021-07-16 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-36183:
---

 Summary: Push down limit 1 through Aggregate
 Key: SPARK-36183
 URL: https://issues.apache.org/jira/browse/SPARK-36183
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36152) Add Scala 2.13 daily build and test GitHub Action job

2021-07-16 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36152:
--
Description: 
https://github.com/apache/spark/actions/workflows/build_and_test_scala213_daily.yml

> Add Scala 2.13 daily build and test GitHub Action job
> -
>
> Key: SPARK-36152
> URL: https://issues.apache.org/jira/browse/SPARK-36152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0, 3.3.0
>
>
> https://github.com/apache/spark/actions/workflows/build_and_test_scala213_daily.yml



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32210) Failed to serialize large MapStatuses

2021-07-16 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382132#comment-17382132
 ] 

Dongjoon Hyun edited comment on SPARK-32210 at 7/16/21, 3:13 PM:
-

Please feel free to work on this, [~kazuyukitanimura] and ping me after you 
make a PR.


was (Author: dongjoon):
Please feel free to work on this, [~kazuyukitanimura].

> Failed to serialize large MapStatuses
> -
>
> Key: SPARK-32210
> URL: https://issues.apache.org/jira/browse/SPARK-32210
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 3.1.2
>Reporter: Yuming Wang
>Priority: Major
>
> Driver side exception:
> {noformat}
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] 
> spark.MapOutputTrackerMaster:91 :
> java.lang.NegativeArraySizeException
> at 
> org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] 
> spark.MapOutputTrackerMaster:91 :
> java.lang.NegativeArraySizeException
> at 
> org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] 
> spark.MapOutputTrackerMaster:91 :
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32210) Failed to serialize large MapStatuses

2021-07-16 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382132#comment-17382132
 ] 

Dongjoon Hyun commented on SPARK-32210:
---

Please feel free to work on this, [~kazuyukitanimura].

> Failed to serialize large MapStatuses
> -
>
> Key: SPARK-32210
> URL: https://issues.apache.org/jira/browse/SPARK-32210
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 3.1.2
>Reporter: Yuming Wang
>Priority: Major
>
> Driver side exception:
> {noformat}
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] 
> spark.MapOutputTrackerMaster:91 :
> java.lang.NegativeArraySizeException
> at 
> org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] 
> spark.MapOutputTrackerMaster:91 :
> java.lang.NegativeArraySizeException
> at 
> org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] 
> spark.MapOutputTrackerMaster:91 :
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32210) Failed to serialize large MapStatuses

2021-07-16 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382131#comment-17382131
 ] 

Dongjoon Hyun commented on SPARK-32210:
---

There is a new observation of this situation in 3.1 from [~kazuyukitanimura].

> Failed to serialize large MapStatuses
> -
>
> Key: SPARK-32210
> URL: https://issues.apache.org/jira/browse/SPARK-32210
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 3.1.2
>Reporter: Yuming Wang
>Priority: Major
>
> Driver side exception:
> {noformat}
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] 
> spark.MapOutputTrackerMaster:91 :
> java.lang.NegativeArraySizeException
> at 
> org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] 
> spark.MapOutputTrackerMaster:91 :
> java.lang.NegativeArraySizeException
> at 
> org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] 
> spark.MapOutputTrackerMaster:91 :
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32210) Failed to serialize large MapStatuses

2021-07-16 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32210:
--
Affects Version/s: 3.1.2

> Failed to serialize large MapStatuses
> -
>
> Key: SPARK-32210
> URL: https://issues.apache.org/jira/browse/SPARK-32210
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 3.1.2
>Reporter: Yuming Wang
>Priority: Major
>
> Driver side exception:
> {noformat}
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] 
> spark.MapOutputTrackerMaster:91 :
> java.lang.NegativeArraySizeException
> at 
> org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] 
> spark.MapOutputTrackerMaster:91 :
> java.lang.NegativeArraySizeException
> at 
> org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] 
> spark.MapOutputTrackerMaster:91 :
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36134) jackson-databind RCE vulnerability

2021-07-16 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382128#comment-17382128
 ] 

Erik Krogen commented on SPARK-36134:
-

Whoops, must have missed the 3.1.2 release :) Thanks for correcting me.

Still, 3.1.2 is using Jackson 2.10.0, so I don't see where the CVE report is 
coming from. Can you elaborate?

> jackson-databind RCE vulnerability
> --
>
> Key: SPARK-36134
> URL: https://issues.apache.org/jira/browse/SPARK-36134
> Project: Spark
>  Issue Type: Task
>  Components: Java API
>Affects Versions: 3.1.2, 3.1.3
>Reporter: Sumit
>Priority: Major
> Attachments: Screenshot 2021-07-15 at 1.00.55 PM.png
>
>
> Need to upgrade   jackson-databind version to *2.9.3.1*
> At the beginning of 2018, jackson-databind was reported to contain another 
> remote code execution (RCE) vulnerability (CVE-2017-17485) that affects 
> versions 2.9.3 and earlier, 2.7.9.1 and earlier, and 2.8.10 and earlier. This 
> vulnerability is caused by jackson-dababind’s incomplete blacklist. An 
> application that uses jackson-databind will become vulnerable when the 
> enableDefaultTyping method is called via the ObjectMapper object within the 
> application. An attacker can thus compromise the application by sending 
> maliciously crafted JSON input to gain direct control over a server. 
> Currently, a proof of concept (POC) exploit for this vulnerability has been 
> publicly available. All users who are affected by this vulnerability should 
> upgrade to the latest versions as soon as possible to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36182) Support TimestampNTZ type in Parquet file source

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382126#comment-17382126
 ] 

Apache Spark commented on SPARK-36182:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/33395

> Support TimestampNTZ type in Parquet file source
> 
>
> Key: SPARK-36182
> URL: https://issues.apache.org/jira/browse/SPARK-36182
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> As per 
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp,
>  Parquet supports both TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current 
> default timestamp type):
> * A TIMESTAMP with isAdjustedToUTC=true => TIMESTAMP_LTZ
> * A TIMESTAMP with isAdjustedToUTC=false => TIMESTAMP_NTZ
> In Spark 3.1 or prior,  the Parquet writer follows the definition and sets 
> the field `isAdjustedToUTC` as `true`, while the Parquet reader doesn’t 
> respect the `isAdjustedToUTC` flag and convert any Parquet Timestamp type as 
> TIMESTAMP_LTZ.
> Since 3.2, with the support of timestamp without time zone type:
> * Parquet writer follows the definition and sets the field `isAdjustedToUTC` 
> as `false` on writing TIMESTAMP_NTZ. 
> * Parquet reader 
> ** For schema inference, Spark converts the Parquet timestamp type to the 
> corresponding catalyst timestamp type according to the timestamp annotation 
> flag `isAdjustedToUTC`.
> ** If merge schema is enabled in schema inference and some of the files are 
> inferred as TIMESTAMP_NTZ while the others are TIMESTAMP_LTZ, the result type 
> is TIMESTAMP_LTZ  which is considered as the “wider” type
> ** If a column of a user-provided schema is TIMESTAMP_LTZ and the column was 
> written as  TIMESTAMP_NTZ type, Spark allows the read operation.
> ** If a column of a user-provided schema is TIMESTAMP_NTZ and the column was 
> written as  TIMESTAMP_LTZ type, the read operation is not allowed since the 
> TIMESTAMP_NTZ is considered as narrower than TIMESTAMP_LTZ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36182) Support TimestampNTZ type in Parquet file source

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36182:


Assignee: Apache Spark  (was: Gengliang Wang)

> Support TimestampNTZ type in Parquet file source
> 
>
> Key: SPARK-36182
> URL: https://issues.apache.org/jira/browse/SPARK-36182
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> As per 
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp,
>  Parquet supports both TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current 
> default timestamp type):
> * A TIMESTAMP with isAdjustedToUTC=true => TIMESTAMP_LTZ
> * A TIMESTAMP with isAdjustedToUTC=false => TIMESTAMP_NTZ
> In Spark 3.1 or prior,  the Parquet writer follows the definition and sets 
> the field `isAdjustedToUTC` as `true`, while the Parquet reader doesn’t 
> respect the `isAdjustedToUTC` flag and convert any Parquet Timestamp type as 
> TIMESTAMP_LTZ.
> Since 3.2, with the support of timestamp without time zone type:
> * Parquet writer follows the definition and sets the field `isAdjustedToUTC` 
> as `false` on writing TIMESTAMP_NTZ. 
> * Parquet reader 
> ** For schema inference, Spark converts the Parquet timestamp type to the 
> corresponding catalyst timestamp type according to the timestamp annotation 
> flag `isAdjustedToUTC`.
> ** If merge schema is enabled in schema inference and some of the files are 
> inferred as TIMESTAMP_NTZ while the others are TIMESTAMP_LTZ, the result type 
> is TIMESTAMP_LTZ  which is considered as the “wider” type
> ** If a column of a user-provided schema is TIMESTAMP_LTZ and the column was 
> written as  TIMESTAMP_NTZ type, Spark allows the read operation.
> ** If a column of a user-provided schema is TIMESTAMP_NTZ and the column was 
> written as  TIMESTAMP_LTZ type, the read operation is not allowed since the 
> TIMESTAMP_NTZ is considered as narrower than TIMESTAMP_LTZ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36182) Support TimestampNTZ type in Parquet file source

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36182:


Assignee: Gengliang Wang  (was: Apache Spark)

> Support TimestampNTZ type in Parquet file source
> 
>
> Key: SPARK-36182
> URL: https://issues.apache.org/jira/browse/SPARK-36182
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> As per 
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp,
>  Parquet supports both TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current 
> default timestamp type):
> * A TIMESTAMP with isAdjustedToUTC=true => TIMESTAMP_LTZ
> * A TIMESTAMP with isAdjustedToUTC=false => TIMESTAMP_NTZ
> In Spark 3.1 or prior,  the Parquet writer follows the definition and sets 
> the field `isAdjustedToUTC` as `true`, while the Parquet reader doesn’t 
> respect the `isAdjustedToUTC` flag and convert any Parquet Timestamp type as 
> TIMESTAMP_LTZ.
> Since 3.2, with the support of timestamp without time zone type:
> * Parquet writer follows the definition and sets the field `isAdjustedToUTC` 
> as `false` on writing TIMESTAMP_NTZ. 
> * Parquet reader 
> ** For schema inference, Spark converts the Parquet timestamp type to the 
> corresponding catalyst timestamp type according to the timestamp annotation 
> flag `isAdjustedToUTC`.
> ** If merge schema is enabled in schema inference and some of the files are 
> inferred as TIMESTAMP_NTZ while the others are TIMESTAMP_LTZ, the result type 
> is TIMESTAMP_LTZ  which is considered as the “wider” type
> ** If a column of a user-provided schema is TIMESTAMP_LTZ and the column was 
> written as  TIMESTAMP_NTZ type, Spark allows the read operation.
> ** If a column of a user-provided schema is TIMESTAMP_NTZ and the column was 
> written as  TIMESTAMP_LTZ type, the read operation is not allowed since the 
> TIMESTAMP_NTZ is considered as narrower than TIMESTAMP_LTZ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36152) Add Scala 2.13 daily build and test GitHub Action job

2021-07-16 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36152.
---
Fix Version/s: 3.2.0
   3.3.0
   Resolution: Fixed

Issue resolved by pull request 33358
[https://github.com/apache/spark/pull/33358]

> Add Scala 2.13 daily build and test GitHub Action job
> -
>
> Key: SPARK-36152
> URL: https://issues.apache.org/jira/browse/SPARK-36152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36182) Support TimestampNTZ type in Parquet file source

2021-07-16 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-36182:
--

 Summary: Support TimestampNTZ type in Parquet file source
 Key: SPARK-36182
 URL: https://issues.apache.org/jira/browse/SPARK-36182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


As per 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp, 
Parquet supports both TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current default 
timestamp type):
* A TIMESTAMP with isAdjustedToUTC=true => TIMESTAMP_LTZ
* A TIMESTAMP with isAdjustedToUTC=false => TIMESTAMP_NTZ

In Spark 3.1 or prior,  the Parquet writer follows the definition and sets the 
field `isAdjustedToUTC` as `true`, while the Parquet reader doesn’t respect the 
`isAdjustedToUTC` flag and convert any Parquet Timestamp type as TIMESTAMP_LTZ.

Since 3.2, with the support of timestamp without time zone type:
* Parquet writer follows the definition and sets the field `isAdjustedToUTC` as 
`false` on writing TIMESTAMP_NTZ. 

* Parquet reader 
** For schema inference, Spark converts the Parquet timestamp type to the 
corresponding catalyst timestamp type according to the timestamp annotation 
flag `isAdjustedToUTC`.
** If merge schema is enabled in schema inference and some of the files are 
inferred as TIMESTAMP_NTZ while the others are TIMESTAMP_LTZ, the result type 
is TIMESTAMP_LTZ  which is considered as the “wider” type
** If a column of a user-provided schema is TIMESTAMP_LTZ and the column was 
written as  TIMESTAMP_NTZ type, Spark allows the read operation.
** If a column of a user-provided schema is TIMESTAMP_NTZ and the column was 
written as  TIMESTAMP_LTZ type, the read operation is not allowed since the 
TIMESTAMP_NTZ is considered as narrower than TIMESTAMP_LTZ.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382123#comment-17382123
 ] 

Apache Spark commented on SPARK-36181:
--

User 'dominikgehl' has created a pull request for this issue:
https://github.com/apache/spark/pull/33394

> Update pyspark sql readwriter documentation to Scala level
> --
>
> Key: SPARK-36181
> URL: https://issues.apache.org/jira/browse/SPARK-36181
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Trivial
>
> Update pyspark sql readwriter documentation to the level of detail the Scala 
> documentation provides



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36181:


Assignee: (was: Apache Spark)

> Update pyspark sql readwriter documentation to Scala level
> --
>
> Key: SPARK-36181
> URL: https://issues.apache.org/jira/browse/SPARK-36181
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Trivial
>
> Update pyspark sql readwriter documentation to the level of detail the Scala 
> documentation provides



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36181:


Assignee: Apache Spark

> Update pyspark sql readwriter documentation to Scala level
> --
>
> Key: SPARK-36181
> URL: https://issues.apache.org/jira/browse/SPARK-36181
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Apache Spark
>Priority: Trivial
>
> Update pyspark sql readwriter documentation to the level of detail the Scala 
> documentation provides



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382121#comment-17382121
 ] 

Apache Spark commented on SPARK-36181:
--

User 'dominikgehl' has created a pull request for this issue:
https://github.com/apache/spark/pull/33394

> Update pyspark sql readwriter documentation to Scala level
> --
>
> Key: SPARK-36181
> URL: https://issues.apache.org/jira/browse/SPARK-36181
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Trivial
>
> Update pyspark sql readwriter documentation to the level of detail the Scala 
> documentation provides



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level

2021-07-16 Thread Dominik Gehl (Jira)

Dominik Gehl created SPARK-36181:


 Summary: Update pyspark sql readwriter documentation to Scala level
 Key: SPARK-36181
 URL: https://issues.apache.org/jira/browse/SPARK-36181
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 3.1.2
Reporter: Dominik Gehl


Update pyspark sql readwriter documentation to the level of detail the Scala 
documentation provides



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded

2021-07-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36177.
--
Fix Version/s: 3.0.4
   Resolution: Fixed

Issue resolved by pull request 33390
[https://github.com/apache/spark/pull/33390]

> Disable CRAN in branches lower than the latest version uploaded
> ---
>
> Key: SPARK-36177
> URL: https://issues.apache.org/jira/browse/SPARK-36177
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.4
>
>
> CRAN check in branch-3.0 fails as below:
> {code}
> Insufficient package version (submitted: 3.0.4, existing: 3.1.2)
> {code}
> This is because CRAN doesn't allow lower version then the latest version. We 
> can't upload so should better just skip the CRAN check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded

2021-07-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36177:
-
Fix Version/s: 3.1.3

> Disable CRAN in branches lower than the latest version uploaded
> ---
>
> Key: SPARK-36177
> URL: https://issues.apache.org/jira/browse/SPARK-36177
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.3, 3.0.4
>
>
> CRAN check in branch-3.0 fails as below:
> {code}
> Insufficient package version (submitted: 3.0.4, existing: 3.1.2)
> {code}
> This is because CRAN doesn't allow lower version then the latest version. We 
> can't upload so should better just skip the CRAN check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded

2021-07-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36177:


Assignee: Hyukjin Kwon

> Disable CRAN in branches lower than the latest version uploaded
> ---
>
> Key: SPARK-36177
> URL: https://issues.apache.org/jira/browse/SPARK-36177
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> CRAN check in branch-3.0 fails as below:
> {code}
> Insufficient package version (submitted: 3.0.4, existing: 3.1.2)
> {code}
> This is because CRAN doesn't allow lower version then the latest version. We 
> can't upload so should better just skip the CRAN check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36158) pyspark sql/functions documentation for months_between isn't as precise as scala version

2021-07-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36158:
-
Fix Version/s: 3.1.3
   3.2.0

> pyspark sql/functions documentation for months_between isn't as precise as 
> scala version
> 
>
> Key: SPARK-36158
> URL: https://issues.apache.org/jira/browse/SPARK-36158
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Dominik Gehl
>Priority: Trivial
> Fix For: 3.2.0, 3.1.3, 3.3.0
>
>
> pyspark months_between documentation doesn't mention that months are assumed 
> with 31 days in the calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36154) pyspark documentation doesn't mention week and quarter as valid format arguments to trunc

2021-07-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36154:


Assignee: Dominik Gehl

> pyspark documentation doesn't mention week and quarter as valid format 
> arguments to trunc
> -
>
> Key: SPARK-36154
> URL: https://issues.apache.org/jira/browse/SPARK-36154
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Dominik Gehl
>Priority: Trivial
> Fix For: 3.2.0, 3.1.3, 3.3.0
>
>
> pyspark documention for {{trunc}} in sql/functions doesn't mention that 
> {{week}} and {{quarter}} are valid format specifiers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36160) pyspark sql/column documentation doesn't always match scala documentation

2021-07-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36160.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33369
[https://github.com/apache/spark/pull/33369]

> pyspark sql/column documentation doesn't always match scala documentation
> -
>
> Key: SPARK-36160
> URL: https://issues.apache.org/jira/browse/SPARK-36160
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Dominik Gehl
>Priority: Trivial
> Fix For: 3.3.0
>
>
> The pyspark sql/column documentation for methods between, getField, 
> dropFields and cast could be adapted to follow more closely the corresponding 
> Scala one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36158) pyspark sql/functions documentation for months_between isn't as precise as scala version

2021-07-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36158:


Assignee: Dominik Gehl

> pyspark sql/functions documentation for months_between isn't as precise as 
> scala version
> 
>
> Key: SPARK-36158
> URL: https://issues.apache.org/jira/browse/SPARK-36158
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Dominik Gehl
>Priority: Trivial
> Fix For: 3.3.0
>
>
> pyspark months_between documentation doesn't mention that months are assumed 
> with 31 days in the calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36160) pyspark sql/column documentation doesn't always match scala documentation

2021-07-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36160:


Assignee: Dominik Gehl

> pyspark sql/column documentation doesn't always match scala documentation
> -
>
> Key: SPARK-36160
> URL: https://issues.apache.org/jira/browse/SPARK-36160
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Dominik Gehl
>Priority: Trivial
>
> The pyspark sql/column documentation for methods between, getField, 
> dropFields and cast could be adapted to follow more closely the corresponding 
> Scala one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36034:
-
Fix Version/s: 3.0.4

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Assignee: Max Gekk
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3, 3.0.4, 3.3.0
>
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382036#comment-17382036
 ] 

Apache Spark commented on SPARK-36179:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/33393

> Support TimestampNTZType in SparkGetColumnsOperation
> 
>
> Key: SPARK-36179
> URL: https://issues.apache.org/jira/browse/SPARK-36179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> TimestampNTZType is unhandled in SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382035#comment-17382035
 ] 

Apache Spark commented on SPARK-36179:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/33393

> Support TimestampNTZType in SparkGetColumnsOperation
> 
>
> Key: SPARK-36179
> URL: https://issues.apache.org/jira/browse/SPARK-36179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> TimestampNTZType is unhandled in SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36179:


Assignee: Apache Spark

> Support TimestampNTZType in SparkGetColumnsOperation
> 
>
> Key: SPARK-36179
> URL: https://issues.apache.org/jira/browse/SPARK-36179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> TimestampNTZType is unhandled in SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36179:


Assignee: (was: Apache Spark)

> Support TimestampNTZType in SparkGetColumnsOperation
> 
>
> Key: SPARK-36179
> URL: https://issues.apache.org/jira/browse/SPARK-36179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> TimestampNTZType is unhandled in SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36178) Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36178:


Assignee: Apache Spark

> Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst
> --
>
> Key: SPARK-36178
> URL: https://issues.apache.org/jira/browse/SPARK-36178
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Apache Spark
>Priority: Minor
>
> PySpark Catalog API currently isn't documented in 
> docs/source/reference/pyspark.sql.rst



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36178) Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36178:


Assignee: (was: Apache Spark)

> Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst
> --
>
> Key: SPARK-36178
> URL: https://issues.apache.org/jira/browse/SPARK-36178
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> PySpark Catalog API currently isn't documented in 
> docs/source/reference/pyspark.sql.rst



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36178) Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382017#comment-17382017
 ] 

Apache Spark commented on SPARK-36178:
--

User 'dominikgehl' has created a pull request for this issue:
https://github.com/apache/spark/pull/33392

> Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst
> --
>
> Key: SPARK-36178
> URL: https://issues.apache.org/jira/browse/SPARK-36178
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> PySpark Catalog API currently isn't documented in 
> docs/source/reference/pyspark.sql.rst



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36180) HMS can not recognize timestamp_ntz

2021-07-16 Thread Kent Yao (Jira)

Kent Yao created SPARK-36180:


 Summary: HMS can not recognize timestamp_ntz
 Key: SPARK-36180
 URL: https://issues.apache.org/jira/browse/SPARK-36180
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kent Yao


 
{code:java}
[info] Caused by: java.lang.IllegalArgumentException: Error: type expected at 
the position 0 of 'timestamp_ntz:timestamp' but 'timestamp_ntz' is found.[info] 
Caused by: java.lang.IllegalArgumentException: Error: type expected at the 
position 0 of 'timestamp_ntz:timestamp' but 'timestamp_ntz' is found.[info]  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:372)[info]
  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)[info]
  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:416)[info]
  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)[info]
  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:814)[info]
  at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:162)[info]
  at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:91)[info]
  at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:116)[info]
  at 
org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:54)[info]
  at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533)[info]
  at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:453)[info]
  at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:440)[info]
  at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281)[info]
  at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:199)[info]  
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:842)[info]  
... 63 more[info]   at 
org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:385)[info]
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)[info]  
 at 
org.apache.spark.sql.hive.thriftserver.SparkMetadataOperationSuite.$anonfun$new$145(SparkMetadataOperationSuite.scala:666)[info]
   at 
org.apache.spark.sql.hive.thriftserver.SparkMetadataOperationSuite.$anonfun$new$145$adapted(SparkMetadataOperationSuite.scala:665)[info]
   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.$anonfun$withMultipleConnectionJdbcStatement$4(HiveThriftServer2Suites.scala:1422)[info]
   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.$anonfun$withMultipleConnectionJdbcStatement$4$adapted(HiveThriftServer2Suites.scala:1422)[info]
   at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)[info]  
 at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)[info] 
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)[info]   
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.$anonfun$withMultipleConnectionJdbcStatement$1(HiveThriftServer2Suites.scala:1422)[info]
   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.tryCaptureSysLog(HiveThriftServer2Suites.scala:1407)[info]
   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.withMultipleConnectionJdbcStatement(HiveThriftServer2Suites.scala:1416)[info]
   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.withJdbcStatement(HiveThriftServer2Suites.scala:1454)[info]
   at 
org.apache.spark.sql.hive.thriftserver.SparkMetadataOperationSuite.$anonfun$new$144(SparkMetadataOperationSuite.scala:665)[info]
   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[info]   
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[info]   at 
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[info]   at 
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)[info]   at 
org.scalatest.Transformer.apply(Transformer.scala:22)[info]   at 
org.scalatest.Transformer.apply(Transformer.scala:20)[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)[info]
   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190[info]  
 at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)[info]
   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)[info]
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)[info] 
  at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLi

[jira] [Created] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation

2021-07-16 Thread Kent Yao (Jira)

Kent Yao created SPARK-36179:


 Summary: Support TimestampNTZType in SparkGetColumnsOperation
 Key: SPARK-36179
 URL: https://issues.apache.org/jira/browse/SPARK-36179
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kent Yao


TimestampNTZType is unhandled in SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36178) Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst

2021-07-16 Thread Dominik Gehl (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Gehl updated SPARK-36178:
-
Summary: Document PySpark Catalog APIs in 
docs/source/reference/pyspark.sql.rst  (was: document PySpark Catalog APIs in 
docs/source/reference/pyspark.sql.rst)

> Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst
> --
>
> Key: SPARK-36178
> URL: https://issues.apache.org/jira/browse/SPARK-36178
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> PySpark Catalog API currently isn't documented in 
> docs/source/reference/pyspark.sql.rst



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36178) document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst

2021-07-16 Thread Dominik Gehl (Jira)

Dominik Gehl created SPARK-36178:


 Summary: document PySpark Catalog APIs in 
docs/source/reference/pyspark.sql.rst
 Key: SPARK-36178
 URL: https://issues.apache.org/jira/browse/SPARK-36178
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 3.1.2
Reporter: Dominik Gehl


PySpark Catalog API currently isn't documented in 
docs/source/reference/pyspark.sql.rst



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34893) Support native session window

2021-07-16 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-34893:


Assignee: Jungtaek Lim

> Support native session window
> -
>
> Key: SPARK-34893
> URL: https://issues.apache.org/jira/browse/SPARK-34893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> This issue tracks effort on supporting native session window, on both batch 
> query and streaming query.
> This issue is the finalization of SPARK-10816 leveraging SPARK-34888, 
> SPARK-34889, SPARK-35861, SPARK-34891, SPARK-34892.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34893) Support native session window

2021-07-16 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-34893.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33081
[https://github.com/apache/spark/pull/33081]

> Support native session window
> -
>
> Key: SPARK-34893
> URL: https://issues.apache.org/jira/browse/SPARK-34893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.2.0
>
>
> This issue tracks effort on supporting native session window, on both batch 
> query and streaming query.
> This issue is the finalization of SPARK-10816 leveraging SPARK-34888, 
> SPARK-34889, SPARK-35861, SPARK-34891, SPARK-34892.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36122) Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not allow to configure mTLS authentication.

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36122:


Assignee: Apache Spark

> Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not 
> allow to configure mTLS authentication.
> -
>
> Key: SPARK-36122
> URL: https://issues.apache.org/jira/browse/SPARK-36122
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.4.7
>Reporter: Seetharama Khandrika
>Assignee: Apache Spark
>Priority: Major
>
> Spark does not pass on the needClientAuth flag to Jetty engine. This prevents 
> the UI from honouring mutual TLS authentication using x509 certificates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36122) Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not allow to configure mTLS authentication.

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36122:


Assignee: (was: Apache Spark)

> Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not 
> allow to configure mTLS authentication.
> -
>
> Key: SPARK-36122
> URL: https://issues.apache.org/jira/browse/SPARK-36122
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.4.7
>Reporter: Seetharama Khandrika
>Priority: Major
>
> Spark does not pass on the needClientAuth flag to Jetty engine. This prevents 
> the UI from honouring mutual TLS authentication using x509 certificates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36122) Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not allow to configure mTLS authentication.

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36122:


Assignee: Apache Spark

> Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not 
> allow to configure mTLS authentication.
> -
>
> Key: SPARK-36122
> URL: https://issues.apache.org/jira/browse/SPARK-36122
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.4.7
>Reporter: Seetharama Khandrika
>Assignee: Apache Spark
>Priority: Major
>
> Spark does not pass on the needClientAuth flag to Jetty engine. This prevents 
> the UI from honouring mutual TLS authentication using x509 certificates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36122) Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not allow to configure mTLS authentication.

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381996#comment-17381996
 ] 

Apache Spark commented on SPARK-36122:
--

User 'skhandrikagmail' has created a pull request for this issue:
https://github.com/apache/spark/pull/33301

> Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not 
> allow to configure mTLS authentication.
> -
>
> Key: SPARK-36122
> URL: https://issues.apache.org/jira/browse/SPARK-36122
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.4.7
>Reporter: Seetharama Khandrika
>Priority: Major
>
> Spark does not pass on the needClientAuth flag to Jetty engine. This prevents 
> the UI from honouring mutual TLS authentication using x509 certificates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded

2021-07-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36177:
-
Description: 
CRAN check in branch-3.0 fails as below:

{code}
Insufficient package version (submitted: 3.0.4, existing: 3.1.2)
{code}

This is because CRAN doesn't allow lower version then the latest version. We 
can't upload so should better just skip the CRAN check.

  was:
CRAN check in branch-3.0 fails as below:

{code}

Insufficient package version (submitted: 3.0.4, existing: 3.1.2)
{code}

This is because CRAN doesn't allow lower version then the latest version. We 
can't upload so should better just skip the CRAN check.


> Disable CRAN in branches lower than the latest version uploaded
> ---
>
> Key: SPARK-36177
> URL: https://issues.apache.org/jira/browse/SPARK-36177
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Hyukjin Kwon
>Priority: Major
>
> CRAN check in branch-3.0 fails as below:
> {code}
> Insufficient package version (submitted: 3.0.4, existing: 3.1.2)
> {code}
> This is because CRAN doesn't allow lower version then the latest version. We 
> can't upload so should better just skip the CRAN check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381979#comment-17381979
 ] 

Apache Spark commented on SPARK-36177:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33391

> Disable CRAN in branches lower than the latest version uploaded
> ---
>
> Key: SPARK-36177
> URL: https://issues.apache.org/jira/browse/SPARK-36177
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Hyukjin Kwon
>Priority: Major
>
> CRAN check in branch-3.0 fails as below:
> {code}
> Insufficient package version (submitted: 3.0.4, existing: 3.1.2)
> {code}
> This is because CRAN doesn't allow lower version then the latest version. We 
> can't upload so should better just skip the CRAN check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36177:


Assignee: (was: Apache Spark)

> Disable CRAN in branches lower than the latest version uploaded
> ---
>
> Key: SPARK-36177
> URL: https://issues.apache.org/jira/browse/SPARK-36177
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Hyukjin Kwon
>Priority: Major
>
> CRAN check in branch-3.0 fails as below:
> {code}
> Insufficient package version (submitted: 3.0.4, existing: 3.1.2)
> {code}
> This is because CRAN doesn't allow lower version then the latest version. We 
> can't upload so should better just skip the CRAN check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36177:


Assignee: Apache Spark

> Disable CRAN in branches lower than the latest version uploaded
> ---
>
> Key: SPARK-36177
> URL: https://issues.apache.org/jira/browse/SPARK-36177
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> CRAN check in branch-3.0 fails as below:
> {code}
> Insufficient package version (submitted: 3.0.4, existing: 3.1.2)
> {code}
> This is because CRAN doesn't allow lower version then the latest version. We 
> can't upload so should better just skip the CRAN check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381977#comment-17381977
 ] 

Apache Spark commented on SPARK-36177:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33390

> Disable CRAN in branches lower than the latest version uploaded
> ---
>
> Key: SPARK-36177
> URL: https://issues.apache.org/jira/browse/SPARK-36177
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Hyukjin Kwon
>Priority: Major
>
> CRAN check in branch-3.0 fails as below:
> {code}
> Insufficient package version (submitted: 3.0.4, existing: 3.1.2)
> {code}
> This is because CRAN doesn't allow lower version then the latest version. We 
> can't upload so should better just skip the CRAN check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded

2021-07-16 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-36177:


 Summary: Disable CRAN in branches lower than the latest version 
uploaded
 Key: SPARK-36177
 URL: https://issues.apache.org/jira/browse/SPARK-36177
 Project: Spark
  Issue Type: Test
  Components: SparkR
Affects Versions: 3.1.2, 3.0.3
Reporter: Hyukjin Kwon


CRAN check in branch-3.0 fails as below:

{code}

Insufficient package version (submitted: 3.0.4, existing: 3.1.2)
{code}

This is because CRAN doesn't allow lower version then the latest version. We 
can't upload so should better just skip the CRAN check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36176) Expose tableExists in pyspark.sql.catalog

2021-07-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36176:
-
Summary: Expose tableExists in pyspark.sql.catalog  (was: expose 
tableExists in pyspark.sql.catalog)

> Expose tableExists in pyspark.sql.catalog
> -
>
> Key: SPARK-36176
> URL: https://issues.apache.org/jira/browse/SPARK-36176
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> expose in pyspark tableExists which is part of the scala implementation 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2021-07-16 Thread canwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381919#comment-17381919
 ] 

canwang edited comment on SPARK-32530 at 7/16/21, 9:17 AM:
---

I've been helping with the Jetbrains' [Kotlin Spark 
API|https://github.com/JetBrains/kotlin-spark-api] too,I also hope that 
first-class support for Kotlin language into the Apache Spark project

1. I think kotlin api may be a better choice on jvm for spark developers.
 - As the description says, there are a lot of kotlin developers now, and they 
are growing fast, and more and more projects use kotlin as the first-class 
api,For example the demo on spring's web page has defaulted to kotlin.

 - As you said, there are very few developers using java to develop spark, 
because although spark perfectly supports java, the syntax of java is not 
friendly to developing spark. I believe they use java because of the relatively 
long learning curve of scala. High, koltin is much better, which can also be 
reflected in the growth rate of koltin users

2. The cost of adapting kotlin may not be high
 - The current [Kotlin Spark API|https://github.com/JetBrains/kotlin-spark-api] 
already exists and it is basically usable. Migrating to the spark appliction 
repo should only need to add more tests.

 - Judging from the existing [Kotlin Spark 
API|https://github.com/JetBrains/kotlin-spark-api], the main work of adaptation 
is to process the Serializer and Deserializer in the Encoder. I think the 
workload of these adaptation work should be able to refer to the adaptation of 
java, and it is even simpler than java. , Because of the adaptation of java, 
kotlin has a reference


was (Author: nonpool):
I've been helping with the Jetbrains' [Kotlin Spark 
APIhttps://github.com/JetBrains/kotlin-spark-api] too,I also hope that 
first-class support for Kotlin language into the Apache Spark project

1. I think kotlin api may be a better choice on jvm for spark developers.

- As the description says, there are a lot of kotlin developers now, and they 
are growing fast, and more and more projects use kotlin as the first-class 
api,For example the demo on spring's web page has defaulted to kotlin.

- As you said, there are very few developers using java to develop spark, 
because although spark perfectly supports java, the syntax of java is not 
friendly to developing spark. I believe they use java because of the relatively 
long learning curve of scala. High, koltin is much better, which can also be 
reflected in the growth rate of koltin users

2. The cost of adapting kotlin may not be high

- The current [Kotlin Spark APIhttps://github.com/JetBrains/kotlin-spark-api] 
already exists and it is basically usable. Migrating to the spark appliction 
repo should only need to add more tests.

- Judging from the existing [Kotlin Spark 
APIhttps://github.com/JetBrains/kotlin-spark-api], the main work of adaptation 
is to process the Serializer and Deserializer in the Encoder. I think the 
workload of these adaptation work should be able to refer to the adaptation of 
java, and it is even simpler than java. , Because of the adaptation of java, 
kotlin has a reference

> SPIP: Kotlin support for Apache Spark
> -
>
> Key: SPARK-32530
> URL: https://issues.apache.org/jira/browse/SPARK-32530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The

[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2021-07-16 Thread canwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381919#comment-17381919
 ] 

canwang commented on SPARK-32530:
-

I've been helping with the Jetbrains' [Kotlin Spark 
APIhttps://github.com/JetBrains/kotlin-spark-api] too,I also hope that 
first-class support for Kotlin language into the Apache Spark project

1. I think kotlin api may be a better choice on jvm for spark developers.

- As the description says, there are a lot of kotlin developers now, and they 
are growing fast, and more and more projects use kotlin as the first-class 
api,For example the demo on spring's web page has defaulted to kotlin.

- As you said, there are very few developers using java to develop spark, 
because although spark perfectly supports java, the syntax of java is not 
friendly to developing spark. I believe they use java because of the relatively 
long learning curve of scala. High, koltin is much better, which can also be 
reflected in the growth rate of koltin users

2. The cost of adapting kotlin may not be high

- The current [Kotlin Spark APIhttps://github.com/JetBrains/kotlin-spark-api] 
already exists and it is basically usable. Migrating to the spark appliction 
repo should only need to add more tests.

- Judging from the existing [Kotlin Spark 
APIhttps://github.com/JetBrains/kotlin-spark-api], the main work of adaptation 
is to process the Serializer and Deserializer in the Encoder. I think the 
workload of these adaptation work should be able to refer to the adaptation of 
java, and it is even simpler than java. , Because of the adaptation of java, 
kotlin has a reference

> SPIP: Kotlin support for Apache Spark
> -
>
> Key: SPARK-32530
> URL: https://issues.apache.org/jira/browse/SPARK-32530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The goal of this project is to bring first-class support for Kotlin language 
> into the Apache Spark project. We’re going to achieve this by adding one more 
> module to the current Apache Spark distribution.
> h2. Non-goals
> There is no goal to replace any existing language support or to change any 
> existing Apache Spark API.
> At this time, there is no goal to support non-core APIs of Apache Spark like 
> Spark ML and Spark structured streaming. This may change in the future based 
> on community feedback.
> There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 
> separate SPIP.
> There is no goal to provide support for Apache Spark < 3.0.0.
> h2. Current implementation
> A working prototype is available at 
> [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside 
> JetBrains and by early adopters.
> h2. What are the risks?
> There is always a risk that this product won’t get enough popularity and will 
> bring more costs than benefits. It can be mitigated by the fact that we don't 
> need to change any existing API and support can be potentially dropped at any 
> time.
> We also believe that existing API is rather low maintenance. It does not 
> bring anything more complex than already exists in the Spark codebase. 
> Furthermore, the implementation is compact - less than 2000 lines of code.
> We are committed to maintaining, improving and evolving the API based on 
> feedback from both Spark and Kotlin communities. As the Kotlin data community 
> continues to grow, we see Kotlin API for A

[jira] [Resolved] (SPARK-36048) Wrong HealthTrackerSuite.allExecutorAndHostIds

2021-07-16 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi resolved SPARK-36048.
--
Fix Version/s: 3.3.0
 Assignee: wuyi
   Resolution: Fixed

Issue resolved by [https://github.com/apache/spark/pull/33262]

> Wrong HealthTrackerSuite.allExecutorAndHostIds
> --
>
> Key: SPARK-36048
> URL: https://issues.apache.org/jira/browse/SPARK-36048
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.3.0
>
>
> `HealthTrackerSuite.allExecutorAndHostIds` is mistakenly declared, which 
> leads to the executor exclusion isn't correctly tested. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36176) expose tableExists in pyspark.sql.catalog

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381904#comment-17381904
 ] 

Apache Spark commented on SPARK-36176:
--

User 'dominikgehl' has created a pull request for this issue:
https://github.com/apache/spark/pull/33388

> expose tableExists in pyspark.sql.catalog
> -
>
> Key: SPARK-36176
> URL: https://issues.apache.org/jira/browse/SPARK-36176
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> expose in pyspark tableExists which is part of the scala implementation 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36176) expose tableExists in pyspark.sql.catalog

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36176:


Assignee: Apache Spark

> expose tableExists in pyspark.sql.catalog
> -
>
> Key: SPARK-36176
> URL: https://issues.apache.org/jira/browse/SPARK-36176
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Apache Spark
>Priority: Minor
>
> expose in pyspark tableExists which is part of the scala implementation 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36176) expose tableExists in pyspark.sql.catalog

2021-07-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36176:


Assignee: (was: Apache Spark)

> expose tableExists in pyspark.sql.catalog
> -
>
> Key: SPARK-36176
> URL: https://issues.apache.org/jira/browse/SPARK-36176
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> expose in pyspark tableExists which is part of the scala implementation 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36176) expose tableExists in pyspark.sql.catalog

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381902#comment-17381902
 ] 

Apache Spark commented on SPARK-36176:
--

User 'dominikgehl' has created a pull request for this issue:
https://github.com/apache/spark/pull/33388

> expose tableExists in pyspark.sql.catalog
> -
>
> Key: SPARK-36176
> URL: https://issues.apache.org/jira/browse/SPARK-36176
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> expose in pyspark tableExists which is part of the scala implementation 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36176) expose tableExists in pyspark.sql.catalog

2021-07-16 Thread Dominik Gehl (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Gehl updated SPARK-36176:
-
Summary: expose tableExists in pyspark.sql.catalog  (was: expost 
tableExists in pyspark.sql.catalog)

> expose tableExists in pyspark.sql.catalog
> -
>
> Key: SPARK-36176
> URL: https://issues.apache.org/jira/browse/SPARK-36176
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> expose in pyspark tableExists which is part of the scala implementation 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36176) expost tableExists in pyspark.sql.catalog

2021-07-16 Thread Dominik Gehl (Jira)

Dominik Gehl created SPARK-36176:


 Summary: expost tableExists in pyspark.sql.catalog
 Key: SPARK-36176
 URL: https://issues.apache.org/jira/browse/SPARK-36176
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.2
Reporter: Dominik Gehl


expose in pyspark tableExists which is part of the scala implementation 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35710) Support DPP + AQE when no reused broadcast exchange

2021-07-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35710:
---

Assignee: Ke Jia

> Support DPP + AQE when no reused broadcast exchange
> ---
>
> Key: SPARK-35710
> URL: https://issues.apache.org/jira/browse/SPARK-35710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>
> Support DPP + AQE when no reused broadcast exchange.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35710) Support DPP + AQE when no reused broadcast exchange

2021-07-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35710.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32861
[https://github.com/apache/spark/pull/32861]

> Support DPP + AQE when no reused broadcast exchange
> ---
>
> Key: SPARK-35710
> URL: https://issues.apache.org/jira/browse/SPARK-35710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.2.0
>
>
> Support DPP + AQE when no reused broadcast exchange.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36099) Group exception messages in core/util

2021-07-16 Thread Shockang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381874#comment-17381874
 ] 

Shockang commented on SPARK-36099:
--

[~allisonwang-db] Could I working on this issue?

> Group exception messages in core/util
> -
>
> Key: SPARK-36099
> URL: https://issues.apache.org/jira/browse/SPARK-36099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> 'core/src/main/scala/org/apache/spark/util'
> || Filename ||   Count ||
> | AccumulatorV2.scala  |   4 |
> | ClosureCleaner.scala |   1 |
> | DependencyUtils.scala|   1 |
> | KeyLock.scala|   1 |
> | ListenerBus.scala|   1 |
> | NextIterator.scala   |   1 |
> | SerializableBuffer.scala |   2 |
> | ThreadUtils.scala|   4 |
> | Utils.scala  |  16 |
> 'core/src/main/scala/org/apache/spark/util/collection'
> || Filename  ||   Count ||
> | AppendOnlyMap.scala   |   1 |
> | CompactBuffer.scala   |   1 |
> | ImmutableBitSet.scala |   6 |
> | MedianHeap.scala  |   1 |
> | OpenHashSet.scala |   2 |
> 'core/src/main/scala/org/apache/spark/util/io'
> || Filename||   Count ||
> | ChunkedByteBuffer.scala |   1 |
> 'core/src/main/scala/org/apache/spark/util/logging'
> || Filename   ||   Count ||
> | DriverLogger.scala |   1 |
> 'core/src/main/scala/org/apache/spark/util/random'
> || Filename||   Count ||
> | RandomSampler.scala |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35972) When replace ExtractValue in NestedColumnAliasing we should use semanticEquals

2021-07-16 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35972:

Fix Version/s: 3.1.3

> When replace ExtractValue in NestedColumnAliasing we should use semanticEquals
> --
>
> Key: SPARK-35972
> URL: https://issues.apache.org/jira/browse/SPARK-35972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> {code:java}
> Job aborted due to stage failure: Task 47 in stage 1.0 failed 4 times, most 
> recent failure: Lost task 47.3 in stage 1.0 (TID 328) 
> (ip-idata-server.shopee.io executor 3): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_788#788
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:377)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:438)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:438)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:377)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:438)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:438)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalys

[jira] [Assigned] (SPARK-35972) When replace ExtractValue in NestedColumnAliasing we should use semanticEquals

2021-07-16 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-35972:
---

Assignee: angerszhu

> When replace ExtractValue in NestedColumnAliasing we should use semanticEquals
> --
>
> Key: SPARK-35972
> URL: https://issues.apache.org/jira/browse/SPARK-35972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:java}
> Job aborted due to stage failure: Task 47 in stage 1.0 failed 4 times, most 
> recent failure: Lost task 47.3 in stage 1.0 (TID 328) 
> (ip-idata-server.shopee.io executor 3): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_788#788
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:377)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:438)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:438)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:377)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:438)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:438)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst

[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381857#comment-17381857
 ] 

Apache Spark commented on SPARK-36034:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33387

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Assignee: Max Gekk
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3, 3.3.0
>
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381856#comment-17381856
 ] 

Apache Spark commented on SPARK-36034:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33387

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Assignee: Max Gekk
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3, 3.3.0
>
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

98 matches

Mail list logo