[jira] [Assigned] (SPARK-35276) Write checksum files for shuffle
[ https://issues.apache.org/jira/browse/SPARK-35276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-35276: --- Assignee: wuyi > Write checksum files for shuffle > - > > Key: SPARK-35276 > URL: https://issues.apache.org/jira/browse/SPARK-35276 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35276) Write checksum files for shuffle
[ https://issues.apache.org/jira/browse/SPARK-35276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-35276. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32401 [https://github.com/apache/spark/pull/32401] > Write checksum files for shuffle > - > > Key: SPARK-35276 > URL: https://issues.apache.org/jira/browse/SPARK-35276 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36192) Better error messages when comparing against list
Xinrong Meng created SPARK-36192: Summary: Better error messages when comparing against list Key: SPARK-36192 URL: https://issues.apache.org/jira/browse/SPARK-36192 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.0 Environment: We shall throw TypeError messages rather than Spark exceptions. Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36191) Support ORDER BY and LIMIT to be on the correlation path
Allison Wang created SPARK-36191: Summary: Support ORDER BY and LIMIT to be on the correlation path Key: SPARK-36191 URL: https://issues.apache.org/jira/browse/SPARK-36191 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Allison Wang A correlation path is defined as the sub-tree of all the operators that are on the path from the operator hosting the correlated expressions up to the operator producing the correlated values. We want to support ORDER BY (Sort) and LIMT operators to be on the correlation path to achieve better feature parity with Postgres. Here is an example query in `postgreSQL/join.sql`: {code:SQL} select * from text_tbl t1 left join int8_tbl i8 on i8.q2 = 123, lateral (select i8.q1, t2.f1 from text_tbl t2 limit 1) as ss where t1.f1 = ss.f1; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats
[ https://issues.apache.org/jira/browse/SPARK-36187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Zhang updated SPARK-36187: --- Description: Hi, my question here is specifically about [PR #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for SPARK-29302. To my understanding, the PR is to introduce a different staging directory at job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not null: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58], and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet formats: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96]. However I didn't find similar behavior in Orc related code to set that config. If I understand it correctly, without setting SQLConf.OUTPUT_COMMITTER_CLASS properly (like for Orc format), SQLHadoopMapReduceCommitProtocol will still use the original staging directory, which may void the fix by the PR, in which case the commit collision may still happen, thus the fix is now only effective for Parquet, but not for non-Parquet files. Could someone confirm if it is a potential problem, or not? Thanks! [~duripeng] [~dagrawal3409] was: Hi, my question here is specifically about [PR #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for SPARK-29302. To my understanding, the PR is to introduce a different staging directory at job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not null: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58], however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet formats: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96]. However I didn't find similar behavior in Orc related code. Does it mean that this new staging directory will not take effect for non-Parquet formats? Could that be a potential problem? or am I missing something here? Thanks! [~duripeng] [~dagrawal3409] > Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet > formats > --- > > Key: SPARK-36187 > URL: https://issues.apache.org/jira/browse/SPARK-36187 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.1.2 >Reporter: Tony Zhang >Priority: Minor > > Hi, my question here is specifically about [PR > #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for > SPARK-29302. > To my understanding, the PR is to introduce a different staging directory at > job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, > the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is > not null: > [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58], > and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet > formats: > [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96]. > However I didn't find similar behavior in Orc related code to set that > config. If I understand it correctly, without setting > SQLConf.OUTPUT_COMMITTER_CLASS properly (like for Orc format), > SQLHadoopMapReduceCommitProtocol will still use the original staging > directory, which may void the fix by the PR, in which case the commit > collision may still happen, thus the fix is now only effective for Parquet, > but not for non-Parquet files. > Could someone confirm if it is a potential problem, or not? Thanks! > [~duripeng] [~dagrawal3409] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36189) Improve bool, string, numeric DataTypeOps tests by avoiding joins
[ https://issues.apache.org/jira/browse/SPARK-36189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36189: Assignee: Apache Spark > Improve bool, string, numeric DataTypeOps tests by avoiding joins > - > > Key: SPARK-36189 > URL: https://issues.apache.org/jira/browse/SPARK-36189 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Improve bool, string, numeric DataTypeOps tests by avoiding joins. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36189) Improve bool, string, numeric DataTypeOps tests by avoiding joins
[ https://issues.apache.org/jira/browse/SPARK-36189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36189: Assignee: (was: Apache Spark) > Improve bool, string, numeric DataTypeOps tests by avoiding joins > - > > Key: SPARK-36189 > URL: https://issues.apache.org/jira/browse/SPARK-36189 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > Improve bool, string, numeric DataTypeOps tests by avoiding joins. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36188) Add categories setter to CategoricalAccessor and CategoricalIndex.
[ https://issues.apache.org/jira/browse/SPARK-36188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382378#comment-17382378 ] Takuya Ueshin commented on SPARK-36188: --- I'm working on this. > Add categories setter to CategoricalAccessor and CategoricalIndex. > -- > > Key: SPARK-36188 > URL: https://issues.apache.org/jira/browse/SPARK-36188 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36189) Improve bool, string, numeric DataTypeOps tests by avoiding joins
[ https://issues.apache.org/jira/browse/SPARK-36189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382379#comment-17382379 ] Apache Spark commented on SPARK-36189: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/33402 > Improve bool, string, numeric DataTypeOps tests by avoiding joins > - > > Key: SPARK-36189 > URL: https://issues.apache.org/jira/browse/SPARK-36189 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > Improve bool, string, numeric DataTypeOps tests by avoiding joins. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36190) Improve the rest of DataTypeOps tests by avoiding joins
Xinrong Meng created SPARK-36190: Summary: Improve the rest of DataTypeOps tests by avoiding joins Key: SPARK-36190 URL: https://issues.apache.org/jira/browse/SPARK-36190 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.2.0 Reporter: Xinrong Meng bool, string, numeric DataTypeOps tests have been improved by avoiding joins. Improve the rest of DataTypeOps tests in the same way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36189) Improve bool, string, numeric DataTypeOps tests by avoiding joins
Xinrong Meng created SPARK-36189: Summary: Improve bool, string, numeric DataTypeOps tests by avoiding joins Key: SPARK-36189 URL: https://issues.apache.org/jira/browse/SPARK-36189 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.2.0 Reporter: Xinrong Meng Improve bool, string, numeric DataTypeOps tests by avoiding joins. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36188) Add categories setter to CategoricalAccessor and CategoricalIndex.
Takuya Ueshin created SPARK-36188: - Summary: Add categories setter to CategoricalAccessor and CategoricalIndex. Key: SPARK-36188 URL: https://issues.apache.org/jira/browse/SPARK-36188 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.2.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats
[ https://issues.apache.org/jira/browse/SPARK-36187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Zhang updated SPARK-36187: --- Shepherd: Wenchen Fan Description: Hi, my question here is specifically about [PR #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for SPARK-29302. To my understanding, the PR is to introduce a different staging directory at job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not null: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58], however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet formats: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96]. However I didn't find similar behavior in Orc related code. Does it mean that this new staging directory will not take effect for non-Parquet formats? Could that be a potential problem? or am I missing something here? Thanks! [~duripeng] [~dagrawal3409] was: Hi, my question here is specifically about [PR #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for SPARK-29302. To my understanding, the PR is to introduce a different staging directory at job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not null: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58], however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet formats: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96]. However I didn't find similar behavior in Orc related code. Does it mean that this new staging directory will not take effect for non-Parquet formats? Could that be a potential problem? or am I missing something here? Thanks! > Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet > formats > --- > > Key: SPARK-36187 > URL: https://issues.apache.org/jira/browse/SPARK-36187 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.1.2 >Reporter: Tony Zhang >Priority: Minor > > Hi, my question here is specifically about [PR > #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for > SPARK-29302. > To my understanding, the PR is to introduce a different staging directory at > job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, > the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is > not null: > [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58], > however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for > parquet formats: > [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96]. > However I didn't find similar behavior in Orc related code. Does it mean that > this new staging directory will not take effect for non-Parquet formats? > Could that be a potential problem? or am I missing something here? > Thanks! > [~duripeng] [~dagrawal3409] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats
Tony Zhang created SPARK-36187: -- Summary: Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats Key: SPARK-36187 URL: https://issues.apache.org/jira/browse/SPARK-36187 Project: Spark Issue Type: Question Components: SQL Affects Versions: 3.1.2 Reporter: Tony Zhang Hi, my question here is specifically about [PR #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for SPARK-29302. To my understanding, the PR is to introduce a different staging directory at job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not null: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58], however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet formats: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96]. However I didn't find similar behavior in Orc related code. Does it mean that this new staging directory will not take effect for non-Parquet formats? Could that be a potential problem? or am I missing something here? Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35785) Cleanup support for RocksDB instance
[ https://issues.apache.org/jira/browse/SPARK-35785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382361#comment-17382361 ] Apache Spark commented on SPARK-35785: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/33401 > Cleanup support for RocksDB instance > > > Key: SPARK-35785 > URL: https://issues.apache.org/jira/browse/SPARK-35785 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.2.0, 3.3.0 > > > Add the functionality of cleaning up files of old versions for the RocksDB > instance and RocksDBFileManager. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36186) Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.
[ https://issues.apache.org/jira/browse/SPARK-36186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36186: Assignee: (was: Apache Spark) > Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex. > > > Key: SPARK-36186 > URL: https://issues.apache.org/jira/browse/SPARK-36186 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36186) Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.
[ https://issues.apache.org/jira/browse/SPARK-36186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36186: Assignee: Apache Spark > Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex. > > > Key: SPARK-36186 > URL: https://issues.apache.org/jira/browse/SPARK-36186 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36186) Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.
[ https://issues.apache.org/jira/browse/SPARK-36186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382352#comment-17382352 ] Apache Spark commented on SPARK-36186: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/33400 > Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex. > > > Key: SPARK-36186 > URL: https://issues.apache.org/jira/browse/SPARK-36186 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36186) Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex.
Takuya Ueshin created SPARK-36186: - Summary: Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex. Key: SPARK-36186 URL: https://issues.apache.org/jira/browse/SPARK-36186 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.2.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36185) Implement functions in CategoricalAccessor/CategoricalIndex
Takuya Ueshin created SPARK-36185: - Summary: Implement functions in CategoricalAccessor/CategoricalIndex Key: SPARK-36185 URL: https://issues.apache.org/jira/browse/SPARK-36185 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.2.0 Reporter: Takuya Ueshin There are functions we haven't implemented in {{CategoricalAccessor}} and {{CategoricalIndex}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36099) Group exception messages in core/util
[ https://issues.apache.org/jira/browse/SPARK-36099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382330#comment-17382330 ] Allison Wang commented on SPARK-36099: -- [~Shockang] Yes of course! > Group exception messages in core/util > - > > Key: SPARK-36099 > URL: https://issues.apache.org/jira/browse/SPARK-36099 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Allison Wang >Priority: Major > > 'core/src/main/scala/org/apache/spark/util' > || Filename || Count || > | AccumulatorV2.scala | 4 | > | ClosureCleaner.scala | 1 | > | DependencyUtils.scala| 1 | > | KeyLock.scala| 1 | > | ListenerBus.scala| 1 | > | NextIterator.scala | 1 | > | SerializableBuffer.scala | 2 | > | ThreadUtils.scala| 4 | > | Utils.scala | 16 | > 'core/src/main/scala/org/apache/spark/util/collection' > || Filename || Count || > | AppendOnlyMap.scala | 1 | > | CompactBuffer.scala | 1 | > | ImmutableBitSet.scala | 6 | > | MedianHeap.scala | 1 | > | OpenHashSet.scala | 2 | > 'core/src/main/scala/org/apache/spark/util/io' > || Filename|| Count || > | ChunkedByteBuffer.scala | 1 | > 'core/src/main/scala/org/apache/spark/util/logging' > || Filename || Count || > | DriverLogger.scala | 1 | > 'core/src/main/scala/org/apache/spark/util/random' > || Filename|| Count || > | RandomSampler.scala | 1 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning
[ https://issues.apache.org/jira/browse/SPARK-36128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-36128. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33348 [https://github.com/apache/spark/pull/33348] > CatalogFileIndex.filterPartitions should respect > spark.sql.hive.metastorePartitionPruning > - > > Key: SPARK-36128 > URL: https://issues.apache.org/jira/browse/SPARK-36128 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only > used in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. > The latter calls {{CatalogFileIndex.filterPartitions}} which calls > {{listPartitionsByFilter}} regardless of whether the above config is set or > not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36128) CatalogFileIndex.filterPartitions should respect spark.sql.hive.metastorePartitionPruning
[ https://issues.apache.org/jira/browse/SPARK-36128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-36128: --- Assignee: Chao Sun > CatalogFileIndex.filterPartitions should respect > spark.sql.hive.metastorePartitionPruning > - > > Key: SPARK-36128 > URL: https://issues.apache.org/jira/browse/SPARK-36128 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > Currently the config {{spark.sql.hive.metastorePartitionPruning}} is only > used in {{PruneHiveTablePartitions}} but not {{PruneFileSourcePartitions}}. > The latter calls {{CatalogFileIndex.filterPartitions}} which calls > {{listPartitionsByFilter}} regardless of whether the above config is set or > not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36167) Revisit more InternalField managements.
[ https://issues.apache.org/jira/browse/SPARK-36167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382241#comment-17382241 ] Apache Spark commented on SPARK-36167: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/33398 > Revisit more InternalField managements. > --- > > Key: SPARK-36167 > URL: https://issues.apache.org/jira/browse/SPARK-36167 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.2.0 > > > There are other places we can manage {{InternalField}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36183) Push down limit 1 through Aggregate
[ https://issues.apache.org/jira/browse/SPARK-36183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382174#comment-17382174 ] Apache Spark commented on SPARK-36183: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/33397 > Push down limit 1 through Aggregate > --- > > Key: SPARK-36183 > URL: https://issues.apache.org/jira/browse/SPARK-36183 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36183) Push down limit 1 through Aggregate
[ https://issues.apache.org/jira/browse/SPARK-36183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36183: Assignee: Apache Spark > Push down limit 1 through Aggregate > --- > > Key: SPARK-36183 > URL: https://issues.apache.org/jira/browse/SPARK-36183 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36183) Push down limit 1 through Aggregate
[ https://issues.apache.org/jira/browse/SPARK-36183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36183: Assignee: (was: Apache Spark) > Push down limit 1 through Aggregate > --- > > Key: SPARK-36183 > URL: https://issues.apache.org/jira/browse/SPARK-36183 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36183) Push down limit 1 through Aggregate
[ https://issues.apache.org/jira/browse/SPARK-36183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382173#comment-17382173 ] Apache Spark commented on SPARK-36183: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/33397 > Push down limit 1 through Aggregate > --- > > Key: SPARK-36183 > URL: https://issues.apache.org/jira/browse/SPARK-36183 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36184) Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles
[ https://issues.apache.org/jira/browse/SPARK-36184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36184: Assignee: (was: Apache Spark) > Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that > adds extra shuffles > - > > Key: SPARK-36184 > URL: https://issues.apache.org/jira/browse/SPARK-36184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36184) Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles
[ https://issues.apache.org/jira/browse/SPARK-36184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36184: Assignee: Apache Spark > Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that > adds extra shuffles > - > > Key: SPARK-36184 > URL: https://issues.apache.org/jira/browse/SPARK-36184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36184) Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles
[ https://issues.apache.org/jira/browse/SPARK-36184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382171#comment-17382171 ] Apache Spark commented on SPARK-36184: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/33396 > Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that > adds extra shuffles > - > > Key: SPARK-36184 > URL: https://issues.apache.org/jira/browse/SPARK-36184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36184) Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles
Wenchen Fan created SPARK-36184: --- Summary: Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles Key: SPARK-36184 URL: https://issues.apache.org/jira/browse/SPARK-36184 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36183) Push down limit 1 through Aggregate
Yuming Wang created SPARK-36183: --- Summary: Push down limit 1 through Aggregate Key: SPARK-36183 URL: https://issues.apache.org/jira/browse/SPARK-36183 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36152) Add Scala 2.13 daily build and test GitHub Action job
[ https://issues.apache.org/jira/browse/SPARK-36152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36152: -- Description: https://github.com/apache/spark/actions/workflows/build_and_test_scala213_daily.yml > Add Scala 2.13 daily build and test GitHub Action job > - > > Key: SPARK-36152 > URL: https://issues.apache.org/jira/browse/SPARK-36152 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.0, 3.3.0 > > > https://github.com/apache/spark/actions/workflows/build_and_test_scala213_daily.yml -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32210) Failed to serialize large MapStatuses
[ https://issues.apache.org/jira/browse/SPARK-32210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382132#comment-17382132 ] Dongjoon Hyun edited comment on SPARK-32210 at 7/16/21, 3:13 PM: - Please feel free to work on this, [~kazuyukitanimura] and ping me after you make a PR. was (Author: dongjoon): Please feel free to work on this, [~kazuyukitanimura]. > Failed to serialize large MapStatuses > - > > Key: SPARK-32210 > URL: https://issues.apache.org/jira/browse/SPARK-32210 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 3.1.2 >Reporter: Yuming Wang >Priority: Major > > Driver side exception: > {noformat} > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] > spark.MapOutputTrackerMaster:91 : > java.lang.NegativeArraySizeException > at > org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) > at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] > spark.MapOutputTrackerMaster:91 : > java.lang.NegativeArraySizeException > at > org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) > at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] > spark.MapOutputTrackerMaster:91 : > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32210) Failed to serialize large MapStatuses
[ https://issues.apache.org/jira/browse/SPARK-32210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382132#comment-17382132 ] Dongjoon Hyun commented on SPARK-32210: --- Please feel free to work on this, [~kazuyukitanimura]. > Failed to serialize large MapStatuses > - > > Key: SPARK-32210 > URL: https://issues.apache.org/jira/browse/SPARK-32210 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 3.1.2 >Reporter: Yuming Wang >Priority: Major > > Driver side exception: > {noformat} > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] > spark.MapOutputTrackerMaster:91 : > java.lang.NegativeArraySizeException > at > org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) > at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] > spark.MapOutputTrackerMaster:91 : > java.lang.NegativeArraySizeException > at > org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) > at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] > spark.MapOutputTrackerMaster:91 : > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32210) Failed to serialize large MapStatuses
[ https://issues.apache.org/jira/browse/SPARK-32210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382131#comment-17382131 ] Dongjoon Hyun commented on SPARK-32210: --- There is a new observation of this situation in 3.1 from [~kazuyukitanimura]. > Failed to serialize large MapStatuses > - > > Key: SPARK-32210 > URL: https://issues.apache.org/jira/browse/SPARK-32210 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 3.1.2 >Reporter: Yuming Wang >Priority: Major > > Driver side exception: > {noformat} > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] > spark.MapOutputTrackerMaster:91 : > java.lang.NegativeArraySizeException > at > org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) > at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] > spark.MapOutputTrackerMaster:91 : > java.lang.NegativeArraySizeException > at > org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) > at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] > spark.MapOutputTrackerMaster:91 : > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32210) Failed to serialize large MapStatuses
[ https://issues.apache.org/jira/browse/SPARK-32210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32210: -- Affects Version/s: 3.1.2 > Failed to serialize large MapStatuses > - > > Key: SPARK-32210 > URL: https://issues.apache.org/jira/browse/SPARK-32210 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 3.1.2 >Reporter: Yuming Wang >Priority: Major > > Driver side exception: > {noformat} > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] > spark.MapOutputTrackerMaster:91 : > java.lang.NegativeArraySizeException > at > org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) > at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] > spark.MapOutputTrackerMaster:91 : > java.lang.NegativeArraySizeException > at > org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) > at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] > spark.MapOutputTrackerMaster:91 : > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36134) jackson-databind RCE vulnerability
[ https://issues.apache.org/jira/browse/SPARK-36134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382128#comment-17382128 ] Erik Krogen commented on SPARK-36134: - Whoops, must have missed the 3.1.2 release :) Thanks for correcting me. Still, 3.1.2 is using Jackson 2.10.0, so I don't see where the CVE report is coming from. Can you elaborate? > jackson-databind RCE vulnerability > -- > > Key: SPARK-36134 > URL: https://issues.apache.org/jira/browse/SPARK-36134 > Project: Spark > Issue Type: Task > Components: Java API >Affects Versions: 3.1.2, 3.1.3 >Reporter: Sumit >Priority: Major > Attachments: Screenshot 2021-07-15 at 1.00.55 PM.png > > > Need to upgrade jackson-databind version to *2.9.3.1* > At the beginning of 2018, jackson-databind was reported to contain another > remote code execution (RCE) vulnerability (CVE-2017-17485) that affects > versions 2.9.3 and earlier, 2.7.9.1 and earlier, and 2.8.10 and earlier. This > vulnerability is caused by jackson-dababind’s incomplete blacklist. An > application that uses jackson-databind will become vulnerable when the > enableDefaultTyping method is called via the ObjectMapper object within the > application. An attacker can thus compromise the application by sending > maliciously crafted JSON input to gain direct control over a server. > Currently, a proof of concept (POC) exploit for this vulnerability has been > publicly available. All users who are affected by this vulnerability should > upgrade to the latest versions as soon as possible to fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36182) Support TimestampNTZ type in Parquet file source
[ https://issues.apache.org/jira/browse/SPARK-36182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382126#comment-17382126 ] Apache Spark commented on SPARK-36182: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/33395 > Support TimestampNTZ type in Parquet file source > > > Key: SPARK-36182 > URL: https://issues.apache.org/jira/browse/SPARK-36182 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > As per > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp, > Parquet supports both TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current > default timestamp type): > * A TIMESTAMP with isAdjustedToUTC=true => TIMESTAMP_LTZ > * A TIMESTAMP with isAdjustedToUTC=false => TIMESTAMP_NTZ > In Spark 3.1 or prior, the Parquet writer follows the definition and sets > the field `isAdjustedToUTC` as `true`, while the Parquet reader doesn’t > respect the `isAdjustedToUTC` flag and convert any Parquet Timestamp type as > TIMESTAMP_LTZ. > Since 3.2, with the support of timestamp without time zone type: > * Parquet writer follows the definition and sets the field `isAdjustedToUTC` > as `false` on writing TIMESTAMP_NTZ. > * Parquet reader > ** For schema inference, Spark converts the Parquet timestamp type to the > corresponding catalyst timestamp type according to the timestamp annotation > flag `isAdjustedToUTC`. > ** If merge schema is enabled in schema inference and some of the files are > inferred as TIMESTAMP_NTZ while the others are TIMESTAMP_LTZ, the result type > is TIMESTAMP_LTZ which is considered as the “wider” type > ** If a column of a user-provided schema is TIMESTAMP_LTZ and the column was > written as TIMESTAMP_NTZ type, Spark allows the read operation. > ** If a column of a user-provided schema is TIMESTAMP_NTZ and the column was > written as TIMESTAMP_LTZ type, the read operation is not allowed since the > TIMESTAMP_NTZ is considered as narrower than TIMESTAMP_LTZ. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36182) Support TimestampNTZ type in Parquet file source
[ https://issues.apache.org/jira/browse/SPARK-36182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36182: Assignee: Apache Spark (was: Gengliang Wang) > Support TimestampNTZ type in Parquet file source > > > Key: SPARK-36182 > URL: https://issues.apache.org/jira/browse/SPARK-36182 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > As per > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp, > Parquet supports both TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current > default timestamp type): > * A TIMESTAMP with isAdjustedToUTC=true => TIMESTAMP_LTZ > * A TIMESTAMP with isAdjustedToUTC=false => TIMESTAMP_NTZ > In Spark 3.1 or prior, the Parquet writer follows the definition and sets > the field `isAdjustedToUTC` as `true`, while the Parquet reader doesn’t > respect the `isAdjustedToUTC` flag and convert any Parquet Timestamp type as > TIMESTAMP_LTZ. > Since 3.2, with the support of timestamp without time zone type: > * Parquet writer follows the definition and sets the field `isAdjustedToUTC` > as `false` on writing TIMESTAMP_NTZ. > * Parquet reader > ** For schema inference, Spark converts the Parquet timestamp type to the > corresponding catalyst timestamp type according to the timestamp annotation > flag `isAdjustedToUTC`. > ** If merge schema is enabled in schema inference and some of the files are > inferred as TIMESTAMP_NTZ while the others are TIMESTAMP_LTZ, the result type > is TIMESTAMP_LTZ which is considered as the “wider” type > ** If a column of a user-provided schema is TIMESTAMP_LTZ and the column was > written as TIMESTAMP_NTZ type, Spark allows the read operation. > ** If a column of a user-provided schema is TIMESTAMP_NTZ and the column was > written as TIMESTAMP_LTZ type, the read operation is not allowed since the > TIMESTAMP_NTZ is considered as narrower than TIMESTAMP_LTZ. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36182) Support TimestampNTZ type in Parquet file source
[ https://issues.apache.org/jira/browse/SPARK-36182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36182: Assignee: Gengliang Wang (was: Apache Spark) > Support TimestampNTZ type in Parquet file source > > > Key: SPARK-36182 > URL: https://issues.apache.org/jira/browse/SPARK-36182 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > As per > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp, > Parquet supports both TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current > default timestamp type): > * A TIMESTAMP with isAdjustedToUTC=true => TIMESTAMP_LTZ > * A TIMESTAMP with isAdjustedToUTC=false => TIMESTAMP_NTZ > In Spark 3.1 or prior, the Parquet writer follows the definition and sets > the field `isAdjustedToUTC` as `true`, while the Parquet reader doesn’t > respect the `isAdjustedToUTC` flag and convert any Parquet Timestamp type as > TIMESTAMP_LTZ. > Since 3.2, with the support of timestamp without time zone type: > * Parquet writer follows the definition and sets the field `isAdjustedToUTC` > as `false` on writing TIMESTAMP_NTZ. > * Parquet reader > ** For schema inference, Spark converts the Parquet timestamp type to the > corresponding catalyst timestamp type according to the timestamp annotation > flag `isAdjustedToUTC`. > ** If merge schema is enabled in schema inference and some of the files are > inferred as TIMESTAMP_NTZ while the others are TIMESTAMP_LTZ, the result type > is TIMESTAMP_LTZ which is considered as the “wider” type > ** If a column of a user-provided schema is TIMESTAMP_LTZ and the column was > written as TIMESTAMP_NTZ type, Spark allows the read operation. > ** If a column of a user-provided schema is TIMESTAMP_NTZ and the column was > written as TIMESTAMP_LTZ type, the read operation is not allowed since the > TIMESTAMP_NTZ is considered as narrower than TIMESTAMP_LTZ. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36152) Add Scala 2.13 daily build and test GitHub Action job
[ https://issues.apache.org/jira/browse/SPARK-36152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-36152. --- Fix Version/s: 3.2.0 3.3.0 Resolution: Fixed Issue resolved by pull request 33358 [https://github.com/apache/spark/pull/33358] > Add Scala 2.13 daily build and test GitHub Action job > - > > Key: SPARK-36152 > URL: https://issues.apache.org/jira/browse/SPARK-36152 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0, 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36182) Support TimestampNTZ type in Parquet file source
Gengliang Wang created SPARK-36182: -- Summary: Support TimestampNTZ type in Parquet file source Key: SPARK-36182 URL: https://issues.apache.org/jira/browse/SPARK-36182 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Gengliang Wang Assignee: Gengliang Wang As per https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp, Parquet supports both TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current default timestamp type): * A TIMESTAMP with isAdjustedToUTC=true => TIMESTAMP_LTZ * A TIMESTAMP with isAdjustedToUTC=false => TIMESTAMP_NTZ In Spark 3.1 or prior, the Parquet writer follows the definition and sets the field `isAdjustedToUTC` as `true`, while the Parquet reader doesn’t respect the `isAdjustedToUTC` flag and convert any Parquet Timestamp type as TIMESTAMP_LTZ. Since 3.2, with the support of timestamp without time zone type: * Parquet writer follows the definition and sets the field `isAdjustedToUTC` as `false` on writing TIMESTAMP_NTZ. * Parquet reader ** For schema inference, Spark converts the Parquet timestamp type to the corresponding catalyst timestamp type according to the timestamp annotation flag `isAdjustedToUTC`. ** If merge schema is enabled in schema inference and some of the files are inferred as TIMESTAMP_NTZ while the others are TIMESTAMP_LTZ, the result type is TIMESTAMP_LTZ which is considered as the “wider” type ** If a column of a user-provided schema is TIMESTAMP_LTZ and the column was written as TIMESTAMP_NTZ type, Spark allows the read operation. ** If a column of a user-provided schema is TIMESTAMP_NTZ and the column was written as TIMESTAMP_LTZ type, the read operation is not allowed since the TIMESTAMP_NTZ is considered as narrower than TIMESTAMP_LTZ. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level
[ https://issues.apache.org/jira/browse/SPARK-36181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382123#comment-17382123 ] Apache Spark commented on SPARK-36181: -- User 'dominikgehl' has created a pull request for this issue: https://github.com/apache/spark/pull/33394 > Update pyspark sql readwriter documentation to Scala level > -- > > Key: SPARK-36181 > URL: https://issues.apache.org/jira/browse/SPARK-36181 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Trivial > > Update pyspark sql readwriter documentation to the level of detail the Scala > documentation provides -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level
[ https://issues.apache.org/jira/browse/SPARK-36181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36181: Assignee: (was: Apache Spark) > Update pyspark sql readwriter documentation to Scala level > -- > > Key: SPARK-36181 > URL: https://issues.apache.org/jira/browse/SPARK-36181 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Trivial > > Update pyspark sql readwriter documentation to the level of detail the Scala > documentation provides -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level
[ https://issues.apache.org/jira/browse/SPARK-36181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36181: Assignee: Apache Spark > Update pyspark sql readwriter documentation to Scala level > -- > > Key: SPARK-36181 > URL: https://issues.apache.org/jira/browse/SPARK-36181 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Assignee: Apache Spark >Priority: Trivial > > Update pyspark sql readwriter documentation to the level of detail the Scala > documentation provides -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level
[ https://issues.apache.org/jira/browse/SPARK-36181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382121#comment-17382121 ] Apache Spark commented on SPARK-36181: -- User 'dominikgehl' has created a pull request for this issue: https://github.com/apache/spark/pull/33394 > Update pyspark sql readwriter documentation to Scala level > -- > > Key: SPARK-36181 > URL: https://issues.apache.org/jira/browse/SPARK-36181 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Trivial > > Update pyspark sql readwriter documentation to the level of detail the Scala > documentation provides -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level
Dominik Gehl created SPARK-36181: Summary: Update pyspark sql readwriter documentation to Scala level Key: SPARK-36181 URL: https://issues.apache.org/jira/browse/SPARK-36181 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 3.1.2 Reporter: Dominik Gehl Update pyspark sql readwriter documentation to the level of detail the Scala documentation provides -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded
[ https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36177. -- Fix Version/s: 3.0.4 Resolution: Fixed Issue resolved by pull request 33390 [https://github.com/apache/spark/pull/33390] > Disable CRAN in branches lower than the latest version uploaded > --- > > Key: SPARK-36177 > URL: https://issues.apache.org/jira/browse/SPARK-36177 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.3, 3.1.2 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.4 > > > CRAN check in branch-3.0 fails as below: > {code} > Insufficient package version (submitted: 3.0.4, existing: 3.1.2) > {code} > This is because CRAN doesn't allow lower version then the latest version. We > can't upload so should better just skip the CRAN check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded
[ https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36177: - Fix Version/s: 3.1.3 > Disable CRAN in branches lower than the latest version uploaded > --- > > Key: SPARK-36177 > URL: https://issues.apache.org/jira/browse/SPARK-36177 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.3, 3.1.2 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.3, 3.0.4 > > > CRAN check in branch-3.0 fails as below: > {code} > Insufficient package version (submitted: 3.0.4, existing: 3.1.2) > {code} > This is because CRAN doesn't allow lower version then the latest version. We > can't upload so should better just skip the CRAN check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded
[ https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36177: Assignee: Hyukjin Kwon > Disable CRAN in branches lower than the latest version uploaded > --- > > Key: SPARK-36177 > URL: https://issues.apache.org/jira/browse/SPARK-36177 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.3, 3.1.2 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > CRAN check in branch-3.0 fails as below: > {code} > Insufficient package version (submitted: 3.0.4, existing: 3.1.2) > {code} > This is because CRAN doesn't allow lower version then the latest version. We > can't upload so should better just skip the CRAN check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36158) pyspark sql/functions documentation for months_between isn't as precise as scala version
[ https://issues.apache.org/jira/browse/SPARK-36158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36158: - Fix Version/s: 3.1.3 3.2.0 > pyspark sql/functions documentation for months_between isn't as precise as > scala version > > > Key: SPARK-36158 > URL: https://issues.apache.org/jira/browse/SPARK-36158 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Assignee: Dominik Gehl >Priority: Trivial > Fix For: 3.2.0, 3.1.3, 3.3.0 > > > pyspark months_between documentation doesn't mention that months are assumed > with 31 days in the calculation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36154) pyspark documentation doesn't mention week and quarter as valid format arguments to trunc
[ https://issues.apache.org/jira/browse/SPARK-36154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36154: Assignee: Dominik Gehl > pyspark documentation doesn't mention week and quarter as valid format > arguments to trunc > - > > Key: SPARK-36154 > URL: https://issues.apache.org/jira/browse/SPARK-36154 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Assignee: Dominik Gehl >Priority: Trivial > Fix For: 3.2.0, 3.1.3, 3.3.0 > > > pyspark documention for {{trunc}} in sql/functions doesn't mention that > {{week}} and {{quarter}} are valid format specifiers -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36160) pyspark sql/column documentation doesn't always match scala documentation
[ https://issues.apache.org/jira/browse/SPARK-36160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36160. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33369 [https://github.com/apache/spark/pull/33369] > pyspark sql/column documentation doesn't always match scala documentation > - > > Key: SPARK-36160 > URL: https://issues.apache.org/jira/browse/SPARK-36160 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Assignee: Dominik Gehl >Priority: Trivial > Fix For: 3.3.0 > > > The pyspark sql/column documentation for methods between, getField, > dropFields and cast could be adapted to follow more closely the corresponding > Scala one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36158) pyspark sql/functions documentation for months_between isn't as precise as scala version
[ https://issues.apache.org/jira/browse/SPARK-36158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36158: Assignee: Dominik Gehl > pyspark sql/functions documentation for months_between isn't as precise as > scala version > > > Key: SPARK-36158 > URL: https://issues.apache.org/jira/browse/SPARK-36158 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Assignee: Dominik Gehl >Priority: Trivial > Fix For: 3.3.0 > > > pyspark months_between documentation doesn't mention that months are assumed > with 31 days in the calculation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36160) pyspark sql/column documentation doesn't always match scala documentation
[ https://issues.apache.org/jira/browse/SPARK-36160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36160: Assignee: Dominik Gehl > pyspark sql/column documentation doesn't always match scala documentation > - > > Key: SPARK-36160 > URL: https://issues.apache.org/jira/browse/SPARK-36160 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Assignee: Dominik Gehl >Priority: Trivial > > The pyspark sql/column documentation for methods between, getField, > dropFields and cast could be adapted to follow more closely the corresponding > Scala one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36034: - Fix Version/s: 3.0.4 > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36034 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.2 >Reporter: Willi Raschkowski >Assignee: Max Gekk >Priority: Blocker > Labels: correctness > Fix For: 3.2.0, 3.1.3, 3.0.4, 3.3.0 > > > We're seeing incorrect date filters on Parquet files written by Spark 2 or by > Spark 3 with legacy rebase mode. > This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): > {code:title=Good (Corrected Mode)} > >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > >>> "CORRECTED") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected") > >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = > >>> '0001-01-01'").show() > +--+ > | date| > +--+ > |0001-01-01| > +--+ > {code} > This is how we get incorrect results in _legacy_ mode, in this case the > filter is dropping rows it shouldn't: > {code:title=Bad (Legacy Mode)} > In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > "LEGACY") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") > >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").show() > ++ > |date| > ++ > ++ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").explain() > == Physical Plan == > *(1) Filter (isnotnull(date#154) AND (date#154 = -719162)) > +- *(1) ColumnarToRow >+- FileScan parquet [date#154] Batched: true, DataFilters: > [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: > InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar..., > PartitionFilters: [], PushedFilters: [IsNotNull(date), > EqualTo(date,0001-01-01)], ReadSchema: struct > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation
[ https://issues.apache.org/jira/browse/SPARK-36179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382036#comment-17382036 ] Apache Spark commented on SPARK-36179: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/33393 > Support TimestampNTZType in SparkGetColumnsOperation > > > Key: SPARK-36179 > URL: https://issues.apache.org/jira/browse/SPARK-36179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kent Yao >Priority: Major > > TimestampNTZType is unhandled in SparkGetColumnsOperation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation
[ https://issues.apache.org/jira/browse/SPARK-36179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382035#comment-17382035 ] Apache Spark commented on SPARK-36179: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/33393 > Support TimestampNTZType in SparkGetColumnsOperation > > > Key: SPARK-36179 > URL: https://issues.apache.org/jira/browse/SPARK-36179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kent Yao >Priority: Major > > TimestampNTZType is unhandled in SparkGetColumnsOperation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation
[ https://issues.apache.org/jira/browse/SPARK-36179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36179: Assignee: Apache Spark > Support TimestampNTZType in SparkGetColumnsOperation > > > Key: SPARK-36179 > URL: https://issues.apache.org/jira/browse/SPARK-36179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > TimestampNTZType is unhandled in SparkGetColumnsOperation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation
[ https://issues.apache.org/jira/browse/SPARK-36179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36179: Assignee: (was: Apache Spark) > Support TimestampNTZType in SparkGetColumnsOperation > > > Key: SPARK-36179 > URL: https://issues.apache.org/jira/browse/SPARK-36179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kent Yao >Priority: Major > > TimestampNTZType is unhandled in SparkGetColumnsOperation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36178) Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst
[ https://issues.apache.org/jira/browse/SPARK-36178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36178: Assignee: Apache Spark > Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst > -- > > Key: SPARK-36178 > URL: https://issues.apache.org/jira/browse/SPARK-36178 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Assignee: Apache Spark >Priority: Minor > > PySpark Catalog API currently isn't documented in > docs/source/reference/pyspark.sql.rst -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36178) Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst
[ https://issues.apache.org/jira/browse/SPARK-36178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36178: Assignee: (was: Apache Spark) > Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst > -- > > Key: SPARK-36178 > URL: https://issues.apache.org/jira/browse/SPARK-36178 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Minor > > PySpark Catalog API currently isn't documented in > docs/source/reference/pyspark.sql.rst -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36178) Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst
[ https://issues.apache.org/jira/browse/SPARK-36178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382017#comment-17382017 ] Apache Spark commented on SPARK-36178: -- User 'dominikgehl' has created a pull request for this issue: https://github.com/apache/spark/pull/33392 > Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst > -- > > Key: SPARK-36178 > URL: https://issues.apache.org/jira/browse/SPARK-36178 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Minor > > PySpark Catalog API currently isn't documented in > docs/source/reference/pyspark.sql.rst -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36180) HMS can not recognize timestamp_ntz
Kent Yao created SPARK-36180: Summary: HMS can not recognize timestamp_ntz Key: SPARK-36180 URL: https://issues.apache.org/jira/browse/SPARK-36180 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Kent Yao {code:java} [info] Caused by: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'timestamp_ntz:timestamp' but 'timestamp_ntz' is found.[info] Caused by: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'timestamp_ntz:timestamp' but 'timestamp_ntz' is found.[info] at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:372)[info] at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)[info] at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:416)[info] at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)[info] at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:814)[info] at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:162)[info] at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:91)[info] at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:116)[info] at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:54)[info] at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533)[info] at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:453)[info] at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:440)[info] at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281)[info] at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:199)[info] at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:842)[info] ... 63 more[info] at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:385)[info] at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)[info] at org.apache.spark.sql.hive.thriftserver.SparkMetadataOperationSuite.$anonfun$new$145(SparkMetadataOperationSuite.scala:666)[info] at org.apache.spark.sql.hive.thriftserver.SparkMetadataOperationSuite.$anonfun$new$145$adapted(SparkMetadataOperationSuite.scala:665)[info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.$anonfun$withMultipleConnectionJdbcStatement$4(HiveThriftServer2Suites.scala:1422)[info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.$anonfun$withMultipleConnectionJdbcStatement$4$adapted(HiveThriftServer2Suites.scala:1422)[info] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)[info] at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)[info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)[info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.$anonfun$withMultipleConnectionJdbcStatement$1(HiveThriftServer2Suites.scala:1422)[info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.tryCaptureSysLog(HiveThriftServer2Suites.scala:1407)[info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.withMultipleConnectionJdbcStatement(HiveThriftServer2Suites.scala:1416)[info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.withJdbcStatement(HiveThriftServer2Suites.scala:1454)[info] at org.apache.spark.sql.hive.thriftserver.SparkMetadataOperationSuite.$anonfun$new$144(SparkMetadataOperationSuite.scala:665)[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)[info] at org.scalatest.Transformer.apply(Transformer.scala:22)[info] at org.scalatest.Transformer.apply(Transformer.scala:20)[info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)[info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190[info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLi
[jira] [Created] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation
Kent Yao created SPARK-36179: Summary: Support TimestampNTZType in SparkGetColumnsOperation Key: SPARK-36179 URL: https://issues.apache.org/jira/browse/SPARK-36179 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Kent Yao TimestampNTZType is unhandled in SparkGetColumnsOperation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36178) Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst
[ https://issues.apache.org/jira/browse/SPARK-36178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominik Gehl updated SPARK-36178: - Summary: Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst (was: document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst) > Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst > -- > > Key: SPARK-36178 > URL: https://issues.apache.org/jira/browse/SPARK-36178 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Minor > > PySpark Catalog API currently isn't documented in > docs/source/reference/pyspark.sql.rst -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36178) document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst
Dominik Gehl created SPARK-36178: Summary: document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst Key: SPARK-36178 URL: https://issues.apache.org/jira/browse/SPARK-36178 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 3.1.2 Reporter: Dominik Gehl PySpark Catalog API currently isn't documented in docs/source/reference/pyspark.sql.rst -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34893) Support native session window
[ https://issues.apache.org/jira/browse/SPARK-34893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-34893: Assignee: Jungtaek Lim > Support native session window > - > > Key: SPARK-34893 > URL: https://issues.apache.org/jira/browse/SPARK-34893 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > This issue tracks effort on supporting native session window, on both batch > query and streaming query. > This issue is the finalization of SPARK-10816 leveraging SPARK-34888, > SPARK-34889, SPARK-35861, SPARK-34891, SPARK-34892. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34893) Support native session window
[ https://issues.apache.org/jira/browse/SPARK-34893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-34893. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33081 [https://github.com/apache/spark/pull/33081] > Support native session window > - > > Key: SPARK-34893 > URL: https://issues.apache.org/jira/browse/SPARK-34893 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.2.0 > > > This issue tracks effort on supporting native session window, on both batch > query and streaming query. > This issue is the finalization of SPARK-10816 leveraging SPARK-34888, > SPARK-34889, SPARK-35861, SPARK-34891, SPARK-34892. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36122) Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not allow to configure mTLS authentication.
[ https://issues.apache.org/jira/browse/SPARK-36122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36122: Assignee: Apache Spark > Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not > allow to configure mTLS authentication. > - > > Key: SPARK-36122 > URL: https://issues.apache.org/jira/browse/SPARK-36122 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.4.7 >Reporter: Seetharama Khandrika >Assignee: Apache Spark >Priority: Major > > Spark does not pass on the needClientAuth flag to Jetty engine. This prevents > the UI from honouring mutual TLS authentication using x509 certificates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36122) Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not allow to configure mTLS authentication.
[ https://issues.apache.org/jira/browse/SPARK-36122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36122: Assignee: (was: Apache Spark) > Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not > allow to configure mTLS authentication. > - > > Key: SPARK-36122 > URL: https://issues.apache.org/jira/browse/SPARK-36122 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.4.7 >Reporter: Seetharama Khandrika >Priority: Major > > Spark does not pass on the needClientAuth flag to Jetty engine. This prevents > the UI from honouring mutual TLS authentication using x509 certificates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36122) Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not allow to configure mTLS authentication.
[ https://issues.apache.org/jira/browse/SPARK-36122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36122: Assignee: Apache Spark > Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not > allow to configure mTLS authentication. > - > > Key: SPARK-36122 > URL: https://issues.apache.org/jira/browse/SPARK-36122 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.4.7 >Reporter: Seetharama Khandrika >Assignee: Apache Spark >Priority: Major > > Spark does not pass on the needClientAuth flag to Jetty engine. This prevents > the UI from honouring mutual TLS authentication using x509 certificates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36122) Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not allow to configure mTLS authentication.
[ https://issues.apache.org/jira/browse/SPARK-36122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381996#comment-17381996 ] Apache Spark commented on SPARK-36122: -- User 'skhandrikagmail' has created a pull request for this issue: https://github.com/apache/spark/pull/33301 > Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not > allow to configure mTLS authentication. > - > > Key: SPARK-36122 > URL: https://issues.apache.org/jira/browse/SPARK-36122 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.4.7 >Reporter: Seetharama Khandrika >Priority: Major > > Spark does not pass on the needClientAuth flag to Jetty engine. This prevents > the UI from honouring mutual TLS authentication using x509 certificates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded
[ https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36177: - Description: CRAN check in branch-3.0 fails as below: {code} Insufficient package version (submitted: 3.0.4, existing: 3.1.2) {code} This is because CRAN doesn't allow lower version then the latest version. We can't upload so should better just skip the CRAN check. was: CRAN check in branch-3.0 fails as below: {code} Insufficient package version (submitted: 3.0.4, existing: 3.1.2) {code} This is because CRAN doesn't allow lower version then the latest version. We can't upload so should better just skip the CRAN check. > Disable CRAN in branches lower than the latest version uploaded > --- > > Key: SPARK-36177 > URL: https://issues.apache.org/jira/browse/SPARK-36177 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.3, 3.1.2 >Reporter: Hyukjin Kwon >Priority: Major > > CRAN check in branch-3.0 fails as below: > {code} > Insufficient package version (submitted: 3.0.4, existing: 3.1.2) > {code} > This is because CRAN doesn't allow lower version then the latest version. We > can't upload so should better just skip the CRAN check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded
[ https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381979#comment-17381979 ] Apache Spark commented on SPARK-36177: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/33391 > Disable CRAN in branches lower than the latest version uploaded > --- > > Key: SPARK-36177 > URL: https://issues.apache.org/jira/browse/SPARK-36177 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.3, 3.1.2 >Reporter: Hyukjin Kwon >Priority: Major > > CRAN check in branch-3.0 fails as below: > {code} > Insufficient package version (submitted: 3.0.4, existing: 3.1.2) > {code} > This is because CRAN doesn't allow lower version then the latest version. We > can't upload so should better just skip the CRAN check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded
[ https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36177: Assignee: (was: Apache Spark) > Disable CRAN in branches lower than the latest version uploaded > --- > > Key: SPARK-36177 > URL: https://issues.apache.org/jira/browse/SPARK-36177 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.3, 3.1.2 >Reporter: Hyukjin Kwon >Priority: Major > > CRAN check in branch-3.0 fails as below: > {code} > Insufficient package version (submitted: 3.0.4, existing: 3.1.2) > {code} > This is because CRAN doesn't allow lower version then the latest version. We > can't upload so should better just skip the CRAN check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded
[ https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36177: Assignee: Apache Spark > Disable CRAN in branches lower than the latest version uploaded > --- > > Key: SPARK-36177 > URL: https://issues.apache.org/jira/browse/SPARK-36177 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.3, 3.1.2 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > CRAN check in branch-3.0 fails as below: > {code} > Insufficient package version (submitted: 3.0.4, existing: 3.1.2) > {code} > This is because CRAN doesn't allow lower version then the latest version. We > can't upload so should better just skip the CRAN check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded
[ https://issues.apache.org/jira/browse/SPARK-36177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381977#comment-17381977 ] Apache Spark commented on SPARK-36177: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/33390 > Disable CRAN in branches lower than the latest version uploaded > --- > > Key: SPARK-36177 > URL: https://issues.apache.org/jira/browse/SPARK-36177 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 3.0.3, 3.1.2 >Reporter: Hyukjin Kwon >Priority: Major > > CRAN check in branch-3.0 fails as below: > {code} > Insufficient package version (submitted: 3.0.4, existing: 3.1.2) > {code} > This is because CRAN doesn't allow lower version then the latest version. We > can't upload so should better just skip the CRAN check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36177) Disable CRAN in branches lower than the latest version uploaded
Hyukjin Kwon created SPARK-36177: Summary: Disable CRAN in branches lower than the latest version uploaded Key: SPARK-36177 URL: https://issues.apache.org/jira/browse/SPARK-36177 Project: Spark Issue Type: Test Components: SparkR Affects Versions: 3.1.2, 3.0.3 Reporter: Hyukjin Kwon CRAN check in branch-3.0 fails as below: {code} Insufficient package version (submitted: 3.0.4, existing: 3.1.2) {code} This is because CRAN doesn't allow lower version then the latest version. We can't upload so should better just skip the CRAN check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36176) Expose tableExists in pyspark.sql.catalog
[ https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36176: - Summary: Expose tableExists in pyspark.sql.catalog (was: expose tableExists in pyspark.sql.catalog) > Expose tableExists in pyspark.sql.catalog > - > > Key: SPARK-36176 > URL: https://issues.apache.org/jira/browse/SPARK-36176 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Minor > > expose in pyspark tableExists which is part of the scala implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32530) SPIP: Kotlin support for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381919#comment-17381919 ] canwang edited comment on SPARK-32530 at 7/16/21, 9:17 AM: --- I've been helping with the Jetbrains' [Kotlin Spark API|https://github.com/JetBrains/kotlin-spark-api] too,I also hope that first-class support for Kotlin language into the Apache Spark project 1. I think kotlin api may be a better choice on jvm for spark developers. - As the description says, there are a lot of kotlin developers now, and they are growing fast, and more and more projects use kotlin as the first-class api,For example the demo on spring's web page has defaulted to kotlin. - As you said, there are very few developers using java to develop spark, because although spark perfectly supports java, the syntax of java is not friendly to developing spark. I believe they use java because of the relatively long learning curve of scala. High, koltin is much better, which can also be reflected in the growth rate of koltin users 2. The cost of adapting kotlin may not be high - The current [Kotlin Spark API|https://github.com/JetBrains/kotlin-spark-api] already exists and it is basically usable. Migrating to the spark appliction repo should only need to add more tests. - Judging from the existing [Kotlin Spark API|https://github.com/JetBrains/kotlin-spark-api], the main work of adaptation is to process the Serializer and Deserializer in the Encoder. I think the workload of these adaptation work should be able to refer to the adaptation of java, and it is even simpler than java. , Because of the adaptation of java, kotlin has a reference was (Author: nonpool): I've been helping with the Jetbrains' [Kotlin Spark APIhttps://github.com/JetBrains/kotlin-spark-api] too,I also hope that first-class support for Kotlin language into the Apache Spark project 1. I think kotlin api may be a better choice on jvm for spark developers. - As the description says, there are a lot of kotlin developers now, and they are growing fast, and more and more projects use kotlin as the first-class api,For example the demo on spring's web page has defaulted to kotlin. - As you said, there are very few developers using java to develop spark, because although spark perfectly supports java, the syntax of java is not friendly to developing spark. I believe they use java because of the relatively long learning curve of scala. High, koltin is much better, which can also be reflected in the growth rate of koltin users 2. The cost of adapting kotlin may not be high - The current [Kotlin Spark APIhttps://github.com/JetBrains/kotlin-spark-api] already exists and it is basically usable. Migrating to the spark appliction repo should only need to add more tests. - Judging from the existing [Kotlin Spark APIhttps://github.com/JetBrains/kotlin-spark-api], the main work of adaptation is to process the Serializer and Deserializer in the Encoder. I think the workload of these adaptation work should be able to refer to the adaptation of java, and it is even simpler than java. , Because of the adaptation of java, kotlin has a reference > SPIP: Kotlin support for Apache Spark > - > > Key: SPARK-32530 > URL: https://issues.apache.org/jira/browse/SPARK-32530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Pasha Finkeshteyn >Priority: Major > > h2. Background and motivation > Kotlin is a cross-platform, statically typed, general-purpose JVM language. > In the last year more than 5 million developers have used Kotlin in mobile, > backend, frontend and scientific development. The number of Kotlin developers > grows rapidly every year. > * [According to > redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: > "Kotlin, the second fastest growing language we’ve seen outside of Swift, > made a big splash a year ago at this time when it vaulted eight full spots up > the list." > * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], > Kotlin is the second most popular language on the JVM > * [According to > StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share > increased by 7.8% in 2020. > We notice the increasing usage of Kotlin in data analysis ([6% of users in > 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to > 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in > 2019), and we expect these numbers to continue to grow. > We, authors of this SPIP, strongly believe that making Kotlin API officially > available to developers can bring new users to Apache Spark and help some of > the existing users. > h2. Goals > The
[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381919#comment-17381919 ] canwang commented on SPARK-32530: - I've been helping with the Jetbrains' [Kotlin Spark APIhttps://github.com/JetBrains/kotlin-spark-api] too,I also hope that first-class support for Kotlin language into the Apache Spark project 1. I think kotlin api may be a better choice on jvm for spark developers. - As the description says, there are a lot of kotlin developers now, and they are growing fast, and more and more projects use kotlin as the first-class api,For example the demo on spring's web page has defaulted to kotlin. - As you said, there are very few developers using java to develop spark, because although spark perfectly supports java, the syntax of java is not friendly to developing spark. I believe they use java because of the relatively long learning curve of scala. High, koltin is much better, which can also be reflected in the growth rate of koltin users 2. The cost of adapting kotlin may not be high - The current [Kotlin Spark APIhttps://github.com/JetBrains/kotlin-spark-api] already exists and it is basically usable. Migrating to the spark appliction repo should only need to add more tests. - Judging from the existing [Kotlin Spark APIhttps://github.com/JetBrains/kotlin-spark-api], the main work of adaptation is to process the Serializer and Deserializer in the Encoder. I think the workload of these adaptation work should be able to refer to the adaptation of java, and it is even simpler than java. , Because of the adaptation of java, kotlin has a reference > SPIP: Kotlin support for Apache Spark > - > > Key: SPARK-32530 > URL: https://issues.apache.org/jira/browse/SPARK-32530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Pasha Finkeshteyn >Priority: Major > > h2. Background and motivation > Kotlin is a cross-platform, statically typed, general-purpose JVM language. > In the last year more than 5 million developers have used Kotlin in mobile, > backend, frontend and scientific development. The number of Kotlin developers > grows rapidly every year. > * [According to > redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: > "Kotlin, the second fastest growing language we’ve seen outside of Swift, > made a big splash a year ago at this time when it vaulted eight full spots up > the list." > * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], > Kotlin is the second most popular language on the JVM > * [According to > StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share > increased by 7.8% in 2020. > We notice the increasing usage of Kotlin in data analysis ([6% of users in > 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to > 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in > 2019), and we expect these numbers to continue to grow. > We, authors of this SPIP, strongly believe that making Kotlin API officially > available to developers can bring new users to Apache Spark and help some of > the existing users. > h2. Goals > The goal of this project is to bring first-class support for Kotlin language > into the Apache Spark project. We’re going to achieve this by adding one more > module to the current Apache Spark distribution. > h2. Non-goals > There is no goal to replace any existing language support or to change any > existing Apache Spark API. > At this time, there is no goal to support non-core APIs of Apache Spark like > Spark ML and Spark structured streaming. This may change in the future based > on community feedback. > There is no goal to provide CLI for Kotlin for Apache Spark, this will be a > separate SPIP. > There is no goal to provide support for Apache Spark < 3.0.0. > h2. Current implementation > A working prototype is available at > [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside > JetBrains and by early adopters. > h2. What are the risks? > There is always a risk that this product won’t get enough popularity and will > bring more costs than benefits. It can be mitigated by the fact that we don't > need to change any existing API and support can be potentially dropped at any > time. > We also believe that existing API is rather low maintenance. It does not > bring anything more complex than already exists in the Spark codebase. > Furthermore, the implementation is compact - less than 2000 lines of code. > We are committed to maintaining, improving and evolving the API based on > feedback from both Spark and Kotlin communities. As the Kotlin data community > continues to grow, we see Kotlin API for A
[jira] [Resolved] (SPARK-36048) Wrong HealthTrackerSuite.allExecutorAndHostIds
[ https://issues.apache.org/jira/browse/SPARK-36048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi resolved SPARK-36048. -- Fix Version/s: 3.3.0 Assignee: wuyi Resolution: Fixed Issue resolved by [https://github.com/apache/spark/pull/33262] > Wrong HealthTrackerSuite.allExecutorAndHostIds > -- > > Key: SPARK-36048 > URL: https://issues.apache.org/jira/browse/SPARK-36048 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.3.0 > > > `HealthTrackerSuite.allExecutorAndHostIds` is mistakenly declared, which > leads to the executor exclusion isn't correctly tested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36176) expose tableExists in pyspark.sql.catalog
[ https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381904#comment-17381904 ] Apache Spark commented on SPARK-36176: -- User 'dominikgehl' has created a pull request for this issue: https://github.com/apache/spark/pull/33388 > expose tableExists in pyspark.sql.catalog > - > > Key: SPARK-36176 > URL: https://issues.apache.org/jira/browse/SPARK-36176 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Minor > > expose in pyspark tableExists which is part of the scala implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36176) expose tableExists in pyspark.sql.catalog
[ https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36176: Assignee: Apache Spark > expose tableExists in pyspark.sql.catalog > - > > Key: SPARK-36176 > URL: https://issues.apache.org/jira/browse/SPARK-36176 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Assignee: Apache Spark >Priority: Minor > > expose in pyspark tableExists which is part of the scala implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36176) expose tableExists in pyspark.sql.catalog
[ https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36176: Assignee: (was: Apache Spark) > expose tableExists in pyspark.sql.catalog > - > > Key: SPARK-36176 > URL: https://issues.apache.org/jira/browse/SPARK-36176 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Minor > > expose in pyspark tableExists which is part of the scala implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36176) expose tableExists in pyspark.sql.catalog
[ https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381902#comment-17381902 ] Apache Spark commented on SPARK-36176: -- User 'dominikgehl' has created a pull request for this issue: https://github.com/apache/spark/pull/33388 > expose tableExists in pyspark.sql.catalog > - > > Key: SPARK-36176 > URL: https://issues.apache.org/jira/browse/SPARK-36176 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Minor > > expose in pyspark tableExists which is part of the scala implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36176) expose tableExists in pyspark.sql.catalog
[ https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominik Gehl updated SPARK-36176: - Summary: expose tableExists in pyspark.sql.catalog (was: expost tableExists in pyspark.sql.catalog) > expose tableExists in pyspark.sql.catalog > - > > Key: SPARK-36176 > URL: https://issues.apache.org/jira/browse/SPARK-36176 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Dominik Gehl >Priority: Minor > > expose in pyspark tableExists which is part of the scala implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36176) expost tableExists in pyspark.sql.catalog
Dominik Gehl created SPARK-36176: Summary: expost tableExists in pyspark.sql.catalog Key: SPARK-36176 URL: https://issues.apache.org/jira/browse/SPARK-36176 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.2 Reporter: Dominik Gehl expose in pyspark tableExists which is part of the scala implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35710) Support DPP + AQE when no reused broadcast exchange
[ https://issues.apache.org/jira/browse/SPARK-35710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35710: --- Assignee: Ke Jia > Support DPP + AQE when no reused broadcast exchange > --- > > Key: SPARK-35710 > URL: https://issues.apache.org/jira/browse/SPARK-35710 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > > Support DPP + AQE when no reused broadcast exchange. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35710) Support DPP + AQE when no reused broadcast exchange
[ https://issues.apache.org/jira/browse/SPARK-35710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35710. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32861 [https://github.com/apache/spark/pull/32861] > Support DPP + AQE when no reused broadcast exchange > --- > > Key: SPARK-35710 > URL: https://issues.apache.org/jira/browse/SPARK-35710 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.2.0 > > > Support DPP + AQE when no reused broadcast exchange. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36099) Group exception messages in core/util
[ https://issues.apache.org/jira/browse/SPARK-36099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381874#comment-17381874 ] Shockang commented on SPARK-36099: -- [~allisonwang-db] Could I working on this issue? > Group exception messages in core/util > - > > Key: SPARK-36099 > URL: https://issues.apache.org/jira/browse/SPARK-36099 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Allison Wang >Priority: Major > > 'core/src/main/scala/org/apache/spark/util' > || Filename || Count || > | AccumulatorV2.scala | 4 | > | ClosureCleaner.scala | 1 | > | DependencyUtils.scala| 1 | > | KeyLock.scala| 1 | > | ListenerBus.scala| 1 | > | NextIterator.scala | 1 | > | SerializableBuffer.scala | 2 | > | ThreadUtils.scala| 4 | > | Utils.scala | 16 | > 'core/src/main/scala/org/apache/spark/util/collection' > || Filename || Count || > | AppendOnlyMap.scala | 1 | > | CompactBuffer.scala | 1 | > | ImmutableBitSet.scala | 6 | > | MedianHeap.scala | 1 | > | OpenHashSet.scala | 2 | > 'core/src/main/scala/org/apache/spark/util/io' > || Filename|| Count || > | ChunkedByteBuffer.scala | 1 | > 'core/src/main/scala/org/apache/spark/util/logging' > || Filename || Count || > | DriverLogger.scala | 1 | > 'core/src/main/scala/org/apache/spark/util/random' > || Filename|| Count || > | RandomSampler.scala | 1 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35972) When replace ExtractValue in NestedColumnAliasing we should use semanticEquals
[ https://issues.apache.org/jira/browse/SPARK-35972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-35972: Fix Version/s: 3.1.3 > When replace ExtractValue in NestedColumnAliasing we should use semanticEquals > -- > > Key: SPARK-35972 > URL: https://issues.apache.org/jira/browse/SPARK-35972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0, 3.1.3 > > > {code:java} > Job aborted due to stage failure: Task 47 in stage 1.0 failed 4 times, most > recent failure: Lost task 47.3 in stage 1.0 (TID 328) > (ip-idata-server.shopee.io executor 3): > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: _gen_alias_788#788 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:377) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:438) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:438) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:377) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:438) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:438) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalys
[jira] [Assigned] (SPARK-35972) When replace ExtractValue in NestedColumnAliasing we should use semanticEquals
[ https://issues.apache.org/jira/browse/SPARK-35972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-35972: --- Assignee: angerszhu > When replace ExtractValue in NestedColumnAliasing we should use semanticEquals > -- > > Key: SPARK-35972 > URL: https://issues.apache.org/jira/browse/SPARK-35972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > {code:java} > Job aborted due to stage failure: Task 47 in stage 1.0 failed 4 times, most > recent failure: Lost task 47.3 in stage 1.0 (TID 328) > (ip-idata-server.shopee.io executor 3): > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: _gen_alias_788#788 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:377) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:438) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:438) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:377) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:438) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:438) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst
[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381857#comment-17381857 ] Apache Spark commented on SPARK-36034: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/33387 > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36034 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.2 >Reporter: Willi Raschkowski >Assignee: Max Gekk >Priority: Blocker > Labels: correctness > Fix For: 3.2.0, 3.1.3, 3.3.0 > > > We're seeing incorrect date filters on Parquet files written by Spark 2 or by > Spark 3 with legacy rebase mode. > This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): > {code:title=Good (Corrected Mode)} > >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > >>> "CORRECTED") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected") > >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = > >>> '0001-01-01'").show() > +--+ > | date| > +--+ > |0001-01-01| > +--+ > {code} > This is how we get incorrect results in _legacy_ mode, in this case the > filter is dropping rows it shouldn't: > {code:title=Bad (Legacy Mode)} > In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > "LEGACY") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") > >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").show() > ++ > |date| > ++ > ++ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").explain() > == Physical Plan == > *(1) Filter (isnotnull(date#154) AND (date#154 = -719162)) > +- *(1) ColumnarToRow >+- FileScan parquet [date#154] Batched: true, DataFilters: > [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: > InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar..., > PartitionFilters: [], PushedFilters: [IsNotNull(date), > EqualTo(date,0001-01-01)], ReadSchema: struct > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381856#comment-17381856 ] Apache Spark commented on SPARK-36034: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/33387 > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36034 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.2 >Reporter: Willi Raschkowski >Assignee: Max Gekk >Priority: Blocker > Labels: correctness > Fix For: 3.2.0, 3.1.3, 3.3.0 > > > We're seeing incorrect date filters on Parquet files written by Spark 2 or by > Spark 3 with legacy rebase mode. > This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): > {code:title=Good (Corrected Mode)} > >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > >>> "CORRECTED") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected") > >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = > >>> '0001-01-01'").show() > +--+ > | date| > +--+ > |0001-01-01| > +--+ > {code} > This is how we get incorrect results in _legacy_ mode, in this case the > filter is dropping rows it shouldn't: > {code:title=Bad (Legacy Mode)} > In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > "LEGACY") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") > >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").show() > ++ > |date| > ++ > ++ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").explain() > == Physical Plan == > *(1) Filter (isnotnull(date#154) AND (date#154 = -719162)) > +- *(1) ColumnarToRow >+- FileScan parquet [date#154] Batched: true, DataFilters: > [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: > InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar..., > PartitionFilters: [], PushedFilters: [IsNotNull(date), > EqualTo(date,0001-01-01)], ReadSchema: struct > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org