[jira] [Comment Edited] (SPARK-33638) Full support of V2 table creation in Structured Streaming writer path
[ https://issues.apache.org/jira/browse/SPARK-33638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243803#comment-17243803 ] Jungtaek Lim edited comment on SPARK-33638 at 12/4/20, 7:56 AM: I don't agree with handling this in DataStreamWriter, hence I changed the title. My claim is designing DataStreamWriterV2, nothing else. I also don't agree that we need to deal with partition columns verification in such way. DataFrameWriterV2 does this nicely, via branching the path between appending/overwriting/truncating table vs creating/replacing table and enforce latter whenever the configuration for creating table is provided. I think this is pretty much clearer for end users, rather than letting they concern about the impact. For sure, even we address it with DataStreamWriterV2, we still need to deal with the consistency in DataStreamWriter.toTable(). Given DataStreamWriterV2 is taking place and recommended for table write, that would be less important. was (Author: kabhwan): I don't agree with handling this in DataStreamWriter, hence I changed the title. My claim is designing DataStreamWriterV2, nothing else. I also don't agree that we need to deal with partition columns verification in such way. DataFrameWriterV2 does this nicely, via branching the path between appending/overwriting/truncating table vs creating/replacing table and enforce latter whenever the configuration for creating table is provided. I think this is pretty much clearer for end users, rather than letting they concern about the impact. For sure, even we address it with DataStreamWriterV2, we still need to deal with the consistency in DataStreamWriter.toTable(). Given DataStreamWriterV2 is taking place and recommended for table write, that would be less important. > Full support of V2 table creation in Structured Streaming writer path > - > > Key: SPARK-33638 > URL: https://issues.apache.org/jira/browse/SPARK-33638 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Priority: Blocker > > Currently, we want to add support of creating if not exists in > DataStreamWriter.toTable API. Since the file format in streaming doesn't > support DSv2 for now, the current implementation mainly focuses on V1 > support. We need more work to do for the full support of V2 table creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33638) Full support of V2 table creation in Structured Streaming writer path
[ https://issues.apache.org/jira/browse/SPARK-33638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243803#comment-17243803 ] Jungtaek Lim commented on SPARK-33638: -- I don't agree with handling this in DataStreamWriter, hence I changed the title. My claim is designing DataStreamWriterV2, nothing else. I also don't agree that we need to deal with partition columns verification in such way. DataFrameWriterV2 does this nicely, via branching the path between appending/overwriting/truncating table vs creating/replacing table and enforce latter whenever the configuration for creating table is provided. I think this is pretty much clearer for end users, rather than letting they concern about the impact. For sure, even we address it with DataStreamWriterV2, we still need to deal with the consistency in DataStreamWriter.toTable(). Given DataStreamWriterV2 is taking place and recommended for table write, that would be less important. > Full support of V2 table creation in Structured Streaming writer path > - > > Key: SPARK-33638 > URL: https://issues.apache.org/jira/browse/SPARK-33638 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Priority: Blocker > > Currently, we want to add support of creating if not exists in > DataStreamWriter.toTable API. Since the file format in streaming doesn't > support DSv2 for now, the current implementation mainly focuses on V1 > support. We need more work to do for the full support of V2 table creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33638) Full support of V2 table creation in Structured Streaming writer path
[ https://issues.apache.org/jira/browse/SPARK-33638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-33638: - Summary: Full support of V2 table creation in Structured Streaming writer path (was: Full support of V2 table creation in DataStreamWriter.toTable API) > Full support of V2 table creation in Structured Streaming writer path > - > > Key: SPARK-33638 > URL: https://issues.apache.org/jira/browse/SPARK-33638 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Priority: Blocker > > Currently, we want to add support of creating if not exists in > DataStreamWriter.toTable API. Since the file format in streaming doesn't > support DSv2 for now, the current implementation mainly focuses on V1 > support. We need more work to do for the full support of V2 table creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33656) Add option to keep container after tests finish for DockerJDBCIntegrationSuites for debug
[ https://issues.apache.org/jira/browse/SPARK-33656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33656. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30601 [https://github.com/apache/spark/pull/30601] > Add option to keep container after tests finish for > DockerJDBCIntegrationSuites for debug > - > > Key: SPARK-33656 > URL: https://issues.apache.org/jira/browse/SPARK-33656 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.1.0 > > > DockerJDBCIntegrationSuites (e.g. DB2IntegrationSuite, > PostgresIntegrationSuite) launch a docker container which is removed after > tests finish. > If we have an option to keep the container, it would be useful for debug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33577) Add support for V1Table in stream writer table API
[ https://issues.apache.org/jira/browse/SPARK-33577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-33577. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30521 [https://github.com/apache/spark/pull/30521] > Add support for V1Table in stream writer table API > -- > > Key: SPARK-33577 > URL: https://issues.apache.org/jira/browse/SPARK-33577 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.1.0 > > > After SPARK-32896, we have table API for stream writer but only support > DataSource v2 tables. Here we add the following enhancements: > * Create non-existing tables by default > * Support both managed and external V1Tables -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33577) Add support for V1Table in stream writer table API
[ https://issues.apache.org/jira/browse/SPARK-33577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-33577: Assignee: Yuanjian Li > Add support for V1Table in stream writer table API > -- > > Key: SPARK-33577 > URL: https://issues.apache.org/jira/browse/SPARK-33577 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > > After SPARK-32896, we have table API for stream writer but only support > DataSource v2 tables. Here we add the following enhancements: > * Create non-existing tables by default > * Support both managed and external V1Tables -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33659) Document the current behavior for DataStreamWriter.toTable API
[ https://issues.apache.org/jira/browse/SPARK-33659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243795#comment-17243795 ] Yuanjian Li commented on SPARK-33659: - I'm working on this. > Document the current behavior for DataStreamWriter.toTable API > -- > > Key: SPARK-33659 > URL: https://issues.apache.org/jira/browse/SPARK-33659 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Priority: Blocker > > Follow up work for SPARK-33577 and need to be done before the 3.1 release. As > we didn't have full support for the V2 table created in the API, the > following documentation work is needed: > * figure out the effects when configurations are (provider/partitionBy) > conflicting with existing table, and document in javadoc of {{toTable}}. I > think you'll need to make a matrix and describe which takes effect (table vs > input) - creating table vs table exists, DSv1 vs DSv2 (4 different situations > should be all documented). > * document the lack of functionality on creating v2 table in javadoc of > {{toTable}}, and guide that they should ensure table is created in prior to > avoid the behavior unintended/insufficient table is being created. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33659) Document the current behavior for DataStreamWriter.toTable API
Yuanjian Li created SPARK-33659: --- Summary: Document the current behavior for DataStreamWriter.toTable API Key: SPARK-33659 URL: https://issues.apache.org/jira/browse/SPARK-33659 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.1.0 Reporter: Yuanjian Li Follow up work for SPARK-33577 and need to be done before the 3.1 release. As we didn't have full support for the V2 table created in the API, the following documentation work is needed: * figure out the effects when configurations are (provider/partitionBy) conflicting with existing table, and document in javadoc of {{toTable}}. I think you'll need to make a matrix and describe which takes effect (table vs input) - creating table vs table exists, DSv1 vs DSv2 (4 different situations should be all documented). * document the lack of functionality on creating v2 table in javadoc of {{toTable}}, and guide that they should ensure table is created in prior to avoid the behavior unintended/insufficient table is being created. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33571. -- Fix Version/s: 3.1.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/30596 > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > Fix For: 3.1.0 > > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33658) Suggest using datetime conversion functions for invalid ANSI casting
[ https://issues.apache.org/jira/browse/SPARK-33658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33658. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30603 [https://github.com/apache/spark/pull/30603] > Suggest using datetime conversion functions for invalid ANSI casting > > > Key: SPARK-33658 > URL: https://issues.apache.org/jira/browse/SPARK-33658 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.1.0 > > > In ANSI mode, explicit cast between DateTime types and Numeric types is not > allowed. > As of now, we have introduced new functions > UNIX_SECONDS/UNIX_MILLIS/UNIX_MICROS/UNIX_DATE/DATE_FROM_UNIX_DATE, we can > show suggestions to users so that they can complete these type conversions > precisely and easily in ANSI mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-33430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33430. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30473 [https://github.com/apache/spark/pull/30473] > Support namespaces in JDBC v2 Table Catalog > --- > > Key: SPARK-33430 > URL: https://issues.apache.org/jira/browse/SPARK-33430 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.1.0 > > > When I extend JDBCTableCatalogSuite by > org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance: > {code:scala} > import org.apache.spark.sql.execution.command.v2.ShowTablesSuite > class JDBCTableCatalogSuite extends ShowTablesSuite { > override def version: String = "JDBC V2" > override def catalog: String = "h2" > ... > {code} > some tests from JDBCTableCatalogSuite fail with: > {code} > [info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 > seconds, 502 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does > not support namespaces; > [info] at > org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-33430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33430: --- Assignee: Huaxin Gao > Support namespaces in JDBC v2 Table Catalog > --- > > Key: SPARK-33430 > URL: https://issues.apache.org/jira/browse/SPARK-33430 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Huaxin Gao >Priority: Major > > When I extend JDBCTableCatalogSuite by > org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance: > {code:scala} > import org.apache.spark.sql.execution.command.v2.ShowTablesSuite > class JDBCTableCatalogSuite extends ShowTablesSuite { > override def version: String = "JDBC V2" > override def catalog: String = "h2" > ... > {code} > some tests from JDBCTableCatalogSuite fail with: > {code} > [info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 > seconds, 502 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does > not support namespaces; > [info] at > org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33638) Full support of V2 table creation in DataStreamWriter.toTable API
[ https://issues.apache.org/jira/browse/SPARK-33638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-33638: Priority: Blocker (was: Major) > Full support of V2 table creation in DataStreamWriter.toTable API > - > > Key: SPARK-33638 > URL: https://issues.apache.org/jira/browse/SPARK-33638 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Priority: Blocker > > Currently, we want to add support of creating if not exists in > DataStreamWriter.toTable API. Since the file format in streaming doesn't > support DSv2 for now, the current implementation mainly focuses on V1 > support. We need more work to do for the full support of V2 table creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33142) SQL temp view should store SQL text as well
[ https://issues.apache.org/jira/browse/SPARK-33142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33142: --- Assignee: Linhong Liu > SQL temp view should store SQL text as well > --- > > Key: SPARK-33142 > URL: https://issues.apache.org/jira/browse/SPARK-33142 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: Linhong Liu >Priority: Major > Fix For: 3.1.0 > > > TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33142) SQL temp view should store SQL text as well
[ https://issues.apache.org/jira/browse/SPARK-33142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33142. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30567 [https://github.com/apache/spark/pull/30567] > SQL temp view should store SQL text as well > --- > > Key: SPARK-33142 > URL: https://issues.apache.org/jira/browse/SPARK-33142 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > Fix For: 3.1.0 > > > TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33647) cache table not working for persisted view
[ https://issues.apache.org/jira/browse/SPARK-33647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33647: --- Assignee: Linhong Liu > cache table not working for persisted view > -- > > Key: SPARK-33647 > URL: https://issues.apache.org/jira/browse/SPARK-33647 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > > In `CacheManager`, tables (including views) are cached by its logical plan, > and > use `QueryPlan.sameResult` to lookup the cache. But the PersistedView wraps > the child plan with a `View` which always lead false for `sameResult` check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33647) cache table not working for persisted view
[ https://issues.apache.org/jira/browse/SPARK-33647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33647. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30567 [https://github.com/apache/spark/pull/30567] > cache table not working for persisted view > -- > > Key: SPARK-33647 > URL: https://issues.apache.org/jira/browse/SPARK-33647 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > Fix For: 3.1.0 > > > In `CacheManager`, tables (including views) are cached by its logical plan, > and > use `QueryPlan.sameResult` to lookup the cache. But the PersistedView wraps > the child plan with a `View` which always lead false for `sameResult` check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33647) cache table not working for persisted view
[ https://issues.apache.org/jira/browse/SPARK-33647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243758#comment-17243758 ] Apache Spark commented on SPARK-33647: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/30567 > cache table not working for persisted view > -- > > Key: SPARK-33647 > URL: https://issues.apache.org/jira/browse/SPARK-33647 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Linhong Liu >Priority: Major > > In `CacheManager`, tables (including views) are cached by its logical plan, > and > use `QueryPlan.sameResult` to lookup the cache. But the PersistedView wraps > the child plan with a `View` which always lead false for `sameResult` check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243748#comment-17243748 ] leesf commented on SPARK-29625: --- hi,[~sanysand...@gmail.com] any updates here, how would you solve the error, thanks. > Spark Structure Streaming Kafka Wrong Reset Offset twice > > > Key: SPARK-29625 > URL: https://issues.apache.org/jira/browse/SPARK-29625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Sandish Kumar HN >Priority: Major > > Spark Structure Streaming Kafka Reset Offset twice, once with right offsets > and second time with very old offsets > {code} > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-151 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-118 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-85 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 122677634. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-19 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 120504922.* > [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO ContextCleaner: Cleaned accumulator 810 > {code} > which is causing a Data loss issue. > {code} > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, > runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > java.lang.IllegalStateException: Partition topic-52's offset was changed from > 122677598 to 120504922, some data may have been missed. > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have > been lost because they are not available in Kafka any more; either the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - data was aged out > by Kafka or the topic may have been deleted before all the data in the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - topic was > processed. If you don't want your streaming query to fail on such cases, set > the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - source option > "failOnDataLoss" to "false". > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) > [2019-10-28 19:27:
[jira] [Updated] (SPARK-24607) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-24607: - Labels: bulk-closed correctness (was: bulk-closed) > Distribute by rand() can lead to data inconsistency > --- > > Key: SPARK-24607 > URL: https://issues.apache.org/jira/browse/SPARK-24607 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: zenglinxi >Priority: Major > Labels: bulk-closed, correctness > > Noticed the following queries can give different results: > {code:java} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a;{code} > this issue was first reported by someone using kylin for building cube with > hiveSQL which include distribute by rand, data inconsistency may happen > during failure tolerance operations. Since spark has similar failure > tolerance mechanism, I think it's also an hidden serious problem in sparksql. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24607) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243743#comment-17243743 ] Kent Yao edited comment on SPARK-24607 at 12/4/20, 6:17 AM: This could happen when the map stage retries, the same record that in the map task probably targets to different reduce tasks among task attempts. This could result in an incomplete result set when introducing a non-deterministic expression, e.g. rand(), in the jobs that need shuffle, e.g. aggregates, sort-merge join. We may need a random but replayable function to handle these use cases because it is a common way that users use to deal with data skewness. Otherwise, we may forbid non-deterministic functions to be used shuffle related operations. cc [~cloud_fan] [~ulysses] [~maropu] was (Author: qin yao): This could happen when the map stage retries, the same record that in the map task probably targets to different reduce tasks among task attempts. his could result in an incomplete result set when introducing a non-deterministic expression, e.g. rand(), in the jobs that need shuffle, e.g. aggregates, sort-merge join. We may need a random but replayable function to handle these use cases because it is a common way that users use to deal with data skewness. Otherwise, we may forbid non-deterministic functions to be used shuffle related operations. cc [~cloud_fan] [~ulysses] [~maropu] > Distribute by rand() can lead to data inconsistency > --- > > Key: SPARK-24607 > URL: https://issues.apache.org/jira/browse/SPARK-24607 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: zenglinxi >Priority: Major > Labels: bulk-closed > > Noticed the following queries can give different results: > {code:java} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a;{code} > this issue was first reported by someone using kylin for building cube with > hiveSQL which include distribute by rand, data inconsistency may happen > during failure tolerance operations. Since spark has similar failure > tolerance mechanism, I think it's also an hidden serious problem in sparksql. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24607) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243743#comment-17243743 ] Kent Yao commented on SPARK-24607: -- This could happen when the map stage retries, the same record that in the map task probably targets to different reduce tasks among task attempts. his could result in an incomplete result set when introducing a non-deterministic expression, e.g. rand(), in the jobs that need shuffle, e.g. aggregates, sort-merge join. We may need a random but replayable function to handle these use cases because it is a common way that users use to deal with data skewness. Otherwise, we may forbid non-deterministic functions to be used shuffle related operations. cc [~cloud_fan] [~ulysses] [~maropu] > Distribute by rand() can lead to data inconsistency > --- > > Key: SPARK-24607 > URL: https://issues.apache.org/jira/browse/SPARK-24607 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: zenglinxi >Priority: Major > Labels: bulk-closed > > Noticed the following queries can give different results: > {code:java} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a;{code} > this issue was first reported by someone using kylin for building cube with > hiveSQL which include distribute by rand, data inconsistency may happen > during failure tolerance operations. Since spark has similar failure > tolerance mechanism, I think it's also an hidden serious problem in sparksql. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33658) Suggest using datetime conversion functions for invalid ANSI casting
[ https://issues.apache.org/jira/browse/SPARK-33658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33658: Assignee: Gengliang Wang (was: Apache Spark) > Suggest using datetime conversion functions for invalid ANSI casting > > > Key: SPARK-33658 > URL: https://issues.apache.org/jira/browse/SPARK-33658 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > In ANSI mode, explicit cast between DateTime types and Numeric types is not > allowed. > As of now, we have introduced new functions > UNIX_SECONDS/UNIX_MILLIS/UNIX_MICROS/UNIX_DATE/DATE_FROM_UNIX_DATE, we can > show suggestions to users so that they can complete these type conversions > precisely and easily in ANSI mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33658) Suggest using datetime conversion functions for invalid ANSI casting
[ https://issues.apache.org/jira/browse/SPARK-33658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243690#comment-17243690 ] Apache Spark commented on SPARK-33658: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/30603 > Suggest using datetime conversion functions for invalid ANSI casting > > > Key: SPARK-33658 > URL: https://issues.apache.org/jira/browse/SPARK-33658 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > In ANSI mode, explicit cast between DateTime types and Numeric types is not > allowed. > As of now, we have introduced new functions > UNIX_SECONDS/UNIX_MILLIS/UNIX_MICROS/UNIX_DATE/DATE_FROM_UNIX_DATE, we can > show suggestions to users so that they can complete these type conversions > precisely and easily in ANSI mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33658) Suggest using datetime conversion functions for invalid ANSI casting
[ https://issues.apache.org/jira/browse/SPARK-33658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33658: Assignee: Apache Spark (was: Gengliang Wang) > Suggest using datetime conversion functions for invalid ANSI casting > > > Key: SPARK-33658 > URL: https://issues.apache.org/jira/browse/SPARK-33658 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > In ANSI mode, explicit cast between DateTime types and Numeric types is not > allowed. > As of now, we have introduced new functions > UNIX_SECONDS/UNIX_MILLIS/UNIX_MICROS/UNIX_DATE/DATE_FROM_UNIX_DATE, we can > show suggestions to users so that they can complete these type conversions > precisely and easily in ANSI mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33658) Suggest using datetime conversion functions for invalid ANSI casting
Gengliang Wang created SPARK-33658: -- Summary: Suggest using datetime conversion functions for invalid ANSI casting Key: SPARK-33658 URL: https://issues.apache.org/jira/browse/SPARK-33658 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Gengliang Wang Assignee: Gengliang Wang In ANSI mode, explicit cast between DateTime types and Numeric types is not allowed. As of now, we have introduced new functions UNIX_SECONDS/UNIX_MILLIS/UNIX_MICROS/UNIX_DATE/DATE_FROM_UNIX_DATE, we can show suggestions to users so that they can complete these type conversions precisely and easily in ANSI mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33657) After spark sql is executed to generate hdfs data, the relevant status information is printed
[ https://issues.apache.org/jira/browse/SPARK-33657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33657: Assignee: (was: Apache Spark) > After spark sql is executed to generate hdfs data, the relevant status > information is printed > -- > > Key: SPARK-33657 > URL: https://issues.apache.org/jira/browse/SPARK-33657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: guihuawen >Priority: Major > > Spark sql executes the operation that needs to be executed after the data > generated by hdfs, which is very user-friendly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33657) After spark sql is executed to generate hdfs data, the relevant status information is printed
[ https://issues.apache.org/jira/browse/SPARK-33657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33657: Assignee: (was: Apache Spark) > After spark sql is executed to generate hdfs data, the relevant status > information is printed > -- > > Key: SPARK-33657 > URL: https://issues.apache.org/jira/browse/SPARK-33657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: guihuawen >Priority: Major > > Spark sql executes the operation that needs to be executed after the data > generated by hdfs, which is very user-friendly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33657) After spark sql is executed to generate hdfs data, the relevant status information is printed
[ https://issues.apache.org/jira/browse/SPARK-33657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243687#comment-17243687 ] Apache Spark commented on SPARK-33657: -- User 'guixiaowen' has created a pull request for this issue: https://github.com/apache/spark/pull/30602 > After spark sql is executed to generate hdfs data, the relevant status > information is printed > -- > > Key: SPARK-33657 > URL: https://issues.apache.org/jira/browse/SPARK-33657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: guihuawen >Priority: Major > > Spark sql executes the operation that needs to be executed after the data > generated by hdfs, which is very user-friendly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33657) After spark sql is executed to generate hdfs data, the relevant status information is printed
[ https://issues.apache.org/jira/browse/SPARK-33657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33657: Assignee: Apache Spark > After spark sql is executed to generate hdfs data, the relevant status > information is printed > -- > > Key: SPARK-33657 > URL: https://issues.apache.org/jira/browse/SPARK-33657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: guihuawen >Assignee: Apache Spark >Priority: Major > > Spark sql executes the operation that needs to be executed after the data > generated by hdfs, which is very user-friendly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33632) to_date doesn't behave as documented
[ https://issues.apache.org/jira/browse/SPARK-33632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243681#comment-17243681 ] Liu Neng edited comment on SPARK-33632 at 12/4/20, 3:46 AM: This is not an issue, you may misunderstand the docs. You should use pattern m/d/yy, parse mode is determined by count of letter 'y'. below is source code from DateTimeFormatterBuilder. !image-2020-12-04-11-45-10-379.png! was (Author: qwe1398775315): you should use pattern m/d/yy, parse mode is determined by count of letter 'y'. below is source code from DateTimeFormatterBuilder. !image-2020-12-04-11-45-10-379.png! > to_date doesn't behave as documented > > > Key: SPARK-33632 > URL: https://issues.apache.org/jira/browse/SPARK-33632 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Frank Oosterhuis >Priority: Major > Attachments: image-2020-12-04-11-45-10-379.png > > > I'm trying to use to_date on a string formatted as "10/31/20". > Expected output is "2020-10-31". > Actual output is "0020-01-31". > The > [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html] > suggests 2020 or 20 as input for "y". > Example below. Expected behaviour is included in the udf. > {code:scala} > import java.sql.Date > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions.{to_date, udf} > object ToDate { > val toDate = udf((date: String) => { > val split = date.split("/") > val month = "%02d".format(split(0).toInt) > val day = "%02d".format(split(1).toInt) > val year = split(2).toInt + 2000 > Date.valueOf(s"${year}-${month}-${day}") > }) > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder().master("local[2]").getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.implicits._ > Seq("1/1/20", "10/31/20") > .toDF("raw") > .withColumn("to_date", to_date($"raw", "m/d/y")) > .withColumn("udf", toDate($"raw")) > .show > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33632) to_date doesn't behave as documented
[ https://issues.apache.org/jira/browse/SPARK-33632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Neng updated SPARK-33632: - Attachment: image-2020-12-04-11-45-10-379.png > to_date doesn't behave as documented > > > Key: SPARK-33632 > URL: https://issues.apache.org/jira/browse/SPARK-33632 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Frank Oosterhuis >Priority: Major > Attachments: image-2020-12-04-11-45-10-379.png > > > I'm trying to use to_date on a string formatted as "10/31/20". > Expected output is "2020-10-31". > Actual output is "0020-01-31". > The > [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html] > suggests 2020 or 20 as input for "y". > Example below. Expected behaviour is included in the udf. > {code:scala} > import java.sql.Date > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions.{to_date, udf} > object ToDate { > val toDate = udf((date: String) => { > val split = date.split("/") > val month = "%02d".format(split(0).toInt) > val day = "%02d".format(split(1).toInt) > val year = split(2).toInt + 2000 > Date.valueOf(s"${year}-${month}-${day}") > }) > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder().master("local[2]").getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.implicits._ > Seq("1/1/20", "10/31/20") > .toDF("raw") > .withColumn("to_date", to_date($"raw", "m/d/y")) > .withColumn("udf", toDate($"raw")) > .show > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33632) to_date doesn't behave as documented
[ https://issues.apache.org/jira/browse/SPARK-33632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243681#comment-17243681 ] Liu Neng commented on SPARK-33632: -- you should use pattern m/d/yy, parse mode is determined by count of letter 'y'. below is source code from DateTimeFormatterBuilder. !image-2020-12-04-11-45-10-379.png! > to_date doesn't behave as documented > > > Key: SPARK-33632 > URL: https://issues.apache.org/jira/browse/SPARK-33632 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Frank Oosterhuis >Priority: Major > Attachments: image-2020-12-04-11-45-10-379.png > > > I'm trying to use to_date on a string formatted as "10/31/20". > Expected output is "2020-10-31". > Actual output is "0020-01-31". > The > [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html] > suggests 2020 or 20 as input for "y". > Example below. Expected behaviour is included in the udf. > {code:scala} > import java.sql.Date > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions.{to_date, udf} > object ToDate { > val toDate = udf((date: String) => { > val split = date.split("/") > val month = "%02d".format(split(0).toInt) > val day = "%02d".format(split(1).toInt) > val year = split(2).toInt + 2000 > Date.valueOf(s"${year}-${month}-${day}") > }) > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder().master("local[2]").getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.implicits._ > Seq("1/1/20", "10/31/20") > .toDF("raw") > .withColumn("to_date", to_date($"raw", "m/d/y")) > .withColumn("udf", toDate($"raw")) > .show > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33657) After spark sql is executed to generate hdfs data, the relevant status information is printed
guihuawen created SPARK-33657: - Summary: After spark sql is executed to generate hdfs data, the relevant status information is printed Key: SPARK-33657 URL: https://issues.apache.org/jira/browse/SPARK-33657 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: guihuawen Spark sql executes the operation that needs to be executed after the data generated by hdfs, which is very user-friendly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33649) Improve the doc of spark.sql.ansi.enabled
[ https://issues.apache.org/jira/browse/SPARK-33649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-33649. Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30593 [https://github.com/apache/spark/pull/30593] > Improve the doc of spark.sql.ansi.enabled > - > > Key: SPARK-33649 > URL: https://issues.apache.org/jira/browse/SPARK-33649 > Project: Spark > Issue Type: New Feature > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.1.0 > > > As there are more and more new features under the SQL configuration > spark.sql.ansi.enabled, we should make it more clear about: > 1. what exactly it is > 2. where user can find all the features of the ANSI mode > 3. whether all the feature exactly from the SQL standard -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33656) Add option to keep container after tests finish for DockerJDBCIntegrationSuites for debug
[ https://issues.apache.org/jira/browse/SPARK-33656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33656: Assignee: Kousuke Saruta (was: Apache Spark) > Add option to keep container after tests finish for > DockerJDBCIntegrationSuites for debug > - > > Key: SPARK-33656 > URL: https://issues.apache.org/jira/browse/SPARK-33656 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > DockerJDBCIntegrationSuites (e.g. DB2IntegrationSuite, > PostgresIntegrationSuite) launch a docker container which is removed after > tests finish. > If we have an option to keep the container, it would be useful for debug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33656) Add option to keep container after tests finish for DockerJDBCIntegrationSuites for debug
[ https://issues.apache.org/jira/browse/SPARK-33656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33656: Assignee: Apache Spark (was: Kousuke Saruta) > Add option to keep container after tests finish for > DockerJDBCIntegrationSuites for debug > - > > Key: SPARK-33656 > URL: https://issues.apache.org/jira/browse/SPARK-33656 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > DockerJDBCIntegrationSuites (e.g. DB2IntegrationSuite, > PostgresIntegrationSuite) launch a docker container which is removed after > tests finish. > If we have an option to keep the container, it would be useful for debug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33656) Add option to keep container after tests finish for DockerJDBCIntegrationSuites for debug
[ https://issues.apache.org/jira/browse/SPARK-33656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243665#comment-17243665 ] Apache Spark commented on SPARK-33656: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/30601 > Add option to keep container after tests finish for > DockerJDBCIntegrationSuites for debug > - > > Key: SPARK-33656 > URL: https://issues.apache.org/jira/browse/SPARK-33656 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > DockerJDBCIntegrationSuites (e.g. DB2IntegrationSuite, > PostgresIntegrationSuite) launch a docker container which is removed after > tests finish. > If we have an option to keep the container, it would be useful for debug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33656) Add option to keep container after tests finish for DockerJDBCIntegrationSuites for debug
Kousuke Saruta created SPARK-33656: -- Summary: Add option to keep container after tests finish for DockerJDBCIntegrationSuites for debug Key: SPARK-33656 URL: https://issues.apache.org/jira/browse/SPARK-33656 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta DockerJDBCIntegrationSuites (e.g. DB2IntegrationSuite, PostgresIntegrationSuite) launch a docker container which is removed after tests finish. If we have an option to keep the container, it would be useful for debug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33655) Thrift server : FETCH_PRIOR does not cause to reiterate from start position.
[ https://issues.apache.org/jira/browse/SPARK-33655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33655: Assignee: Apache Spark > Thrift server : FETCH_PRIOR does not cause to reiterate from start position. > - > > Key: SPARK-33655 > URL: https://issues.apache.org/jira/browse/SPARK-33655 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dooyoung Hwang >Assignee: Apache Spark >Priority: Major > > Currently, when a client requests FETCH_PRIOR to thrift server, thrift server > reiterates from start position. Because thrift server caches a query result > with an array, FETCH_PRIOR can be implemented without reiterating the result. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33655) Thrift server : FETCH_PRIOR does not cause to reiterate from start position.
[ https://issues.apache.org/jira/browse/SPARK-33655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33655: Assignee: (was: Apache Spark) > Thrift server : FETCH_PRIOR does not cause to reiterate from start position. > - > > Key: SPARK-33655 > URL: https://issues.apache.org/jira/browse/SPARK-33655 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dooyoung Hwang >Priority: Major > > Currently, when a client requests FETCH_PRIOR to thrift server, thrift server > reiterates from start position. Because thrift server caches a query result > with an array, FETCH_PRIOR can be implemented without reiterating the result. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33655) Thrift server : FETCH_PRIOR does not cause to reiterate from start position.
[ https://issues.apache.org/jira/browse/SPARK-33655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243656#comment-17243656 ] Apache Spark commented on SPARK-33655: -- User 'Dooyoung-Hwang' has created a pull request for this issue: https://github.com/apache/spark/pull/30600 > Thrift server : FETCH_PRIOR does not cause to reiterate from start position. > - > > Key: SPARK-33655 > URL: https://issues.apache.org/jira/browse/SPARK-33655 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dooyoung Hwang >Priority: Major > > Currently, when a client requests FETCH_PRIOR to thrift server, thrift server > reiterates from start position. Because thrift server caches a query result > with an array, FETCH_PRIOR can be implemented without reiterating the result. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243654#comment-17243654 ] Apache Spark commented on SPARK-32405: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/30599 > Apply table options while creating tables in JDBC Table Catalog > --- > > Key: SPARK-32405 > URL: https://issues.apache.org/jira/browse/SPARK-32405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.1.0 > > > We need to add an API to `JdbcDialect` to generate the SQL statement to > specify table options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243653#comment-17243653 ] Apache Spark commented on SPARK-32405: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/30599 > Apply table options while creating tables in JDBC Table Catalog > --- > > Key: SPARK-32405 > URL: https://issues.apache.org/jira/browse/SPARK-32405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.1.0 > > > We need to add an API to `JdbcDialect` to generate the SQL statement to > specify table options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33654) Migrate CACHE TABLE to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243637#comment-17243637 ] Apache Spark commented on SPARK-33654: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/30598 > Migrate CACHE TABLE to new resolution framework > --- > > Key: SPARK-33654 > URL: https://issues.apache.org/jira/browse/SPARK-33654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > Migrate CACHE TABLE to new resolution framework -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33654) Migrate CACHE TABLE to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33654: Assignee: (was: Apache Spark) > Migrate CACHE TABLE to new resolution framework > --- > > Key: SPARK-33654 > URL: https://issues.apache.org/jira/browse/SPARK-33654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > Migrate CACHE TABLE to new resolution framework -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33654) Migrate CACHE TABLE to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33654: Assignee: Apache Spark > Migrate CACHE TABLE to new resolution framework > --- > > Key: SPARK-33654 > URL: https://issues.apache.org/jira/browse/SPARK-33654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Minor > > Migrate CACHE TABLE to new resolution framework -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33655) Thrift server : FETCH_PRIOR does not cause to reiterate from start position.
Dooyoung Hwang created SPARK-33655: -- Summary: Thrift server : FETCH_PRIOR does not cause to reiterate from start position. Key: SPARK-33655 URL: https://issues.apache.org/jira/browse/SPARK-33655 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Dooyoung Hwang Currently, when a client requests FETCH_PRIOR to thrift server, thrift server reiterates from start position. Because thrift server caches a query result with an array, FETCH_PRIOR can be implemented without reiterating the result. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33654) Migrate CACHE TABLE to new resolution framework
Terry Kim created SPARK-33654: - Summary: Migrate CACHE TABLE to new resolution framework Key: SPARK-33654 URL: https://issues.apache.org/jira/browse/SPARK-33654 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Terry Kim Migrate CACHE TABLE to new resolution framework -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33653) DSv2: REFRESH TABLE should recache the table itself
Chao Sun created SPARK-33653: Summary: DSv2: REFRESH TABLE should recache the table itself Key: SPARK-33653 URL: https://issues.apache.org/jira/browse/SPARK-33653 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Chao Sun As "CACHE TABLE" is supported in DSv2 now, we should also recache the table itself in "REFRESH TABLE" command, to match the behavior in DSv1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33652) DSv2: DeleteFrom should refresh cache
[ https://issues.apache.org/jira/browse/SPARK-33652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243624#comment-17243624 ] Apache Spark commented on SPARK-33652: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/30597 > DSv2: DeleteFrom should refresh cache > - > > Key: SPARK-33652 > URL: https://issues.apache.org/jira/browse/SPARK-33652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently DeleteFrom in DSv2 doesn't refresh cache, which could lead to > correctness issue if the cache becomes stale and queried after. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33652) DSv2: DeleteFrom should refresh cache
[ https://issues.apache.org/jira/browse/SPARK-33652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33652: Assignee: (was: Apache Spark) > DSv2: DeleteFrom should refresh cache > - > > Key: SPARK-33652 > URL: https://issues.apache.org/jira/browse/SPARK-33652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently DeleteFrom in DSv2 doesn't refresh cache, which could lead to > correctness issue if the cache becomes stale and queried after. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33652) DSv2: DeleteFrom should refresh cache
[ https://issues.apache.org/jira/browse/SPARK-33652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33652: Assignee: Apache Spark > DSv2: DeleteFrom should refresh cache > - > > Key: SPARK-33652 > URL: https://issues.apache.org/jira/browse/SPARK-33652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > > Currently DeleteFrom in DSv2 doesn't refresh cache, which could lead to > correctness issue if the cache becomes stale and queried after. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33652) DSv2: DeleteFrom should refresh cache
[ https://issues.apache.org/jira/browse/SPARK-33652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243623#comment-17243623 ] Apache Spark commented on SPARK-33652: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/30597 > DSv2: DeleteFrom should refresh cache > - > > Key: SPARK-33652 > URL: https://issues.apache.org/jira/browse/SPARK-33652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently DeleteFrom in DSv2 doesn't refresh cache, which could lead to > correctness issue if the cache becomes stale and queried after. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33652) DSv2: DeleteFrom should refresh cache
Chao Sun created SPARK-33652: Summary: DSv2: DeleteFrom should refresh cache Key: SPARK-33652 URL: https://issues.apache.org/jira/browse/SPARK-33652 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Chao Sun Currently DeleteFrom in DSv2 doesn't refresh cache, which could lead to correctness issue if the cache becomes stale and queried after. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33652) DSv2: DeleteFrom should refresh cache
[ https://issues.apache.org/jira/browse/SPARK-33652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-33652: - Parent: SPARK-33507 Issue Type: Sub-task (was: Improvement) > DSv2: DeleteFrom should refresh cache > - > > Key: SPARK-33652 > URL: https://issues.apache.org/jira/browse/SPARK-33652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Priority: Major > > Currently DeleteFrom in DSv2 doesn't refresh cache, which could lead to > correctness issue if the cache becomes stale and queried after. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32968) Column pruning for CsvToStructs
[ https://issues.apache.org/jira/browse/SPARK-32968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243610#comment-17243610 ] L. C. Hsieh commented on SPARK-32968: - You could try to help on the tickets without assignee. Thanks. > Column pruning for CsvToStructs > --- > > Key: SPARK-32968 > URL: https://issues.apache.org/jira/browse/SPARK-32968 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > We could do column pruning for CsvToStructs expression if we only require > some fields from it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32968) Column pruning for CsvToStructs
[ https://issues.apache.org/jira/browse/SPARK-32968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243608#comment-17243608 ] L. C. Hsieh commented on SPARK-32968: - Sorry but I am working on it. > Column pruning for CsvToStructs > --- > > Key: SPARK-32968 > URL: https://issues.apache.org/jira/browse/SPARK-32968 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > We could do column pruning for CsvToStructs expression if we only require > some fields from it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33650) Misleading error from ALTER TABLE .. PARTITION for non-supported partition management table
[ https://issues.apache.org/jira/browse/SPARK-33650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33650: - Assignee: Maxim Gekk > Misleading error from ALTER TABLE .. PARTITION for non-supported partition > management table > --- > > Key: SPARK-33650 > URL: https://issues.apache.org/jira/browse/SPARK-33650 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > For a V2 table that doesn't support partition management, ALTER TABLE .. > ADD/DROP PARTITION throws misleading exception: > {code:java} > PartitionSpecs are not resolved;; > 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false > +- ResolvedTable > org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, > org.apache.spark.sql.connector.InMemoryTable@5d3ff859 > org.apache.spark.sql.AnalysisException: PartitionSpecs are not resolved;; > 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false > +- ResolvedTable > org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, > org.apache.spark.sql.connector.InMemoryTable@5d3ff859 > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) > {code} > The error should say that the table doesn't support partition management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33650) Misleading error from ALTER TABLE .. PARTITION for non-supported partition management table
[ https://issues.apache.org/jira/browse/SPARK-33650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33650. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30594 [https://github.com/apache/spark/pull/30594] > Misleading error from ALTER TABLE .. PARTITION for non-supported partition > management table > --- > > Key: SPARK-33650 > URL: https://issues.apache.org/jira/browse/SPARK-33650 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > For a V2 table that doesn't support partition management, ALTER TABLE .. > ADD/DROP PARTITION throws misleading exception: > {code:java} > PartitionSpecs are not resolved;; > 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false > +- ResolvedTable > org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, > org.apache.spark.sql.connector.InMemoryTable@5d3ff859 > org.apache.spark.sql.AnalysisException: PartitionSpecs are not resolved;; > 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false > +- ResolvedTable > org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, > org.apache.spark.sql.connector.InMemoryTable@5d3ff859 > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) > {code} > The error should say that the table doesn't support partition management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33520) make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator
[ https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu resolved SPARK-33520. Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30471 [https://github.com/apache/spark/pull/30471] > make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python > backend estimator/evaluator > - > > Key: SPARK-33520 > URL: https://issues.apache.org/jira/browse/SPARK-33520 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.1.0 > > > Currently, pyspark support third-party library to define python backend > estimator/evaluator, i.e., estimator that inherit `Estimator` instead of > `JavaEstimator`, and only can be used in pyspark. > CrossValidator and TrainValidateSplit support tuning these python backend > estimator, > but cannot support saving/load, becase CrossValidator and TrainValidateSplit > writer implementation is use JavaMLWriter, which require to convert nested > estimator and evaluator into java instance. > OneVsRest saving/load now only support java backend classifier due to similar > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33520) make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator
[ https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-33520: -- Assignee: Weichen Xu > make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python > backend estimator/evaluator > - > > Key: SPARK-33520 > URL: https://issues.apache.org/jira/browse/SPARK-33520 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Currently, pyspark support third-party library to define python backend > estimator/evaluator, i.e., estimator that inherit `Estimator` instead of > `JavaEstimator`, and only can be used in pyspark. > CrossValidator and TrainValidateSplit support tuning these python backend > estimator, > but cannot support saving/load, becase CrossValidator and TrainValidateSplit > writer implementation is use JavaMLWriter, which require to convert nested > estimator and evaluator into java instance. > OneVsRest saving/load now only support java backend classifier due to similar > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32968) Column pruning for CsvToStructs
[ https://issues.apache.org/jira/browse/SPARK-32968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243599#comment-17243599 ] Yesheng Ma edited comment on SPARK-32968 at 12/4/20, 12:07 AM: --- Looks like it is similar to https://issues.apache.org/jira/browse/SPARK-32958 and I can help out if necessary. was (Author: manifoldqaq): Looks like it is similar to https://issues.apache.org/jira/browse/SPARK-32958 and I can take a look. > Column pruning for CsvToStructs > --- > > Key: SPARK-32968 > URL: https://issues.apache.org/jira/browse/SPARK-32968 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > We could do column pruning for CsvToStructs expression if we only require > some fields from it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32968) Column pruning for CsvToStructs
[ https://issues.apache.org/jira/browse/SPARK-32968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243599#comment-17243599 ] Yesheng Ma commented on SPARK-32968: Looks like it is similar to https://issues.apache.org/jira/browse/SPARK-32958 and I can take a look. > Column pruning for CsvToStructs > --- > > Key: SPARK-32968 > URL: https://issues.apache.org/jira/browse/SPARK-32968 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > We could do column pruning for CsvToStructs expression if we only require > some fields from it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33295) Upgrade ORC to 1.6.6
[ https://issues.apache.org/jira/browse/SPARK-33295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33295: -- Target Version/s: 3.2.0 > Upgrade ORC to 1.6.6 > - > > Key: SPARK-33295 > URL: https://issues.apache.org/jira/browse/SPARK-33295 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.2.0 >Reporter: Jaanai Zhang >Assignee: Dongjoon Hyun >Priority: Major > > support zstd compression algorithm for ORC format -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33295) Upgrade ORC to 1.6.6
[ https://issues.apache.org/jira/browse/SPARK-33295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33295: -- Affects Version/s: (was: 3.1.0) 3.2.0 > Upgrade ORC to 1.6.6 > - > > Key: SPARK-33295 > URL: https://issues.apache.org/jira/browse/SPARK-33295 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.2.0 >Reporter: Jaanai Zhang >Priority: Major > > support zstd compression algorithm for ORC format -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33295) Upgrade ORC to 1.6.6
[ https://issues.apache.org/jira/browse/SPARK-33295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33295: - Assignee: Dongjoon Hyun > Upgrade ORC to 1.6.6 > - > > Key: SPARK-33295 > URL: https://issues.apache.org/jira/browse/SPARK-33295 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 3.2.0 >Reporter: Jaanai Zhang >Assignee: Dongjoon Hyun >Priority: Major > > support zstd compression algorithm for ORC format -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243441#comment-17243441 ] Maxim Gekk commented on SPARK-33571: I opened the PR [https://github.com/apache/spark/pull/30596] with some improvements for config docs. [~hyukjin.kwon] [~cloud_fan] could you review it, please. > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33571: Assignee: (was: Apache Spark) > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33571: Assignee: Apache Spark > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Assignee: Apache Spark >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243440#comment-17243440 ] Apache Spark commented on SPARK-33571: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30596 > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33651) allow CREATE EXTERNAL TABLE with LOCATION for data source tables
[ https://issues.apache.org/jira/browse/SPARK-33651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243413#comment-17243413 ] Apache Spark commented on SPARK-33651: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/30595 > allow CREATE EXTERNAL TABLE with LOCATION for data source tables > > > Key: SPARK-33651 > URL: https://issues.apache.org/jira/browse/SPARK-33651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33651) allow CREATE EXTERNAL TABLE with LOCATION for data source tables
[ https://issues.apache.org/jira/browse/SPARK-33651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33651: Assignee: Apache Spark (was: Wenchen Fan) > allow CREATE EXTERNAL TABLE with LOCATION for data source tables > > > Key: SPARK-33651 > URL: https://issues.apache.org/jira/browse/SPARK-33651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33651) allow CREATE EXTERNAL TABLE with LOCATION for data source tables
[ https://issues.apache.org/jira/browse/SPARK-33651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33651: Assignee: Wenchen Fan (was: Apache Spark) > allow CREATE EXTERNAL TABLE with LOCATION for data source tables > > > Key: SPARK-33651 > URL: https://issues.apache.org/jira/browse/SPARK-33651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33651) allow CREATE EXTERNAL TABLE with LOCATION for data source tables
[ https://issues.apache.org/jira/browse/SPARK-33651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-33651: Summary: allow CREATE EXTERNAL TABLE with LOCATION for data source tables (was: allow CREATE EXTERNAL TABLE without LOCATION for data source tables) > allow CREATE EXTERNAL TABLE with LOCATION for data source tables > > > Key: SPARK-33651 > URL: https://issues.apache.org/jira/browse/SPARK-33651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33651) allow CREATE EXTERNAL TABLE without LOCATION for data source tables
Wenchen Fan created SPARK-33651: --- Summary: allow CREATE EXTERNAL TABLE without LOCATION for data source tables Key: SPARK-33651 URL: https://issues.apache.org/jira/browse/SPARK-33651 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33634) use Analyzer in PlanResolutionSuite
[ https://issues.apache.org/jira/browse/SPARK-33634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33634. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30574 [https://github.com/apache/spark/pull/30574] > use Analyzer in PlanResolutionSuite > --- > > Key: SPARK-33634 > URL: https://issues.apache.org/jira/browse/SPARK-33634 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33623) Add canDeleteWhere to SupportsDelete
[ https://issues.apache.org/jira/browse/SPARK-33623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33623. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30562 [https://github.com/apache/spark/pull/30562] > Add canDeleteWhere to SupportsDelete > > > Key: SPARK-33623 > URL: https://issues.apache.org/jira/browse/SPARK-33623 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.1.0 > > > The only way to support delete statements right now is to implement > \{{SupportsDelete}}. According to its Javadoc, that interface is meant for > cases when we can delete data without much effort (e.g. like deleting a > complete partition in a Hive table). It is clear we need a more sophisticated > API for row-level deletes. That's why it would be beneficial to add a method > to \{{SupportsDelete}} so that Spark can check if a source can easily delete > data with just having filters or it will need a full rewrite later on. This > way, we have more control in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33623) Add canDeleteWhere to SupportsDelete
[ https://issues.apache.org/jira/browse/SPARK-33623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33623: - Assignee: Anton Okolnychyi > Add canDeleteWhere to SupportsDelete > > > Key: SPARK-33623 > URL: https://issues.apache.org/jira/browse/SPARK-33623 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > > The only way to support delete statements right now is to implement > \{{SupportsDelete}}. According to its Javadoc, that interface is meant for > cases when we can delete data without much effort (e.g. like deleting a > complete partition in a Hive table). It is clear we need a more sophisticated > API for row-level deletes. That's why it would be beneficial to add a method > to \{{SupportsDelete}} so that Spark can check if a source can easily delete > data with just having filters or it will need a full rewrite later on. This > way, we have more control in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33650) Misleading error from ALTER TABLE .. PARTITION for non-supported partition management table
[ https://issues.apache.org/jira/browse/SPARK-33650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33650: Assignee: Apache Spark > Misleading error from ALTER TABLE .. PARTITION for non-supported partition > management table > --- > > Key: SPARK-33650 > URL: https://issues.apache.org/jira/browse/SPARK-33650 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > For a V2 table that doesn't support partition management, ALTER TABLE .. > ADD/DROP PARTITION throws misleading exception: > {code:java} > PartitionSpecs are not resolved;; > 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false > +- ResolvedTable > org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, > org.apache.spark.sql.connector.InMemoryTable@5d3ff859 > org.apache.spark.sql.AnalysisException: PartitionSpecs are not resolved;; > 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false > +- ResolvedTable > org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, > org.apache.spark.sql.connector.InMemoryTable@5d3ff859 > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) > {code} > The error should say that the table doesn't support partition management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33650) Misleading error from ALTER TABLE .. PARTITION for non-supported partition management table
[ https://issues.apache.org/jira/browse/SPARK-33650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33650: Assignee: (was: Apache Spark) > Misleading error from ALTER TABLE .. PARTITION for non-supported partition > management table > --- > > Key: SPARK-33650 > URL: https://issues.apache.org/jira/browse/SPARK-33650 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > For a V2 table that doesn't support partition management, ALTER TABLE .. > ADD/DROP PARTITION throws misleading exception: > {code:java} > PartitionSpecs are not resolved;; > 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false > +- ResolvedTable > org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, > org.apache.spark.sql.connector.InMemoryTable@5d3ff859 > org.apache.spark.sql.AnalysisException: PartitionSpecs are not resolved;; > 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false > +- ResolvedTable > org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, > org.apache.spark.sql.connector.InMemoryTable@5d3ff859 > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) > {code} > The error should say that the table doesn't support partition management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33650) Misleading error from ALTER TABLE .. PARTITION for non-supported partition management table
[ https://issues.apache.org/jira/browse/SPARK-33650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243372#comment-17243372 ] Apache Spark commented on SPARK-33650: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30594 > Misleading error from ALTER TABLE .. PARTITION for non-supported partition > management table > --- > > Key: SPARK-33650 > URL: https://issues.apache.org/jira/browse/SPARK-33650 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > For a V2 table that doesn't support partition management, ALTER TABLE .. > ADD/DROP PARTITION throws misleading exception: > {code:java} > PartitionSpecs are not resolved;; > 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false > +- ResolvedTable > org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, > org.apache.spark.sql.connector.InMemoryTable@5d3ff859 > org.apache.spark.sql.AnalysisException: PartitionSpecs are not resolved;; > 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false > +- ResolvedTable > org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, > org.apache.spark.sql.connector.InMemoryTable@5d3ff859 > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) > {code} > The error should say that the table doesn't support partition management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33650) Misleading error from ALTER TABLE .. PARTITION for non-supported partition management table
Maxim Gekk created SPARK-33650: -- Summary: Misleading error from ALTER TABLE .. PARTITION for non-supported partition management table Key: SPARK-33650 URL: https://issues.apache.org/jira/browse/SPARK-33650 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk For a V2 table that doesn't support partition management, ALTER TABLE .. ADD/DROP PARTITION throws misleading exception: {code:java} PartitionSpecs are not resolved;; 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false +- ResolvedTable org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, org.apache.spark.sql.connector.InMemoryTable@5d3ff859 org.apache.spark.sql.AnalysisException: PartitionSpecs are not resolved;; 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false +- ResolvedTable org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, ns1.ns2.tbl, org.apache.spark.sql.connector.InMemoryTable@5d3ff859 at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) {code} The error should say that the table doesn't support partition management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33629) spark.buffer.size not applied in driver from pyspark
[ https://issues.apache.org/jira/browse/SPARK-33629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33629: - Fix Version/s: (was: 3.2.0) 3.1.0 > spark.buffer.size not applied in driver from pyspark > > > Key: SPARK-33629 > URL: https://issues.apache.org/jira/browse/SPARK-33629 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > The problem has been discovered here: > [https://github.com/apache/spark/pull/30389#issuecomment-729524618] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33629) spark.buffer.size not applied in driver from pyspark
[ https://issues.apache.org/jira/browse/SPARK-33629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33629: Assignee: Gabor Somogyi > spark.buffer.size not applied in driver from pyspark > > > Key: SPARK-33629 > URL: https://issues.apache.org/jira/browse/SPARK-33629 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > > The problem has been discovered here: > [https://github.com/apache/spark/pull/30389#issuecomment-729524618] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33629) spark.buffer.size not applied in driver from pyspark
[ https://issues.apache.org/jira/browse/SPARK-33629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33629. -- Fix Version/s: 3.2.0 3.0.2 Resolution: Fixed Issue resolved by pull request 30592 [https://github.com/apache/spark/pull/30592] > spark.buffer.size not applied in driver from pyspark > > > Key: SPARK-33629 > URL: https://issues.apache.org/jira/browse/SPARK-33629 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.2, 3.2.0 > > > The problem has been discovered here: > [https://github.com/apache/spark/pull/30389#issuecomment-729524618] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27733) Upgrade to Avro 1.10.1
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated SPARK-27733: - Summary: Upgrade to Avro 1.10.1 (was: Upgrade to Avro 1.10.0) > Upgrade to Avro 1.10.1 > -- > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.1.0 >Reporter: Ismaël Mejía >Priority: Major > > Avro 1.9.2 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranamer, no shaded guava, security > updates, so probably a worth upgrade. > Avro 1.10.0 was released and this is still not done. > There is at the moment (2020/08) still a blocker because of Hive related > transitive dependencies bringing older versions of Avro, so we could say that > this is somehow still blocked until HIVE-21737 is solved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33649) Improve the doc of spark.sql.ansi.enabled
[ https://issues.apache.org/jira/browse/SPARK-33649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243280#comment-17243280 ] Apache Spark commented on SPARK-33649: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/30593 > Improve the doc of spark.sql.ansi.enabled > - > > Key: SPARK-33649 > URL: https://issues.apache.org/jira/browse/SPARK-33649 > Project: Spark > Issue Type: New Feature > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > As there are more and more new features under the SQL configuration > spark.sql.ansi.enabled, we should make it more clear about: > 1. what exactly it is > 2. where user can find all the features of the ANSI mode > 3. whether all the feature exactly from the SQL standard -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33649) Improve the doc of spark.sql.ansi.enabled
[ https://issues.apache.org/jira/browse/SPARK-33649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243282#comment-17243282 ] Apache Spark commented on SPARK-33649: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/30593 > Improve the doc of spark.sql.ansi.enabled > - > > Key: SPARK-33649 > URL: https://issues.apache.org/jira/browse/SPARK-33649 > Project: Spark > Issue Type: New Feature > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > As there are more and more new features under the SQL configuration > spark.sql.ansi.enabled, we should make it more clear about: > 1. what exactly it is > 2. where user can find all the features of the ANSI mode > 3. whether all the feature exactly from the SQL standard -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33649) Improve the doc of spark.sql.ansi.enabled
[ https://issues.apache.org/jira/browse/SPARK-33649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33649: Assignee: Apache Spark (was: Gengliang Wang) > Improve the doc of spark.sql.ansi.enabled > - > > Key: SPARK-33649 > URL: https://issues.apache.org/jira/browse/SPARK-33649 > Project: Spark > Issue Type: New Feature > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > As there are more and more new features under the SQL configuration > spark.sql.ansi.enabled, we should make it more clear about: > 1. what exactly it is > 2. where user can find all the features of the ANSI mode > 3. whether all the feature exactly from the SQL standard -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33649) Improve the doc of spark.sql.ansi.enabled
[ https://issues.apache.org/jira/browse/SPARK-33649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33649: Assignee: Gengliang Wang (was: Apache Spark) > Improve the doc of spark.sql.ansi.enabled > - > > Key: SPARK-33649 > URL: https://issues.apache.org/jira/browse/SPARK-33649 > Project: Spark > Issue Type: New Feature > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > As there are more and more new features under the SQL configuration > spark.sql.ansi.enabled, we should make it more clear about: > 1. what exactly it is > 2. where user can find all the features of the ANSI mode > 3. whether all the feature exactly from the SQL standard -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33649) Improve the doc of spark.sql.ansi.enabled
Gengliang Wang created SPARK-33649: -- Summary: Improve the doc of spark.sql.ansi.enabled Key: SPARK-33649 URL: https://issues.apache.org/jira/browse/SPARK-33649 Project: Spark Issue Type: New Feature Components: Documentation, SQL Affects Versions: 3.1.0 Reporter: Gengliang Wang Assignee: Gengliang Wang As there are more and more new features under the SQL configuration spark.sql.ansi.enabled, we should make it more clear about: 1. what exactly it is 2. where user can find all the features of the ANSI mode 3. whether all the feature exactly from the SQL standard -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30098) Add a configuration to use default datasource as provider for CREATE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-30098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30098: Description: Changing the default provider from `hive` to the value of `spark.sql.sources.default` for "CREATE TABLE" command to make it be consistent with DataFrameWriter.saveAsTable API, w.r.t. to the new cofig. (by default we don't change the table provider) Also, it brings more friendly to end users since Spark is well know of using parquet(default value of `spark.sql.sources.default`) as its default I/O format. was: Changing the default provider from `hive` to the value of `spark.sql.sources.default` for "CREATE TABLE" command to make it be consistent with DataFrameWriter.saveAsTable API. Also, it brings more friendly to end users since Spark is well know of using parquet(default value of `spark.sql.sources.default`) as its default I/O format. > Add a configuration to use default datasource as provider for CREATE TABLE > command > -- > > Key: SPARK-30098 > URL: https://issues.apache.org/jira/browse/SPARK-30098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0 > > > Changing the default provider from `hive` to the value of > `spark.sql.sources.default` for "CREATE TABLE" command to make it be > consistent with DataFrameWriter.saveAsTable API, w.r.t. to the new cofig. (by > default we don't change the table provider) > Also, it brings more friendly to end users since Spark is well know of using > parquet(default value of `spark.sql.sources.default`) as its default I/O > format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30098) Add a configuration to use default datasource as provider for CREATE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-30098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30098: Summary: Add a configuration to use default datasource as provider for CREATE TABLE command (was: Use default datasource as provider for CREATE TABLE command) > Add a configuration to use default datasource as provider for CREATE TABLE > command > -- > > Key: SPARK-30098 > URL: https://issues.apache.org/jira/browse/SPARK-30098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0 > > > Changing the default provider from `hive` to the value of > `spark.sql.sources.default` for "CREATE TABLE" command to make it be > consistent with DataFrameWriter.saveAsTable API. > Also, it brings more friendly to end users since Spark is well know of using > parquet(default value of `spark.sql.sources.default`) as its default I/O > format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30098) Use default datasource as provider for CREATE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-30098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30098. - Fix Version/s: (was: 3.0.0) 3.1.0 Resolution: Fixed Issue resolved by pull request 30554 [https://github.com/apache/spark/pull/30554] > Use default datasource as provider for CREATE TABLE command > --- > > Key: SPARK-30098 > URL: https://issues.apache.org/jira/browse/SPARK-30098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0 > > > Changing the default provider from `hive` to the value of > `spark.sql.sources.default` for "CREATE TABLE" command to make it be > consistent with DataFrameWriter.saveAsTable API. > Also, it brings more friendly to end users since Spark is well know of using > parquet(default value of `spark.sql.sources.default`) as its default I/O > format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30098) Use default datasource as provider for CREATE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-30098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30098: --- Assignee: Wenchen Fan > Use default datasource as provider for CREATE TABLE command > --- > > Key: SPARK-30098 > URL: https://issues.apache.org/jira/browse/SPARK-30098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Changing the default provider from `hive` to the value of > `spark.sql.sources.default` for "CREATE TABLE" command to make it be > consistent with DataFrameWriter.saveAsTable API. > Also, it brings more friendly to end users since Spark is well know of using > parquet(default value of `spark.sql.sources.default`) as its default I/O > format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33646) Add new function DATE_FROM_UNIX_DATE and UNIX_DATE
[ https://issues.apache.org/jira/browse/SPARK-33646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33646. - Fix Version/s: 3.1.0 Resolution: Fixed > Add new function DATE_FROM_UNIX_DATE and UNIX_DATE > -- > > Key: SPARK-33646 > URL: https://issues.apache.org/jira/browse/SPARK-33646 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.1.0 > > > h2. What changes were proposed in this pull request? > Add new functions DATE_FROM_UNIX_DATE and UNIX_DATE for conversion between > Date type and Numeric types. > h2. Why are the changes needed? > 1. Explicit conversion between Date type and Numeric types is disallowed in > ANSI mode. We need to provide new functions for users to complete the > conversion. > 2. We have introduced new functions from Bigquery for conversion between > Timestamp type and Numeric types: TIMESTAMP_SECONDS, TIMESTAMP_MILLIS, > TIMESTAMP_MICROS , UNIX_SECONDS, UNIX_MILLIS, and UNIX_MICROS. It makes sense > to add functions for conversion between Date type and Numeric types as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33646) Add new function DATE_FROM_UNIX_DATE and UNIX_DATE
[ https://issues.apache.org/jira/browse/SPARK-33646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33646: Assignee: Gengliang Wang (was: Apache Spark) > Add new function DATE_FROM_UNIX_DATE and UNIX_DATE > -- > > Key: SPARK-33646 > URL: https://issues.apache.org/jira/browse/SPARK-33646 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > h2. What changes were proposed in this pull request? > Add new functions DATE_FROM_UNIX_DATE and UNIX_DATE for conversion between > Date type and Numeric types. > h2. Why are the changes needed? > 1. Explicit conversion between Date type and Numeric types is disallowed in > ANSI mode. We need to provide new functions for users to complete the > conversion. > 2. We have introduced new functions from Bigquery for conversion between > Timestamp type and Numeric types: TIMESTAMP_SECONDS, TIMESTAMP_MILLIS, > TIMESTAMP_MICROS , UNIX_SECONDS, UNIX_MILLIS, and UNIX_MICROS. It makes sense > to add functions for conversion between Date type and Numeric types as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33646) Add new function DATE_FROM_UNIX_DATE and UNIX_DATE
[ https://issues.apache.org/jira/browse/SPARK-33646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243198#comment-17243198 ] Apache Spark commented on SPARK-33646: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/30588 > Add new function DATE_FROM_UNIX_DATE and UNIX_DATE > -- > > Key: SPARK-33646 > URL: https://issues.apache.org/jira/browse/SPARK-33646 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > h2. What changes were proposed in this pull request? > Add new functions DATE_FROM_UNIX_DATE and UNIX_DATE for conversion between > Date type and Numeric types. > h2. Why are the changes needed? > 1. Explicit conversion between Date type and Numeric types is disallowed in > ANSI mode. We need to provide new functions for users to complete the > conversion. > 2. We have introduced new functions from Bigquery for conversion between > Timestamp type and Numeric types: TIMESTAMP_SECONDS, TIMESTAMP_MILLIS, > TIMESTAMP_MICROS , UNIX_SECONDS, UNIX_MILLIS, and UNIX_MICROS. It makes sense > to add functions for conversion between Date type and Numeric types as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33646) Add new function DATE_FROM_UNIX_DATE and UNIX_DATE
[ https://issues.apache.org/jira/browse/SPARK-33646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33646: Assignee: Apache Spark (was: Gengliang Wang) > Add new function DATE_FROM_UNIX_DATE and UNIX_DATE > -- > > Key: SPARK-33646 > URL: https://issues.apache.org/jira/browse/SPARK-33646 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > h2. What changes were proposed in this pull request? > Add new functions DATE_FROM_UNIX_DATE and UNIX_DATE for conversion between > Date type and Numeric types. > h2. Why are the changes needed? > 1. Explicit conversion between Date type and Numeric types is disallowed in > ANSI mode. We need to provide new functions for users to complete the > conversion. > 2. We have introduced new functions from Bigquery for conversion between > Timestamp type and Numeric types: TIMESTAMP_SECONDS, TIMESTAMP_MILLIS, > TIMESTAMP_MICROS , UNIX_SECONDS, UNIX_MILLIS, and UNIX_MICROS. It makes sense > to add functions for conversion between Date type and Numeric types as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org