[jira] [Created] (SPARK-33849) Unify v1 and v2 DROP TABLE tests

2020-12-19 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33849:
--

 Summary: Unify v1 and v2 DROP TABLE tests
 Key: SPARK-33849
 URL: https://issues.apache.org/jira/browse/SPARK-33849
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33838) Add comments about `PURGE` in DropTable and in AlterTableDropPartition

2020-12-18 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33838:
---
Summary: Add comments about `PURGE` in DropTable and in 
AlterTableDropPartition  (was: Add comment about `PURGE` in DropTable and in 
AlterTableDropPartition)

> Add comments about `PURGE` in DropTable and in AlterTableDropPartition
> --
>
> Key: SPARK-33838
> URL: https://issues.apache.org/jira/browse/SPARK-33838
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Update comments for the commands AlterTableDropPartition and DropTable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33838) Add comment about `PURGE` in DropTable and in AlterTableDropPartition

2020-12-18 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33838:
--

 Summary: Add comment about `PURGE` in DropTable and in 
AlterTableDropPartition
 Key: SPARK-33838
 URL: https://issues.apache.org/jira/browse/SPARK-33838
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


Update comments for the commands AlterTableDropPartition and DropTable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33830) Describe the PURGE option of ALTER TABLE .. DROP PARTITION

2020-12-17 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33830:
---
Summary: Describe the PURGE option of ALTER TABLE .. DROP PARTITION  (was: 
Describe the PURGE option of ALTER TABLE .. DROP PATITION)

> Describe the PURGE option of ALTER TABLE .. DROP PARTITION
> --
>
> Key: SPARK-33830
> URL: https://issues.apache.org/jira/browse/SPARK-33830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33830) Describe the PURGE option of ALTER TABLE .. DROP PATITION

2020-12-17 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33830:
--

 Summary: Describe the PURGE option of ALTER TABLE .. DROP PATITION
 Key: SPARK-33830
 URL: https://issues.apache.org/jira/browse/SPARK-33830
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33789) Refactor unified V1 and V2 datasource tests

2020-12-15 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33789:
--

 Summary: Refactor unified V1 and V2 datasource tests
 Key: SPARK-33789
 URL: https://issues.apache.org/jira/browse/SPARK-33789
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


Extract common utils methods and settings, and place them to common traits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33788) Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions()

2020-12-15 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33788:
---
Description: HiveExternalCatalog.dropPartitions and as a consequence of 
that V1 Hive catalog throws AnalysisException. The behavior deviates from V1/V2 
in-memory catalogs that throw NoSuchPartitionsException.  (was: 
HiveExternalCatalog.createPartitions throws AlreadyExistsException wrapped by 
AnalysisException. The behavior deviates from V1/V2 in-memory catalogs that 
throw PartitionsAlreadyExistException.)

> Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions()
> -
>
> Key: SPARK-33788
> URL: https://issues.apache.org/jira/browse/SPARK-33788
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> HiveExternalCatalog.dropPartitions and as a consequence of that V1 Hive 
> catalog throws AnalysisException. The behavior deviates from V1/V2 in-memory 
> catalogs that throw NoSuchPartitionsException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33788) Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions()

2020-12-15 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33788:
--

 Summary: Throw NoSuchPartitionsException from 
HiveExternalCatalog.dropPartitions()
 Key: SPARK-33788
 URL: https://issues.apache.org/jira/browse/SPARK-33788
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.7, 3.0.1, 3.1.0
Reporter: Maxim Gekk
Assignee: Maxim Gekk
 Fix For: 2.4.8, 3.0.2, 3.1.0


HiveExternalCatalog.createPartitions throws AlreadyExistsException wrapped by 
AnalysisException. The behavior deviates from V1/V2 in-memory catalogs that 
throw PartitionsAlreadyExistException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33787) Add `purge` to `dropPartition` in `SupportsPartitionManagement`

2020-12-14 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33787:
--

 Summary: Add `purge` to `dropPartition` in 
`SupportsPartitionManagement`
 Key: SPARK-33787
 URL: https://issues.apache.org/jira/browse/SPARK-33787
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


Add the `purge` parameter to the `dropPartition` in 
`SupportsPartitionManagement` and to the `dropPartitions` in 
`SupportsAtomicPartitionManagement`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33777) Sort output of V2 SHOW PARTITIONS

2020-12-14 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249120#comment-17249120
 ] 

Maxim Gekk commented on SPARK-33777:


I am working on this.

> Sort output of V2 SHOW PARTITIONS
> -
>
> Key: SPARK-33777
> URL: https://issues.apache.org/jira/browse/SPARK-33777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> V1 SHOW PARTITIONS command sorts its results. Both V1 implementations 
> in-memory and Hive catalog (according to Hive docs 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
>  perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33777) Sort output of SHOW PARTITIONS V2

2020-12-14 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33777:
--

 Summary: Sort output of SHOW PARTITIONS V2
 Key: SPARK-33777
 URL: https://issues.apache.org/jira/browse/SPARK-33777
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


V1 SHOW PARTITIONS command sorts its results. Both V1 implementations in-memory 
and Hive catalog (according to Hive docs 
[https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
 perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33777) Sort output of V2 SHOW PARTITIONS

2020-12-14 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33777:
---
Summary: Sort output of V2 SHOW PARTITIONS  (was: Sort output of SHOW 
PARTITIONS V2)

> Sort output of V2 SHOW PARTITIONS
> -
>
> Key: SPARK-33777
> URL: https://issues.apache.org/jira/browse/SPARK-33777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> V1 SHOW PARTITIONS command sorts its results. Both V1 implementations 
> in-memory and Hive catalog (according to Hive docs 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
>  perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33770) Test failures: ALTER TABLE .. DROP PARTITION tries to delete files out of partition path

2020-12-13 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33770:
--

 Summary: Test failures: ALTER TABLE .. DROP PARTITION tries to 
delete files out of partition path
 Key: SPARK-33770
 URL: https://issues.apache.org/jira/browse/SPARK-33770
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


For example: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132719/testReport/org.apache.spark.sql.hive.execution.command/AlterTableAddPartitionSuite/ALTER_TABLEADD_PARTITION_Hive_V1__SPARK_33521__universal_type_conversions_of_partition_values/

{code:java}
org.apache.spark.sql.hive.execution.command.AlterTableAddPartitionSuite.ALTER 
TABLE .. ADD PARTITION Hive V1: SPARK-33521: universal type conversions of 
partition values
sbt.ForkMain$ForkError: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: File 
file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-38fe2706-33e5-469a-ba3a-682391e02179
 does not exist;
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.dropPartitions(HiveExternalCatalog.scala:1014)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.dropPartitions(ExternalCatalogWithListener.scala:211)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.dropPartitions(SessionCatalog.scala:1036)
at 
org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:582)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33768) Remove unused parameter `retainData` from AlterTableDropPartition

2020-12-12 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33768:
--

 Summary: Remove unused parameter `retainData` from 
AlterTableDropPartition
 Key: SPARK-33768
 URL: https://issues.apache.org/jira/browse/SPARK-33768
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


The parameter is hard-coded to false while parsing in AstBuilder. The parameter 
can be removed from the logical node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33767) Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests

2020-12-12 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33767:
---
Description: Extract ALTER TABLE .. DROP PARTITION tests to the common 
place to run them for V1 and v2 datasources. Some tests can be places to V1 and 
V2 specific test suites.  (was: Extract ALTER TABLE .. ADD PARTITION tests to 
the common place to run them for V1 and v2 datasources. Some tests can be 
places to V1 and V2 specific test suites.)

> Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests
> ---
>
> Key: SPARK-33767
> URL: https://issues.apache.org/jira/browse/SPARK-33767
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> Extract ALTER TABLE .. DROP PARTITION tests to the common place to run them 
> for V1 and v2 datasources. Some tests can be places to V1 and V2 specific 
> test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33767) Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests

2020-12-12 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33767:
--

 Summary: Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests
 Key: SPARK-33767
 URL: https://issues.apache.org/jira/browse/SPARK-33767
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk
Assignee: Maxim Gekk
 Fix For: 3.1.0


Extract ALTER TABLE .. ADD PARTITION tests to the common place to run them for 
V1 and v2 datasources. Some tests can be places to V1 and V2 specific test 
suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33742) Throw PartitionsAlreadyExistException from HiveExternalCatalog.createPartitions()

2020-12-10 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33742:
--

 Summary: Throw PartitionsAlreadyExistException from 
HiveExternalCatalog.createPartitions()
 Key: SPARK-33742
 URL: https://issues.apache.org/jira/browse/SPARK-33742
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.1, 2.4.7, 3.1.0
Reporter: Maxim Gekk


HiveExternalCatalog.createPartitions throws AlreadyExistsException wrapped by 
AnalysisException. The behavior deviates from V1/V2 in-memory catalogs that 
throw PartitionsAlreadyExistException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-09 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246717#comment-17246717
 ] 

Maxim Gekk commented on SPARK-33571:


> The behavior of the to be introduced in Spark 3.1 
> `spark.sql.legacy.parquet.int96RebaseModeIn*` is the same as for 
> `datetimeRebaseModeIn*`?

Yes.

> So Spark will check the parquet metadata for Spark version and the 
> `datetimeRebaseModeInRead` metadata key and use the correct behavior.

Correct, except of names of metadata keys. Spark checks , see 
https://github.com/MaxGekk/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/package.scala#L58-L68

> If those are not set it will raise an exception and ask the user to define 
> the mode. Is that correct?

Yes. Spark should raise the exception if it is not clear which calendar the 
writer used.

> but from my testing Spark 3 does the same by default, not sure if that aligns 
> with your findings?

Spark 3.0.0-SNAPSHOT saved timestamps as TIMESTAMP_MICROS in parquet till 
https://github.com/apache/spark/pull/28450 . I just wanted to say that the 
configs datetimeRebaseModeIn* you pointed out don't impact on INT96 in Spark 
3.0.

> What is the expected behavior for TIMESTAMP_MICROS and TIMESTAMP_MILLIS with 
> regards to this?

The same as for DATE type. Spark takes into account the same SQL configs and 
metdata keys from parquet files.



> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
> Fix For: 3.1.0
>
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compared to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. 

[jira] [Updated] (SPARK-33558) Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests

2020-12-09 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33558:
---
Description: Extract ALTER TABLE .. ADD PARTITION tests to the common place 
to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 
specific test suites.  (was: Extract ALTER TABLE .. PARTITION tests to the 
common place to run them for V1 and v2 datasources. Some tests can be places to 
V1 and V2 specific test suites.)

> Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests
> --
>
> Key: SPARK-33558
> URL: https://issues.apache.org/jira/browse/SPARK-33558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Extract ALTER TABLE .. ADD PARTITION tests to the common place to run them 
> for V1 and v2 datasources. Some tests can be places to V1 and V2 specific 
> test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33558) Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests

2020-12-09 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33558:
---
Summary: Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests  (was: Unify v1 
and v2 ALTER TABLE .. PARTITION tests)

> Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests
> --
>
> Key: SPARK-33558
> URL: https://issues.apache.org/jira/browse/SPARK-33558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Extract ALTER TABLE .. PARTITION tests to the common place to run them for V1 
> and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33706) Require fully specified partition identifier in partitionExists()

2020-12-07 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33706:
--

 Summary: Require fully specified partition identifier in 
partitionExists()
 Key: SPARK-33706
 URL: https://issues.apache.org/jira/browse/SPARK-33706
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


Currently, partitionExists() from SupportsPartitionManagement accept any 
partition identifier even which is not fully specified. This ticket aim to add 
a check for the length of partition schema and partition identifier, and 
require exact matching. So, we should prohibit not fully specified IDs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33441) Add "unused-import" compile arg to scalac and remove all unused imports in Scala code

2020-12-07 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245723#comment-17245723
 ] 

Maxim Gekk commented on SPARK-33441:


Would it be possible to check unused imports in Java code?

> Add "unused-import" compile arg to scalac and remove all unused imports in 
> Scala code 
> --
>
> Key: SPARK-33441
> URL: https://issues.apache.org/jira/browse/SPARK-33441
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.1.0
>
>
> * Add new scala compile arg to defense against new unused imports:
>  ** "-Ywarn-unused-import" for Scala 2.12
>  ** "-Wconf:cat=unused-imports:ws" or "-Wconf:cat=unused-imports:error" for 
> Scala 2.13
>  * Remove all unused imports in Scala code 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33670) Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED

2020-12-07 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245174#comment-17245174
 ] 

Maxim Gekk commented on SPARK-33670:


[~hyukjin.kwon] Just in case, which "Affects Version" should be pointed out - 
already released or current unreleased version. For example, I pointed out 
3.0.2 but it has been not released yet. Maybe, I should set 3.0.1?

> Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED
> ---
>
> Key: SPARK-33670
> URL: https://issues.apache.org/jira/browse/SPARK-33670
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Invoke the check verifyPartitionProviderIsHive() from v1 implementation of 
> SHOW TABLE EXTENDED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33688) Migrate SHOW TABLE EXTENDED to new resolution framework

2020-12-07 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245143#comment-17245143
 ] 

Maxim Gekk commented on SPARK-33688:


I am working on this.

> Migrate SHOW TABLE EXTENDED to new resolution framework
> ---
>
> Key: SPARK-33688
> URL: https://issues.apache.org/jira/browse/SPARK-33688
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> # Create the Command logical node for SHOW TABLE EXTENDED
> # Remove ShowTableStatement



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33688) Migrate SHOW TABLE EXTENDED to new resolution framework

2020-12-07 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33688:
--

 Summary: Migrate SHOW TABLE EXTENDED to new resolution framework
 Key: SPARK-33688
 URL: https://issues.apache.org/jira/browse/SPARK-33688
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


# Create the Command logical node for SHOW TABLE EXTENDED
# Remove ShowTableStatement



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33676) Require exact matched partition spec to schema in ADD/DROP PARTITION

2020-12-06 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33676:
--

 Summary: Require exact matched partition spec to schema in 
ADD/DROP PARTITION
 Key: SPARK-33676
 URL: https://issues.apache.org/jira/browse/SPARK-33676
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


The V1 implementation of ALTER TABLE .. ADD/DROP PARTITION fails when the 
partition spec doesn't exactly match to the partition schema:
{code:sql}
ALTER TABLE tab1 ADD PARTITION (A='9')
Partition spec is invalid. The spec (a) must match the partition spec (a, b) 
defined in table '`dbx`.`tab1`';
org.apache.spark.sql.AnalysisException: Partition spec is invalid. The spec (a) 
must match the partition spec (a, b) defined in table '`dbx`.`tab1`';
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$requireExactMatchedPartitionSpec$1(SessionCatalog.scala:1173)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$requireExactMatchedPartitionSpec$1$adapted(SessionCatalog.scala:1171)
at scala.collection.immutable.List.foreach(List.scala:392)
{code}
for a table partitioned by "a", "b" but the V2 implementation add the wrong 
partition silently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33672) Check SQLContext.tables() for V2 session catalog

2020-12-05 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244542#comment-17244542
 ] 

Maxim Gekk commented on SPARK-33672:


[~cloud_fan] FYI

> Check SQLContext.tables() for V2 session catalog
> 
>
> Key: SPARK-33672
> URL: https://issues.apache.org/jira/browse/SPARK-33672
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> V1 ShowTablesCommand is hard coded in SQLContext:
> https://github.com/apache/spark/blob/a088a801ed8c17171545c196a3f26ce415de0cd1/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L671
> The ticket aims to checks tables() behavior for V2 session catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33672) Check SQLContext.tables() for V2 session catalog

2020-12-05 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33672:
--

 Summary: Check SQLContext.tables() for V2 session catalog
 Key: SPARK-33672
 URL: https://issues.apache.org/jira/browse/SPARK-33672
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


V1 ShowTablesCommand is hard coded in SQLContext:
https://github.com/apache/spark/blob/a088a801ed8c17171545c196a3f26ce415de0cd1/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L671
The ticket aims to checks tables() behavior for V2 session catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33671) Remove VIEW checks from V1 table commands

2020-12-05 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33671:
--

 Summary: Remove VIEW checks from V1 table commands
 Key: SPARK-33671
 URL: https://issues.apache.org/jira/browse/SPARK-33671
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.2, 3.1.0
Reporter: Maxim Gekk


Checking of VIEWs is performed earlier, see 
https://github.com/apache/spark/pull/30461 . So, the checks can be removed from 
some V1 commands.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33670) Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED

2020-12-05 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33670:
--

 Summary: Verify the partition provider is Hive in v1 SHOW TABLE 
EXTENDED
 Key: SPARK-33670
 URL: https://issues.apache.org/jira/browse/SPARK-33670
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.2, 3.1.0
Reporter: Maxim Gekk


Invoke the check verifyPartitionProviderIsHive() from v1 implementation of SHOW 
TABLE EXTENDED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33667) Respect case sensitivity in V1 SHOW PARTITIONS

2020-12-04 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33667:
---
Description: 
SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
*spark.sql.caseSensitive* which is false by default, for instance:
{code:sql}
spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
 > USING parquet
 > PARTITIONED BY (year, month);
spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW 
PARTITIONS;
{code}
 

  was:
SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
*spark.sql.caseSensitive* which is true by default, for instance:
{code:sql}
spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
 > USING parquet
 > PARTITIONED BY (year, month);
spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW 
PARTITIONS;
{code}
 


> Respect case sensitivity in V1 SHOW PARTITIONS
> --
>
> Key: SPARK-33667
> URL: https://issues.apache.org/jira/browse/SPARK-33667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
> *spark.sql.caseSensitive* which is false by default, for instance:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
>  > USING parquet
>  > PARTITIONED BY (year, month);
> spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
> spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
> Error in query: Non-partitioning column(s) [YEAR, Month] are specified for 
> SHOW PARTITIONS;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33667) Respect case sensitivity in V1 SHOW PARTITIONS

2020-12-04 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33667:
--

 Summary: Respect case sensitivity in V1 SHOW PARTITIONS
 Key: SPARK-33667
 URL: https://issues.apache.org/jira/browse/SPARK-33667
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.8, 3.0.2, 3.1.0
Reporter: Maxim Gekk


SHOW PARTITIONS is case sensitive, and doesn't respect the SQL config 
*spark.sql.caseSensitive* which is true by default, for instance:
{code:sql}
spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
 > USING parquet
 > PARTITIONED BY (year, month);
spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW 
PARTITIONS;
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-03 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243441#comment-17243441
 ] 

Maxim Gekk commented on SPARK-33571:


I opened the PR [https://github.com/apache/spark/pull/30596] with some 
improvements for config docs. [~hyukjin.kwon] [~cloud_fan] could you review it, 
please.

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33650) Misleading error from ALTER TABLE .. PARTITION for non-supported partition management table

2020-12-03 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33650:
--

 Summary: Misleading error from ALTER TABLE .. PARTITION for 
non-supported partition management table
 Key: SPARK-33650
 URL: https://issues.apache.org/jira/browse/SPARK-33650
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


For a V2 table that doesn't support partition management, ALTER TABLE .. 
ADD/DROP PARTITION throws misleading exception:
{code:java}
PartitionSpecs are not resolved;;
'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false
+- ResolvedTable org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, 
ns1.ns2.tbl, org.apache.spark.sql.connector.InMemoryTable@5d3ff859

org.apache.spark.sql.AnalysisException: PartitionSpecs are not resolved;;
'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false
+- ResolvedTable org.apache.spark.sql.connector.InMemoryTableCatalog@2fd64b11, 
ns1.ns2.tbl, org.apache.spark.sql.connector.InMemoryTable@5d3ff859

at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49)
{code}

The error should say that the table doesn't support partition management.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241408#comment-17241408
 ] 

Maxim Gekk commented on SPARK-33571:


[~simonvanderveldt] Looking at the dates, you tested, both dates 1880-10-01 and 
2020-10-01 belong to the Gregorian calendar, so, should be no diffs.

For the date 0220-10-01, please, have a look at the table which I built in the 
PR: https://github.com/apache/spark/pull/28067 . The table shows that there is 
no diffs between 2 calendars for the year.

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241400#comment-17241400
 ] 

Maxim Gekk commented on SPARK-33571:


Spark 3.0.1 shows different results as well:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_275)
scala> 
spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false)
20/12/01 12:31:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.SparkUpgradeException: You may get a different result due to 
the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during reading. Or set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
datetime values as it is.

scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", 
"LEGACY")

scala> 
spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false)
+--+--+
|dict  |plain |
+--+--+
|1001-01-01|1001-01-01|
|1001-01-01|1001-01-02|
|1001-01-01|1001-01-03|
|1001-01-01|1001-01-04|
|1001-01-01|1001-01-05|
|1001-01-01|1001-01-06|
|1001-01-01|1001-01-07|
|1001-01-01|1001-01-08|
+--+--+


scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", 
"CORRECTED")

scala> 
spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false)
+--+--+
|dict  |plain |
+--+--+
|1001-01-07|1001-01-07|
|1001-01-07|1001-01-08|
|1001-01-07|1001-01-09|
|1001-01-07|1001-01-10|
|1001-01-07|1001-01-11|
|1001-01-07|1001-01-12|
|1001-01-07|1001-01-13|
|1001-01-07|1001-01-14|
+--+--+

{code}

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there 

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241379#comment-17241379
 ] 

Maxim Gekk commented on SPARK-33571:


I have tried to reproduce the issue on the master branch by reading the file 
saved by Spark 2.4.5 
(https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data):
{code:scala}
  test("SPARK-33571: read ancient dates saved by Spark 2.4.5") {
withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> 
LEGACY.toString) {
  val path = 
getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet")
  val df = spark.read.parquet(path)
  df.show(false)
}
withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> 
CORRECTED.toString) {
  val path = 
getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet")
  val df = spark.read.parquet(path)
  df.show(false)
}
  }
{code}

The results are different in LEGACY and in CORRECTED modes:
{code}
+--+--+
|dict  |plain |
+--+--+
|1001-01-01|1001-01-01|
|1001-01-01|1001-01-02|
|1001-01-01|1001-01-03|
|1001-01-01|1001-01-04|
|1001-01-01|1001-01-05|
|1001-01-01|1001-01-06|
|1001-01-01|1001-01-07|
|1001-01-01|1001-01-08|
+--+--+

+--+--+
|dict  |plain |
+--+--+
|1001-01-07|1001-01-07|
|1001-01-07|1001-01-08|
|1001-01-07|1001-01-09|
|1001-01-07|1001-01-10|
|1001-01-07|1001-01-11|
|1001-01-07|1001-01-12|
|1001-01-07|1001-01-13|
|1001-01-07|1001-01-14|
+--+--+
{code}

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts 

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241339#comment-17241339
 ] 

Maxim Gekk commented on SPARK-33571:


[~simonvanderveldt] Thank you for the detailed description and your 
investigation. Let me clarify a few things:

> From our testing we're seeing several issues:
> Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that 
> contains fields of type `TimestampType` which contain timestamps before the 
> above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5

Spark 2.4.5 writes timestamps as parquet INT96 type. The SQL config 
`datetimeRebaseModeInRead` does not influence on reading such types in Spark 
3.0.1, so, Spark performs rebasing always (LEGACY mode). We recently added 
separate configs for INT96:
* https://github.com/apache/spark/pull/30056
* https://github.com/apache/spark/pull/30121

The changes will be released with Spark 3.1.0.

> Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that 
> contains fields of type `TimestampType` or `DateType` which contain dates or 
> timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.

For INT96, it seems it is correct behavior. We should observe different results 
for TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, see the SQL config 
spark.sql.parquet.outputTimestampType.

The DATE case is more interesting as we must see a difference in results for 
ancient dates. I will investigate this case. 

 

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned 

[jira] [Updated] (SPARK-33591) NULL is recognized as the "null" string in partition specs

2020-11-29 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33591:
---
Issue Type: Bug  (was: Improvement)

> NULL is recognized as the "null" string in partition specs
> --
>
> Key: SPARK-33591
> URL: https://issues.apache.org/jira/browse/SPARK-33591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> For example:
> {code:sql}
> spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED 
> BY (p1);
> spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
> spark-sql> SELECT isnull(p1) FROM tbl5;
> false
> {code}
> The *p1 = null* is not recognized as a partition with NULL value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33591) NULL is recognized as the "null" string in partition specs

2020-11-29 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33591:
---
Issue Type: Improvement  (was: Bug)

> NULL is recognized as the "null" string in partition specs
> --
>
> Key: SPARK-33591
> URL: https://issues.apache.org/jira/browse/SPARK-33591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> For example:
> {code:sql}
> spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED 
> BY (p1);
> spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
> spark-sql> SELECT isnull(p1) FROM tbl5;
> false
> {code}
> The *p1 = null* is not recognized as a partition with NULL value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33591) NULL is recognized as the "null" string in partition specs

2020-11-29 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33591:
--

 Summary: NULL is recognized as the "null" string in partition specs
 Key: SPARK-33591
 URL: https://issues.apache.org/jira/browse/SPARK-33591
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


For example:
{code:sql}
spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY 
(p1);
spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
spark-sql> SELECT isnull(p1) FROM tbl5;
false
{code}

The *p1 = null* is not recognized as a partition with NULL value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column

2020-11-28 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33585:
---
Affects Version/s: (was: 2.4.7)
   (was: 3.0.1)
   3.0.2
   2.4.8

> The comment for SQLContext.tables() doesn't mention the `database` column
> -
>
> Key: SPARK-33585
> URL: https://issues.apache.org/jira/browse/SPARK-33585
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The comment says: "The returned DataFrame has two columns, tableName and 
> isTemporary":
> https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664
> but actually the dataframe has 3 columns:
> {code:scala}
> scala> spark.range(10).createOrReplaceTempView("view1")
> scala> val tables = spark.sqlContext.tables()
> tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string 
> ... 1 more field]
> scala> tables.printSchema
> root
>  |-- database: string (nullable = false)
>  |-- tableName: string (nullable = false)
>  |-- isTemporary: boolean (nullable = false)
> scala> tables.show
> ++-+---+
> |database|tableName|isTemporary|
> ++-+---+
> | default|   t1|  false|
> | default|   t2|  false|
> | default|  ymd|  false|
> ||view1|   true|
> ++-+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33588) Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`

2020-11-28 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33588:
--

 Summary: Partition spec in SHOW TABLE EXTENDED doesn't respect 
`spark.sql.caseSensitive`
 Key: SPARK-33588
 URL: https://issues.apache.org/jira/browse/SPARK-33588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.8, 3.0.2, 3.1.0
Reporter: Maxim Gekk


For example:
{code:sql}
spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
 > USING parquet
 > partitioned by (year, month);
spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1);
Error in query: Partition spec is invalid. The spec (YEAR, Month) must match 
the partition spec (year, month) defined in table '`default`.`tbl1`';
{code}
The spark.sql.caseSensitive flag is false by default, so, the partition spec is 
valid.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33585) The comment for SQLContext.tables() doesn't mention the `database` column

2020-11-28 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33585:
--

 Summary: The comment for SQLContext.tables() doesn't mention the 
`database` column
 Key: SPARK-33585
 URL: https://issues.apache.org/jira/browse/SPARK-33585
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.0.1, 2.4.7, 3.1.0
Reporter: Maxim Gekk


The comment says: "The returned DataFrame has two columns, tableName and 
isTemporary":
https://github.com/apache/spark/blob/b26ae98407c6c017a4061c0c420f48685ddd6163/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L664

but actually the dataframe has 3 columns:
{code:scala}
scala> spark.range(10).createOrReplaceTempView("view1")
scala> val tables = spark.sqlContext.tables()
tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string 
... 1 more field]

scala> tables.printSchema
root
 |-- database: string (nullable = false)
 |-- tableName: string (nullable = false)
 |-- isTemporary: boolean (nullable = false)


scala> tables.show
++-+---+
|database|tableName|isTemporary|
++-+---+
| default|   t1|  false|
| default|   t2|  false|
| default|  ymd|  false|
||view1|   true|
++-+---+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33569) Remove getting partitions by only ident

2020-11-26 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33569:
--

 Summary: Remove getting partitions by only ident
 Key: SPARK-33569
 URL: https://issues.apache.org/jira/browse/SPARK-33569
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


This is a follow up of SPARK-33509 which added a function for getting 
partitions by names and ident. The function which gets partitions by ident is 
not used anymore, and it can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33558) Unify v1 and v2 ALTER TABLE .. PARTITION tests

2020-11-25 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33558:
--

 Summary: Unify v1 and v2 ALTER TABLE .. PARTITION tests
 Key: SPARK-33558
 URL: https://issues.apache.org/jira/browse/SPARK-33558
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Extract ALTER TABLE .. PARTITION tests to the common place to run them for V1 
and v2 datasources. Some tests can be places to V1 and V2 specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33529) Resolver of V2 partition specs doesn't handle __HIVE_DEFAULT_PARTITION__

2020-11-24 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33529:
---
Description: 
The partition value '__HIVE_DEFAULT_PARTITION__' should be handled as null - 
the same as DSv1 does.

For example in DSv1:

{code:java}
spark-sql> CREATE TABLE tbl11 (id int, part0 string) USING parquet PARTITIONED 
BY (part0);
spark-sql> ALTER TABLE tbl11 ADD PARTITION (part0 = 
'__HIVE_DEFAULT_PARTITION__');
spark-sql> INSERT INTO tbl11 PARTITION (part0='__HIVE_DEFAULT_PARTITION__') 
SELECT 1;
spark-sql> SELECT * FROM tbl11;
1   NULL
{code}
 

 

  was:The partition value '__HIVE_DEFAULT_PARTITION__' should be handled as 
null - the same as DSv1 does.


> Resolver of V2 partition specs doesn't handle __HIVE_DEFAULT_PARTITION__
> 
>
> Key: SPARK-33529
> URL: https://issues.apache.org/jira/browse/SPARK-33529
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The partition value '__HIVE_DEFAULT_PARTITION__' should be handled as null - 
> the same as DSv1 does.
> For example in DSv1:
> {code:java}
> spark-sql> CREATE TABLE tbl11 (id int, part0 string) USING parquet 
> PARTITIONED BY (part0);
> spark-sql> ALTER TABLE tbl11 ADD PARTITION (part0 = 
> '__HIVE_DEFAULT_PARTITION__');
> spark-sql> INSERT INTO tbl11 PARTITION (part0='__HIVE_DEFAULT_PARTITION__') 
> SELECT 1;
> spark-sql> SELECT * FROM tbl11;
> 1 NULL
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33529) Resolver of V2 partition specs doesn't handle __HIVE_DEFAULT_PARTITION__

2020-11-24 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33529:
--

 Summary: Resolver of V2 partition specs doesn't handle 
__HIVE_DEFAULT_PARTITION__
 Key: SPARK-33529
 URL: https://issues.apache.org/jira/browse/SPARK-33529
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


The partition value '__HIVE_DEFAULT_PARTITION__' should be handled as null - 
the same as DSv1 does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33521) Universal type conversion of V2 partition values

2020-11-23 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33521:
--

 Summary: Universal type conversion of V2 partition values
 Key: SPARK-33521
 URL: https://issues.apache.org/jira/browse/SPARK-33521
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Support other types while resolving partition specs in

https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33511) Respect case sensitivity in resolving partition specs V2

2020-11-21 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33511:
--

 Summary: Respect case sensitivity in resolving partition specs V2
 Key: SPARK-33511
 URL: https://issues.apache.org/jira/browse/SPARK-33511
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


DSv1 DDL commands respect the SQL config spark.sql.caseSensitive, for example
{code:java}
spark-sql> CREATE TABLE tbl1 (id bigint, data string) USING parquet PARTITIONED 
BY (id);
spark-sql> ALTER TABLE tbl1 ADD PARTITION (ID=1);
spark-sql> SHOW PARTITIONS tbl1;
id=1
{code}
but the same ALTER TABLE command fails on DSv2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33509) List partition by names from V2 tables that support partition management

2020-11-21 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33509:
--

 Summary: List partition by names from V2 tables that support 
partition management
 Key: SPARK-33509
 URL: https://issues.apache.org/jira/browse/SPARK-33509
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Currently, the SupportsPartitionManagement interface exposes only the 
listPartitionIdentifiers() method which does not allow to list partition by 
names. So, it is hard to implement:
{code:java}
SHOW PARTITIONS table PARTITION(month=2)
{code}
from the table like:
{code:java}
CREATE TABLE $table (price int, qty int, year int, month int)
USING parquet
partitioned by (year, month)
{code}
because listPartitionIdentifiers() requires to specify value for *year* .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33505) Fix insert into `InMemoryPartitionTable`

2020-11-20 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33505:
--

 Summary: Fix insert into `InMemoryPartitionTable`
 Key: SPARK-33505
 URL: https://issues.apache.org/jira/browse/SPARK-33505
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Currently, INSERT INTO a partitioned table in V2 in-memory catalog doesn't 
create partitions. The example below demonstrates the issue:

{code:scala}
  test("insert into partitioned table") {
val t = "testpart.ns1.ns2.tbl"
withTable(t) {
  spark.sql(
s"""
   |CREATE TABLE $t (id bigint, name string, data string)
   |USING foo
   |PARTITIONED BY (id, name)""".stripMargin)
  spark.sql(s"INSERT INTO $t PARTITION(id = 1, name = 'Max') SELECT 'abc'")

  val partTable = catalog("testpart").asTableCatalog
.loadTable(Identifier.of(Array("ns1", "ns2"), 
"tbl")).asInstanceOf[InMemoryPartitionTable]
  assert(partTable.partitionExists(InternalRow.fromSeq(Seq(1, 
UTF8String.fromString("Max")
}
  }
{code}

The partitionExists() function return false for the partitions that must be 
created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33453) Unify v1 and v2 SHOW PARTITIONS tests

2020-11-14 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33453:
--

 Summary: Unify v1 and v2 SHOW PARTITIONS tests
 Key: SPARK-33453
 URL: https://issues.apache.org/jira/browse/SPARK-33453
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Gather common tests for DSv1 and DSv2 SHOW PARTITIONS command to a common test. 
Mix this trait to datasource specific test suites.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33452) Create a V2 SHOW PARTITIONS execution node

2020-11-14 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33452:
--

 Summary: Create a V2 SHOW PARTITIONS execution node
 Key: SPARK-33452
 URL: https://issues.apache.org/jira/browse/SPARK-33452
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


There is the V1 SHOW PARTITIONS implementation:
https://github.com/apache/spark/blob/7e99fcd64efa425f3c985df4fe957a3be274a49a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L975
The ticket aims to add V2 implementation with similar behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33452) Create a V2 SHOW PARTITIONS execution node

2020-11-14 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232095#comment-17232095
 ] 

Maxim Gekk commented on SPARK-33452:


I plan to work on this soon.

> Create a V2 SHOW PARTITIONS execution node
> --
>
> Key: SPARK-33452
> URL: https://issues.apache.org/jira/browse/SPARK-33452
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> There is the V1 SHOW PARTITIONS implementation:
> https://github.com/apache/spark/blob/7e99fcd64efa425f3c985df4fe957a3be274a49a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L975
> The ticket aims to add V2 implementation with similar behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33393) Support SHOW TABLE EXTENDED in DSv2

2020-11-14 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232094#comment-17232094
 ] 

Maxim Gekk commented on SPARK-33393:


I plan to work on this soon.

> Support SHOW TABLE EXTENDED in DSv2
> ---
>
> Key: SPARK-33393
> URL: https://issues.apache.org/jira/browse/SPARK-33393
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Current implementation of DSv2 SHOW TABLE doesn't support the EXTENDED mode 
> in:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExec.scala#L33
> which is supported in DSv1:
> https://github.com/apache/spark/blob/7e99fcd64efa425f3c985df4fe957a3be274a49a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L870
> Need to add the same functionality to ShowTablesExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog

2020-11-12 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230514#comment-17230514
 ] 

Maxim Gekk commented on SPARK-33430:


[~cloud_fan] [~huaxingao] It would be nice to support namespaces, WDYT?

> Support namespaces in JDBC v2 Table Catalog
> ---
>
> Key: SPARK-33430
> URL: https://issues.apache.org/jira/browse/SPARK-33430
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> When I extend JDBCTableCatalogSuite by 
> org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance:
> {code:scala}
> import org.apache.spark.sql.execution.command.v2.ShowTablesSuite
> class JDBCTableCatalogSuite extends ShowTablesSuite {
>   override def version: String = "JDBC V2"
>   override def catalog: String = "h2"
> ...
> {code}
> some tests from JDBCTableCatalogSuite fail with:
> {code}
> [info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 
> seconds, 502 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does 
> not support namespaces;
> [info]   at 
> org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog

2020-11-12 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33430:
--

 Summary: Support namespaces in JDBC v2 Table Catalog
 Key: SPARK-33430
 URL: https://issues.apache.org/jira/browse/SPARK-33430
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


When I extend JDBCTableCatalogSuite by 
org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance:
{code:scala}
import org.apache.spark.sql.execution.command.v2.ShowTablesSuite

class JDBCTableCatalogSuite extends ShowTablesSuite {
  override def version: String = "JDBC V2"
  override def catalog: String = "h2"
...
{code}
some tests from JDBCTableCatalogSuite fail with:
{code}
[info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 seconds, 
502 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does 
not support namespaces;
[info]   at 
org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83)
[info]   at 
org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208)
[info]   at 
org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33426) Unify Hive SHOW TABLES tests

2020-11-11 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33426:
--

 Summary: Unify Hive SHOW TABLES tests
 Key: SPARK-33426
 URL: https://issues.apache.org/jira/browse/SPARK-33426
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


1. Move Hive SHOW TABLES tests to a separate test suite
2. Extend the common SHOW TABLES trait by the new test suite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33403) DSv2 SHOW TABLES doesn't show `default`

2020-11-09 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33403:
--

 Summary: DSv2 SHOW TABLES doesn't show `default`
 Key: SPARK-33403
 URL: https://issues.apache.org/jira/browse/SPARK-33403
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


DSv1:
{code:scala}
  test("namespace is not specified and the default catalog is set") {
withSQLConf(SQLConf.DEFAULT_CATALOG.key -> catalog) {
  withTable("table") {
spark.sql(s"CREATE TABLE table (id bigint, data string) $defaultUsing")
sql("SHOW TABLES").show()
  }
}
  }
{code}
{code}
++-+---+
|database|tableName|isTemporary|
++-+---+
| default|table|  false|
++-+---+
{code}

DSv2:
{code}
+-+-+
|namespace|tableName|
+-+-+
| |table|
+-+-+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33394) Throw `NoSuchDatabaseException` for not existing namespace in DSv2 SHOW TABLES

2020-11-09 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33394:
--

 Summary: Throw `NoSuchDatabaseException` for not existing 
namespace in DSv2 SHOW TABLES
 Key: SPARK-33394
 URL: https://issues.apache.org/jira/browse/SPARK-33394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Current implementation of DSv2 SHOW TABLES return an empty result for not 
existing database/namespace. This implementation should be aligned to DSv1 
which throws the `NoSuchDatabaseException` exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33393) Support SHOW TABLE EXTENDED in DSv2

2020-11-09 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33393:
--

 Summary: Support SHOW TABLE EXTENDED in DSv2
 Key: SPARK-33393
 URL: https://issues.apache.org/jira/browse/SPARK-33393
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Current implementation of DSv2 SHOW TABLE doesn't support the EXTENDED mode in:
https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExec.scala#L33
which is supported in DSv1:
https://github.com/apache/spark/blob/7e99fcd64efa425f3c985df4fe957a3be274a49a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L870

Need to add the same functionality to ShowTablesExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33364) Expose purge option in TableCatalog.dropTable

2020-11-09 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33364:
---
Parent: SPARK-33392
Issue Type: Sub-task  (was: New Feature)

> Expose purge option in TableCatalog.dropTable
> -
>
> Key: SPARK-33364
> URL: https://issues.apache.org/jira/browse/SPARK-33364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.1.0
>
>
> TableCatalog.dropTable currently does not support the purge option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache

2020-11-09 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33305:
---
Parent: SPARK-33392
Issue Type: Sub-task  (was: Bug)

> DSv2: DROP TABLE command should also invalidate cache
> -
>
> Key: SPARK-33305
> URL: https://issues.apache.org/jira/browse/SPARK-33305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Major
>
> Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the 
> table but doesn't invalidate all caches referencing the table. We should make 
> the behavior consistent between v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33392) Align DSv2 commands to DSv1 implementation

2020-11-09 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33392:
--

 Summary: Align DSv2 commands to DSv1 implementation
 Key: SPARK-33392
 URL: https://issues.apache.org/jira/browse/SPARK-33392
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


The purpose of this umbrella ticket is:
# Implement missing features of datasource v1 commands in DSv2
# Align behavior of DSv2 commands to the current implementation of DSv1 
commands as much as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests

2020-11-07 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33382:
--

 Summary: Unify v1 and v2 SHOW TABLES tests
 Key: SPARK-33382
 URL: https://issues.apache.org/jira/browse/SPARK-33382
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. Mix 
this trait to datasource specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33381) Unify dsv1 and dsv2 command tests

2020-11-07 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33381:
--

 Summary: Unify dsv1 and dsv2 command tests
 Key: SPARK-33381
 URL: https://issues.apache.org/jira/browse/SPARK-33381
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Create unified test suites for DSv1 and DSv2 commands like CREATE TABLE, SHOW 
TABLES and etc. Put datasource specific tests to separate test suites. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33381) Unify DSv1 and DSv2 command tests

2020-11-07 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33381:
---
Summary: Unify DSv1 and DSv2 command tests  (was: Unify dsv1 and dsv2 
command tests)

> Unify DSv1 and DSv2 command tests
> -
>
> Key: SPARK-33381
> URL: https://issues.apache.org/jira/browse/SPARK-33381
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Create unified test suites for DSv1 and DSv2 commands like CREATE TABLE, SHOW 
> TABLES and etc. Put datasource specific tests to separate test suites. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33299) Unify schema parsing in from_json/from_csv across all APIs

2020-10-30 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33299:
--

 Summary: Unify schema parsing in from_json/from_csv across all APIs
 Key: SPARK-33299
 URL: https://issues.apache.org/jira/browse/SPARK-33299
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Currently, from_json() has extra capability in Scala API. It accepts schema in 
JSON format but other API (SQL, Python, R) lacks the feature. The ticket aims 
to unify all APIs, and support schemas in JSON format everywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33286) Misleading error messages of `from_json`

2020-10-29 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33286:
--

 Summary: Misleading error messages of `from_json`
 Key: SPARK-33286
 URL: https://issues.apache.org/jira/browse/SPARK-33286
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Schema parsing of from_json output misleading error messages. It output the 
error message only from the fallback parser. And the message does not show 
actual root cause.

For example:
{code:scala}
val df = Seq("""{"a":1}""").toDF("json")
val invalidJsonSchema = """{"fields": [{"a":123}], "type": "struct"}"""
df.select(from_json($"json", invalidJsonSchema, Map.empty[String, 
String])).show
{code}

{code}
mismatched input '{' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 
'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 
'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 
'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 
'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 
'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 
'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 
'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 
'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 
'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 
'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 
'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 
'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 
'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 
'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 
'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 
'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 
'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 
'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 
'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 
'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 
'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 
'PERCENT', 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 
'PRINCIPALS', 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 
'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 
'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 
'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 
'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 
'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 
'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 
'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TIME', 'TO', 
'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 
'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 
'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 
'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'ZONE', 
IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)

== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^

org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input '{' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 
'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 
'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 
'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 
'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 
'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 
'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 
'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 
'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 
'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 
'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 
'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 
'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 
'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 
'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 
'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 
'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 
'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 
'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 

[jira] [Created] (SPARK-33281) Return SQL schema instead of Catalog string from the `SchemaOfCsv` expression

2020-10-29 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33281:
--

 Summary: Return SQL schema instead of Catalog string from the 
`SchemaOfCsv` expression
 Key: SPARK-33281
 URL: https://issues.apache.org/jira/browse/SPARK-33281
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


SPARK-33270 changed the schema format from catalog string to SQL format. This 
ticket aims to unify output of schema_of_json and schema_of_csv.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33270) from_json() cannot parse schema from schema_of_json()

2020-10-28 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33270:
--

 Summary: from_json() cannot parse schema from schema_of_json()
 Key: SPARK-33270
 URL: https://issues.apache.org/jira/browse/SPARK-33270
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2, 3.1.0
Reporter: Maxim Gekk


For example:
{code:scala}
val in = Seq("""{"a b": 1}""").toDS()
in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed")
{code}
This fails with the exception:
{code:java}
org.apache.spark.sql.catalyst.parser.ParseException: 
extraneous input '<' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 
'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 
'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 
'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 
'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 
'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 
'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 
'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 
'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 
'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 
'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 
'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 
'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 
'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 
'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 
'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 
'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 
'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 
'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 
'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 
'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 
'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 
'PERCENT', 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 
'PRINCIPALS', 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 
'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 
'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 
'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 
'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 
'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 
'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 
'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TIME', 'TO', 
'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 
'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 
'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 
'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'ZONE', 
IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6)

== SQL ==
struct
--^^^

at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76)
at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131)
at 
org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33)
at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:537)
at org.apache.spark.sql.functions$.from_json(functions.scala:4141)
at org.apache.spark.sql.functions$.from_json(functions.scala:4124)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31342) Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582

2020-10-25 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220274#comment-17220274
 ] 

Maxim Gekk commented on SPARK-31342:


[~cloud_fan] I think we can close this because it has been already implemented.

> Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582
> 
>
> Key: SPARK-31342
> URL: https://issues.apache.org/jira/browse/SPARK-31342
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Some users may not know they are creating and/or reading DATE or TIMESTAMP 
> data from before October 15, 1582 (because of data encoding libraries, etc.). 
> Therefore, it may not be clear whether they need to toggle the two 
> rebaseDateTime config settings.
> By default, Spark should fail if it reads or writes data from October 15, 
> 1582 or before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33210) Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default

2020-10-21 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33210:
--

 Summary: Set the rebasing mode for parquet INT96 type to 
`EXCEPTION` by default
 Key: SPARK-33210
 URL: https://issues.apache.org/jira/browse/SPARK-33210
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


The ticket aims to set the following SQL configs:
- spark.sql.legacy.parquet.int96RebaseModeInWrite
- spark.sql.legacy.parquet.int96RebaseModeInRead
to EXCEPTION by default.

The reason is let users to decide should Spark modify loaded/saved timestamps 
instead of silently shifting timestamps while rebasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33169) Check propagation of datasource options to underlying file system for built-in file-based datasources

2020-10-16 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33169:
--

 Summary: Check propagation of datasource options to underlying 
file system for built-in file-based datasources
 Key: SPARK-33169
 URL: https://issues.apache.org/jira/browse/SPARK-33169
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Add a common trait with a test to check that datasource options are propagated 
to underlying file systems. Individual tests were already added by SPARK-33094 
and SPARK-33089. The ticket aims to de-duplicate the tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33163) Check the metadata key 'org.apache.spark.legacyDateTime' in Avro/Parquet files

2020-10-15 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33163:
--

 Summary: Check the metadata key 'org.apache.spark.legacyDateTime' 
in Avro/Parquet files
 Key: SPARK-33163
 URL: https://issues.apache.org/jira/browse/SPARK-33163
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Add tests to check that the metadata key 'org.apache.spark.legacyDateTime' is 
saved correctly depending on the SQL configs:
# spark.sql.legacy.avro.datetimeRebaseModeInWrite
# spark.sql.legacy.parquet.datetimeRebaseModeInWrite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33160) Allow saving/loading INT96 in parquet w/o rebasing

2020-10-15 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214643#comment-17214643
 ] 

Maxim Gekk commented on SPARK-33160:


I am working on this.

> Allow saving/loading INT96 in parquet w/o rebasing
> --
>
> Key: SPARK-33160
> URL: https://issues.apache.org/jira/browse/SPARK-33160
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, Spark always performs rebasing of INT96 columns in Parquet 
> datasource but this is not required by parquet spec. This tickets aims to 
> allow users to turn off rebasing via SQL config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33160) Allow saving/loading INT96 in parquet w/o rebasing

2020-10-15 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33160:
--

 Summary: Allow saving/loading INT96 in parquet w/o rebasing
 Key: SPARK-33160
 URL: https://issues.apache.org/jira/browse/SPARK-33160
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Currently, Spark always performs rebasing of INT96 columns in Parquet 
datasource but this is not required by parquet spec. This tickets aims to allow 
users to turn off rebasing via SQL config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33134:
---
Description: 
The code below:
{code:scala}
val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 
123456}]""").toDF("events")
val event = new StructType()
  .add("playerId", LongType)
  .add("cards", ArrayType(
new StructType()
  .add("id", LongType)
  .add("rank", StringType)))
val pokerhand_events = pokerhand_raw
  .select(explode(from_json($"events", ArrayType(event))).as("event"))
pokerhand_events.show
{code}
throw the exception in the PERMISSIVE mode (default):
{code:java}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
  at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
  at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
{code}
The same works in Spark 2.4:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
  /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
...
scala> pokerhand_events.show()
+-+
|event|
+-+
+-+
{code}

  was:
The code below:
{code:scala}
val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
583651}]""").toDF("events")
val event = new StructType()
  .add("playerId", LongType)
  .add("cards", ArrayType(
new StructType()
  .add("id", LongType)
  .add("rank", StringType)))
val pokerhand_events = pokerhand_raw
  .select(explode(from_json($"events", ArrayType(event))).as("event"))
pokerhand_events.show
{code}
throw the exception in the PERMISSIVE mode (default):
{code:java}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
  at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
  at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
{code}
The same works in Spark 2.4:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
  /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
...
scala> pokerhand_events.show()
+-+
|event|
+-+
+-+
{code}


> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 
> 123456}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> 

[jira] [Updated] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33134:
---
Description: 
The code below:
{code:scala}
val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
583651}]""").toDF("events")
val event = new StructType()
  .add("playerId", LongType)
  .add("cards", ArrayType(
new StructType()
  .add("id", LongType)
  .add("rank", StringType)))
val pokerhand_events = pokerhand_raw
  .select(explode(from_json($"events", ArrayType(event))).as("event"))
pokerhand_events.show
{code}
throw the exception in the PERMISSIVE mode (default):
{code:java}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
  at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
  at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
{code}
The same works in Spark 2.4:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
  /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
...
scala> pokerhand_events.show()
+-+
|event|
+-+
+-+
{code}

  was:
The code below:
{code:scala}
val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
583651}]""").toDF("events")
val event = new StructType()
  .add("playerId", LongType)
  .add("cards", ArrayType(
new StructType()
  .add("id", LongType)
  .add("rank", StringType)))
val pokerhand_events = pokerhand_raw
  .select(explode(from_json($"events", ArrayType(event))).as("event"))
pokerhand_events.show
{code}
throw the exception in the PERMISSIVE mode (default):
{code:java}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
  at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
  at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
{code}
The same works in Spark 2.4:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
  /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
...
scala> pokerhand_events.show()
+-+
|event|
+-+
+-+
{code}


> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
> 583651}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> 

[jira] [Created] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-13 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33134:
--

 Summary: Incorrect nested complex JSON fields raise an exception
 Key: SPARK-33134
 URL: https://issues.apache.org/jira/browse/SPARK-33134
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.2, 3.1.0
Reporter: Maxim Gekk


The code below:
{code:scala}
val pokerhand_raw = Seq("""[{"cards": [11], "playerId": 
583651}]""").toDF("events")
val event = new StructType()
  .add("playerId", LongType)
  .add("cards", ArrayType(
new StructType()
  .add("id", LongType)
  .add("rank", StringType)))
val pokerhand_events = pokerhand_raw
  .select(explode(from_json($"events", ArrayType(event))).as("event"))
pokerhand_events.show
{code}
throw the exception in the PERMISSIVE mode (default):
{code:java}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.sql.catalyst.util.ArrayData
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
  at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
  at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
  at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
{code}
The same works in Spark 2.4:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
  /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
...
scala> pokerhand_events.show()
+-+
|event|
+-+
+-+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31294) Benchmark the performance regression

2020-10-12 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212199#comment-17212199
 ] 

Maxim Gekk commented on SPARK-31294:


[~cloud_fan] Could you close this ticket since the PR was merged.

> Benchmark the performance regression
> 
>
> Key: SPARK-31294
> URL: https://issues.apache.org/jira/browse/SPARK-31294
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33101:
---
Description: 
When running:
{code:java}
spark.read.format("libsvm").options(conf).load(path)
{code}
The underlying file system will not receive the `conf` options.

  was:
When running:
{code:java}
spark.read.format("orc").options(conf).load(path)
{code}
The underlying file system will not receive the `conf` options.


> LibSVM format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> -
>
> Key: SPARK-33101
> URL: https://issues.apache.org/jira/browse/SPARK-33101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("libsvm").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33101:
--

 Summary: LibSVM format does not propagate Hadoop config from DS 
options to underlying HDFS file system
 Key: SPARK-33101
 URL: https://issues.apache.org/jira/browse/SPARK-33101
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk
Assignee: Maxim Gekk
 Fix For: 3.1.0


When running:
{code:java}
spark.read.format("orc").options(conf).load(path)
{code}
The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33094) ORC format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33094:
--

 Summary: ORC format does not propagate Hadoop config from DS 
options to underlying HDFS file system
 Key: SPARK-33094
 URL: https://issues.apache.org/jira/browse/SPARK-33094
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk
Assignee: Yuning Zhang
 Fix For: 3.0.2, 3.1.0


When running:
{code:java}
spark.read.format("avro").options(conf).load(path)
{code}
The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33094) ORC format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-08 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33094:
---
Description: 
When running:
{code:java}
spark.read.format("orc").options(conf).load(path)
{code}
The underlying file system will not receive the `conf` options.

  was:
When running:
{code:java}
spark.read.format("avro").options(conf).load(path)
{code}
The underlying file system will not receive the `conf` options.


> ORC format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> --
>
> Key: SPARK-33094
> URL: https://issues.apache.org/jira/browse/SPARK-33094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Yuning Zhang
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("orc").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33074) Classify dialect exceptions in JDBC v2 Table Catalog

2020-10-07 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33074:
---
Parent: SPARK-24907
Issue Type: Sub-task  (was: Improvement)

> Classify dialect exceptions in JDBC v2 Table Catalog
> 
>
> Key: SPARK-33074
> URL: https://issues.apache.org/jira/browse/SPARK-33074
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The current implementation of v2.jdbc.JDBCTableCatalog  don't care of 
> exceptions defined by org.apache.spark.sql.connector.catalog.TableCatalog at 
> all like
> * NoSuchNamespaceException
> * NoSuchTableException
> * TableAlreadyExistsException
> it either throw dialect exception or generic exception AnalysisException.
> Since we split forming of dialect specific statements and their execution, we 
> should extend dialect APIs and ask them how to convert their exceptions to 
> TableCatalog  exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33074) Classify dialect exceptions in JDBC v2 Table Catalog

2020-10-06 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33074:
--

 Summary: Classify dialect exceptions in JDBC v2 Table Catalog
 Key: SPARK-33074
 URL: https://issues.apache.org/jira/browse/SPARK-33074
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


The current implementation of v2.jdbc.JDBCTableCatalog  don't care of 
exceptions defined by org.apache.spark.sql.connector.catalog.TableCatalog at 
all like
* NoSuchNamespaceException
* NoSuchTableException
* TableAlreadyExistsException
it either throw dialect exception or generic exception AnalysisException.

Since we split forming of dialect specific statements and their execution, we 
should extend dialect APIs and ask them how to convert their exceptions to 
TableCatalog  exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33067) Add negative checks to JDBC v2 Table Catalog tests

2020-10-05 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33067:
--

 Summary: Add negative checks to JDBC v2 Table Catalog tests
 Key: SPARK-33067
 URL: https://issues.apache.org/jira/browse/SPARK-33067
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Add checks when JDBC v2 commands fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33066) Port docker integration tests to JDBC v2

2020-10-05 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33066:
--

 Summary: Port docker integration tests to JDBC v2
 Key: SPARK-33066
 URL: https://issues.apache.org/jira/browse/SPARK-33066
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Port existing docker integration tests like 
org.apache.spark.sql.jdbc.OracleIntegrationSuite to JDBC v2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33034) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (Oracle dialect)

2020-09-30 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33034:
---
Summary: Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and 
nullability of columns (Oracle dialect)  (was: Support ALTER TABLE in JDBC v2 
Table Catalog: add, update type and nullability of columns)

> Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and 
> nullability of columns (Oracle dialect)
> --
>
> Key: SPARK-33034
> URL: https://issues.apache.org/jira/browse/SPARK-33034
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Override the default SQL strings for:
> - ALTER TABLE ADD COLUMN
> - ALTER TABLE UPDATE COLUMN TYPE
> - ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following Oracle JDBC dialect according to official documentation.
> Write Oracle integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33034) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns

2020-09-30 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204672#comment-17204672
 ] 

Maxim Gekk commented on SPARK-33034:


I am working on this.

> Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and 
> nullability of columns
> -
>
> Key: SPARK-33034
> URL: https://issues.apache.org/jira/browse/SPARK-33034
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Override the default SQL strings for:
> - ALTER TABLE ADD COLUMN
> - ALTER TABLE UPDATE COLUMN TYPE
> - ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following Oracle JDBC dialect according to official documentation.
> Write Oracle integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33034) Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns

2020-09-30 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33034:
--

 Summary: Support ALTER TABLE in JDBC v2 Table Catalog: add, update 
type and nullability of columns
 Key: SPARK-33034
 URL: https://issues.apache.org/jira/browse/SPARK-33034
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Override the default SQL strings for:
- ALTER TABLE ADD COLUMN
- ALTER TABLE UPDATE COLUMN TYPE
- ALTER TABLE UPDATE COLUMN NULLABILITY

in the following Oracle JDBC dialect according to official documentation.

Write Oracle integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33015) Compute the current date only once

2020-09-28 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33015:
--

 Summary: Compute the current date only once
 Key: SPARK-33015
 URL: https://issues.apache.org/jira/browse/SPARK-33015
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2, 3.1.0
Reporter: Maxim Gekk


According to the doc for current_date(), it must compute the current date at 
the start of query evaluation: 
http://spark.apache.org/docs/latest/api/sql/#current_date but it can compute it 
multiple times: 
https://github.com/apache/spark/blob/0df8dd60733066076967f0525210bbdb5e12415a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L85



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33015) Compute the current date only once

2020-09-28 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203084#comment-17203084
 ] 

Maxim Gekk commented on SPARK-33015:


I am working on this

> Compute the current date only once
> --
>
> Key: SPARK-33015
> URL: https://issues.apache.org/jira/browse/SPARK-33015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> According to the doc for current_date(), it must compute the current date at 
> the start of query evaluation: 
> http://spark.apache.org/docs/latest/api/sql/#current_date but it can compute 
> it multiple times: 
> https://github.com/apache/spark/blob/0df8dd60733066076967f0525210bbdb5e12415a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L85



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199952#comment-17199952
 ] 

Maxim Gekk commented on SPARK-32306:


I opened PR https://github.com/apache/spark/pull/29835 with clarification.

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-22 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199839#comment-17199839
 ] 

Maxim Gekk commented on SPARK-32306:


The function returns an element of the input sequence, see 
https://en.wikipedia.org/wiki/Percentile#The_nearest-rank_method

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27345) Expose approx_percentile in python api

2020-09-21 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199618#comment-17199618
 ] 

Maxim Gekk commented on SPARK-27345:


The function was exposed recently by https://github.com/apache/spark/pull/27278

> Expose approx_percentile in python api
> --
>
> Key: SPARK-27345
> URL: https://issues.apache.org/jira/browse/SPARK-27345
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Roi Zaig
>Priority: Major
>
> The function approx_percentile does is not exposed in the PySpark api.
> We need to work around it by writing a sql and then back to data frame.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32306) `approx_percentile` in Spark SQL gives incorrect results

2020-09-21 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199615#comment-17199615
 ] 

Maxim Gekk commented on SPARK-32306:


[~seanmalory] Which result do you expect? 8?

> `approx_percentile` in Spark SQL gives incorrect results
> 
>
> Key: SPARK-32306
> URL: https://issues.apache.org/jira/browse/SPARK-32306
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Sean Malory
>Priority: Major
>
> The `approx_percentile` function in Spark SQL does not give the correct 
> result. I'm not sure how incorrect it is; it may just be a boundary issue. 
> From the docs:
> {quote}The accuracy parameter (default: 1) is a positive numeric literal 
> which controls approximation accuracy at the cost of memory. Higher value of 
> accuracy yields better accuracy, 1.0/accuracy is the relative error of the 
> approximation.
> {quote}
> This is not true. Here is a minimum example in `pyspark` where, essentially, 
> the median of 5 and 8 is being calculated as 5:
> {code:python}
> import pyspark.sql.functions as psf
> df = spark.createDataFrame(
> [('bar', 5), ('bar', 8)], ['name', 'val']
> )
> median = psf.expr('percentile_approx(val, 0.5, 2147483647)')
> df.groupBy('name').agg(median.alias('median'))# gives the median as 5
> {code}
> I've tested this with Spark v2.4.4, pyspark v2.4.5- although I suspect this 
> is an issue with the underlying algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30882) Inaccurate results with higher precision ApproximatePercentile results

2020-09-21 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199602#comment-17199602
 ] 

Maxim Gekk commented on SPARK-30882:


This PR [https://github.com/apache/spark/pull/29784] solves the issue:
{code}
+--++
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 1)|
+--++
|metric|[45, 1124848, 3374848, 405] |
+--++

+--+--+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 100)|
+--+--+
|metric|[45, 1124996, 3374996, 405]   |
+--+--+

+--+--+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)|
+--+--+
|metric|[45, 1125000, 3375000, 405]   |
+--+--+

+--+--+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 700)|
+--+--+
|metric|[45, 1125000, 3375000, 405]   |
+--+--+

+--+---+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 1000)|
+--+---+
|metric|[45, 1125000, 3375000, 405]|
+--+---+
{code}

I propose to close this bug report. cc [~hyukjin.kwon] [~cloud_fan] [~srowen] 
[~dongjoon]

> Inaccurate results with higher precision ApproximatePercentile results 
> ---
>
> Key: SPARK-30882
> URL: https://issues.apache.org/jira/browse/SPARK-30882
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.4
>Reporter: Nandini malempati
>Priority: Major
>
> Results of ApproximatePercentile should have better accuracy with increased 
> precision as per the documentation provided here: 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
> But i'm seeing nondeterministic behavior . On a data set of size 450 with 
> Accuracy 100 is returning better results than 500 (P25 and P90 looks 
> fine but P75 is very off). And accuracy with 700 gives exact results. But 
> this behavior is not consistent. 
> {code:scala}
> // Some comments here
> package com.microsoft.teams.test.utils
> import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
> import org.scalatest.{Assertions, FunSpec}
> import org.apache.spark.sql.functions.{lit, _}
> import 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
> StructType}
> import scala.collection.mutable.ListBuffer
> class PercentilesTest extends FunSpec with Assertions {
> it("check percentiles with different precision") {
>   val schema = List(StructField("MetricName", StringType), 
> StructField("DataPoint", IntegerType))
>   val data = new ListBuffer[Row]
>   for(i <- 1 to 450) { data += Row("metric", i)}
>   import spark.implicits._
> val df = createDF(schema, data.toSeq)
> val accuracy1000 = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 
> ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
> val accuracy1M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100))
> val accuracy5M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500))
> val accuracy7M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700))
> val accuracy10M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000))
> accuracy1000.show(1, false)
> accuracy1M.show(1, false)
> accuracy5M.show(1, false)
> accuracy7M.show(1, false)
> accuracy10M.show(1, false)
>   }
>   def 

<    1   2   3   4   5   6   7   8   9   10   >