[jira] [Commented] (SPARK-33357) Support SparkLauncher in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-33357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239550#comment-17239550 ] Apache Spark commented on SPARK-33357: -- User 'hddong' has created a pull request for this issue: https://github.com/apache/spark/pull/30520 > Support SparkLauncher in Kubernetes > --- > > Key: SPARK-33357 > URL: https://issues.apache.org/jira/browse/SPARK-33357 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: hong dongdong >Priority: Major > > Now, SparkAppHandle can not get state report in k8s, we can support it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33357) Support SparkLauncher in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-33357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239549#comment-17239549 ] Apache Spark commented on SPARK-33357: -- User 'hddong' has created a pull request for this issue: https://github.com/apache/spark/pull/30520 > Support SparkLauncher in Kubernetes > --- > > Key: SPARK-33357 > URL: https://issues.apache.org/jira/browse/SPARK-33357 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: hong dongdong >Priority: Major > > Now, SparkAppHandle can not get state report in k8s, we can support it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33357) Support SparkLauncher in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-33357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33357: Assignee: (was: Apache Spark) > Support SparkLauncher in Kubernetes > --- > > Key: SPARK-33357 > URL: https://issues.apache.org/jira/browse/SPARK-33357 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: hong dongdong >Priority: Major > > Now, SparkAppHandle can not get state report in k8s, we can support it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33357) Support SparkLauncher in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-33357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33357: Assignee: Apache Spark > Support SparkLauncher in Kubernetes > --- > > Key: SPARK-33357 > URL: https://issues.apache.org/jira/browse/SPARK-33357 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: hong dongdong >Assignee: Apache Spark >Priority: Major > > Now, SparkAppHandle can not get state report in k8s, we can support it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32691) Update commons-crypto to v1.1.0
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239536#comment-17239536 ] RuiChen edited comment on SPARK-32691 at 11/27/20, 7:18 AM: [~huangtianhua] looks Spark ARM CI passed in these days, the issue have been fixed, right? https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ was (Author: ruichen): [~huangtianhua] looks Spark ARM CI passed in these days, > Update commons-crypto to v1.1.0 > --- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.7, 3.0.0, 3.0.1, 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Assignee: huangtianhua >Priority: Major > Fix For: 3.1.0 > > Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, > success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32691) Update commons-crypto to v1.1.0
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239536#comment-17239536 ] RuiChen commented on SPARK-32691: - [~huangtianhua] looks Spark ARM CI passed in these days, > Update commons-crypto to v1.1.0 > --- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.7, 3.0.0, 3.0.1, 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Assignee: huangtianhua >Priority: Major > Fix For: 3.1.0 > > Attachments: Screen Shot 2020-09-28 at 8.49.04 AM.png, failure.log, > success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33575) Fix incorrect exception message for ANALYZE COLUMN
[ https://issues.apache.org/jira/browse/SPARK-33575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33575. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30519 [https://github.com/apache/spark/pull/30519] > Fix incorrect exception message for ANALYZE COLUMN > -- > > Key: SPARK-33575 > URL: https://issues.apache.org/jira/browse/SPARK-33575 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > Fix For: 3.1.0 > > > Currently, "ANALYZE TABLE tempView COMPUTE STATISTICS FOR COLUMNS " > throws "NoSuchTableException" even if "tempView" exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33575) Fix incorrect exception message for ANALYZE COLUMN
[ https://issues.apache.org/jira/browse/SPARK-33575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33575: --- Assignee: Terry Kim > Fix incorrect exception message for ANALYZE COLUMN > -- > > Key: SPARK-33575 > URL: https://issues.apache.org/jira/browse/SPARK-33575 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > > Currently, "ANALYZE TABLE tempView COMPUTE STATISTICS FOR COLUMNS " > throws "NoSuchTableException" even if "tempView" exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33566) Incorrectly Parsing CSV file
[ https://issues.apache.org/jira/browse/SPARK-33566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33566. -- Fix Version/s: 3.1.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/30518 > Incorrectly Parsing CSV file > > > Key: SPARK-33566 > URL: https://issues.apache.org/jira/browse/SPARK-33566 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Stephen More >Priority: Minor > Fix For: 3.1.0 > > > Here is a test case: > [https://github.com/mores/maven-examples/blob/master/comma/src/test/java/org/test/CommaTest.java] > It shows how I believe apache commons csv and opencsv correctly parses the > sample csv file. > spark is not correctly parsing the sample csv file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.
Darshat created SPARK-33576: --- Summary: PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'. Key: SPARK-33576 URL: https://issues.apache.org/jira/browse/SPARK-33576 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.1 Environment: Databricks runtime 7.3 Spakr 3.0.1 Scala 2.12 Reporter: Darshat Hello, We are using Databricks on Azure to process large amount of ecommerce data. Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12. During processing, there is a groupby operation on the DataFrame that consistently gets an exception of this type: {color:#FF}PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'. Full traceback below: Traceback (most recent call last): File "/databricks/spark/python/pyspark/worker.py", line 654, in main process() File "/databricks/spark/python/pyspark/worker.py", line 646, in process serializer.dump_stream(out_iter, outfile) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in dump_stream for batch in iterator: File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in init_stream_yield_batches for series in iterator: File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in load_stream for batch in batches: File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in load_stream for batch in batches: File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in __iter__ File "pyarrow/ipc.pxi", line 432, in pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative bodyLength{color} Code that causes this: {color:#57d9a3}## df has 22 million rows and 3 distinct provider ids. Domain features adds couple of computed columns to the dataframe{color} {color:#FF}x = df.groupby('providerid').apply(domain_features){color} {color:#FF}display(x.info()){color} We've put all possible checks in the code for null values, or corrupt data and we are not able to track this to application level code. I hope we can get some help troubleshooting this as this is a blocker for rolling out at scale. Dataframe size - 22 million rows, 31 columns One of the columns is a string ('providerid') on which we do a groupby followed by an apply operation. There are 3 distinct provider ids in this set. While trying to enumerate/count the results, we get this exception. The cluster has 8 nodes + driver, all 28GB. I can provide any other settings that could be useful. Hope to get some insights into the problem. Thanks, Darshat Shah -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationsuite
[ https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-33570: --- Description: For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer available in the official apt repository and MariaDBKrbIntegrationSuite doesn't pass for now. It seems that only the most recent three versions are available and they are 10.5.6, 10.5.7 and 10.5.8 for now. Further, the release cycle of MariaDB seems to be very rapid (1 ~ 2 months) so I don't think it's a good idea to set to an specific version for mariadb-plugin-gssapi-server. was: For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer available in the official apt repository and MariaDBKrbIntegrationSuite doesn't pass for now. It seems that only the most recent three versions are available and they are 10.5.6, 10.5.7 and 10.5.8 for now. Further, the release cycle of MariaDB seems to be too fast (1 ~ 2 months) so I don't think it's a good idea to set to an specific version for mariadb-plugin-gssapi-server. > Set the proper version of gssapi plugin automatically for > MariaDBKrbIntegrationsuite > > > Key: SPARK-33570 > URL: https://issues.apache.org/jira/browse/SPARK-33570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server > is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer > available in the official apt repository and MariaDBKrbIntegrationSuite > doesn't pass for now. > It seems that only the most recent three versions are available and they are > 10.5.6, 10.5.7 and 10.5.8 for now. > Further, the release cycle of MariaDB seems to be very rapid (1 ~ 2 months) > so I don't think it's a good idea to set to an specific version for > mariadb-plugin-gssapi-server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33560) Add "unused import" check to Maven compilation process
[ https://issues.apache.org/jira/browse/SPARK-33560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239495#comment-17239495 ] Hyukjin Kwon commented on SPARK-33560: -- Seems like Scalac now has its built-in silencer support. Once we drop Scala 2.12, I think we can add it this into Maven as well, see also [https://github.com/scala/scala/pull/8373]. cc [~maxgekk] FYI > Add "unused import" check to Maven compilation process > -- > > Key: SPARK-33560 > URL: https://issues.apache.org/jira/browse/SPARK-33560 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Minor > > Similar to SPARK-33441, need add "unused import" check to maven pom. > The blocker is how to achieve the same effect as SBT compiler check, It seems > that adding "-P:silencer:globalFilters=.*deprecated.*" configuration to > "scala-maven-plugin" is not supported at present -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33575) Fix incorrect exception message for ANALYZE COLUMN
[ https://issues.apache.org/jira/browse/SPARK-33575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33575: Assignee: Apache Spark > Fix incorrect exception message for ANALYZE COLUMN > -- > > Key: SPARK-33575 > URL: https://issues.apache.org/jira/browse/SPARK-33575 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Minor > > Currently, "ANALYZE TABLE tempView COMPUTE STATISTICS FOR COLUMNS " > throws "NoSuchTableException" even if "tempView" exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33575) Fix incorrect exception message for ANALYZE COLUMN
[ https://issues.apache.org/jira/browse/SPARK-33575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33575: Assignee: (was: Apache Spark) > Fix incorrect exception message for ANALYZE COLUMN > -- > > Key: SPARK-33575 > URL: https://issues.apache.org/jira/browse/SPARK-33575 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > Currently, "ANALYZE TABLE tempView COMPUTE STATISTICS FOR COLUMNS " > throws "NoSuchTableException" even if "tempView" exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33575) Fix incorrect exception message for ANALYZE COLUMN
[ https://issues.apache.org/jira/browse/SPARK-33575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239478#comment-17239478 ] Apache Spark commented on SPARK-33575: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/30519 > Fix incorrect exception message for ANALYZE COLUMN > -- > > Key: SPARK-33575 > URL: https://issues.apache.org/jira/browse/SPARK-33575 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > Currently, "ANALYZE TABLE tempView COMPUTE STATISTICS FOR COLUMNS " > throws "NoSuchTableException" even if "tempView" exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33563) Expose inverse hyperbolic trig functions in PySpark and SparkR
[ https://issues.apache.org/jira/browse/SPARK-33563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33563: Assignee: Maciej Szymkiewicz > Expose inverse hyperbolic trig functions in PySpark and SparkR > -- > > Key: SPARK-33563 > URL: https://issues.apache.org/jira/browse/SPARK-33563 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > > {{acosh}}, {{asinh}} and {{atanh}} were exposed in Scala {{sql.functions}} in > Spark 3.1 (SPARK-33061). > For consistency, we should expose these in Python and R as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33563) Expose inverse hyperbolic trig functions in PySpark and SparkR
[ https://issues.apache.org/jira/browse/SPARK-33563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33563. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30501 [https://github.com/apache/spark/pull/30501] > Expose inverse hyperbolic trig functions in PySpark and SparkR > -- > > Key: SPARK-33563 > URL: https://issues.apache.org/jira/browse/SPARK-33563 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.1.0 > > > {{acosh}}, {{asinh}} and {{atanh}} were exposed in Scala {{sql.functions}} in > Spark 3.1 (SPARK-33061). > For consistency, we should expose these in Python and R as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33575) Fix incorrect exception message for ANALYZE COLUMN
Terry Kim created SPARK-33575: - Summary: Fix incorrect exception message for ANALYZE COLUMN Key: SPARK-33575 URL: https://issues.apache.org/jira/browse/SPARK-33575 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Terry Kim Currently, "ANALYZE TABLE tempView COMPUTE STATISTICS FOR COLUMNS " throws "NoSuchTableException" even if "tempView" exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33517) Incorrect menu item display and link in PySpark Usage Guide for Pandas with Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-33517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liucht-inspur updated SPARK-33517: -- Description: Error setting menu item and link, change "Apache Arrow in Spark" to "Apache Arrow in PySpark" !image-2020-11-23-18-47-01-591.png! after: !image-2020-11-27-09-43-58-141.png! was: Error setting menu item and link, change "Apache Arrow in Spark" to "Apache Arrow in PySpark" !image-2020-11-23-18-47-01-591.png! > Incorrect menu item display and link in PySpark Usage Guide for Pandas with > Apache Arrow > > > Key: SPARK-33517 > URL: https://issues.apache.org/jira/browse/SPARK-33517 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: liucht-inspur >Priority: Minor > Attachments: image-2020-11-23-18-47-01-591.png, > image-2020-11-27-09-43-58-141.png, spark-doc.jpg > > > Error setting menu item and link, change "Apache Arrow in Spark" to "Apache > Arrow in PySpark" > !image-2020-11-23-18-47-01-591.png! > > after: > !image-2020-11-27-09-43-58-141.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33517) Incorrect menu item display and link in PySpark Usage Guide for Pandas with Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-33517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liucht-inspur updated SPARK-33517: -- Attachment: image-2020-11-27-09-43-58-141.png > Incorrect menu item display and link in PySpark Usage Guide for Pandas with > Apache Arrow > > > Key: SPARK-33517 > URL: https://issues.apache.org/jira/browse/SPARK-33517 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: liucht-inspur >Priority: Minor > Attachments: image-2020-11-23-18-47-01-591.png, > image-2020-11-27-09-43-58-141.png, spark-doc.jpg > > > Error setting menu item and link, change "Apache Arrow in Spark" to "Apache > Arrow in PySpark" > !image-2020-11-23-18-47-01-591.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33489) Support null for conversion from and to Arrow type
[ https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239461#comment-17239461 ] Yuya Kanai commented on SPARK-33489: [~bryanc] Yes I'll try working on it. Thank you for mentioning. > Support null for conversion from and to Arrow type > -- > > Key: SPARK-33489 > URL: https://issues.apache.org/jira/browse/SPARK-33489 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.1 >Reporter: Yuya Kanai >Priority: Minor > > I got below error when using from_arrow_type() in pyspark.sql.pandas.types > {{Unsupported type in conversion from Arrow: null}} > I noticed NullType exists under pyspark.sql.types so it seems possible to > convert from pyarrow null to pyspark null type and vice versa. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16854) mapWithState Support for Python
[ https://issues.apache.org/jira/browse/SPARK-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239407#comment-17239407 ] Haim Bendanan commented on SPARK-16854: --- +1 > mapWithState Support for Python > --- > > Key: SPARK-16854 > URL: https://issues.apache.org/jira/browse/SPARK-16854 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 >Reporter: Boaz >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33568) install coverage for pypy3
[ https://issues.apache.org/jira/browse/SPARK-33568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp resolved SPARK-33568. - Resolution: Fixed > install coverage for pypy3 > -- > > Key: SPARK-33568 > URL: https://issues.apache.org/jira/browse/SPARK-33568 > Project: Spark > Issue Type: Bug > Components: Build, PySpark >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > from: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-sbt-hadoop-2.7-hive-1.2/1002/console > > Coverage is not installed in Python executable 'pypy3' but > 'COVERAGE_PROCESS_START' environment variable is set, exiting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33568) install coverage for pypy3
[ https://issues.apache.org/jira/browse/SPARK-33568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp reassigned SPARK-33568: --- Assignee: Shane Knapp > install coverage for pypy3 > -- > > Key: SPARK-33568 > URL: https://issues.apache.org/jira/browse/SPARK-33568 > Project: Spark > Issue Type: Bug > Components: Build, PySpark >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > from: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-sbt-hadoop-2.7-hive-1.2/1002/console > > Coverage is not installed in Python executable 'pypy3' but > 'COVERAGE_PROCESS_START' environment variable is set, exiting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33568) install coverage for pypy3
[ https://issues.apache.org/jira/browse/SPARK-33568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239399#comment-17239399 ] Shane Knapp commented on SPARK-33568: - this is now installed on the ubuntu 16 workers > install coverage for pypy3 > -- > > Key: SPARK-33568 > URL: https://issues.apache.org/jira/browse/SPARK-33568 > Project: Spark > Issue Type: Bug > Components: Build, PySpark >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > from: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-sbt-hadoop-2.7-hive-1.2/1002/console > > Coverage is not installed in Python executable 'pypy3' but > 'COVERAGE_PROCESS_START' environment variable is set, exiting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33574) Improve locality for push-based shuffle especially for join like operations
Min Shen created SPARK-33574: Summary: Improve locality for push-based shuffle especially for join like operations Key: SPARK-33574 URL: https://issues.apache.org/jira/browse/SPARK-33574 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Affects Versions: 3.1.0 Reporter: Min Shen Currently, we only set locality for ShuffledRDD and ShuffledRowRDD with push-based shuffle. In simple stage DAGs where a ShuffledRDD or ShuffledRowRDD is the only input RDD, Spark can handle locality fine. However, if we have a join operation where a stage can consume multiple shuffle inputs or other non-shuffle inputs, the locality will take a hit with how DAGScheduler currently determines the preferred location. With push-based shuffle, we could potentially reuse the same set of merger locations across sibling ShuffleMapStages. This would enable a much better locality on the reducer stage side, where corresponding merged shuffle partitions for the multiple shuffle inputs are already colocated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33573) Server and client side metrics related to push-based shuffle
Min Shen created SPARK-33573: Summary: Server and client side metrics related to push-based shuffle Key: SPARK-33573 URL: https://issues.apache.org/jira/browse/SPARK-33573 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Affects Versions: 3.1.0 Reporter: Min Shen Need to add metrics on both server and client side related to push-based shuffle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33235) Push-based Shuffle Improvement Tasks
[ https://issues.apache.org/jira/browse/SPARK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Shen updated SPARK-33235: - Description: This is the parent jira for follow-up improvement tasks for supporting Push-based shuffle. Refer SPARK-30602. (was: This is the parent jira for the phase 2 or follow-up tasks for supporting Push-based shuffle. Refer SPARK-30602. ) > Push-based Shuffle Improvement Tasks > > > Key: SPARK-33235 > URL: https://issues.apache.org/jira/browse/SPARK-33235 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Priority: Major > Labels: release-notes > > This is the parent jira for follow-up improvement tasks for supporting > Push-based shuffle. Refer SPARK-30602. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33235) Push-based Shuffle Improvement Tasks
[ https://issues.apache.org/jira/browse/SPARK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Shen updated SPARK-33235: - Summary: Push-based Shuffle Improvement Tasks (was: Push-based Shuffle Phase 2 Tasks) > Push-based Shuffle Improvement Tasks > > > Key: SPARK-33235 > URL: https://issues.apache.org/jira/browse/SPARK-33235 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Priority: Major > Labels: release-notes > > This is the parent jira for the phase 2 or follow-up tasks for supporting > Push-based shuffle. Refer SPARK-30602. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33566) Incorrectly Parsing CSV file
[ https://issues.apache.org/jira/browse/SPARK-33566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33566: Assignee: Apache Spark > Incorrectly Parsing CSV file > > > Key: SPARK-33566 > URL: https://issues.apache.org/jira/browse/SPARK-33566 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Stephen More >Assignee: Apache Spark >Priority: Minor > > Here is a test case: > [https://github.com/mores/maven-examples/blob/master/comma/src/test/java/org/test/CommaTest.java] > It shows how I believe apache commons csv and opencsv correctly parses the > sample csv file. > spark is not correctly parsing the sample csv file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33566) Incorrectly Parsing CSV file
[ https://issues.apache.org/jira/browse/SPARK-33566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239304#comment-17239304 ] Apache Spark commented on SPARK-33566: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/30518 > Incorrectly Parsing CSV file > > > Key: SPARK-33566 > URL: https://issues.apache.org/jira/browse/SPARK-33566 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Stephen More >Priority: Minor > > Here is a test case: > [https://github.com/mores/maven-examples/blob/master/comma/src/test/java/org/test/CommaTest.java] > It shows how I believe apache commons csv and opencsv correctly parses the > sample csv file. > spark is not correctly parsing the sample csv file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33566) Incorrectly Parsing CSV file
[ https://issues.apache.org/jira/browse/SPARK-33566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239303#comment-17239303 ] Apache Spark commented on SPARK-33566: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/30518 > Incorrectly Parsing CSV file > > > Key: SPARK-33566 > URL: https://issues.apache.org/jira/browse/SPARK-33566 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Stephen More >Priority: Minor > > Here is a test case: > [https://github.com/mores/maven-examples/blob/master/comma/src/test/java/org/test/CommaTest.java] > It shows how I believe apache commons csv and opencsv correctly parses the > sample csv file. > spark is not correctly parsing the sample csv file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33566) Incorrectly Parsing CSV file
[ https://issues.apache.org/jira/browse/SPARK-33566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33566: Assignee: (was: Apache Spark) > Incorrectly Parsing CSV file > > > Key: SPARK-33566 > URL: https://issues.apache.org/jira/browse/SPARK-33566 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Stephen More >Priority: Minor > > Here is a test case: > [https://github.com/mores/maven-examples/blob/master/comma/src/test/java/org/test/CommaTest.java] > It shows how I believe apache commons csv and opencsv correctly parses the > sample csv file. > spark is not correctly parsing the sample csv file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33566) Incorrectly Parsing CSV file
[ https://issues.apache.org/jira/browse/SPARK-33566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239256#comment-17239256 ] Yang Jie edited comment on SPARK-33566 at 11/26/20, 1:12 PM: - I think the reason for the bad case is Spark use "STOP_AT_DELIMITER" as default "UnescapedQuoteHandling" to build "CsvParser". Configure "UnescapedQuoteHandling" to "STOP_AT_CLOSING_QUOTE" seems can resolve this issue, but Spark not support configure this option now. [~hyukjin.kwon] [~moresmores] was (Author: luciferyang): I think the reason for the bad case is Spark use "STOP_AT_DELIMITER" as default "UnescapedQuoteHandling" to build "CsvParser". Configure "UnescapedQuoteHandling" to "STOP_AT_CLOSING_QUOTE" seems can resolve this issue. [~hyukjin.kwon] [~moresmores] > Incorrectly Parsing CSV file > > > Key: SPARK-33566 > URL: https://issues.apache.org/jira/browse/SPARK-33566 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Stephen More >Priority: Minor > > Here is a test case: > [https://github.com/mores/maven-examples/blob/master/comma/src/test/java/org/test/CommaTest.java] > It shows how I believe apache commons csv and opencsv correctly parses the > sample csv file. > spark is not correctly parsing the sample csv file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33566) Incorrectly Parsing CSV file
[ https://issues.apache.org/jira/browse/SPARK-33566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239256#comment-17239256 ] Yang Jie commented on SPARK-33566: -- I think the reason for the bad case is Spark use "STOP_AT_DELIMITER" as default "UnescapedQuoteHandling" to build "CsvParser". Configure "UnescapedQuoteHandling" to "STOP_AT_CLOSING_QUOTE" seems can resolve this issue. [~hyukjin.kwon] [~moresmores] > Incorrectly Parsing CSV file > > > Key: SPARK-33566 > URL: https://issues.apache.org/jira/browse/SPARK-33566 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Stephen More >Priority: Minor > > Here is a test case: > [https://github.com/mores/maven-examples/blob/master/comma/src/test/java/org/test/CommaTest.java] > It shows how I believe apache commons csv and opencsv correctly parses the > sample csv file. > spark is not correctly parsing the sample csv file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33564) Prometheus metrics for Master and Worker isn't working
[ https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paulo Roberto de Oliveira Castro updated SPARK-33564: - Description: Following the [PR|https://github.com/apache/spark/pull/25769] that introduced the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}} (also tested with 3.0.0), uncompressed the tgz and created a file called {{metrics.properties}} adding this content: {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}} {{*.sink.prometheusServlet.path=/metrics/prometheus}} master.sink.prometheusServlet.path=/metrics/master/prometheus applications.sink.prometheusServlet.path=/metrics/applications/prometheus {quote} Then I ran: {quote}{{$ sbin/start-master.sh}} {{$ sbin/start-slave.sh spark://`hostname`:7077}} {{$ bin/spark-shell --master spark://`hostname`:7077 --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}} {quote} {{The Spark shell opens without problems:}} {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable}} {{Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties}} {{Setting default log level to "WARN".}} {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).}} {{Spark context Web UI available at [http://192.168.0.6:4040|http://192.168.0.6:4040/]}} {{Spark context available as 'sc' (master = spark://MacBook-Pro-de-Paulo-2.local:7077, app id = app-20201125173618-0002).}} {{Spark session available as 'spark'.}} {{Welcome to}} {{ __}} {{ / __/_ _/ /__}} {{ _\ \/ _ \/ _ `/ __/ '_/}} {{ /___/ .__/_,_/_/ /_/_\ version 3.0.0}} {{ /_/}} {{ }} {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}} {{Type in expressions to have them evaluated.}} {{Type :help for more information. }} {{scala>}} {quote} {{And when I try to fetch prometheus metrics for driver, everything works fine:}} {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5 metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"} 0 metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"} 0 metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"} 732 metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"} 732 metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"} 0 {quote} *The problem appears when I try accessing master metrics*, and I get the following problem: {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}} {{}} {{ }} {{ setUIRoot('')}} {{ }} {{ }} {{ Spark Master at spark://MacBook-Pro-de-Paulo-2.local:7077}} {{ }} {{ }} {{ }} {{ }} {{ }} {{ }} {{ }} {{ }} {{ 3.0.0}} {{ }} {{ Spark Master at spark://MacBook-Pro-de-Paulo-2.local:7077}} {{ }} {{ }} {{ }} {{ }} {{ }} {{ }} {{ URL: spark://MacBook-Pro-de-Paulo-2.local:7077}} ... {quote} Instead of the metrics I'm getting an HTML page. The same happens for all of those here: {quote}{{$ curl -s [http://localhost:8080/metrics/applications/prometheus/]}} {{$ curl -s [http://localhost:8081/metrics/prometheus/]}} {quote} Instead, *I expected metrics in prometheus metrics*. All related JSON endpoints seem to be working fine. was: Following the [PR|https://github.com/apache/spark/pull/25769] that introduced the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}} (also tested with 3.0.0), uncompressed the tgz and created a file called {{metrics.properties}} adding this content: {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}} {{*.sink.prometheusServlet.path=/metrics/prometheus}} master.sink.prometheusServlet.path=/metrics/master/prometheus applications.sink.prometheusServlet.path=/metrics/applications/prometheus {quote} Then I ran: {quote}{{$ sbin/start-master.sh}} {{$ sbin/start-slave.sh spark://`hostname`:7077}} {{$ bin/spark-shell --master spark://`hostname`:7077 --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}} {quote} {{The Spark shell opens without problems:}} {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable}} {{Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties}} {{Setting default log
[jira] [Commented] (SPARK-33498) Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid
[ https://issues.apache.org/jira/browse/SPARK-33498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239233#comment-17239233 ] Apache Spark commented on SPARK-33498: -- User 'waitinfuture' has created a pull request for this issue: https://github.com/apache/spark/pull/30516 > Datetime parsing should fail if the input string can't be parsed, or the > pattern string is invalid > -- > > Key: SPARK-33498 > URL: https://issues.apache.org/jira/browse/SPARK-33498 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > Datetime parsing should fail if the input string can't be parsed, or the > pattern string is invalid, when ANSI mode is enable. This patch should update > GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33572) Datetime building should fail if the year, month, ..., second combination is invalid
zhoukeyong created SPARK-33572: -- Summary: Datetime building should fail if the year, month, ..., second combination is invalid Key: SPARK-33572 URL: https://issues.apache.org/jira/browse/SPARK-33572 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: zhoukeyong Datetime building should fail if the year, month, ..., second combination is invalid, when ANSI mode is enabled. This patch should update MakeDate, MakeTimestamp and MakeInterval. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239204#comment-17239204 ] Simon commented on SPARK-33571: --- Below the output of the date testscript with the noise removed Writing without additional config works as expected. Spark 3.0.1. throws a `SparkUpgradeException` Reading without additional config works as expected. Spark 3.0.1. throws a `SparkUpgradeException` when reading parquet files written with Spark 2.4.5 in Spark 3.0.1. Reading using the two different `datetimeRebaseModeInRead` modes doesn't work though, it shows no difference {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'LEGACY'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/datespark245/*.parquet root |-- row: string (nullable = true) |-- date: date (nullable = true) +---+--+ |row| date| +---+--+ | 1|0220-10-01| | 2|1880-10-01| | 3|2020-10-01| +---+--+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/datespark245/*.parquet root |-- row: string (nullable = true) |-- date: date (nullable = true) +---+--+ |row| date| +---+--+ | 1|0220-10-01| | 2|1880-10-01| | 3|2020-10-01| +---+--+ done {code} Note no difference in the dates shown > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` resul
[jira] [Comment Edited] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239204#comment-17239204 ] Simon edited comment on SPARK-33571 at 11/26/20, 11:16 AM: --- Below the output of the date testscript with the noise removed Writing without additional config works as expected. Spark 3.0.1. throws a `SparkUpgradeException` when writing to parquet and the dataframe contains old dates. Reading without additional config works as expected. Spark 3.0.1. throws a `SparkUpgradeException` when reading parquet files written with Spark 2.4.5 in Spark 3.0.1. Reading using the two different `datetimeRebaseModeInRead` modes doesn't work though, it shows no difference {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'LEGACY'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/datespark245/*.parquet root |-- row: string (nullable = true) |-- date: date (nullable = true) +---+--+ |row| date| +---+--+ | 1|0220-10-01| | 2|1880-10-01| | 3|2020-10-01| +---+--+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/datespark245/*.parquet root |-- row: string (nullable = true) |-- date: date (nullable = true) +---+--+ |row| date| +---+--+ | 1|0220-10-01| | 2|1880-10-01| | 3|2020-10-01| +---+--+ done {code} Note no difference in the dates shown was (Author: simonvanderveldt): Below the output of the date testscript with the noise removed Writing without additional config works as expected. Spark 3.0.1. throws a `SparkUpgradeException` Reading without additional config works as expected. Spark 3.0.1. throws a `SparkUpgradeException` when reading parquet files written with Spark 2.4.5 in Spark 3.0.1. Reading using the two different `datetimeRebaseModeInRead` modes doesn't work though, it shows no difference {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'LEGACY'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/datespark245/*.parquet root |-- row: string (nullable = true) |-- date: date (nullable = true) +---+--+ |row| date| +---+--+ | 1|0220-10-01| | 2|1880-10-01| | 3|2020-10-01| +---+--+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/datespark245/*.parquet root |-- row: string (nullable = true) |-- date: date (nullable = true) +---+--+ |row| date| +---+--+ | 1|0220-10-01| | 2|1880-10-01| | 3|2020-10-01| +---+--+ done {code} Note no difference in the dates shown > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()`
[jira] [Comment Edited] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239193#comment-17239193 ] Simon edited comment on SPARK-33571 at 11/26/20, 11:11 AM: --- Below the output of the timestamp test script with the noise removed *Writing:* {code:java} Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark246/ done ... Spark version: 3.0.1 Spark conf [('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark301/ done {code} Note no exception was raised when writing old timestamps to parquet in spark 3.0.1 *Reading* {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark246/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark301/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ done {code} Note no exception was raised when reading parquet files written with Spark 2.4.5 containing old timestamps *Reading using the two different datetimeRebaseModeInRead modes* {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'LEGACY'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timesta
[jira] [Comment Edited] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239193#comment-17239193 ] Simon edited comment on SPARK-33571 at 11/26/20, 11:11 AM: --- Below the output of the timestamp test script with the noise removed *Writing:* {code:java} Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark246/ done ... Spark version: 3.0.1 Spark conf [('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark301/ done {code} Note no exception was raised when writing old timestamps to parquet in spark 3.0.1 *Reading* {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark246/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark301/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ done {code} Note no exception was raised when reading parquet files written with Spark 2.4.5 containing old timestamps *Reading using the two different datetimeRebaseModeInRead modes* {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'LEGACY'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timesta
[jira] [Comment Edited] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239193#comment-17239193 ] Simon edited comment on SPARK-33571 at 11/26/20, 11:07 AM: --- Below the output of the timestamp test script with the noise removed *Writing:* {code:java} Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark246/ done ... Spark version: 3.0.1 Spark conf [('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark301/ done {code} Note no exception was raised when writing old timestamps to parquet in spark 3.0.1 *Reading* {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark246/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark301/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ done {code} Note no exception was raised when reading parquet files written with Spark 2.4.5 containing old timestamps was (Author: simonvanderveldt): Below the output of the timestamp test script with the noise removed *Writing:* {code:java} Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+--
[jira] [Comment Edited] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239193#comment-17239193 ] Simon edited comment on SPARK-33571 at 11/26/20, 11:06 AM: --- Below the output of the timestamp test script with the noise removed *Writing:* {code:java} Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark246/ done ... Spark version: 3.0.1 Spark conf [('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark301/ done {code} Note no exception was raised when writing old timestamps to parquet in spark 3.0.1 *Reading* {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark246/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark301/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ done {code} Note no exception was raised when reading the parquet files written with Spark 2.4.5 was (Author: simonvanderveldt): Below the output of the timestamp test script with the noise removed *Writing:* {code:java} Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing pa
[jira] [Comment Edited] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239193#comment-17239193 ] Simon edited comment on SPARK-33571 at 11/26/20, 11:06 AM: --- Below the output of the timestamp test script with the noise removed *Writing:* {code:java} Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark246/ done ... Spark version: 3.0.1 Spark conf [('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark301/ done {code} Note no exception was raised when writing old timestamps to parquet in spark 3.0.1 *Reading* {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark246/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark301/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ done {code} Note no exception was raised when reading parquet the parquet files written with Spark 2.4.5 which contains old timestamps was (Author: simonvanderveldt): Below the output of the timestamp test script with the noise removed #Writing: {code:java} Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10|
[jira] [Comment Edited] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239193#comment-17239193 ] Simon edited comment on SPARK-33571 at 11/26/20, 11:05 AM: --- Below the output of the timestamp test script with the noise removed #Writing: {code:java} Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark246/ done ... Spark version: 3.0.1 Spark conf [('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark301/ done {code} Note no exception was raised when writing old timestamps to parquet in spark 3.0.1 # Reading {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark246/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark301/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ done {code} Note no exception was raised when reading parquet the parquet files written with Spark 2.4.5 which contains old timestamps was (Author: simonvanderveldt): Below the output of the timestamp test script with the noise removed #Writing: {code:java} Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10|
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239193#comment-17239193 ] Simon commented on SPARK-33571: --- Below the output of the timestamp test script with the noise removed #Writing: ``` Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark246/ done ... Spark version: 3.0.1 Spark conf [('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark301/ done ``` Note not exception was raised when writing old timestamps to parquet in spark 3.0.1 # Reading ``` Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark246/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark301/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ done ``` > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above m
[jira] [Comment Edited] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239193#comment-17239193 ] Simon edited comment on SPARK-33571 at 11/26/20, 11:03 AM: --- Below the output of the timestamp test script with the noise removed #Writing: {code:java} Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark246/ done ... Spark version: 3.0.1 Spark conf [('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark301/ done {code} Note not exception was raised when writing old timestamps to parquet in spark 3.0.1 # Reading ``` Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark246/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark301/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ done ``` was (Author: simonvanderveldt): Below the output of the timestamp test script with the noise removed #Writing: ``` Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark246/ done ... Spark version: 3.0.1 Spark conf [('spark.master', 'local
[jira] [Created] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
Simon created SPARK-33571: - Summary: Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly Key: SPARK-33571 URL: https://issues.apache.org/jira/browse/SPARK-33571 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1, 3.0.0 Reporter: Simon The handling of old dates written with older Spark versions (<2.4.6) using the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working correctly. >From what I understand it should work like this: * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 1900-01-01T00:00:00Z * Only applies when reading or writing parquet files * When reading parquet files written with Spark < 2.4.6 which contain dates or timestamps before the above mentioned moments in time a `SparkUpgradeException` should be raised informing the user to choose either `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` * When reading parquet files written with Spark < 2.4.6 which contain dates or timestamps before the above mentioned moments in time and `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should show the same values in Spark 3.0.1. with for example `df.show()` as they did in Spark 2.4.5 * When reading parquet files written with Spark < 2.4.6 which contain dates or timestamps before the above mentioned moments in time and `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps should show different values in Spark 3.0.1. with for example `df.show()` as they did in Spark 2.4.5 * When writing parqet files with Spark > 3.0.0 which contain dates or timestamps before the above mentioned moment in time a `SparkUpgradeException` should be raised informing the user to choose either `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` First of all I'm not 100% sure all of this is correct. I've been unable to find any clear documentation on the expected behavior. The understanding I have was pieced together from the mailing list ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] the blog post linked there and looking at the Spark code. >From our testing we're seeing several issues: * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that contains fields of type `TimestampType` which contain timestamps before the above mentioned moments in time without `datetimeRebaseModeInRead` set doesn't raise the `SparkUpgradeException`, it succeeds without any changes to the resulting dataframe compares to that dataframe in Spark 2.4.5 * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that contains fields of type `TimestampType` or `DateType` which contain dates or timestamps before the above mentioned moments in time with `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the dataframe as when using `CORRECTED`, so it seems like no rebasing is happening. I've made some scripts to help with testing/show the behavior, it uses pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here [https://github.com/simonvanderveldt/spark3-rebasemode-issue.] I'll post the outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationsuite
[ https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-33570: --- Description: For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer available in the official apt repository and MariaDBKrbIntegrationSuite doesn't pass for now. It seems that only the most recent three versions are available and they are 10.5.6, 10.5.7 and 10.5.8 for now. Further, the release cycle of MariaDB seems to be too fast (1 ~ 2 months) so I don't think it's a good idea to set to an specific version for mariadb-plugin-gssapi-server. was: For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer available in the official apt repository and MariaDBKrbIntegrationSuite doesn't pass for now. It seems that only the most recent three versions are available and they are 10.5.6, 10.5.7 and 10.5.8 for now. Further, the release cycle for MariaDB seems to be too fast (1 ~ 2 months) so I don't think it's a good idea to set to an specific version for mariadb-plugin-gssapi-server. > Set the proper version of gssapi plugin automatically for > MariaDBKrbIntegrationsuite > > > Key: SPARK-33570 > URL: https://issues.apache.org/jira/browse/SPARK-33570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server > is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer > available in the official apt repository and MariaDBKrbIntegrationSuite > doesn't pass for now. > It seems that only the most recent three versions are available and they are > 10.5.6, 10.5.7 and 10.5.8 for now. > Further, the release cycle of MariaDB seems to be too fast (1 ~ 2 months) so > I don't think it's a good idea to set to an specific version for > mariadb-plugin-gssapi-server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationsuite
[ https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33570: Assignee: Apache Spark (was: Kousuke Saruta) > Set the proper version of gssapi plugin automatically for > MariaDBKrbIntegrationsuite > > > Key: SPARK-33570 > URL: https://issues.apache.org/jira/browse/SPARK-33570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server > is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer > available in the official apt repository and MariaDBKrbIntegrationSuite > doesn't pass for now. > It seems that only the most recent three versions are available and they are > 10.5.6, 10.5.7 and 10.5.8 for now. > Further, the release cycle for MariaDB seems to be too fast (1 ~ 2 months) so > I don't think it's a good idea to set to an specific version for > mariadb-plugin-gssapi-server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationsuite
[ https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33570: Assignee: Kousuke Saruta (was: Apache Spark) > Set the proper version of gssapi plugin automatically for > MariaDBKrbIntegrationsuite > > > Key: SPARK-33570 > URL: https://issues.apache.org/jira/browse/SPARK-33570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server > is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer > available in the official apt repository and MariaDBKrbIntegrationSuite > doesn't pass for now. > It seems that only the most recent three versions are available and they are > 10.5.6, 10.5.7 and 10.5.8 for now. > Further, the release cycle for MariaDB seems to be too fast (1 ~ 2 months) so > I don't think it's a good idea to set to an specific version for > mariadb-plugin-gssapi-server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationsuite
[ https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239145#comment-17239145 ] Apache Spark commented on SPARK-33570: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/30515 > Set the proper version of gssapi plugin automatically for > MariaDBKrbIntegrationsuite > > > Key: SPARK-33570 > URL: https://issues.apache.org/jira/browse/SPARK-33570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server > is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer > available in the official apt repository and MariaDBKrbIntegrationSuite > doesn't pass for now. > It seems that only the most recent three versions are available and they are > 10.5.6, 10.5.7 and 10.5.8 for now. > Further, the release cycle for MariaDB seems to be too fast (1 ~ 2 months) so > I don't think it's a good idea to set to an specific version for > mariadb-plugin-gssapi-server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationsuite
[ https://issues.apache.org/jira/browse/SPARK-33570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-33570: --- Description: For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer available in the official apt repository and MariaDBKrbIntegrationSuite doesn't pass for now. It seems that only the most recent three versions are available and they are 10.5.6, 10.5.7 and 10.5.8 for now. Further, the release cycle for MariaDB seems to be too fast (1 ~ 2 months) so I don't think it's a good idea to set to an specific version for mariadb-plugin-gssapi-server. was: For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer available in the official apt repository. It seems that only the most recent three versions are available and they are 10.5.6, 10.5.7 and 10.5.8 for now. Further, the release cycle for MariaDB seems to be too fast (1 ~ 2 months) so I don't think it's a good idea to set to an specific version for mariadb-plugin-gssapi-server. > Set the proper version of gssapi plugin automatically for > MariaDBKrbIntegrationsuite > > > Key: SPARK-33570 > URL: https://issues.apache.org/jira/browse/SPARK-33570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server > is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer > available in the official apt repository and MariaDBKrbIntegrationSuite > doesn't pass for now. > It seems that only the most recent three versions are available and they are > 10.5.6, 10.5.7 and 10.5.8 for now. > Further, the release cycle for MariaDB seems to be too fast (1 ~ 2 months) so > I don't think it's a good idea to set to an specific version for > mariadb-plugin-gssapi-server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33570) Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationsuite
Kousuke Saruta created SPARK-33570: -- Summary: Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationsuite Key: SPARK-33570 URL: https://issues.apache.org/jira/browse/SPARK-33570 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta For MariaDBKrbIntegrationSuite, the version of mariadb-plugin-gssapi-server is currently set to 10.5.5 in mariadb_docker_entrypoint.sh but it's no longer available in the official apt repository. It seems that only the most recent three versions are available and they are 10.5.6, 10.5.7 and 10.5.8 for now. Further, the release cycle for MariaDB seems to be too fast (1 ~ 2 months) so I don't think it's a good idea to set to an specific version for mariadb-plugin-gssapi-server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33569) Remove getting partitions by only ident
[ https://issues.apache.org/jira/browse/SPARK-33569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33569: Assignee: (was: Apache Spark) > Remove getting partitions by only ident > --- > > Key: SPARK-33569 > URL: https://issues.apache.org/jira/browse/SPARK-33569 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > This is a follow up of SPARK-33509 which added a function for getting > partitions by names and ident. The function which gets partitions by ident is > not used anymore, and it can be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33569) Remove getting partitions by only ident
[ https://issues.apache.org/jira/browse/SPARK-33569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33569: Assignee: Apache Spark > Remove getting partitions by only ident > --- > > Key: SPARK-33569 > URL: https://issues.apache.org/jira/browse/SPARK-33569 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > This is a follow up of SPARK-33509 which added a function for getting > partitions by names and ident. The function which gets partitions by ident is > not used anymore, and it can be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33569) Remove getting partitions by only ident
[ https://issues.apache.org/jira/browse/SPARK-33569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239128#comment-17239128 ] Apache Spark commented on SPARK-33569: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30514 > Remove getting partitions by only ident > --- > > Key: SPARK-33569 > URL: https://issues.apache.org/jira/browse/SPARK-33569 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > This is a follow up of SPARK-33509 which added a function for getting > partitions by names and ident. The function which gets partitions by ident is > not used anymore, and it can be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33569) Remove getting partitions by only ident
Maxim Gekk created SPARK-33569: -- Summary: Remove getting partitions by only ident Key: SPARK-33569 URL: https://issues.apache.org/jira/browse/SPARK-33569 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk This is a follow up of SPARK-33509 which added a function for getting partitions by names and ident. The function which gets partitions by ident is not used anymore, and it can be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org