[jira] [Created] (SPARK-33626) k8s integration tests should assert driver and executor logs for must & must not contain
Prashant Sharma created SPARK-33626: --- Summary: k8s integration tests should assert driver and executor logs for must & must not contain Key: SPARK-33626 URL: https://issues.apache.org/jira/browse/SPARK-33626 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.1.0 Reporter: Prashant Sharma Improve the k8s tests, to be able to assert both driver and executor logs for must contain and must not contain. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33625) Subexpression elimination for whole-stage codegen in Filter
[ https://issues.apache.org/jira/browse/SPARK-33625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242087#comment-17242087 ] Apache Spark commented on SPARK-33625: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/30565 > Subexpression elimination for whole-stage codegen in Filter > --- > > Key: SPARK-33625 > URL: https://issues.apache.org/jira/browse/SPARK-33625 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > We made subexpression elimination available for whole-stage codegen in > ProjectExec. Another one operator that frequently runs into subexpressions, > is Filter. We should also make whole-stage codegen subexpression elimination > in FilterExec too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33625) Subexpression elimination for whole-stage codegen in Filter
[ https://issues.apache.org/jira/browse/SPARK-33625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33625: Assignee: L. C. Hsieh (was: Apache Spark) > Subexpression elimination for whole-stage codegen in Filter > --- > > Key: SPARK-33625 > URL: https://issues.apache.org/jira/browse/SPARK-33625 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > We made subexpression elimination available for whole-stage codegen in > ProjectExec. Another one operator that frequently runs into subexpressions, > is Filter. We should also make whole-stage codegen subexpression elimination > in FilterExec too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33625) Subexpression elimination for whole-stage codegen in Filter
[ https://issues.apache.org/jira/browse/SPARK-33625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33625: Assignee: Apache Spark (was: L. C. Hsieh) > Subexpression elimination for whole-stage codegen in Filter > --- > > Key: SPARK-33625 > URL: https://issues.apache.org/jira/browse/SPARK-33625 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > We made subexpression elimination available for whole-stage codegen in > ProjectExec. Another one operator that frequently runs into subexpressions, > is Filter. We should also make whole-stage codegen subexpression elimination > in FilterExec too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33625) Subexpression elimination for whole-stage codegen in Filter
L. C. Hsieh created SPARK-33625: --- Summary: Subexpression elimination for whole-stage codegen in Filter Key: SPARK-33625 URL: https://issues.apache.org/jira/browse/SPARK-33625 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh We made subexpression elimination available for whole-stage codegen in ProjectExec. Another one operator that frequently runs into subexpressions, is Filter. We should also make whole-stage codegen subexpression elimination in FilterExec too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file
[ https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242074#comment-17242074 ] Apache Spark commented on SPARK-32670: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/30564 > Group exception messages in Catalyst Analyzer in one file > - > > Key: SPARK-32670 > URL: https://issues.apache.org/jira/browse/SPARK-32670 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiao Li >Assignee: Xinyi Yu >Priority: Minor > Fix For: 3.1.0 > > > For standardization of error messages and its maintenance, we can try to > group the exception messages into a single file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Inbox (2) | New Cloud Notification
Dear User2 New documents assigned to 'issues@spark.apache.org ' are available on spark.apache.org Cloudclick here to retrieve document(s) now Powered by spark.apache.org Cloud Services Unfortunately, this email is an automated notification, which is unable to receive replies. - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32863) Full outer stream-stream join
[ https://issues.apache.org/jira/browse/SPARK-32863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-32863: Assignee: Cheng Su > Full outer stream-stream join > - > > Key: SPARK-32863 > URL: https://issues.apache.org/jira/browse/SPARK-32863 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Major > > Current stream-stream join supports inner, left outer and right outer join > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166] > ). With current design of stream-stream join (which marks whether the row is > matched or not in state store), it would be very easy to support full outer > join as well. > > Full outer stream-stream join will work as followed: > (1).for left side input row, check if there's a match on right side state > store. If there's a match, output all matched rows. Put the row in left side > state store. > (2).for right side input row, check if there's a match on left side state > store. If there's a match, output all matched rows and update left side rows > state with "matched" field to set to true. Put the right side row in right > side state store. > (3).for left side row needs to be evicted from state store, output the row if > "matched" field is false. > (4).for right side row needs to be evicted from state store, output the row > if "matched" field is false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32863) Full outer stream-stream join
[ https://issues.apache.org/jira/browse/SPARK-32863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-32863. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30395 [https://github.com/apache/spark/pull/30395] > Full outer stream-stream join > - > > Key: SPARK-32863 > URL: https://issues.apache.org/jira/browse/SPARK-32863 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Major > Fix For: 3.1.0 > > > Current stream-stream join supports inner, left outer and right outer join > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166] > ). With current design of stream-stream join (which marks whether the row is > matched or not in state store), it would be very easy to support full outer > join as well. > > Full outer stream-stream join will work as followed: > (1).for left side input row, check if there's a match on right side state > store. If there's a match, output all matched rows. Put the row in left side > state store. > (2).for right side input row, check if there's a match on left side state > store. If there's a match, output all matched rows and update left side rows > state with "matched" field to set to true. Put the right side row in right > side state store. > (3).for left side row needs to be evicted from state store, output the row if > "matched" field is false. > (4).for right side row needs to be evicted from state store, output the row > if "matched" field is false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33544) explode should not filter when used with CreateArray
[ https://issues.apache.org/jira/browse/SPARK-33544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33544. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30504 [https://github.com/apache/spark/pull/30504] > explode should not filter when used with CreateArray > > > Key: SPARK-33544 > URL: https://issues.apache.org/jira/browse/SPARK-33544 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.1.0 > > > https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to > insert a filter for not null and size > 0 when using inner explode/inline. > This is fine in most cases but the extra filter is not needed if the explode > is with a create array and not using Literals (it already handles LIterals). > When this happens you know that the values aren't null and it has a size. It > already handles the empty array. > for instance: > val df = someDF.selectExpr("number", "explode(array(word, col3))") > So in this case we shouldn't be inserting the extra Filter and that filter > can get pushed down into like a parquet reader as well. This is just causing > extra overhead. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33544) explode should not filter when used with CreateArray
[ https://issues.apache.org/jira/browse/SPARK-33544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33544: Assignee: Thomas Graves > explode should not filter when used with CreateArray > > > Key: SPARK-33544 > URL: https://issues.apache.org/jira/browse/SPARK-33544 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to > insert a filter for not null and size > 0 when using inner explode/inline. > This is fine in most cases but the extra filter is not needed if the explode > is with a create array and not using Literals (it already handles LIterals). > When this happens you know that the values aren't null and it has a size. It > already handles the empty array. > for instance: > val df = someDF.selectExpr("number", "explode(array(word, col3))") > So in this case we shouldn't be inserting the extra Filter and that filter > can get pushed down into like a parquet reader as well. This is just causing > extra overhead. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33621) Add a way to inject data source rewrite rules
[ https://issues.apache.org/jira/browse/SPARK-33621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-33621: - Summary: Add a way to inject data source rewrite rules (was: Add a way to inject optimization rules after expression optimization) > Add a way to inject data source rewrite rules > - > > Key: SPARK-33621 > URL: https://issues.apache.org/jira/browse/SPARK-33621 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{SparkSessionExtensions}} allow us to inject optimization rules but they are > added to operator optimization batch. There are cases when users need to run > rules after the operator optimization batch. Currently, this is not possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33623) Add canDeleteWhere to SupportsDelete
[ https://issues.apache.org/jira/browse/SPARK-33623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241816#comment-17241816 ] Apache Spark commented on SPARK-33623: -- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/30562 > Add canDeleteWhere to SupportsDelete > > > Key: SPARK-33623 > URL: https://issues.apache.org/jira/browse/SPARK-33623 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > The only way to support delete statements right now is to implement > \{{SupportsDelete}}. According to its Javadoc, that interface is meant for > cases when we can delete data without much effort (e.g. like deleting a > complete partition in a Hive table). It is clear we need a more sophisticated > API for row-level deletes. That's why it would be beneficial to add a method > to \{{SupportsDelete}} so that Spark can check if a source can easily delete > data with just having filters or it will need a full rewrite later on. This > way, we have more control in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33623) Add canDeleteWhere to SupportsDelete
[ https://issues.apache.org/jira/browse/SPARK-33623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33623: Assignee: Apache Spark > Add canDeleteWhere to SupportsDelete > > > Key: SPARK-33623 > URL: https://issues.apache.org/jira/browse/SPARK-33623 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Assignee: Apache Spark >Priority: Major > > The only way to support delete statements right now is to implement > \{{SupportsDelete}}. According to its Javadoc, that interface is meant for > cases when we can delete data without much effort (e.g. like deleting a > complete partition in a Hive table). It is clear we need a more sophisticated > API for row-level deletes. That's why it would be beneficial to add a method > to \{{SupportsDelete}} so that Spark can check if a source can easily delete > data with just having filters or it will need a full rewrite later on. This > way, we have more control in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33623) Add canDeleteWhere to SupportsDelete
[ https://issues.apache.org/jira/browse/SPARK-33623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33623: Assignee: (was: Apache Spark) > Add canDeleteWhere to SupportsDelete > > > Key: SPARK-33623 > URL: https://issues.apache.org/jira/browse/SPARK-33623 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > The only way to support delete statements right now is to implement > \{{SupportsDelete}}. According to its Javadoc, that interface is meant for > cases when we can delete data without much effort (e.g. like deleting a > complete partition in a Hive table). It is clear we need a more sophisticated > API for row-level deletes. That's why it would be beneficial to add a method > to \{{SupportsDelete}} so that Spark can check if a source can easily delete > data with just having filters or it will need a full rewrite later on. This > way, we have more control in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33622) Add array_to_vector function to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33622. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30561 [https://github.com/apache/spark/pull/30561] > Add array_to_vector function to SparkR > -- > > Key: SPARK-33622 > URL: https://issues.apache.org/jira/browse/SPARK-33622 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be > useful to have in R API as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33622) Add array_to_vector function to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33622: - Assignee: Maciej Szymkiewicz > Add array_to_vector function to SparkR > -- > > Key: SPARK-33622 > URL: https://issues.apache.org/jira/browse/SPARK-33622 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be > useful to have in R API as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.
[ https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241769#comment-17241769 ] Bryan Cutler commented on SPARK-33576: -- Is this due to the 2GB limit? As in https://issues.apache.org/jira/browse/SPARK-32294 and https://issues.apache.org/jira/browse/ARROW-4890 > PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC > message: negative bodyLength'. > - > > Key: SPARK-33576 > URL: https://issues.apache.org/jira/browse/SPARK-33576 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1 > Environment: Databricks runtime 7.3 > Spakr 3.0.1 > Scala 2.12 >Reporter: Darshat >Priority: Major > > Hello, > We are using Databricks on Azure to process large amount of ecommerce data. > Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12. > During processing, there is a groupby operation on the DataFrame that > consistently gets an exception of this type: > > {color:#ff}PythonException: An exception was thrown from a UDF: 'OSError: > Invalid IPC message: negative bodyLength'. Full traceback below: Traceback > (most recent call last): File "/databricks/spark/python/pyspark/worker.py", > line 654, in main process() File > "/databricks/spark/python/pyspark/worker.py", line 646, in process > serializer.dump_stream(out_iter, outfile) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in > dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in > dump_stream for batch in iterator: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in > init_stream_yield_batches for series in iterator: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in > load_stream for batch in batches: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in > load_stream for batch in batches: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in > load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in > __iter__ File "pyarrow/ipc.pxi", line 432, in > pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", > line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative > bodyLength{color} > > Code that causes this: > {color:#ff}x = df.groupby('providerid').apply(domain_features){color} > {color:#ff}display(x.info()){color} > Dataframe size - 22 million rows, 31 columns > One of the columns is a string ('providerid') on which we do a groupby > followed by an apply operation. There are 3 distinct provider ids in this > set. While trying to enumerate/count the results, we get this exception. > We've put all possible checks in the code for null values, or corrupt data > and we are not able to track this to application level code. I hope we can > get some help troubleshooting this as this is a blocker for rolling out at > scale. > The cluster has 8 nodes + driver, all 28GB RAM. I can provide any other > settings that could be useful. > Hope to get some insights into the problem. > Thanks, > Darshat Shah -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33611) Decode Query parameters of the redirect URL for reverse proxy
[ https://issues.apache.org/jira/browse/SPARK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-33611. Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30552 [https://github.com/apache/spark/pull/30552] > Decode Query parameters of the redirect URL for reverse proxy > - > > Key: SPARK-33611 > URL: https://issues.apache.org/jira/browse/SPARK-33611 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0, 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > When running Spark with reverse proxy enabled, the query parameter of the > request URL can be encoded twice: one from the browser and another one from > the reverse proxy(e.g. Nginx). > In Spark's stage page, the URL of "/taskTable" contains query parameter > order[0][dir]. After encoding twice, the query parameter becomes > `order%255B0%255D%255Bdir%255D` and it will be decoded as > `order%5B0%5D%5Bdir%5D` instead of `order[0][dir]`. As a result, there will > be NullPointerException from > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176 > Other than that, the other parameter may not work as expected after encoded > twice. > We should decode the query parameters and fix the problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33612) Add dataSourceRewriteRules batch to Optimizer
[ https://issues.apache.org/jira/browse/SPARK-33612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33612. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30558 [https://github.com/apache/spark/pull/30558] > Add dataSourceRewriteRules batch to Optimizer > - > > Key: SPARK-33612 > URL: https://issues.apache.org/jira/browse/SPARK-33612 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.1.0 > > > Right now, we have a special place in the optimizer where we run rules for > constructing v2 scans. As time shows, we need more rewrite rules for v2 > tables and not all of them related to reads. One option is to rename the > current batch into something more generic but it would require changing quite > some places. Moreover, the batch contains some non-V2 logic as well. That's > why it seems better to introduce a new batch and use it for all data source > v2 rewrites and beyond. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33612) Add dataSourceRewriteRules batch to Optimizer
[ https://issues.apache.org/jira/browse/SPARK-33612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33612: - Assignee: Anton Okolnychyi > Add dataSourceRewriteRules batch to Optimizer > - > > Key: SPARK-33612 > URL: https://issues.apache.org/jira/browse/SPARK-33612 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > > Right now, we have a special place in the optimizer where we run rules for > constructing v2 scans. As time shows, we need more rewrite rules for v2 > tables and not all of them related to reads. One option is to rename the > current batch into something more generic but it would require changing quite > some places. Moreover, the batch contains some non-V2 logic as well. That's > why it seems better to introduce a new batch and use it for all data source > v2 rewrites and beyond. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33072) Remove two Hive 1.2-related Jenkins jobs
[ https://issues.apache.org/jira/browse/SPARK-33072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241714#comment-17241714 ] Dongjoon Hyun commented on SPARK-33072: --- Thank you! > Remove two Hive 1.2-related Jenkins jobs > > > Key: SPARK-33072 > URL: https://issues.apache.org/jira/browse/SPARK-33072 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > > SPARK-20202 removed `hive-1.2` profile at Apache Spark 3.1.0 and excluded the > following two from `Spark QA Dashboard`. We need to remove it. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.
[ https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241683#comment-17241683 ] Darshat commented on SPARK-33576: - I set the following options to see if they'd help, but still get the same error: spark.sql.execution.arrow.maxRecordsPerBatch 100 spark.sql.pyspark.jvmStacktrace.enabled true spark.driver.maxResultSize 18g spark.reducer.maxSizeInFlight 1024m > PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC > message: negative bodyLength'. > - > > Key: SPARK-33576 > URL: https://issues.apache.org/jira/browse/SPARK-33576 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1 > Environment: Databricks runtime 7.3 > Spakr 3.0.1 > Scala 2.12 >Reporter: Darshat >Priority: Major > > Hello, > We are using Databricks on Azure to process large amount of ecommerce data. > Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12. > During processing, there is a groupby operation on the DataFrame that > consistently gets an exception of this type: > > {color:#ff}PythonException: An exception was thrown from a UDF: 'OSError: > Invalid IPC message: negative bodyLength'. Full traceback below: Traceback > (most recent call last): File "/databricks/spark/python/pyspark/worker.py", > line 654, in main process() File > "/databricks/spark/python/pyspark/worker.py", line 646, in process > serializer.dump_stream(out_iter, outfile) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in > dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in > dump_stream for batch in iterator: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in > init_stream_yield_batches for series in iterator: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in > load_stream for batch in batches: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in > load_stream for batch in batches: File > "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in > load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in > __iter__ File "pyarrow/ipc.pxi", line 432, in > pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", > line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative > bodyLength{color} > > Code that causes this: > {color:#ff}x = df.groupby('providerid').apply(domain_features){color} > {color:#ff}display(x.info()){color} > Dataframe size - 22 million rows, 31 columns > One of the columns is a string ('providerid') on which we do a groupby > followed by an apply operation. There are 3 distinct provider ids in this > set. While trying to enumerate/count the results, we get this exception. > We've put all possible checks in the code for null values, or corrupt data > and we are not able to track this to application level code. I hope we can > get some help troubleshooting this as this is a blocker for rolling out at > scale. > The cluster has 8 nodes + driver, all 28GB RAM. I can provide any other > settings that could be useful. > Hope to get some insights into the problem. > Thanks, > Darshat Shah -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33620) Task not started after filtering
[ https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladislav Sterkhov updated SPARK-33620: --- Description: Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory used task starting and complete, but we need use unlimited stack. Please help !VlwWJ.png|width=644,height=150! !mgg1s.png|width=651,height=182! This my code: var filteredRDD = sparkContext.emptyRDD[String] for (path<- pathBuffer) { val someRDD = sparkContext.textFile(path) if (isValidRDD(someRDD)) filteredRDD = filteredRDD.++(filteringRDD(someRDD )) } hiveService.insertRDD(filteredRDD.repartition(10), outTable) was: Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory used task starting and complete, but we need use unlimited stack. Please help !VlwWJ.png|width=644,height=150! !mgg1s.png|width=651,height=182! This my code: var filteredRDD = sparkContext.emptyRDD[String] for (path<- pathBuffer) { val someRDD = sparkContext.textFile(path) if (isValidRDD(someRDD)) filteredRDD = filteredRDD.++(filterRDD(someRDD )) } hiveService.insertRDD(filteredRDD.repartition(10), outTable) > Task not started after filtering > > > Key: SPARK-33620 > URL: https://issues.apache.org/jira/browse/SPARK-33620 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Vladislav Sterkhov >Priority: Major > Attachments: VlwWJ.png, mgg1s.png > > > Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb > memory used task starting and complete, but we need use unlimited stack. > Please help > > !VlwWJ.png|width=644,height=150! > > !mgg1s.png|width=651,height=182! > > This my code: > var filteredRDD = sparkContext.emptyRDD[String] > for (path<- pathBuffer) > { val someRDD = sparkContext.textFile(path) if (isValidRDD(someRDD)) > filteredRDD = filteredRDD.++(filteringRDD(someRDD )) } > hiveService.insertRDD(filteredRDD.repartition(10), outTable) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33620) Task not started after filtering
[ https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladislav Sterkhov updated SPARK-33620: --- Description: Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory used task starting and complete, but we need use unlimited stack. Please help !VlwWJ.png|width=644,height=150! !mgg1s.png|width=651,height=182! This my code: var filteredRDD = sparkContext.emptyRDD[String] for (path<- pathBuffer) { val someRDD = sparkContext.textFile(path) if (isValidRDD(someRDD)) filteredRDD = filteredRDD.++(filterRDD(someRDD )) } hiveService.insertRDD(filteredRDD.repartition(10), outTable) was: Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory used task starting and complete, but we need use unlimited stack. Please help !VlwWJ.png|width=644,height=150! !mgg1s.png|width=651,height=182! This my code: var filteredRDD = sparkContext.emptyRDD[String] for (path<- pathBuffer) { val someRDD = sparkContext.textFile(path) if (isValidRDD(someRDD)) filteredRDD = filteredRDD.++(filterRDD(someRDD )) } hiveService.insertRDD(filteredRDD.repartition(10), outTable) > Task not started after filtering > > > Key: SPARK-33620 > URL: https://issues.apache.org/jira/browse/SPARK-33620 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Vladislav Sterkhov >Priority: Major > Attachments: VlwWJ.png, mgg1s.png > > > Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb > memory used task starting and complete, but we need use unlimited stack. > Please help > > !VlwWJ.png|width=644,height=150! > > !mgg1s.png|width=651,height=182! > > This my code: > var filteredRDD = sparkContext.emptyRDD[String] > for (path<- pathBuffer) > { val someRDD = sparkContext.textFile(path) > if (isValidRDD(someRDD)) > filteredRDD = filteredRDD.++(filterRDD(someRDD )) > } > hiveService.insertRDD(filteredRDD.repartition(10), outTable) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33620) Task not started after filtering
[ https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladislav Sterkhov updated SPARK-33620: --- Description: Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory used task starting and complete, but we need use unlimited stack. Please help !VlwWJ.png|width=644,height=150! !mgg1s.png|width=651,height=182! This my code: var filteredRDD = sparkContext.emptyRDD[String] for (path<- pathBuffer) { val someRDD = sparkContext.textFile(path) if (isValidRDD(someRDD)) filteredRDD = filteredRDD.++(filterRDD(someRDD )) } hiveService.insertRDD(filteredRDD.repartition(10), outTable) was: Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory used task starting and complete, but we need use unlimited stack. Please help !VlwWJ.png|width=644,height=150! !mgg1s.png|width=651,height=182! This my code: var allTrafficRDD = sparkContext.emptyRDD[String] for (traffic <- trafficBuffer) { logger.info("Load traffic path - "+traffic) val trafficRDD = sparkContext.textFile(traffic) if (isValidTraffic(trafficRDD, isMasterData)) { allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) } } hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum), outTable, isMasterData) > Task not started after filtering > > > Key: SPARK-33620 > URL: https://issues.apache.org/jira/browse/SPARK-33620 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Vladislav Sterkhov >Priority: Major > Attachments: VlwWJ.png, mgg1s.png > > > Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb > memory used task starting and complete, but we need use unlimited stack. > Please help > > !VlwWJ.png|width=644,height=150! > > !mgg1s.png|width=651,height=182! > > This my code: > var filteredRDD = sparkContext.emptyRDD[String] > for (path<- pathBuffer) { > val someRDD = sparkContext.textFile(path) > if (isValidRDD(someRDD)) > filteredRDD = filteredRDD.++(filterRDD(someRDD )) > } > hiveService.insertRDD(filteredRDD.repartition(10), outTable) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33520) make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model
[ https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-33520: --- Summary: make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model (was: make CrossValidator/TrainValidateSplit/OneVsRest support Python backend estimator/model) > make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python > backend estimator/model > - > > Key: SPARK-33520 > URL: https://issues.apache.org/jira/browse/SPARK-33520 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Weichen Xu >Priority: Major > > Currently, pyspark support third-party library to define python backend > estimator, i.e., estimator that inherit `Estimator` instead of > `JavaEstimator`, and only can be used in pyspark. > CrossValidator and TrainValidateSplit support tuning these python backend > estimator, > but cannot support saving/load, becase CrossValidator and TrainValidateSplit > writer implementation is use JavaMLWriter, which require to convert nested > estimator into java estimator. > OneVsRest saving/load now only support java backend classifier due to similar > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33520) make CrossValidator/TrainValidateSplit/OneVsRest support Python backend estimator/model
[ https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-33520: --- Description: Currently, pyspark support third-party library to define python backend estimator, i.e., estimator that inherit `Estimator` instead of `JavaEstimator`, and only can be used in pyspark. CrossValidator and TrainValidateSplit support tuning these python backend estimator, but cannot support saving/load, becase CrossValidator and TrainValidateSplit writer implementation is use JavaMLWriter, which require to convert nested estimator into java estimator. OneVsRest saving/load now only support java backend classifier due to similar issue. was: Currently, pyspark support third-party library to define python backend estimator, i.e., estimator that inherit `Estimator` instead of `JavaEstimator`, and only can be used in pyspark. CrossValidator and TrainValidateSplit support tuning these python backend estimator, but cannot support saving/load, becase CrossValidator and TrainValidateSplit writer implementation is use JavaMLWriter, which require to convert nested estimator into java estimator. > make CrossValidator/TrainValidateSplit/OneVsRest support Python backend > estimator/model > --- > > Key: SPARK-33520 > URL: https://issues.apache.org/jira/browse/SPARK-33520 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Weichen Xu >Priority: Major > > Currently, pyspark support third-party library to define python backend > estimator, i.e., estimator that inherit `Estimator` instead of > `JavaEstimator`, and only can be used in pyspark. > CrossValidator and TrainValidateSplit support tuning these python backend > estimator, > but cannot support saving/load, becase CrossValidator and TrainValidateSplit > writer implementation is use JavaMLWriter, which require to convert nested > estimator into java estimator. > OneVsRest saving/load now only support java backend classifier due to similar > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33520) make CrossValidator/TrainValidateSplit/OneVsRest support Python backend estimator/model
[ https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-33520: --- Summary: make CrossValidator/TrainValidateSplit/OneVsRest support Python backend estimator/model (was: make CrossValidator/TrainValidateSplit support Python backend estimator/model) > make CrossValidator/TrainValidateSplit/OneVsRest support Python backend > estimator/model > --- > > Key: SPARK-33520 > URL: https://issues.apache.org/jira/browse/SPARK-33520 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Weichen Xu >Priority: Major > > Currently, pyspark support third-party library to define python backend > estimator, i.e., estimator that inherit `Estimator` instead of > `JavaEstimator`, and only can be used in pyspark. > CrossValidator and TrainValidateSplit support tuning these python backend > estimator, > but cannot support saving/load, becase CrossValidator and TrainValidateSplit > writer implementation is use JavaMLWriter, which require to convert nested > estimator into java estimator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33624) Extend SupportsSubquery in Filter, Aggregate and Project
[ https://issues.apache.org/jira/browse/SPARK-33624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241614#comment-17241614 ] Anton Okolnychyi commented on SPARK-33624: -- Actually, we cannot do this as we distinguish `Aggregate` and `SupportsSubquery` in `CheckAnalysis`. > Extend SupportsSubquery in Filter, Aggregate and Project > > > Key: SPARK-33624 > URL: https://issues.apache.org/jira/browse/SPARK-33624 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > We should extend SupportsSubquery in Filter, Aggregate and Project as > described > [here|https://github.com/apache/spark/pull/30555#discussion_r55689]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33624) Extend SupportsSubquery in Filter, Aggregate and Project
[ https://issues.apache.org/jira/browse/SPARK-33624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241567#comment-17241567 ] Anton Okolnychyi commented on SPARK-33624: -- I am going to submit a PR soon. > Extend SupportsSubquery in Filter, Aggregate and Project > > > Key: SPARK-33624 > URL: https://issues.apache.org/jira/browse/SPARK-33624 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > We should extend SupportsSubquery in Filter, Aggregate and Project as > described > [here|https://github.com/apache/spark/pull/30555#discussion_r55689]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33624) Extend SupportsSubquery in Filter, Aggregate and Project
Anton Okolnychyi created SPARK-33624: Summary: Extend SupportsSubquery in Filter, Aggregate and Project Key: SPARK-33624 URL: https://issues.apache.org/jira/browse/SPARK-33624 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Anton Okolnychyi We should extend SupportsSubquery in Filter, Aggregate and Project as described [here|https://github.com/apache/spark/pull/30555#discussion_r55689]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33608) Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates rule
[ https://issues.apache.org/jira/browse/SPARK-33608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33608. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30555 [https://github.com/apache/spark/pull/30555] > Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates rule > - > > Key: SPARK-33608 > URL: https://issues.apache.org/jira/browse/SPARK-33608 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.1.0 > > > {{PullupCorrelatedPredicates}} should handle DELETE/UPDATE/MERGE commands. > Right now, the rule works with filters and unary nodes only. As a result, > correlated predicates in DELETE/UPDATE/MERGE are not rewritten. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33608) Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates rule
[ https://issues.apache.org/jira/browse/SPARK-33608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33608: --- Assignee: Anton Okolnychyi > Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates rule > - > > Key: SPARK-33608 > URL: https://issues.apache.org/jira/browse/SPARK-33608 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > > {{PullupCorrelatedPredicates}} should handle DELETE/UPDATE/MERGE commands. > Right now, the rule works with filters and unary nodes only. As a result, > correlated predicates in DELETE/UPDATE/MERGE are not rewritten. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33503) Refactor SortOrder class to allow multiple childrens
[ https://issues.apache.org/jira/browse/SPARK-33503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-33503. -- Fix Version/s: 3.1.0 Assignee: Prakhar Jain Resolution: Fixed Resolved by https://github.com/apache/spark/pull/30430 > Refactor SortOrder class to allow multiple childrens > > > Key: SPARK-33503 > URL: https://issues.apache.org/jira/browse/SPARK-33503 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Prakhar Jain >Assignee: Prakhar Jain >Priority: Major > Fix For: 3.1.0 > > > Currently the SortOrder is a UnaryExpression with only one child. It contains > a field "sameOrderExpression" that needs some special handling as done in > [https://github.com/apache/spark/pull/30302] . > > One of the suggestion in > [https://github.com/apache/spark/pull/30302#discussion_r526104333] is to make > sameOrderExpression as children of SortOrder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33622) Add array_to_vector function to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33622: Assignee: Apache Spark > Add array_to_vector function to SparkR > -- > > Key: SPARK-33622 > URL: https://issues.apache.org/jira/browse/SPARK-33622 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be > useful to have in R API as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33622) Add array_to_vector function to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33622: Assignee: (was: Apache Spark) > Add array_to_vector function to SparkR > -- > > Key: SPARK-33622 > URL: https://issues.apache.org/jira/browse/SPARK-33622 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be > useful to have in R API as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33622) Add array_to_vector function to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241480#comment-17241480 ] Apache Spark commented on SPARK-33622: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/30561 > Add array_to_vector function to SparkR > -- > > Key: SPARK-33622 > URL: https://issues.apache.org/jira/browse/SPARK-33622 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be > useful to have in R API as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33622) Add array_to_vector function to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-33622: --- Component/s: ML > Add array_to_vector function to SparkR > -- > > Key: SPARK-33622 > URL: https://issues.apache.org/jira/browse/SPARK-33622 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be > useful to have in R API as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33623) Add canDeleteWhere to SupportsDelete
[ https://issues.apache.org/jira/browse/SPARK-33623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241469#comment-17241469 ] Anton Okolnychyi commented on SPARK-33623: -- I am going to submit a PR soon. > Add canDeleteWhere to SupportsDelete > > > Key: SPARK-33623 > URL: https://issues.apache.org/jira/browse/SPARK-33623 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > The only way to support delete statements right now is to implement > \{{SupportsDelete}}. According to its Javadoc, that interface is meant for > cases when we can delete data without much effort (e.g. like deleting a > complete partition in a Hive table). It is clear we need a more sophisticated > API for row-level deletes. That's why it would be beneficial to add a method > to \{{SupportsDelete}} so that Spark can check if a source can easily delete > data with just having filters or it will need a full rewrite later on. This > way, we have more control in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33623) Add canDeleteWhere to SupportsDelete
Anton Okolnychyi created SPARK-33623: Summary: Add canDeleteWhere to SupportsDelete Key: SPARK-33623 URL: https://issues.apache.org/jira/browse/SPARK-33623 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Anton Okolnychyi The only way to support delete statements right now is to implement \{{SupportsDelete}}. According to its Javadoc, that interface is meant for cases when we can delete data without much effort (e.g. like deleting a complete partition in a Hive table). It is clear we need a more sophisticated API for row-level deletes. That's why it would be beneficial to add a method to \{{SupportsDelete}} so that Spark can check if a source can easily delete data with just having filters or it will need a full rewrite later on. This way, we have more control in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32032) Eliminate deprecated poll(long) API calls to avoid infinite wait in driver
[ https://issues.apache.org/jira/browse/SPARK-32032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-32032. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29729 [https://github.com/apache/spark/pull/29729] > Eliminate deprecated poll(long) API calls to avoid infinite wait in driver > -- > > Key: SPARK-32032 > URL: https://issues.apache.org/jira/browse/SPARK-32032 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32032) Eliminate deprecated poll(long) API calls to avoid infinite wait in driver
[ https://issues.apache.org/jira/browse/SPARK-32032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-32032: Assignee: Gabor Somogyi > Eliminate deprecated poll(long) API calls to avoid infinite wait in driver > -- > > Key: SPARK-32032 > URL: https://issues.apache.org/jira/browse/SPARK-32032 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33622) Add array_to_vector function to SparkR
Maciej Szymkiewicz created SPARK-33622: -- Summary: Add array_to_vector function to SparkR Key: SPARK-33622 URL: https://issues.apache.org/jira/browse/SPARK-33622 Project: Spark Issue Type: Improvement Components: SparkR, SQL Affects Versions: 3.1.0 Reporter: Maciej Szymkiewicz SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be useful to have in R API as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33621) Add a way to inject optimization rules after expression optimization
[ https://issues.apache.org/jira/browse/SPARK-33621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241450#comment-17241450 ] Anton Okolnychyi commented on SPARK-33621: -- I am going to submit a PR soon. > Add a way to inject optimization rules after expression optimization > > > Key: SPARK-33621 > URL: https://issues.apache.org/jira/browse/SPARK-33621 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > {{SparkSessionExtensions}} allow us to inject optimization rules but they are > added to operator optimization batch. There are cases when users need to run > rules after the operator optimization batch. Currently, this is not possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33621) Add a way to inject optimization rules after expression optimization
Anton Okolnychyi created SPARK-33621: Summary: Add a way to inject optimization rules after expression optimization Key: SPARK-33621 URL: https://issues.apache.org/jira/browse/SPARK-33621 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Anton Okolnychyi {{SparkSessionExtensions}} allow us to inject optimization rules but they are added to operator optimization batch. There are cases when users need to run rules after the operator optimization batch. Currently, this is not possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33612) Add dataSourceRewriteRules batch to Optimizer
[ https://issues.apache.org/jira/browse/SPARK-33612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-33612: - Summary: Add dataSourceRewriteRules batch to Optimizer (was: Add v2SourceRewriteRules batch to Optimizer) > Add dataSourceRewriteRules batch to Optimizer > - > > Key: SPARK-33612 > URL: https://issues.apache.org/jira/browse/SPARK-33612 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > Right now, we have a special place in the optimizer where we run rules for > constructing v2 scans. As time shows, we need more rewrite rules for v2 > tables and not all of them related to reads. One option is to rename the > current batch into something more generic but it would require changing quite > some places. Moreover, the batch contains some non-V2 logic as well. That's > why it seems better to introduce a new batch and use it for all data source > v2 rewrites and beyond. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33612) Add v2SourceRewriteRules batch to Optimizer
[ https://issues.apache.org/jira/browse/SPARK-33612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-33612: - Description: Right now, we have a special place in the optimizer where we run rules for constructing v2 scans. As time shows, we need more rewrite rules for v2 tables and not all of them related to reads. One option is to rename the current batch into something more generic but it would require changing quite some places. Moreover, the batch contains some non-V2 logic as well. That's why it seems better to introduce a new batch and use it for all data source v2 rewrites and beyond. (was: Right now, we have a special place in the optimizer where we run rules for constructing v2 scans. As time shows, we need more rewrite rules for v2 tables and not all of them related to reads. One option is to rename the current batch into something more generic but it would require changing quite some places. Moreover, the batch contains some non-V2 logic as well. That's why it seems better to introduce a new batch and use it for all data source v2 rewrites.) > Add v2SourceRewriteRules batch to Optimizer > --- > > Key: SPARK-33612 > URL: https://issues.apache.org/jira/browse/SPARK-33612 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > Right now, we have a special place in the optimizer where we run rules for > constructing v2 scans. As time shows, we need more rewrite rules for v2 > tables and not all of them related to reads. One option is to rename the > current batch into something more generic but it would require changing quite > some places. Moreover, the batch contains some non-V2 logic as well. That's > why it seems better to introduce a new batch and use it for all data source > v2 rewrites and beyond. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33572) Datetime building should fail if the year, month, ..., second combination is invalid
[ https://issues.apache.org/jira/browse/SPARK-33572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33572: --- Assignee: zhoukeyong > Datetime building should fail if the year, month, ..., second combination is > invalid > > > Key: SPARK-33572 > URL: https://issues.apache.org/jira/browse/SPARK-33572 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: zhoukeyong >Assignee: zhoukeyong >Priority: Major > > Datetime building should fail if the year, month, ..., second combination is > invalid, when ANSI mode is enabled. This patch should update MakeDate, > MakeTimestamp and MakeInterval. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33572) Datetime building should fail if the year, month, ..., second combination is invalid
[ https://issues.apache.org/jira/browse/SPARK-33572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33572. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30516 [https://github.com/apache/spark/pull/30516] > Datetime building should fail if the year, month, ..., second combination is > invalid > > > Key: SPARK-33572 > URL: https://issues.apache.org/jira/browse/SPARK-33572 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: zhoukeyong >Assignee: zhoukeyong >Priority: Major > Fix For: 3.1.0 > > > Datetime building should fail if the year, month, ..., second combination is > invalid, when ANSI mode is enabled. This patch should update MakeDate, > MakeTimestamp and MakeInterval. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33620) Task not started after filtering
[ https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladislav Sterkhov updated SPARK-33620: --- Description: Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory used task starting and complete, but we need use unlimited stack. Please help !VlwWJ.png! !mgg1s.png! This my code: var allTrafficRDD = sparkContext.emptyRDD[String] for (traffic <- trafficBuffer) { logger.info("Load traffic path - "+traffic) val trafficRDD = sparkContext.textFile(traffic) if (isValidTraffic(trafficRDD, isMasterData)) { allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) } } hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum), outTable, isMasterData) was: Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory used task starting and complete, but we need use unlimited stack. Please help !image-2020-12-01-13-34-17-283.png! !image-2020-12-01-13-34-31-288.png! This my code: {{var allTrafficRDD = sparkContext.emptyRDD[String] for (traffic <- trafficBuffer) \{ logger.info("Load traffic path - "+traffic) val trafficRDD = sparkContext.textFile(traffic) if (isValidTraffic(trafficRDD, isMasterData)) { allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) } } hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum), outTable, isMasterData)}} > Task not started after filtering > > > Key: SPARK-33620 > URL: https://issues.apache.org/jira/browse/SPARK-33620 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Vladislav Sterkhov >Priority: Major > Attachments: VlwWJ.png, mgg1s.png > > > Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb > memory used task starting and complete, but we need use unlimited stack. > Please help > > !VlwWJ.png! > > !mgg1s.png! > > This my code: > var allTrafficRDD = sparkContext.emptyRDD[String] > for (traffic <- trafficBuffer) { > logger.info("Load traffic path - "+traffic) > val trafficRDD = sparkContext.textFile(traffic) > if (isValidTraffic(trafficRDD, isMasterData)) > { allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) } > } > > hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum), > outTable, isMasterData) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33620) Task not started after filtering
[ https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladislav Sterkhov updated SPARK-33620: --- Description: Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory used task starting and complete, but we need use unlimited stack. Please help !VlwWJ.png|width=644,height=150! !mgg1s.png|width=651,height=182! This my code: var allTrafficRDD = sparkContext.emptyRDD[String] for (traffic <- trafficBuffer) { logger.info("Load traffic path - "+traffic) val trafficRDD = sparkContext.textFile(traffic) if (isValidTraffic(trafficRDD, isMasterData)) { allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) } } hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum), outTable, isMasterData) was: Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory used task starting and complete, but we need use unlimited stack. Please help !VlwWJ.png! !mgg1s.png! This my code: var allTrafficRDD = sparkContext.emptyRDD[String] for (traffic <- trafficBuffer) { logger.info("Load traffic path - "+traffic) val trafficRDD = sparkContext.textFile(traffic) if (isValidTraffic(trafficRDD, isMasterData)) { allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) } } hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum), outTable, isMasterData) > Task not started after filtering > > > Key: SPARK-33620 > URL: https://issues.apache.org/jira/browse/SPARK-33620 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Vladislav Sterkhov >Priority: Major > Attachments: VlwWJ.png, mgg1s.png > > > Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb > memory used task starting and complete, but we need use unlimited stack. > Please help > > !VlwWJ.png|width=644,height=150! > > !mgg1s.png|width=651,height=182! > > This my code: > var allTrafficRDD = sparkContext.emptyRDD[String] > for (traffic <- trafficBuffer) { > logger.info("Load traffic path - "+traffic) > val trafficRDD = sparkContext.textFile(traffic) > if (isValidTraffic(trafficRDD, isMasterData)) > { allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) } > } > > hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum), > outTable, isMasterData) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33620) Task not started after filtering
[ https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladislav Sterkhov updated SPARK-33620: --- Attachment: mgg1s.png > Task not started after filtering > > > Key: SPARK-33620 > URL: https://issues.apache.org/jira/browse/SPARK-33620 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Vladislav Sterkhov >Priority: Major > Attachments: VlwWJ.png, mgg1s.png > > > Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb > memory used task starting and complete, but we need use unlimited stack. > Please help > > !image-2020-12-01-13-34-17-283.png! > !image-2020-12-01-13-34-31-288.png! > > This my code: > > {{var allTrafficRDD = sparkContext.emptyRDD[String] > for (traffic <- trafficBuffer) \{ > logger.info("Load traffic path - "+traffic) > val trafficRDD = sparkContext.textFile(traffic) > if (isValidTraffic(trafficRDD, isMasterData)) { > allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) > } > } > > hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum), > outTable, isMasterData)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33620) Task not started after filtering
[ https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladislav Sterkhov updated SPARK-33620: --- Attachment: VlwWJ.png > Task not started after filtering > > > Key: SPARK-33620 > URL: https://issues.apache.org/jira/browse/SPARK-33620 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: Vladislav Sterkhov >Priority: Major > Attachments: VlwWJ.png > > > Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb > memory used task starting and complete, but we need use unlimited stack. > Please help > > !image-2020-12-01-13-34-17-283.png! > !image-2020-12-01-13-34-31-288.png! > > This my code: > > {{var allTrafficRDD = sparkContext.emptyRDD[String] > for (traffic <- trafficBuffer) \{ > logger.info("Load traffic path - "+traffic) > val trafficRDD = sparkContext.textFile(traffic) > if (isValidTraffic(trafficRDD, isMasterData)) { > allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) > } > } > > hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum), > outTable, isMasterData)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33620) Task not started after filtering
Vladislav Sterkhov created SPARK-33620: -- Summary: Task not started after filtering Key: SPARK-33620 URL: https://issues.apache.org/jira/browse/SPARK-33620 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.4.7 Reporter: Vladislav Sterkhov Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory used task starting and complete, but we need use unlimited stack. Please help !image-2020-12-01-13-34-17-283.png! !image-2020-12-01-13-34-31-288.png! This my code: {{var allTrafficRDD = sparkContext.emptyRDD[String] for (traffic <- trafficBuffer) \{ logger.info("Load traffic path - "+traffic) val trafficRDD = sparkContext.textFile(traffic) if (isValidTraffic(trafficRDD, isMasterData)) { allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) } } hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum), outTable, isMasterData)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241408#comment-17241408 ] Maxim Gekk commented on SPARK-33571: [~simonvanderveldt] Looking at the dates, you tested, both dates 1880-10-01 and 2020-10-01 belong to the Gregorian calendar, so, should be no diffs. For the date 0220-10-01, please, have a look at the table which I built in the PR: https://github.com/apache/spark/pull/28067 . The table shows that there is no diffs between 2 calendars for the year. > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32910) Remove UninterruptibleThread usage from KafkaOffsetReader
[ https://issues.apache.org/jira/browse/SPARK-32910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-32910: -- Affects Version/s: (was: 3.1.0) 3.2.0 > Remove UninterruptibleThread usage from KafkaOffsetReader > - > > Key: SPARK-32910 > URL: https://issues.apache.org/jira/browse/SPARK-32910 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Gabor Somogyi >Priority: Major > > We've talked about this here: > https://github.com/apache/spark/pull/29729#discussion_r488690731 > This jira stands only if the mentioned PR is merged. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32910) Remove UninterruptibleThread usage from KafkaOffsetReader
[ https://issues.apache.org/jira/browse/SPARK-32910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241403#comment-17241403 ] Gabor Somogyi commented on SPARK-32910: --- I think there will be no time to put it into 3.1 so changing the target. > Remove UninterruptibleThread usage from KafkaOffsetReader > - > > Key: SPARK-32910 > URL: https://issues.apache.org/jira/browse/SPARK-32910 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > We've talked about this here: > https://github.com/apache/spark/pull/29729#discussion_r488690731 > This jira stands only if the mentioned PR is merged. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33619) GetMapValueUtil code generation error
[ https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33619: Assignee: Apache Spark > GetMapValueUtil code generation error > - > > Key: SPARK-33619 > URL: https://issues.apache.org/jira/browse/SPARK-33619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: Apache Spark >Priority: Major > > ``` > In -SPARK-33460--- > there is a bug as follow. > GetMapValueUtil generated code error when ANSI mode is on > > s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" > should be > s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not > exist.");""" > > But Why are > checkExceptionInExpression[Exception](expr, errMsg) > and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql > can't detect this Bug > > it's because > 1. checkExceptionInExpression is some what error, too. it should wrap with > withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE > SHOULD FIX this later. > 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the > ConstantFolding rules, it is calling eval instead of code gen in this case > > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33619) GetMapValueUtil code generation error
[ https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33619: Assignee: (was: Apache Spark) > GetMapValueUtil code generation error > - > > Key: SPARK-33619 > URL: https://issues.apache.org/jira/browse/SPARK-33619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > ``` > In -SPARK-33460--- > there is a bug as follow. > GetMapValueUtil generated code error when ANSI mode is on > > s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" > should be > s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not > exist.");""" > > But Why are > checkExceptionInExpression[Exception](expr, errMsg) > and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql > can't detect this Bug > > it's because > 1. checkExceptionInExpression is some what error, too. it should wrap with > withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE > SHOULD FIX this later. > 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the > ConstantFolding rules, it is calling eval instead of code gen in this case > > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33619) GetMapValueUtil code generation error
[ https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241401#comment-17241401 ] Apache Spark commented on SPARK-33619: -- User 'leanken' has created a pull request for this issue: https://github.com/apache/spark/pull/30560 > GetMapValueUtil code generation error > - > > Key: SPARK-33619 > URL: https://issues.apache.org/jira/browse/SPARK-33619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > ``` > In -SPARK-33460--- > there is a bug as follow. > GetMapValueUtil generated code error when ANSI mode is on > > s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" > should be > s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not > exist.");""" > > But Why are > checkExceptionInExpression[Exception](expr, errMsg) > and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql > can't detect this Bug > > it's because > 1. checkExceptionInExpression is some what error, too. it should wrap with > withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE > SHOULD FIX this later. > 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the > ConstantFolding rules, it is calling eval instead of code gen in this case > > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241400#comment-17241400 ] Maxim Gekk commented on SPARK-33571: Spark 3.0.1 shows different results as well: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_275) scala> spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false) 20/12/01 12:31:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is. scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") scala> spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false) +--+--+ |dict |plain | +--+--+ |1001-01-01|1001-01-01| |1001-01-01|1001-01-02| |1001-01-01|1001-01-03| |1001-01-01|1001-01-04| |1001-01-01|1001-01-05| |1001-01-01|1001-01-06| |1001-01-01|1001-01-07| |1001-01-01|1001-01-08| +--+--+ scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED") scala> spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false) +--+--+ |dict |plain | +--+--+ |1001-01-07|1001-01-07| |1001-01-07|1001-01-08| |1001-01-07|1001-01-09| |1001-01-07|1001-01-10| |1001-01-07|1001-01-11| |1001-01-07|1001-01-12| |1001-01-07|1001-01-13| |1001-01-07|1001-01-14| +--+--+ {code} > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog po
[jira] [Updated] (SPARK-33619) GetMapValueUtil code generation error
[ https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33619: Description: ``` In -SPARK-33460--- there is a bug as follow. GetMapValueUtil generated code error when ANSI mode is on s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" should be s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not exist.");""" But Why are checkExceptionInExpression[Exception](expr, errMsg) and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql can't detect this Bug it's because 1. checkExceptionInExpression is some what error, too. it should wrap with withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE SHOULD FIX this later. 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the ConstantFolding rules, it is calling eval instead of code gen in this case ``` was: ``` In -SPARK-33460--- there is a bug as follow. GetMapValueUtil generated code error when ANSI mode is on s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" should be s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not exist.");""" But Why are checkExceptionInExpression[Exception](expr, errMsg) and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql can't detect this Bug it's because 1. checkExceptionInExpression is some what error, too. it should wrap with withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the ConstantFolding rules, it is calling eval instead of code gen in this case ``` > GetMapValueUtil code generation error > - > > Key: SPARK-33619 > URL: https://issues.apache.org/jira/browse/SPARK-33619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > ``` > In -SPARK-33460--- > there is a bug as follow. > GetMapValueUtil generated code error when ANSI mode is on > > s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" > should be > s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not > exist.");""" > > But Why are > checkExceptionInExpression[Exception](expr, errMsg) > and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql > can't detect this Bug > > it's because > 1. checkExceptionInExpression is some what error, too. it should wrap with > withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE > SHOULD FIX this later. > 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the > ConstantFolding rules, it is calling eval instead of code gen in this case > > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33619) GetMapValueUtil code generation error
[ https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33619: Description: ``` In -SPARK-33460--- there is a bug as follow. GetMapValueUtil generated code error when ANSI mode is on s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" should be s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not exist.");""" But Why are checkExceptionInExpression[Exception](expr, errMsg) and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql can't detect this Bug it's because 1. checkExceptionInExpression is some what error, too. it should wrap with withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the ConstantFolding rules, it is calling eval instead of code gen in this case ``` was: ``` ``` > GetMapValueUtil code generation error > - > > Key: SPARK-33619 > URL: https://issues.apache.org/jira/browse/SPARK-33619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > ``` > In -SPARK-33460--- > there is a bug as follow. > GetMapValueUtil generated code error when ANSI mode is on > > s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" > should be > s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not > exist.");""" > > But Why are > checkExceptionInExpression[Exception](expr, errMsg) > and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql > can't detect this Bug > > it's because > 1. checkExceptionInExpression is some what error, too. it should wrap with > withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation > 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the > ConstantFolding rules, it is calling eval instead of code gen in this case > > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33619) GetMapValueUtil code generation error
Leanken.Lin created SPARK-33619: --- Summary: GetMapValueUtil code generation error Key: SPARK-33619 URL: https://issues.apache.org/jira/browse/SPARK-33619 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin ``` ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241379#comment-17241379 ] Maxim Gekk commented on SPARK-33571: I have tried to reproduce the issue on the master branch by reading the file saved by Spark 2.4.5 (https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data): {code:scala} test("SPARK-33571: read ancient dates saved by Spark 2.4.5") { withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> LEGACY.toString) { val path = getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet") val df = spark.read.parquet(path) df.show(false) } withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> CORRECTED.toString) { val path = getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet") val df = spark.read.parquet(path) df.show(false) } } {code} The results are different in LEGACY and in CORRECTED modes: {code} +--+--+ |dict |plain | +--+--+ |1001-01-01|1001-01-01| |1001-01-01|1001-01-02| |1001-01-01|1001-01-03| |1001-01-01|1001-01-04| |1001-01-01|1001-01-05| |1001-01-01|1001-01-06| |1001-01-01|1001-01-07| |1001-01-01|1001-01-08| +--+--+ +--+--+ |dict |plain | +--+--+ |1001-01-07|1001-01-07| |1001-01-07|1001-01-08| |1001-01-07|1001-01-09| |1001-01-07|1001-01-10| |1001-01-07|1001-01-11| |1001-01-07|1001-01-12| |1001-01-07|1001-01-13| |1001-01-07|1001-01-14| +--+--+ {code} > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've ma
[jira] [Commented] (SPARK-33616) Join queries across different datasources (hive and hive)
[ https://issues.apache.org/jira/browse/SPARK-33616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241373#comment-17241373 ] Fang Wen commented on SPARK-33616: -- ok, thanks > Join queries across different datasources (hive and hive) > - > > Key: SPARK-33616 > URL: https://issues.apache.org/jira/browse/SPARK-33616 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 3.0.1 >Reporter: Fang Wen >Priority: Major > > spark3.0 datasource v2 has supported join queries across different datasource > like mysql and hive, it's very convenient to handle data from different > datasource. But in our scene,sometimes we need handle Hive data of different > clusters,so l want to kown if there is a plan to support multi hive > datadource in later spark version -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33617) Avoid generating small files for INSERT INTO TABLE from VALUES
[ https://issues.apache.org/jira/browse/SPARK-33617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33617: Assignee: (was: Apache Spark) > Avoid generating small files for INSERT INTO TABLE from VALUES > --- > > Key: SPARK-33617 > URL: https://issues.apache.org/jira/browse/SPARK-33617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > create table t1(id int) stored as textfile; > insert into table t1 values (1), (2), (3), (4), (5), (6), (7), (8); > {code} > It will generate these files: > {noformat} > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-0-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-1-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-2-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-3-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-4-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-5-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-6-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-7-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33617) Avoid generating small files for INSERT INTO TABLE from VALUES
[ https://issues.apache.org/jira/browse/SPARK-33617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241362#comment-17241362 ] Apache Spark commented on SPARK-33617: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30559 > Avoid generating small files for INSERT INTO TABLE from VALUES > --- > > Key: SPARK-33617 > URL: https://issues.apache.org/jira/browse/SPARK-33617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > create table t1(id int) stored as textfile; > insert into table t1 values (1), (2), (3), (4), (5), (6), (7), (8); > {code} > It will generate these files: > {noformat} > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-0-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-1-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-2-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-3-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-4-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-5-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-6-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-7-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33617) Avoid generating small files for INSERT INTO TABLE from VALUES
[ https://issues.apache.org/jira/browse/SPARK-33617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33617: Assignee: Apache Spark > Avoid generating small files for INSERT INTO TABLE from VALUES > --- > > Key: SPARK-33617 > URL: https://issues.apache.org/jira/browse/SPARK-33617 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > How to reproduce this issue: > {code:sql} > create table t1(id int) stored as textfile; > insert into table t1 values (1), (2), (3), (4), (5), (6), (7), (8); > {code} > It will generate these files: > {noformat} > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-0-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-1-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-2-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-3-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-4-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-5-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-6-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > -rwxr-xr-x 1 root root 2 Nov 30 23:07 > part-7-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33618) hadoop-aws doesn't work
[ https://issues.apache.org/jira/browse/SPARK-33618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33618: Assignee: (was: Apache Spark) > hadoop-aws doesn't work > --- > > Key: SPARK-33618 > URL: https://issues.apache.org/jira/browse/SPARK-33618 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > According to > [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since > Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. In > other words, the regression is that `dev/make-distribution.sh -Phadoop-cloud > ...` doesn't make a complete distribution for cloud support. It fails at > write operation like the following. > {code} > $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID > --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY > 20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Spark context available as 'sc' (master = local[*], app id = > local-1606806088715). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.read.parquet("s3a://dongjoon/users.parquet").show > 20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried > hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties > +--+--++ > | name|favorite_color|favorite_numbers| > +--+--++ > |Alyssa| null| [3, 9, 15, 20]| > | Ben| red| []| > +--+--++ > scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet") > 20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ > 1] > java.lang.NoSuchMethodError: > org.apache.hadoop.util.SemaphoredDelegatingExecutor.(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33618) hadoop-aws doesn't work
[ https://issues.apache.org/jira/browse/SPARK-33618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241345#comment-17241345 ] Apache Spark commented on SPARK-33618: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30508 > hadoop-aws doesn't work > --- > > Key: SPARK-33618 > URL: https://issues.apache.org/jira/browse/SPARK-33618 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > According to > [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since > Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. In > other words, the regression is that `dev/make-distribution.sh -Phadoop-cloud > ...` doesn't make a complete distribution for cloud support. It fails at > write operation like the following. > {code} > $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID > --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY > 20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Spark context available as 'sc' (master = local[*], app id = > local-1606806088715). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.read.parquet("s3a://dongjoon/users.parquet").show > 20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried > hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties > +--+--++ > | name|favorite_color|favorite_numbers| > +--+--++ > |Alyssa| null| [3, 9, 15, 20]| > | Ben| red| []| > +--+--++ > scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet") > 20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ > 1] > java.lang.NoSuchMethodError: > org.apache.hadoop.util.SemaphoredDelegatingExecutor.(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33618) hadoop-aws doesn't work
[ https://issues.apache.org/jira/browse/SPARK-33618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33618: Assignee: Apache Spark > hadoop-aws doesn't work > --- > > Key: SPARK-33618 > URL: https://issues.apache.org/jira/browse/SPARK-33618 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Blocker > > According to > [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since > Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. In > other words, the regression is that `dev/make-distribution.sh -Phadoop-cloud > ...` doesn't make a complete distribution for cloud support. It fails at > write operation like the following. > {code} > $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID > --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY > 20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Spark context available as 'sc' (master = local[*], app id = > local-1606806088715). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.read.parquet("s3a://dongjoon/users.parquet").show > 20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried > hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties > +--+--++ > | name|favorite_color|favorite_numbers| > +--+--++ > |Alyssa| null| [3, 9, 15, 20]| > | Ben| red| []| > +--+--++ > scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet") > 20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ > 1] > java.lang.NoSuchMethodError: > org.apache.hadoop.util.SemaphoredDelegatingExecutor.(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.
[ https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241342#comment-17241342 ] Darshat commented on SPARK-33576: - I turned on the jvm exception logging, and this is the full trace: JVM stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 58.0 failed 4 times, most recent failure: Lost task 1.3 in stage 58.0 (TID 3336, 10.139.64.5, executor 7): org.apache.spark.api.python.PythonException: 'OSError: Invalid IPC message: negative bodyLength'. Full traceback below: Traceback (most recent call last): File "/databricks/spark/python/pyspark/worker.py", line 654, in main process() File "/databricks/spark/python/pyspark/worker.py", line 646, in process serializer.dump_stream(out_iter, outfile) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in dump_stream for batch in iterator: File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in init_stream_yield_batches for series in iterator: File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in load_stream for batch in batches: File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in load_stream for batch in batches: File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in __iter__ File "pyarrow/ipc.pxi", line 432, in pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative bodyLength at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:598) at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:99) at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:551) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144) at org.apache.spark.scheduler.Task.run(Task.scala:117) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:660) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721) at o
[jira] [Created] (SPARK-33618) hadoop-aws doesn't work
Dongjoon Hyun created SPARK-33618: - Summary: hadoop-aws doesn't work Key: SPARK-33618 URL: https://issues.apache.org/jira/browse/SPARK-33618 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.0 Reporter: Dongjoon Hyun According to [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. In other words, the regression is that `dev/make-distribution.sh -Phadoop-cloud ...` doesn't make a complete distribution for cloud support. It fails at write operation like the following. {code} $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY 20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context available as 'sc' (master = local[*], app id = local-1606806088715). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272) Type in expressions to have them evaluated. Type :help for more information. scala> spark.read.parquet("s3a://dongjoon/users.parquet").show 20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties +--+--++ | name|favorite_color|favorite_numbers| +--+--++ |Alyssa| null| [3, 9, 15, 20]| | Ben| red| []| +--+--++ scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet") 20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1] java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241339#comment-17241339 ] Maxim Gekk commented on SPARK-33571: [~simonvanderveldt] Thank you for the detailed description and your investigation. Let me clarify a few things: > From our testing we're seeing several issues: > Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that > contains fields of type `TimestampType` which contain timestamps before the > above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 Spark 2.4.5 writes timestamps as parquet INT96 type. The SQL config `datetimeRebaseModeInRead` does not influence on reading such types in Spark 3.0.1, so, Spark performs rebasing always (LEGACY mode). We recently added separate configs for INT96: * https://github.com/apache/spark/pull/30056 * https://github.com/apache/spark/pull/30121 The changes will be released with Spark 3.1.0. > Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that > contains fields of type `TimestampType` or `DateType` which contain dates or > timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. For INT96, it seems it is correct behavior. We should observe different results for TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, see the SQL config spark.sql.parquet.outputTimestampType. The DATE case is more interesting as we must see a difference in results for ancient dates. I will investigate this case. > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the
[jira] [Created] (SPARK-33617) Avoid generating small files for INSERT INTO TABLE from VALUES
Yuming Wang created SPARK-33617: --- Summary: Avoid generating small files for INSERT INTO TABLE from VALUES Key: SPARK-33617 URL: https://issues.apache.org/jira/browse/SPARK-33617 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang How to reproduce this issue: {code:sql} create table t1(id int) stored as textfile; insert into table t1 values (1), (2), (3), (4), (5), (6), (7), (8); {code} It will generate these files: {noformat} -rwxr-xr-x 1 root root 2 Nov 30 23:07 part-0-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 -rwxr-xr-x 1 root root 2 Nov 30 23:07 part-1-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 -rwxr-xr-x 1 root root 2 Nov 30 23:07 part-2-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 -rwxr-xr-x 1 root root 2 Nov 30 23:07 part-3-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 -rwxr-xr-x 1 root root 2 Nov 30 23:07 part-4-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 -rwxr-xr-x 1 root root 2 Nov 30 23:07 part-5-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 -rwxr-xr-x 1 root root 2 Nov 30 23:07 part-6-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 -rwxr-xr-x 1 root root 2 Nov 30 23:07 part-7-76a5ddf9-10df-41f8-ac19-8186449d958d-c000 {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33599) Group exception messages in catalyst/analysis
[ https://issues.apache.org/jira/browse/SPARK-33599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241330#comment-17241330 ] jiaan.geng commented on SPARK-33599: I'm working on. > Group exception messages in catalyst/analysis > - > > Key: SPARK-33599 > URL: https://issues.apache.org/jira/browse/SPARK-33599 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis' > || Filename || Count || > | Analyzer.scala | 1 | > | CheckAnalysis.scala| 1 | > | FunctionRegistry.scala | 5 | > | ResolveCatalogs.scala | 1 | > | ResolveHints.scala | 1 | > | package.scala | 2 | > | unresolved.scala | 43 | > '/core/src/main/scala/org/apache/spark/sql/catalyst/analysis' > || Filename|| Count || > | ResolveSessionCatalog.scala | 12 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org