date:20201201

[jira] [Created] (SPARK-33626) k8s integration tests should assert driver and executor logs for must & must not contain

2020-12-01 Thread Prashant Sharma (Jira)

Prashant Sharma created SPARK-33626:
---

 Summary: k8s integration tests should assert driver and executor 
logs for must & must not contain
 Key: SPARK-33626
 URL: https://issues.apache.org/jira/browse/SPARK-33626
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Prashant Sharma


Improve the k8s tests, to be able to assert both driver and executor logs for 
must contain and must not contain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33625) Subexpression elimination for whole-stage codegen in Filter

2020-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242087#comment-17242087
 ] 

Apache Spark commented on SPARK-33625:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30565

> Subexpression elimination for whole-stage codegen in Filter
> ---
>
> Key: SPARK-33625
> URL: https://issues.apache.org/jira/browse/SPARK-33625
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> We made subexpression elimination available for whole-stage codegen in 
> ProjectExec. Another one operator that frequently runs into subexpressions, 
> is Filter. We should also make whole-stage codegen subexpression elimination 
> in FilterExec too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33625) Subexpression elimination for whole-stage codegen in Filter

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33625:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Subexpression elimination for whole-stage codegen in Filter
> ---
>
> Key: SPARK-33625
> URL: https://issues.apache.org/jira/browse/SPARK-33625
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> We made subexpression elimination available for whole-stage codegen in 
> ProjectExec. Another one operator that frequently runs into subexpressions, 
> is Filter. We should also make whole-stage codegen subexpression elimination 
> in FilterExec too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33625) Subexpression elimination for whole-stage codegen in Filter

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33625:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Subexpression elimination for whole-stage codegen in Filter
> ---
>
> Key: SPARK-33625
> URL: https://issues.apache.org/jira/browse/SPARK-33625
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> We made subexpression elimination available for whole-stage codegen in 
> ProjectExec. Another one operator that frequently runs into subexpressions, 
> is Filter. We should also make whole-stage codegen subexpression elimination 
> in FilterExec too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33625) Subexpression elimination for whole-stage codegen in Filter

2020-12-01 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-33625:
---

 Summary: Subexpression elimination for whole-stage codegen in 
Filter
 Key: SPARK-33625
 URL: https://issues.apache.org/jira/browse/SPARK-33625
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


We made subexpression elimination available for whole-stage codegen in 
ProjectExec. Another one operator that frequently runs into subexpressions, is 
Filter. We should also make whole-stage codegen subexpression elimination in 
FilterExec too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242074#comment-17242074
 ] 

Apache Spark commented on SPARK-32670:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/30564

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xinyi Yu
>Priority: Minor
> Fix For: 3.1.0
>
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Inbox (2) | New Cloud Notification

2020-12-01 Thread Cloud-spark . apache . org



Dear User2 New documents assigned to 'issues@spark.apache.org ' are available on spark.apache.org Cloudclick here to retrieve document(s) now

Powered by
spark.apache.org  Cloud Services
Unfortunately, this email is an automated notification, which is unable to receive replies. 

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32863) Full outer stream-stream join

2020-12-01 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-32863:


Assignee: Cheng Su

> Full outer stream-stream join
> -
>
> Key: SPARK-32863
> URL: https://issues.apache.org/jira/browse/SPARK-32863
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Major
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). With current design of stream-stream join (which marks whether the row is 
> matched or not in state store), it would be very easy to support full outer 
> join as well.
>  
> Full outer stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store. If there's a match, output all matched rows. Put the row in left side 
> state store.
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, output all matched rows and update left side rows 
> state with "matched" field to set to true. Put the right side row in right 
> side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is false.
> (4).for right side row needs to be evicted from state store, output the row 
> if "matched" field is false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32863) Full outer stream-stream join

2020-12-01 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-32863.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30395
[https://github.com/apache/spark/pull/30395]

> Full outer stream-stream join
> -
>
> Key: SPARK-32863
> URL: https://issues.apache.org/jira/browse/SPARK-32863
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Major
> Fix For: 3.1.0
>
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). With current design of stream-stream join (which marks whether the row is 
> matched or not in state store), it would be very easy to support full outer 
> join as well.
>  
> Full outer stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store. If there's a match, output all matched rows. Put the row in left side 
> state store.
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, output all matched rows and update left side rows 
> state with "matched" field to set to true. Put the right side row in right 
> side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is false.
> (4).for right side row needs to be evicted from state store, output the row 
> if "matched" field is false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33544) explode should not filter when used with CreateArray

2020-12-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33544.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30504
[https://github.com/apache/spark/pull/30504]

> explode should not filter when used with CreateArray
> 
>
> Key: SPARK-33544
> URL: https://issues.apache.org/jira/browse/SPARK-33544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to 
> insert a filter for not null and size > 0 when using inner explode/inline. 
> This is fine in most cases but the extra filter is not needed if the explode 
> is with a create array and not using Literals (it already handles LIterals).  
> When this happens you know that the values aren't null and it has a size.  It 
> already handles the empty array.
> for instance:
> val df = someDF.selectExpr("number", "explode(array(word, col3))")
> So in this case we shouldn't be inserting the extra Filter and that filter 
> can get pushed down into like a parquet reader as well. This is just causing 
> extra overhead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33544) explode should not filter when used with CreateArray

2020-12-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33544:


Assignee: Thomas Graves

> explode should not filter when used with CreateArray
> 
>
> Key: SPARK-33544
> URL: https://issues.apache.org/jira/browse/SPARK-33544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to 
> insert a filter for not null and size > 0 when using inner explode/inline. 
> This is fine in most cases but the extra filter is not needed if the explode 
> is with a create array and not using Literals (it already handles LIterals).  
> When this happens you know that the values aren't null and it has a size.  It 
> already handles the empty array.
> for instance:
> val df = someDF.selectExpr("number", "explode(array(word, col3))")
> So in this case we shouldn't be inserting the extra Filter and that filter 
> can get pushed down into like a parquet reader as well. This is just causing 
> extra overhead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33621) Add a way to inject data source rewrite rules

2020-12-01 Thread Anton Okolnychyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-33621:
-
Summary: Add a way to inject data source rewrite rules  (was: Add a way to 
inject optimization rules after expression optimization)

> Add a way to inject data source rewrite rules
> -
>
> Key: SPARK-33621
> URL: https://issues.apache.org/jira/browse/SPARK-33621
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{SparkSessionExtensions}} allow us to inject optimization rules but they are 
> added to operator optimization batch. There are cases when users need to run 
> rules after the operator optimization batch. Currently, this is not possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33623) Add canDeleteWhere to SupportsDelete

2020-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241816#comment-17241816
 ] 

Apache Spark commented on SPARK-33623:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/30562

> Add canDeleteWhere to SupportsDelete
> 
>
> Key: SPARK-33623
> URL: https://issues.apache.org/jira/browse/SPARK-33623
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> The only way to support delete statements right now is to implement 
> \{{SupportsDelete}}. According to its Javadoc, that interface is meant for 
> cases when we can delete data without much effort (e.g. like deleting a 
> complete partition in a Hive table). It is clear we need a more sophisticated 
> API for row-level deletes. That's why it would be beneficial to add a method 
> to \{{SupportsDelete}} so that Spark can check if a source can easily delete 
> data with just having filters or it will need a full rewrite later on. This 
> way, we have more control in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33623) Add canDeleteWhere to SupportsDelete

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33623:


Assignee: Apache Spark

> Add canDeleteWhere to SupportsDelete
> 
>
> Key: SPARK-33623
> URL: https://issues.apache.org/jira/browse/SPARK-33623
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Assignee: Apache Spark
>Priority: Major
>
> The only way to support delete statements right now is to implement 
> \{{SupportsDelete}}. According to its Javadoc, that interface is meant for 
> cases when we can delete data without much effort (e.g. like deleting a 
> complete partition in a Hive table). It is clear we need a more sophisticated 
> API for row-level deletes. That's why it would be beneficial to add a method 
> to \{{SupportsDelete}} so that Spark can check if a source can easily delete 
> data with just having filters or it will need a full rewrite later on. This 
> way, we have more control in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33623) Add canDeleteWhere to SupportsDelete

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33623:


Assignee: (was: Apache Spark)

> Add canDeleteWhere to SupportsDelete
> 
>
> Key: SPARK-33623
> URL: https://issues.apache.org/jira/browse/SPARK-33623
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> The only way to support delete statements right now is to implement 
> \{{SupportsDelete}}. According to its Javadoc, that interface is meant for 
> cases when we can delete data without much effort (e.g. like deleting a 
> complete partition in a Hive table). It is clear we need a more sophisticated 
> API for row-level deletes. That's why it would be beneficial to add a method 
> to \{{SupportsDelete}} so that Spark can check if a source can easily delete 
> data with just having filters or it will need a full rewrite later on. This 
> way, we have more control in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33622) Add array_to_vector function to SparkR

2020-12-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33622.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30561
[https://github.com/apache/spark/pull/30561]

> Add array_to_vector function to SparkR
> --
>
> Key: SPARK-33622
> URL: https://issues.apache.org/jira/browse/SPARK-33622
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be 
> useful to have in R API as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33622) Add array_to_vector function to SparkR

2020-12-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33622:
-

Assignee: Maciej Szymkiewicz

> Add array_to_vector function to SparkR
> --
>
> Key: SPARK-33622
> URL: https://issues.apache.org/jira/browse/SPARK-33622
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be 
> useful to have in R API as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.

2020-12-01 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241769#comment-17241769
 ] 

Bryan Cutler commented on SPARK-33576:
--

Is this due to the 2GB limit? As in 
https://issues.apache.org/jira/browse/SPARK-32294 and 
https://issues.apache.org/jira/browse/ARROW-4890

> PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC 
> message: negative bodyLength'.
> -
>
> Key: SPARK-33576
> URL: https://issues.apache.org/jira/browse/SPARK-33576
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
> Environment: Databricks runtime 7.3
> Spakr 3.0.1
> Scala 2.12
>Reporter: Darshat
>Priority: Major
>
> Hello,
> We are using Databricks on Azure to process large amount of ecommerce data. 
> Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12.
> During processing, there is a groupby operation on the DataFrame that 
> consistently gets an exception of this type:
>  
> {color:#ff}PythonException: An exception was thrown from a UDF: 'OSError: 
> Invalid IPC message: negative bodyLength'. Full traceback below: Traceback 
> (most recent call last): File "/databricks/spark/python/pyspark/worker.py", 
> line 654, in main process() File 
> "/databricks/spark/python/pyspark/worker.py", line 646, in process 
> serializer.dump_stream(out_iter, outfile) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in 
> dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in 
> dump_stream for batch in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in 
> init_stream_yield_batches for series in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in 
> load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in 
> __iter__ File "pyarrow/ipc.pxi", line 432, in 
> pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", 
> line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative 
> bodyLength{color}
>  
> Code that causes this:
> {color:#ff}x = df.groupby('providerid').apply(domain_features){color}
> {color:#ff}display(x.info()){color}
> Dataframe size - 22 million rows, 31 columns
>  One of the columns is a string ('providerid') on which we do a groupby 
> followed by an apply  operation. There are 3 distinct provider ids in this 
> set. While trying to enumerate/count the results, we get this exception.
> We've put all possible checks in the code for null values, or corrupt data 
> and we are not able to track this to application level code. I hope we can 
> get some help troubleshooting this as this is a blocker for rolling out at 
> scale.
> The cluster has 8 nodes + driver, all 28GB RAM. I can provide any other 
> settings that could be useful. 
>  Hope to get some insights into the problem. 
> Thanks,
> Darshat Shah



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33611) Decode Query parameters of the redirect URL for reverse proxy

2020-12-01 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-33611.

Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30552
[https://github.com/apache/spark/pull/30552]

> Decode Query parameters of the redirect URL for reverse proxy
> -
>
> Key: SPARK-33611
> URL: https://issues.apache.org/jira/browse/SPARK-33611
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> When running Spark with reverse proxy enabled, the query parameter of the 
> request URL can be encoded twice:  one from the browser and another one from 
> the reverse proxy(e.g. Nginx).  
> In Spark's stage page, the URL of "/taskTable" contains query parameter 
> order[0][dir].  After encoding twice,  the query parameter becomes 
> `order%255B0%255D%255Bdir%255D` and it will be decoded as 
> `order%5B0%5D%5Bdir%5D` instead of  `order[0][dir]`.  As a result, there will 
> be NullPointerException from 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176
> Other than that, the other parameter may not work as expected after encoded 
> twice.
> We should decode the query parameters and fix the problem



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33612) Add dataSourceRewriteRules batch to Optimizer

2020-12-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33612.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30558
[https://github.com/apache/spark/pull/30558]

> Add dataSourceRewriteRules batch to Optimizer
> -
>
> Key: SPARK-33612
> URL: https://issues.apache.org/jira/browse/SPARK-33612
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Right now, we have a special place in the optimizer where we run rules for 
> constructing v2 scans. As time shows, we need more rewrite rules for v2 
> tables and not all of them related to reads. One option is to rename the 
> current batch into something more generic but it would require changing quite 
> some places. Moreover, the batch contains some non-V2 logic as well. That's 
> why it seems better to introduce a new batch and use it for all data source 
> v2 rewrites and beyond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33612) Add dataSourceRewriteRules batch to Optimizer

2020-12-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33612:
-

Assignee: Anton Okolnychyi

> Add dataSourceRewriteRules batch to Optimizer
> -
>
> Key: SPARK-33612
> URL: https://issues.apache.org/jira/browse/SPARK-33612
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>
> Right now, we have a special place in the optimizer where we run rules for 
> constructing v2 scans. As time shows, we need more rewrite rules for v2 
> tables and not all of them related to reads. One option is to rename the 
> current batch into something more generic but it would require changing quite 
> some places. Moreover, the batch contains some non-V2 logic as well. That's 
> why it seems better to introduce a new batch and use it for all data source 
> v2 rewrites and beyond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33072) Remove two Hive 1.2-related Jenkins jobs

2020-12-01 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241714#comment-17241714
 ] 

Dongjoon Hyun commented on SPARK-33072:
---

Thank you!

> Remove two Hive 1.2-related Jenkins jobs
> 
>
> Key: SPARK-33072
> URL: https://issues.apache.org/jira/browse/SPARK-33072
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> SPARK-20202 removed `hive-1.2` profile at Apache Spark 3.1.0 and excluded the 
> following two from `Spark QA Dashboard`. We need to remove it.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.

2020-12-01 Thread Darshat (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241683#comment-17241683
 ] 

Darshat commented on SPARK-33576:
-

I set the following options to see if they'd help, but still get the same error:

 

spark.sql.execution.arrow.maxRecordsPerBatch 100

spark.sql.pyspark.jvmStacktrace.enabled true

spark.driver.maxResultSize 18g

spark.reducer.maxSizeInFlight 1024m

> PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC 
> message: negative bodyLength'.
> -
>
> Key: SPARK-33576
> URL: https://issues.apache.org/jira/browse/SPARK-33576
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
> Environment: Databricks runtime 7.3
> Spakr 3.0.1
> Scala 2.12
>Reporter: Darshat
>Priority: Major
>
> Hello,
> We are using Databricks on Azure to process large amount of ecommerce data. 
> Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12.
> During processing, there is a groupby operation on the DataFrame that 
> consistently gets an exception of this type:
>  
> {color:#ff}PythonException: An exception was thrown from a UDF: 'OSError: 
> Invalid IPC message: negative bodyLength'. Full traceback below: Traceback 
> (most recent call last): File "/databricks/spark/python/pyspark/worker.py", 
> line 654, in main process() File 
> "/databricks/spark/python/pyspark/worker.py", line 646, in process 
> serializer.dump_stream(out_iter, outfile) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in 
> dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in 
> dump_stream for batch in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in 
> init_stream_yield_batches for series in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in 
> load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in 
> __iter__ File "pyarrow/ipc.pxi", line 432, in 
> pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", 
> line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative 
> bodyLength{color}
>  
> Code that causes this:
> {color:#ff}x = df.groupby('providerid').apply(domain_features){color}
> {color:#ff}display(x.info()){color}
> Dataframe size - 22 million rows, 31 columns
>  One of the columns is a string ('providerid') on which we do a groupby 
> followed by an apply  operation. There are 3 distinct provider ids in this 
> set. While trying to enumerate/count the results, we get this exception.
> We've put all possible checks in the code for null values, or corrupt data 
> and we are not able to track this to application level code. I hope we can 
> get some help troubleshooting this as this is a blocker for rolling out at 
> scale.
> The cluster has 8 nodes + driver, all 28GB RAM. I can provide any other 
> settings that could be useful. 
>  Hope to get some insights into the problem. 
> Thanks,
> Darshat Shah



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33620) Task not started after filtering

2020-12-01 Thread Vladislav Sterkhov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladislav Sterkhov updated SPARK-33620:
---
Description: 
Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory 
used task starting and complete, but we need use unlimited stack. Please help

 

!VlwWJ.png|width=644,height=150!

 

!mgg1s.png|width=651,height=182!

 

This my code:

var filteredRDD = sparkContext.emptyRDD[String]
 for (path<- pathBuffer)

{ val someRDD = sparkContext.textFile(path) if (isValidRDD(someRDD))       
filteredRDD = filteredRDD.++(filteringRDD(someRDD )) }

hiveService.insertRDD(filteredRDD.repartition(10), outTable)

  was:
Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory 
used task starting and complete, but we need use unlimited stack. Please help

 

!VlwWJ.png|width=644,height=150!

 

!mgg1s.png|width=651,height=182!

 

This my code:

var filteredRDD = sparkContext.emptyRDD[String]
 for (path<- pathBuffer)

{ val someRDD = sparkContext.textFile(path)

if (isValidRDD(someRDD))

      filteredRDD = filteredRDD.++(filterRDD(someRDD ))

}

hiveService.insertRDD(filteredRDD.repartition(10), outTable)


> Task not started after filtering
> 
>
> Key: SPARK-33620
> URL: https://issues.apache.org/jira/browse/SPARK-33620
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Vladislav Sterkhov
>Priority: Major
> Attachments: VlwWJ.png, mgg1s.png
>
>
> Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb 
> memory used task starting and complete, but we need use unlimited stack. 
> Please help
>  
> !VlwWJ.png|width=644,height=150!
>  
> !mgg1s.png|width=651,height=182!
>  
> This my code:
> var filteredRDD = sparkContext.emptyRDD[String]
>  for (path<- pathBuffer)
> { val someRDD = sparkContext.textFile(path) if (isValidRDD(someRDD))       
> filteredRDD = filteredRDD.++(filteringRDD(someRDD )) }
> hiveService.insertRDD(filteredRDD.repartition(10), outTable)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33620) Task not started after filtering

2020-12-01 Thread Vladislav Sterkhov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladislav Sterkhov updated SPARK-33620:
---
Description: 
Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory 
used task starting and complete, but we need use unlimited stack. Please help

 

!VlwWJ.png|width=644,height=150!

 

!mgg1s.png|width=651,height=182!

 

This my code:

var filteredRDD = sparkContext.emptyRDD[String]
 for (path<- pathBuffer)

{ val someRDD = sparkContext.textFile(path)

if (isValidRDD(someRDD))

      filteredRDD = filteredRDD.++(filterRDD(someRDD ))

}

hiveService.insertRDD(filteredRDD.repartition(10), outTable)

  was:
Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory 
used task starting and complete, but we need use unlimited stack. Please help

 

!VlwWJ.png|width=644,height=150!

 

!mgg1s.png|width=651,height=182!

 

This my code:

var filteredRDD = sparkContext.emptyRDD[String]
 for (path<- pathBuffer) {
 val someRDD = sparkContext.textFile(path)
 if (isValidRDD(someRDD))

      filteredRDD = filteredRDD.++(filterRDD(someRDD ))

}
 hiveService.insertRDD(filteredRDD.repartition(10), outTable)


> Task not started after filtering
> 
>
> Key: SPARK-33620
> URL: https://issues.apache.org/jira/browse/SPARK-33620
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Vladislav Sterkhov
>Priority: Major
> Attachments: VlwWJ.png, mgg1s.png
>
>
> Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb 
> memory used task starting and complete, but we need use unlimited stack. 
> Please help
>  
> !VlwWJ.png|width=644,height=150!
>  
> !mgg1s.png|width=651,height=182!
>  
> This my code:
> var filteredRDD = sparkContext.emptyRDD[String]
>  for (path<- pathBuffer)
> { val someRDD = sparkContext.textFile(path)
> if (isValidRDD(someRDD))
>       filteredRDD = filteredRDD.++(filterRDD(someRDD ))
> }
> hiveService.insertRDD(filteredRDD.repartition(10), outTable)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33620) Task not started after filtering

2020-12-01 Thread Vladislav Sterkhov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladislav Sterkhov updated SPARK-33620:
---
Description: 
Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory 
used task starting and complete, but we need use unlimited stack. Please help

 

!VlwWJ.png|width=644,height=150!

 

!mgg1s.png|width=651,height=182!

 

This my code:

var filteredRDD = sparkContext.emptyRDD[String]
 for (path<- pathBuffer) {
 val someRDD = sparkContext.textFile(path)
 if (isValidRDD(someRDD))

      filteredRDD = filteredRDD.++(filterRDD(someRDD ))

}
 hiveService.insertRDD(filteredRDD.repartition(10), outTable)

  was:
Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory 
used task starting and complete, but we need use unlimited stack. Please help

 

!VlwWJ.png|width=644,height=150!

 

!mgg1s.png|width=651,height=182!

 

This my code:

var allTrafficRDD = sparkContext.emptyRDD[String]
 for (traffic <- trafficBuffer) {
 logger.info("Load traffic path - "+traffic)
 val trafficRDD = sparkContext.textFile(traffic)
 if (isValidTraffic(trafficRDD, isMasterData))

{ allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) }

}
 
hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum),
 outTable, isMasterData)


> Task not started after filtering
> 
>
> Key: SPARK-33620
> URL: https://issues.apache.org/jira/browse/SPARK-33620
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Vladislav Sterkhov
>Priority: Major
> Attachments: VlwWJ.png, mgg1s.png
>
>
> Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb 
> memory used task starting and complete, but we need use unlimited stack. 
> Please help
>  
> !VlwWJ.png|width=644,height=150!
>  
> !mgg1s.png|width=651,height=182!
>  
> This my code:
> var filteredRDD = sparkContext.emptyRDD[String]
>  for (path<- pathBuffer) {
>  val someRDD = sparkContext.textFile(path)
>  if (isValidRDD(someRDD))
>       filteredRDD = filteredRDD.++(filterRDD(someRDD ))
> }
>  hiveService.insertRDD(filteredRDD.repartition(10), outTable)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33520) make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

2020-12-01 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-33520:
---
Summary: make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer 
support Python backend estimator/model  (was: make 
CrossValidator/TrainValidateSplit/OneVsRest support Python backend 
estimator/model)

> make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python 
> backend estimator/model
> -
>
> Key: SPARK-33520
> URL: https://issues.apache.org/jira/browse/SPARK-33520
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Weichen Xu
>Priority: Major
>
> Currently, pyspark support third-party library to define python backend 
> estimator, i.e., estimator that inherit `Estimator` instead of 
> `JavaEstimator`, and only can be used in pyspark.
> CrossValidator and TrainValidateSplit support tuning these python backend 
> estimator,
> but cannot support saving/load, becase CrossValidator and TrainValidateSplit 
> writer implementation is use JavaMLWriter, which require to convert nested 
> estimator into java estimator.
> OneVsRest saving/load now only support java backend classifier due to similar 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33520) make CrossValidator/TrainValidateSplit/OneVsRest support Python backend estimator/model

2020-12-01 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-33520:
---
Description: 
Currently, pyspark support third-party library to define python backend 
estimator, i.e., estimator that inherit `Estimator` instead of `JavaEstimator`, 
and only can be used in pyspark.

CrossValidator and TrainValidateSplit support tuning these python backend 
estimator,
but cannot support saving/load, becase CrossValidator and TrainValidateSplit 
writer implementation is use JavaMLWriter, which require to convert nested 
estimator into java estimator.

OneVsRest saving/load now only support java backend classifier due to similar 
issue.

  was:
Currently, pyspark support third-party library to define python backend 
estimator, i.e., estimator that inherit `Estimator` instead of `JavaEstimator`, 
and only can be used in pyspark.

CrossValidator and TrainValidateSplit support tuning these python backend 
estimator,
but cannot support saving/load, becase CrossValidator and TrainValidateSplit 
writer implementation is use JavaMLWriter, which require to convert nested 
estimator into java estimator.


> make CrossValidator/TrainValidateSplit/OneVsRest support Python backend 
> estimator/model
> ---
>
> Key: SPARK-33520
> URL: https://issues.apache.org/jira/browse/SPARK-33520
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Weichen Xu
>Priority: Major
>
> Currently, pyspark support third-party library to define python backend 
> estimator, i.e., estimator that inherit `Estimator` instead of 
> `JavaEstimator`, and only can be used in pyspark.
> CrossValidator and TrainValidateSplit support tuning these python backend 
> estimator,
> but cannot support saving/load, becase CrossValidator and TrainValidateSplit 
> writer implementation is use JavaMLWriter, which require to convert nested 
> estimator into java estimator.
> OneVsRest saving/load now only support java backend classifier due to similar 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33520) make CrossValidator/TrainValidateSplit/OneVsRest support Python backend estimator/model

2020-12-01 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-33520:
---
Summary: make CrossValidator/TrainValidateSplit/OneVsRest support Python 
backend estimator/model  (was: make CrossValidator/TrainValidateSplit support 
Python backend estimator/model)

> make CrossValidator/TrainValidateSplit/OneVsRest support Python backend 
> estimator/model
> ---
>
> Key: SPARK-33520
> URL: https://issues.apache.org/jira/browse/SPARK-33520
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Weichen Xu
>Priority: Major
>
> Currently, pyspark support third-party library to define python backend 
> estimator, i.e., estimator that inherit `Estimator` instead of 
> `JavaEstimator`, and only can be used in pyspark.
> CrossValidator and TrainValidateSplit support tuning these python backend 
> estimator,
> but cannot support saving/load, becase CrossValidator and TrainValidateSplit 
> writer implementation is use JavaMLWriter, which require to convert nested 
> estimator into java estimator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33624) Extend SupportsSubquery in Filter, Aggregate and Project

2020-12-01 Thread Anton Okolnychyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241614#comment-17241614
 ] 

Anton Okolnychyi commented on SPARK-33624:
--

Actually, we cannot do this as we distinguish `Aggregate` and 
`SupportsSubquery` in `CheckAnalysis`.

> Extend SupportsSubquery in Filter, Aggregate and Project
> 
>
> Key: SPARK-33624
> URL: https://issues.apache.org/jira/browse/SPARK-33624
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We should extend SupportsSubquery in Filter, Aggregate and Project as 
> described 
> [here|https://github.com/apache/spark/pull/30555#discussion_r55689].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33624) Extend SupportsSubquery in Filter, Aggregate and Project

2020-12-01 Thread Anton Okolnychyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241567#comment-17241567
 ] 

Anton Okolnychyi commented on SPARK-33624:
--

I am going to submit a PR soon.

> Extend SupportsSubquery in Filter, Aggregate and Project
> 
>
> Key: SPARK-33624
> URL: https://issues.apache.org/jira/browse/SPARK-33624
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We should extend SupportsSubquery in Filter, Aggregate and Project as 
> described 
> [here|https://github.com/apache/spark/pull/30555#discussion_r55689].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33624) Extend SupportsSubquery in Filter, Aggregate and Project

2020-12-01 Thread Anton Okolnychyi (Jira)

Anton Okolnychyi created SPARK-33624:


 Summary: Extend SupportsSubquery in Filter, Aggregate and Project
 Key: SPARK-33624
 URL: https://issues.apache.org/jira/browse/SPARK-33624
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Anton Okolnychyi


We should extend SupportsSubquery in Filter, Aggregate and Project as described 
[here|https://github.com/apache/spark/pull/30555#discussion_r55689].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33608) Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates rule

2020-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33608.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30555
[https://github.com/apache/spark/pull/30555]

> Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates rule
> -
>
> Key: SPARK-33608
> URL: https://issues.apache.org/jira/browse/SPARK-33608
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.1.0
>
>
> {{PullupCorrelatedPredicates}} should handle DELETE/UPDATE/MERGE commands. 
> Right now, the rule works with filters and unary nodes only. As a result, 
> correlated predicates in DELETE/UPDATE/MERGE are not rewritten.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33608) Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates rule

2020-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33608:
---

Assignee: Anton Okolnychyi

> Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates rule
> -
>
> Key: SPARK-33608
> URL: https://issues.apache.org/jira/browse/SPARK-33608
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>
> {{PullupCorrelatedPredicates}} should handle DELETE/UPDATE/MERGE commands. 
> Right now, the rule works with filters and unary nodes only. As a result, 
> correlated predicates in DELETE/UPDATE/MERGE are not rewritten.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33503) Refactor SortOrder class to allow multiple childrens

2020-12-01 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33503.
--
Fix Version/s: 3.1.0
 Assignee: Prakhar Jain
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/30430

> Refactor SortOrder class to allow multiple childrens
> 
>
> Key: SPARK-33503
> URL: https://issues.apache.org/jira/browse/SPARK-33503
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Prakhar Jain
>Assignee: Prakhar Jain
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently the SortOrder is a UnaryExpression with only one child. It contains 
> a field "sameOrderExpression" that needs some special handling as done in 
> [https://github.com/apache/spark/pull/30302] .
>  
> One of the suggestion in 
> [https://github.com/apache/spark/pull/30302#discussion_r526104333] is to make 
> sameOrderExpression as children of SortOrder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33622) Add array_to_vector function to SparkR

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33622:


Assignee: Apache Spark

> Add array_to_vector function to SparkR
> --
>
> Key: SPARK-33622
> URL: https://issues.apache.org/jira/browse/SPARK-33622
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be 
> useful to have in R API as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33622) Add array_to_vector function to SparkR

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33622:


Assignee: (was: Apache Spark)

> Add array_to_vector function to SparkR
> --
>
> Key: SPARK-33622
> URL: https://issues.apache.org/jira/browse/SPARK-33622
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be 
> useful to have in R API as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33622) Add array_to_vector function to SparkR

2020-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241480#comment-17241480
 ] 

Apache Spark commented on SPARK-33622:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30561

> Add array_to_vector function to SparkR
> --
>
> Key: SPARK-33622
> URL: https://issues.apache.org/jira/browse/SPARK-33622
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be 
> useful to have in R API as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33622) Add array_to_vector function to SparkR

2020-12-01 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-33622:
---
Component/s: ML

> Add array_to_vector function to SparkR
> --
>
> Key: SPARK-33622
> URL: https://issues.apache.org/jira/browse/SPARK-33622
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be 
> useful to have in R API as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33623) Add canDeleteWhere to SupportsDelete

2020-12-01 Thread Anton Okolnychyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241469#comment-17241469
 ] 

Anton Okolnychyi commented on SPARK-33623:
--

I am going to submit a PR soon.

> Add canDeleteWhere to SupportsDelete
> 
>
> Key: SPARK-33623
> URL: https://issues.apache.org/jira/browse/SPARK-33623
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> The only way to support delete statements right now is to implement 
> \{{SupportsDelete}}. According to its Javadoc, that interface is meant for 
> cases when we can delete data without much effort (e.g. like deleting a 
> complete partition in a Hive table). It is clear we need a more sophisticated 
> API for row-level deletes. That's why it would be beneficial to add a method 
> to \{{SupportsDelete}} so that Spark can check if a source can easily delete 
> data with just having filters or it will need a full rewrite later on. This 
> way, we have more control in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33623) Add canDeleteWhere to SupportsDelete

2020-12-01 Thread Anton Okolnychyi (Jira)

Anton Okolnychyi created SPARK-33623:


 Summary: Add canDeleteWhere to SupportsDelete
 Key: SPARK-33623
 URL: https://issues.apache.org/jira/browse/SPARK-33623
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Anton Okolnychyi


The only way to support delete statements right now is to implement 
\{{SupportsDelete}}. According to its Javadoc, that interface is meant for 
cases when we can delete data without much effort (e.g. like deleting a 
complete partition in a Hive table). It is clear we need a more sophisticated 
API for row-level deletes. That's why it would be beneficial to add a method to 
\{{SupportsDelete}} so that Spark can check if a source can easily delete data 
with just having filters or it will need a full rewrite later on. This way, we 
have more control in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32032) Eliminate deprecated poll(long) API calls to avoid infinite wait in driver

2020-12-01 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-32032.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29729
[https://github.com/apache/spark/pull/29729]

> Eliminate deprecated poll(long) API calls to avoid infinite wait in driver
> --
>
> Key: SPARK-32032
> URL: https://issues.apache.org/jira/browse/SPARK-32032
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32032) Eliminate deprecated poll(long) API calls to avoid infinite wait in driver

2020-12-01 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-32032:


Assignee: Gabor Somogyi

> Eliminate deprecated poll(long) API calls to avoid infinite wait in driver
> --
>
> Key: SPARK-32032
> URL: https://issues.apache.org/jira/browse/SPARK-32032
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33622) Add array_to_vector function to SparkR

2020-12-01 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-33622:
--

 Summary: Add array_to_vector function to SparkR
 Key: SPARK-33622
 URL: https://issues.apache.org/jira/browse/SPARK-33622
 Project: Spark
  Issue Type: Improvement
  Components: SparkR, SQL
Affects Versions: 3.1.0
Reporter: Maciej Szymkiewicz


SPARK-33556 added {{array_to_vector}} to Scala and Python API. It will be 
useful to have in R API as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33621) Add a way to inject optimization rules after expression optimization

2020-12-01 Thread Anton Okolnychyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241450#comment-17241450
 ] 

Anton Okolnychyi commented on SPARK-33621:
--

I am going to submit a PR soon.

> Add a way to inject optimization rules after expression optimization
> 
>
> Key: SPARK-33621
> URL: https://issues.apache.org/jira/browse/SPARK-33621
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> {{SparkSessionExtensions}} allow us to inject optimization rules but they are 
> added to operator optimization batch. There are cases when users need to run 
> rules after the operator optimization batch. Currently, this is not possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33621) Add a way to inject optimization rules after expression optimization

2020-12-01 Thread Anton Okolnychyi (Jira)

Anton Okolnychyi created SPARK-33621:


 Summary: Add a way to inject optimization rules after expression 
optimization
 Key: SPARK-33621
 URL: https://issues.apache.org/jira/browse/SPARK-33621
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Anton Okolnychyi


{{SparkSessionExtensions}} allow us to inject optimization rules but they are 
added to operator optimization batch. There are cases when users need to run 
rules after the operator optimization batch. Currently, this is not possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33612) Add dataSourceRewriteRules batch to Optimizer

2020-12-01 Thread Anton Okolnychyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-33612:
-
Summary: Add dataSourceRewriteRules batch to Optimizer  (was: Add 
v2SourceRewriteRules batch to Optimizer)

> Add dataSourceRewriteRules batch to Optimizer
> -
>
> Key: SPARK-33612
> URL: https://issues.apache.org/jira/browse/SPARK-33612
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> Right now, we have a special place in the optimizer where we run rules for 
> constructing v2 scans. As time shows, we need more rewrite rules for v2 
> tables and not all of them related to reads. One option is to rename the 
> current batch into something more generic but it would require changing quite 
> some places. Moreover, the batch contains some non-V2 logic as well. That's 
> why it seems better to introduce a new batch and use it for all data source 
> v2 rewrites and beyond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33612) Add v2SourceRewriteRules batch to Optimizer

2020-12-01 Thread Anton Okolnychyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-33612:
-
Description: Right now, we have a special place in the optimizer where we 
run rules for constructing v2 scans. As time shows, we need more rewrite rules 
for v2 tables and not all of them related to reads. One option is to rename the 
current batch into something more generic but it would require changing quite 
some places. Moreover, the batch contains some non-V2 logic as well. That's why 
it seems better to introduce a new batch and use it for all data source v2 
rewrites and beyond.  (was: Right now, we have a special place in the optimizer 
where we run rules for constructing v2 scans. As time shows, we need more 
rewrite rules for v2 tables and not all of them related to reads. One option is 
to rename the current batch into something more generic but it would require 
changing quite some places. Moreover, the batch contains some non-V2 logic as 
well. That's why it seems better to introduce a new batch and use it for all 
data source v2 rewrites.)

> Add v2SourceRewriteRules batch to Optimizer
> ---
>
> Key: SPARK-33612
> URL: https://issues.apache.org/jira/browse/SPARK-33612
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> Right now, we have a special place in the optimizer where we run rules for 
> constructing v2 scans. As time shows, we need more rewrite rules for v2 
> tables and not all of them related to reads. One option is to rename the 
> current batch into something more generic but it would require changing quite 
> some places. Moreover, the batch contains some non-V2 logic as well. That's 
> why it seems better to introduce a new batch and use it for all data source 
> v2 rewrites and beyond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33572) Datetime building should fail if the year, month, ..., second combination is invalid

2020-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33572:
---

Assignee: zhoukeyong

> Datetime building should fail if the year, month, ..., second combination is 
> invalid
> 
>
> Key: SPARK-33572
> URL: https://issues.apache.org/jira/browse/SPARK-33572
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: zhoukeyong
>Assignee: zhoukeyong
>Priority: Major
>
> Datetime building should fail if the year, month, ..., second combination is 
> invalid, when ANSI mode is enabled. This patch should update MakeDate, 
> MakeTimestamp and MakeInterval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33572) Datetime building should fail if the year, month, ..., second combination is invalid

2020-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33572.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30516
[https://github.com/apache/spark/pull/30516]

> Datetime building should fail if the year, month, ..., second combination is 
> invalid
> 
>
> Key: SPARK-33572
> URL: https://issues.apache.org/jira/browse/SPARK-33572
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: zhoukeyong
>Assignee: zhoukeyong
>Priority: Major
> Fix For: 3.1.0
>
>
> Datetime building should fail if the year, month, ..., second combination is 
> invalid, when ANSI mode is enabled. This patch should update MakeDate, 
> MakeTimestamp and MakeInterval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33620) Task not started after filtering

2020-12-01 Thread Vladislav Sterkhov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladislav Sterkhov updated SPARK-33620:
---
Description: 
Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory 
used task starting and complete, but we need use unlimited stack. Please help

 

!VlwWJ.png!

 

!mgg1s.png!

 

This my code:

var allTrafficRDD = sparkContext.emptyRDD[String]
 for (traffic <- trafficBuffer) {
 logger.info("Load traffic path - "+traffic)
 val trafficRDD = sparkContext.textFile(traffic)
 if (isValidTraffic(trafficRDD, isMasterData))

{ allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) }

}
 
hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum),
 outTable, isMasterData)

  was:
Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory 
used task starting and complete, but we need use unlimited stack. Please help

 

!image-2020-12-01-13-34-17-283.png!


!image-2020-12-01-13-34-31-288.png!

 

This my code:

 

{{var allTrafficRDD = sparkContext.emptyRDD[String]
for (traffic <- trafficBuffer) \{
  logger.info("Load traffic path - "+traffic)
  val trafficRDD = sparkContext.textFile(traffic)
  if (isValidTraffic(trafficRDD, isMasterData)) {
allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD))
  }
}

hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum),
 outTable, isMasterData)}}


> Task not started after filtering
> 
>
> Key: SPARK-33620
> URL: https://issues.apache.org/jira/browse/SPARK-33620
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Vladislav Sterkhov
>Priority: Major
> Attachments: VlwWJ.png, mgg1s.png
>
>
> Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb 
> memory used task starting and complete, but we need use unlimited stack. 
> Please help
>  
> !VlwWJ.png!
>  
> !mgg1s.png!
>  
> This my code:
> var allTrafficRDD = sparkContext.emptyRDD[String]
>  for (traffic <- trafficBuffer) {
>  logger.info("Load traffic path - "+traffic)
>  val trafficRDD = sparkContext.textFile(traffic)
>  if (isValidTraffic(trafficRDD, isMasterData))
> { allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) }
> }
>  
> hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum),
>  outTable, isMasterData)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33620) Task not started after filtering

2020-12-01 Thread Vladislav Sterkhov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladislav Sterkhov updated SPARK-33620:
---
Description: 
Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory 
used task starting and complete, but we need use unlimited stack. Please help

 

!VlwWJ.png|width=644,height=150!

 

!mgg1s.png|width=651,height=182!

 

This my code:

var allTrafficRDD = sparkContext.emptyRDD[String]
 for (traffic <- trafficBuffer) {
 logger.info("Load traffic path - "+traffic)
 val trafficRDD = sparkContext.textFile(traffic)
 if (isValidTraffic(trafficRDD, isMasterData))

{ allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) }

}
 
hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum),
 outTable, isMasterData)

  was:
Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory 
used task starting and complete, but we need use unlimited stack. Please help

 

!VlwWJ.png!

 

!mgg1s.png!

 

This my code:

var allTrafficRDD = sparkContext.emptyRDD[String]
 for (traffic <- trafficBuffer) {
 logger.info("Load traffic path - "+traffic)
 val trafficRDD = sparkContext.textFile(traffic)
 if (isValidTraffic(trafficRDD, isMasterData))

{ allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) }

}
 
hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum),
 outTable, isMasterData)


> Task not started after filtering
> 
>
> Key: SPARK-33620
> URL: https://issues.apache.org/jira/browse/SPARK-33620
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Vladislav Sterkhov
>Priority: Major
> Attachments: VlwWJ.png, mgg1s.png
>
>
> Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb 
> memory used task starting and complete, but we need use unlimited stack. 
> Please help
>  
> !VlwWJ.png|width=644,height=150!
>  
> !mgg1s.png|width=651,height=182!
>  
> This my code:
> var allTrafficRDD = sparkContext.emptyRDD[String]
>  for (traffic <- trafficBuffer) {
>  logger.info("Load traffic path - "+traffic)
>  val trafficRDD = sparkContext.textFile(traffic)
>  if (isValidTraffic(trafficRDD, isMasterData))
> { allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD)) }
> }
>  
> hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum),
>  outTable, isMasterData)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33620) Task not started after filtering

2020-12-01 Thread Vladislav Sterkhov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladislav Sterkhov updated SPARK-33620:
---
Attachment: mgg1s.png

> Task not started after filtering
> 
>
> Key: SPARK-33620
> URL: https://issues.apache.org/jira/browse/SPARK-33620
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Vladislav Sterkhov
>Priority: Major
> Attachments: VlwWJ.png, mgg1s.png
>
>
> Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb 
> memory used task starting and complete, but we need use unlimited stack. 
> Please help
>  
> !image-2020-12-01-13-34-17-283.png!
> !image-2020-12-01-13-34-31-288.png!
>  
> This my code:
>  
> {{var allTrafficRDD = sparkContext.emptyRDD[String]
> for (traffic <- trafficBuffer) \{
>   logger.info("Load traffic path - "+traffic)
>   val trafficRDD = sparkContext.textFile(traffic)
>   if (isValidTraffic(trafficRDD, isMasterData)) {
> allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD))
>   }
> }
> 
> hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum),
>  outTable, isMasterData)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33620) Task not started after filtering

2020-12-01 Thread Vladislav Sterkhov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladislav Sterkhov updated SPARK-33620:
---
Attachment: VlwWJ.png

> Task not started after filtering
> 
>
> Key: SPARK-33620
> URL: https://issues.apache.org/jira/browse/SPARK-33620
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Vladislav Sterkhov
>Priority: Major
> Attachments: VlwWJ.png
>
>
> Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb 
> memory used task starting and complete, but we need use unlimited stack. 
> Please help
>  
> !image-2020-12-01-13-34-17-283.png!
> !image-2020-12-01-13-34-31-288.png!
>  
> This my code:
>  
> {{var allTrafficRDD = sparkContext.emptyRDD[String]
> for (traffic <- trafficBuffer) \{
>   logger.info("Load traffic path - "+traffic)
>   val trafficRDD = sparkContext.textFile(traffic)
>   if (isValidTraffic(trafficRDD, isMasterData)) {
> allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD))
>   }
> }
> 
> hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum),
>  outTable, isMasterData)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33620) Task not started after filtering

2020-12-01 Thread Vladislav Sterkhov (Jira)

Vladislav Sterkhov created SPARK-33620:
--

 Summary: Task not started after filtering
 Key: SPARK-33620
 URL: https://issues.apache.org/jira/browse/SPARK-33620
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.4.7
Reporter: Vladislav Sterkhov


Hello i have problem with big memory used ~2000gb hdfs stack. With 300gb memory 
used task starting and complete, but we need use unlimited stack. Please help

 

!image-2020-12-01-13-34-17-283.png!


!image-2020-12-01-13-34-31-288.png!

 

This my code:

 

{{var allTrafficRDD = sparkContext.emptyRDD[String]
for (traffic <- trafficBuffer) \{
  logger.info("Load traffic path - "+traffic)
  val trafficRDD = sparkContext.textFile(traffic)
  if (isValidTraffic(trafficRDD, isMasterData)) {
allTrafficRDD = allTrafficRDD.++(filterTraffic(trafficRDD))
  }
}

hiveService.insertTrafficRDD(allTrafficRDD.repartition(beforeInsertPartitionsNum),
 outTable, isMasterData)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241408#comment-17241408
 ] 

Maxim Gekk commented on SPARK-33571:


[~simonvanderveldt] Looking at the dates, you tested, both dates 1880-10-01 and 
2020-10-01 belong to the Gregorian calendar, so, should be no diffs.

For the date 0220-10-01, please, have a look at the table which I built in the 
PR: https://github.com/apache/spark/pull/28067 . The table shows that there is 
no diffs between 2 calendars for the year.

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32910) Remove UninterruptibleThread usage from KafkaOffsetReader

2020-12-01 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-32910:
--
Affects Version/s: (was: 3.1.0)
   3.2.0

> Remove UninterruptibleThread usage from KafkaOffsetReader
> -
>
> Key: SPARK-32910
> URL: https://issues.apache.org/jira/browse/SPARK-32910
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> We've talked about this here: 
> https://github.com/apache/spark/pull/29729#discussion_r488690731
> This jira stands only if the mentioned PR is merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32910) Remove UninterruptibleThread usage from KafkaOffsetReader

2020-12-01 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241403#comment-17241403
 ] 

Gabor Somogyi commented on SPARK-32910:
---

I think there will be no time to put it into 3.1 so changing the target.

> Remove UninterruptibleThread usage from KafkaOffsetReader
> -
>
> Key: SPARK-32910
> URL: https://issues.apache.org/jira/browse/SPARK-32910
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> We've talked about this here: 
> https://github.com/apache/spark/pull/29729#discussion_r488690731
> This jira stands only if the mentioned PR is merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33619) GetMapValueUtil code generation error

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33619:


Assignee: Apache Spark

> GetMapValueUtil code generation error
> -
>
> Key: SPARK-33619
> URL: https://issues.apache.org/jira/browse/SPARK-33619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Apache Spark
>Priority: Major
>
> ```
> In -SPARK-33460---
> there is a bug as follow.
> GetMapValueUtil generated code error when ANSI mode is on
>  
> s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""
> should be 
> s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
> exist.");"""
>  
> But Why are
> checkExceptionInExpression[Exception](expr, errMsg)
> and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql
> can't detect this Bug
>  
> it's because 
> 1. checkExceptionInExpression is some what error, too. it should wrap with 
> withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE 
> SHOULD FIX this later.
> 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
> ConstantFolding rules, it is calling eval instead of code gen  in this case
>  
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33619) GetMapValueUtil code generation error

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33619:


Assignee: (was: Apache Spark)

> GetMapValueUtil code generation error
> -
>
> Key: SPARK-33619
> URL: https://issues.apache.org/jira/browse/SPARK-33619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> ```
> In -SPARK-33460---
> there is a bug as follow.
> GetMapValueUtil generated code error when ANSI mode is on
>  
> s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""
> should be 
> s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
> exist.");"""
>  
> But Why are
> checkExceptionInExpression[Exception](expr, errMsg)
> and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql
> can't detect this Bug
>  
> it's because 
> 1. checkExceptionInExpression is some what error, too. it should wrap with 
> withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE 
> SHOULD FIX this later.
> 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
> ConstantFolding rules, it is calling eval instead of code gen  in this case
>  
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33619) GetMapValueUtil code generation error

2020-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241401#comment-17241401
 ] 

Apache Spark commented on SPARK-33619:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/30560

> GetMapValueUtil code generation error
> -
>
> Key: SPARK-33619
> URL: https://issues.apache.org/jira/browse/SPARK-33619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> ```
> In -SPARK-33460---
> there is a bug as follow.
> GetMapValueUtil generated code error when ANSI mode is on
>  
> s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""
> should be 
> s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
> exist.");"""
>  
> But Why are
> checkExceptionInExpression[Exception](expr, errMsg)
> and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql
> can't detect this Bug
>  
> it's because 
> 1. checkExceptionInExpression is some what error, too. it should wrap with 
> withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE 
> SHOULD FIX this later.
> 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
> ConstantFolding rules, it is calling eval instead of code gen  in this case
>  
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241400#comment-17241400
 ] 

Maxim Gekk commented on SPARK-33571:


Spark 3.0.1 shows different results as well:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_275)
scala> 
spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false)
20/12/01 12:31:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.SparkUpgradeException: You may get a different result due to 
the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during reading. Or set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
datetime values as it is.

scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", 
"LEGACY")

scala> 
spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false)
+--+--+
|dict  |plain |
+--+--+
|1001-01-01|1001-01-01|
|1001-01-01|1001-01-02|
|1001-01-01|1001-01-03|
|1001-01-01|1001-01-04|
|1001-01-01|1001-01-05|
|1001-01-01|1001-01-06|
|1001-01-01|1001-01-07|
|1001-01-01|1001-01-08|
+--+--+


scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", 
"CORRECTED")

scala> 
spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false)
+--+--+
|dict  |plain |
+--+--+
|1001-01-07|1001-01-07|
|1001-01-07|1001-01-08|
|1001-01-07|1001-01-09|
|1001-01-07|1001-01-10|
|1001-01-07|1001-01-11|
|1001-01-07|1001-01-12|
|1001-01-07|1001-01-13|
|1001-01-07|1001-01-14|
+--+--+

{code}

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog po

[jira] [Updated] (SPARK-33619) GetMapValueUtil code generation error

2020-12-01 Thread Leanken.Lin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33619:

Description: 
```

In -SPARK-33460---

there is a bug as follow.

GetMapValueUtil generated code error when ANSI mode is on

 

s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""

should be 

s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
exist.");"""

 

But Why are

checkExceptionInExpression[Exception](expr, errMsg)

and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql

can't detect this Bug

 

it's because 

1. checkExceptionInExpression is some what error, too. it should wrap with 

withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE 
SHOULD FIX this later.

2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
ConstantFolding rules, it is calling eval instead of code gen  in this case

 

```

  was:
```

In -SPARK-33460---

there is a bug as follow.

GetMapValueUtil generated code error when ANSI mode is on

 

s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""

should be 

s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
exist.");"""

 

But Why are

checkExceptionInExpression[Exception](expr, errMsg)

and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql

can't detect this Bug

 

it's because 

1. checkExceptionInExpression is some what error, too. it should wrap with 

withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation

2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
ConstantFolding rules, it is calling eval instead of code gen  in this case

 

```


> GetMapValueUtil code generation error
> -
>
> Key: SPARK-33619
> URL: https://issues.apache.org/jira/browse/SPARK-33619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> ```
> In -SPARK-33460---
> there is a bug as follow.
> GetMapValueUtil generated code error when ANSI mode is on
>  
> s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""
> should be 
> s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
> exist.");"""
>  
> But Why are
> checkExceptionInExpression[Exception](expr, errMsg)
> and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql
> can't detect this Bug
>  
> it's because 
> 1. checkExceptionInExpression is some what error, too. it should wrap with 
> withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE 
> SHOULD FIX this later.
> 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
> ConstantFolding rules, it is calling eval instead of code gen  in this case
>  
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33619) GetMapValueUtil code generation error

2020-12-01 Thread Leanken.Lin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33619:

Description: 
```

In -SPARK-33460---

there is a bug as follow.

GetMapValueUtil generated code error when ANSI mode is on

 

s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""

should be 

s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
exist.");"""

 

But Why are

checkExceptionInExpression[Exception](expr, errMsg)

and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql

can't detect this Bug

 

it's because 

1. checkExceptionInExpression is some what error, too. it should wrap with 

withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation

2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
ConstantFolding rules, it is calling eval instead of code gen  in this case

 

```

  was:
```

 

```


> GetMapValueUtil code generation error
> -
>
> Key: SPARK-33619
> URL: https://issues.apache.org/jira/browse/SPARK-33619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> ```
> In -SPARK-33460---
> there is a bug as follow.
> GetMapValueUtil generated code error when ANSI mode is on
>  
> s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""
> should be 
> s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
> exist.");"""
>  
> But Why are
> checkExceptionInExpression[Exception](expr, errMsg)
> and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql
> can't detect this Bug
>  
> it's because 
> 1. checkExceptionInExpression is some what error, too. it should wrap with 
> withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation
> 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
> ConstantFolding rules, it is calling eval instead of code gen  in this case
>  
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33619) GetMapValueUtil code generation error

2020-12-01 Thread Leanken.Lin (Jira)

Leanken.Lin created SPARK-33619:
---

 Summary: GetMapValueUtil code generation error
 Key: SPARK-33619
 URL: https://issues.apache.org/jira/browse/SPARK-33619
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


```

 

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241379#comment-17241379
 ] 

Maxim Gekk commented on SPARK-33571:


I have tried to reproduce the issue on the master branch by reading the file 
saved by Spark 2.4.5 
(https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data):
{code:scala}
  test("SPARK-33571: read ancient dates saved by Spark 2.4.5") {
withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> 
LEGACY.toString) {
  val path = 
getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet")
  val df = spark.read.parquet(path)
  df.show(false)
}
withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> 
CORRECTED.toString) {
  val path = 
getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet")
  val df = spark.read.parquet(path)
  df.show(false)
}
  }
{code}

The results are different in LEGACY and in CORRECTED modes:
{code}
+--+--+
|dict  |plain |
+--+--+
|1001-01-01|1001-01-01|
|1001-01-01|1001-01-02|
|1001-01-01|1001-01-03|
|1001-01-01|1001-01-04|
|1001-01-01|1001-01-05|
|1001-01-01|1001-01-06|
|1001-01-01|1001-01-07|
|1001-01-01|1001-01-08|
+--+--+

+--+--+
|dict  |plain |
+--+--+
|1001-01-07|1001-01-07|
|1001-01-07|1001-01-08|
|1001-01-07|1001-01-09|
|1001-01-07|1001-01-10|
|1001-01-07|1001-01-11|
|1001-01-07|1001-01-12|
|1001-01-07|1001-01-13|
|1001-01-07|1001-01-14|
+--+--+
{code}

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've ma

[jira] [Commented] (SPARK-33616) Join queries across different datasources (hive and hive)

2020-12-01 Thread Fang Wen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241373#comment-17241373
 ] 

Fang Wen commented on SPARK-33616:
--

ok, thanks

> Join queries across different datasources (hive and hive)
> -
>
> Key: SPARK-33616
> URL: https://issues.apache.org/jira/browse/SPARK-33616
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Fang Wen
>Priority: Major
>
> spark3.0 datasource v2 has supported join queries across different datasource 
> like mysql and hive, it's very convenient to handle data from different 
> datasource. But in our scene，sometimes we need handle Hive data of different 
> clusters，so l want to kown if there is a plan to support multi hive 
> datadource in later spark version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33617) Avoid generating small files for INSERT INTO TABLE from VALUES

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33617:


Assignee: (was: Apache Spark)

> Avoid generating small files for INSERT INTO TABLE from  VALUES
> ---
>
> Key: SPARK-33617
> URL: https://issues.apache.org/jira/browse/SPARK-33617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table t1(id int) stored as textfile;
> insert into table t1 values (1), (2), (3), (4), (5), (6), (7), (8);
> {code}
> It will generate these files:
> {noformat}
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-0-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-1-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-2-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-3-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-4-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-5-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-6-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-7-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33617) Avoid generating small files for INSERT INTO TABLE from VALUES

2020-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241362#comment-17241362
 ] 

Apache Spark commented on SPARK-33617:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30559

> Avoid generating small files for INSERT INTO TABLE from  VALUES
> ---
>
> Key: SPARK-33617
> URL: https://issues.apache.org/jira/browse/SPARK-33617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table t1(id int) stored as textfile;
> insert into table t1 values (1), (2), (3), (4), (5), (6), (7), (8);
> {code}
> It will generate these files:
> {noformat}
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-0-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-1-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-2-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-3-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-4-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-5-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-6-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-7-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33617) Avoid generating small files for INSERT INTO TABLE from VALUES

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33617:


Assignee: Apache Spark

> Avoid generating small files for INSERT INTO TABLE from  VALUES
> ---
>
> Key: SPARK-33617
> URL: https://issues.apache.org/jira/browse/SPARK-33617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table t1(id int) stored as textfile;
> insert into table t1 values (1), (2), (3), (4), (5), (6), (7), (8);
> {code}
> It will generate these files:
> {noformat}
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-0-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-1-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-2-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-3-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-4-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-5-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-6-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> -rwxr-xr-x 1 root root 2 Nov 30 23:07 
> part-7-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33618) hadoop-aws doesn't work

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33618:


Assignee: (was: Apache Spark)

> hadoop-aws doesn't work
> ---
>
> Key: SPARK-33618
> URL: https://issues.apache.org/jira/browse/SPARK-33618
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> According to 
> [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since 
> Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. In 
> other words, the regression is that `dev/make-distribution.sh -Phadoop-cloud 
> ...` doesn't make a complete distribution for cloud support. It fails at 
> write operation like the following.
> {code}
> $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID 
> --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
> 20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context available as 'sc' (master = local[*], app id = 
> local-1606806088715).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.parquet("s3a://dongjoon/users.parquet").show
> 20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
> +--+--++
> |  name|favorite_color|favorite_numbers|
> +--+--++
> |Alyssa|  null|  [3, 9, 15, 20]|
> |   Ben|   red|  []|
> +--+--++
> scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet")
> 20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 
> 1]
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.util.SemaphoredDelegatingExecutor.(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33618) hadoop-aws doesn't work

2020-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241345#comment-17241345
 ] 

Apache Spark commented on SPARK-33618:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30508

> hadoop-aws doesn't work
> ---
>
> Key: SPARK-33618
> URL: https://issues.apache.org/jira/browse/SPARK-33618
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> According to 
> [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since 
> Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. In 
> other words, the regression is that `dev/make-distribution.sh -Phadoop-cloud 
> ...` doesn't make a complete distribution for cloud support. It fails at 
> write operation like the following.
> {code}
> $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID 
> --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
> 20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context available as 'sc' (master = local[*], app id = 
> local-1606806088715).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.parquet("s3a://dongjoon/users.parquet").show
> 20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
> +--+--++
> |  name|favorite_color|favorite_numbers|
> +--+--++
> |Alyssa|  null|  [3, 9, 15, 20]|
> |   Ben|   red|  []|
> +--+--++
> scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet")
> 20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 
> 1]
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.util.SemaphoredDelegatingExecutor.(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33618) hadoop-aws doesn't work

2020-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33618:


Assignee: Apache Spark

> hadoop-aws doesn't work
> ---
>
> Key: SPARK-33618
> URL: https://issues.apache.org/jira/browse/SPARK-33618
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Blocker
>
> According to 
> [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since 
> Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. In 
> other words, the regression is that `dev/make-distribution.sh -Phadoop-cloud 
> ...` doesn't make a complete distribution for cloud support. It fails at 
> write operation like the following.
> {code}
> $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID 
> --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
> 20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context available as 'sc' (master = local[*], app id = 
> local-1606806088715).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.parquet("s3a://dongjoon/users.parquet").show
> 20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
> +--+--++
> |  name|favorite_color|favorite_numbers|
> +--+--++
> |Alyssa|  null|  [3, 9, 15, 20]|
> |   Ben|   red|  []|
> +--+--++
> scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet")
> 20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 
> 1]
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.util.SemaphoredDelegatingExecutor.(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.

2020-12-01 Thread Darshat (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241342#comment-17241342
 ] 

Darshat commented on SPARK-33576:
-

I turned on the jvm exception logging, and this is the full trace:

JVM stacktrace: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 1 in stage 58.0 failed 4 times, most recent failure: Lost task 
1.3 in stage 58.0 (TID 3336, 10.139.64.5, executor 7): 
org.apache.spark.api.python.PythonException: 'OSError: Invalid IPC message: 
negative bodyLength'. Full traceback below: Traceback (most recent call last): 
File "/databricks/spark/python/pyspark/worker.py", line 654, in main process() 
File "/databricks/spark/python/pyspark/worker.py", line 646, in process 
serializer.dump_stream(out_iter, outfile) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in 
dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in 
dump_stream for batch in iterator: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in 
init_stream_yield_batches for series in iterator: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in 
load_stream for batch in batches: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in 
load_stream for batch in batches: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in 
load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in __iter__ 
File "pyarrow/ipc.pxi", line 432, in 
pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", line 
99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative 
bodyLength at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:598)
 at 
org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:99)
 at 
org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49)
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:551)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithoutKey_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
 at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144) at 
org.apache.spark.scheduler.Task.run(Task.scala:117) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:660)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)
 at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460) at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)
 at scala.Option.foreach(Option.scala:407) at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)
 at 
o

[jira] [Created] (SPARK-33618) hadoop-aws doesn't work

2020-12-01 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-33618:
-

 Summary: hadoop-aws doesn't work
 Key: SPARK-33618
 URL: https://issues.apache.org/jira/browse/SPARK-33618
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun


According to [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) 
since Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. 
In other words, the regression is that `dev/make-distribution.sh -Phadoop-cloud 
...` doesn't make a complete distribution for cloud support. It fails at write 
operation like the following.

{code}
$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID 
--conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context available as 'sc' (master = local[*], app id = 
local-1606806088715).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.parquet("s3a://dongjoon/users.parquet").show
20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
+--+--++
|  name|favorite_color|favorite_numbers|
+--+--++
|Alyssa|  null|  [3, 9, 15, 20]|
|   Ben|   red|  []|
+--+--++


scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet")
20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1]
java.lang.NoSuchMethodError: 
org.apache.hadoop.util.SemaphoredDelegatingExecutor.(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241339#comment-17241339
 ] 

Maxim Gekk commented on SPARK-33571:


[~simonvanderveldt] Thank you for the detailed description and your 
investigation. Let me clarify a few things:

> From our testing we're seeing several issues:
> Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that 
> contains fields of type `TimestampType` which contain timestamps before the 
> above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5

Spark 2.4.5 writes timestamps as parquet INT96 type. The SQL config 
`datetimeRebaseModeInRead` does not influence on reading such types in Spark 
3.0.1, so, Spark performs rebasing always (LEGACY mode). We recently added 
separate configs for INT96:
* https://github.com/apache/spark/pull/30056
* https://github.com/apache/spark/pull/30121

The changes will be released with Spark 3.1.0.

> Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that 
> contains fields of type `TimestampType` or `DateType` which contain dates or 
> timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.

For INT96, it seems it is correct behavior. We should observe different results 
for TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, see the SQL config 
spark.sql.parquet.outputTimestampType.

The DATE case is more interesting as we must see a difference in results for 
ancient dates. I will investigate this case. 

 

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the

[jira] [Created] (SPARK-33617) Avoid generating small files for INSERT INTO TABLE from VALUES

2020-12-01 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-33617:
---

 Summary: Avoid generating small files for INSERT INTO TABLE from  
VALUES
 Key: SPARK-33617
 URL: https://issues.apache.org/jira/browse/SPARK-33617
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


How to reproduce this issue:

{code:sql}
create table t1(id int) stored as textfile;
insert into table t1 values (1), (2), (3), (4), (5), (6), (7), (8);
{code}

It will generate these files:

{noformat}
-rwxr-xr-x 1 root root 2 Nov 30 23:07 
part-0-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
-rwxr-xr-x 1 root root 2 Nov 30 23:07 
part-1-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
-rwxr-xr-x 1 root root 2 Nov 30 23:07 
part-2-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
-rwxr-xr-x 1 root root 2 Nov 30 23:07 
part-3-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
-rwxr-xr-x 1 root root 2 Nov 30 23:07 
part-4-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
-rwxr-xr-x 1 root root 2 Nov 30 23:07 
part-5-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
-rwxr-xr-x 1 root root 2 Nov 30 23:07 
part-6-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
-rwxr-xr-x 1 root root 2 Nov 30 23:07 
part-7-76a5ddf9-10df-41f8-ac19-8186449d958d-c000
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33599) Group exception messages in catalyst/analysis

2020-12-01 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241330#comment-17241330
 ] 

jiaan.geng commented on SPARK-33599:


I'm working on.

> Group exception messages in catalyst/analysis
> -
>
> Key: SPARK-33599
> URL: https://issues.apache.org/jira/browse/SPARK-33599
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> '/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis'
> || Filename   ||   Count ||
> | Analyzer.scala |   1 |
> | CheckAnalysis.scala|   1 |
> | FunctionRegistry.scala |   5 |
> | ResolveCatalogs.scala  |   1 |
> | ResolveHints.scala |   1 |
> | package.scala  |   2 |
> | unresolved.scala   |  43 |
> '/core/src/main/scala/org/apache/spark/sql/catalyst/analysis'
> || Filename||   Count ||
> | ResolveSessionCatalog.scala |  12 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

78 matches

Mail list logo