[jira] [Created] (SPARK-37054) Porting "pandas API on Spark: Internals" to PySpark docs.

2021-10-18 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-37054:
---

 Summary: Porting "pandas API on Spark: Internals" to PySpark docs.
 Key: SPARK-37054
 URL: https://issues.apache.org/jira/browse/SPARK-37054
 Project: Spark
  Issue Type: Improvement
  Components: docs, PySpark
Affects Versions: 3.2.0
Reporter: Haejoon Lee


We have a 
[documents|https://docs.google.com/document/d/1PR88p6yMHIeSxkDkSqCxLofkcnP0YtwQ2tETfyAWLQQ/edit?usp=sharing]
 for pandas API on Spark internal features, apart from the PySpark official 
documents.

 

Since pandas API on Spark is officially released in Spark 3.2, it's good to 
port this internal documents to the PySpark official documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37054) Porting "pandas API on Spark: Internals" to PySpark docs.

2021-10-18 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430314#comment-17430314
 ] 

Haejoon Lee commented on SPARK-37054:
-

I'm working on this

> Porting "pandas API on Spark: Internals" to PySpark docs.
> -
>
> Key: SPARK-37054
> URL: https://issues.apache.org/jira/browse/SPARK-37054
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We have a 
> [documents|https://docs.google.com/document/d/1PR88p6yMHIeSxkDkSqCxLofkcnP0YtwQ2tETfyAWLQQ/edit?usp=sharing]
>  for pandas API on Spark internal features, apart from the PySpark official 
> documents.
>  
> Since pandas API on Spark is officially released in Spark 3.2, it's good to 
> port this internal documents to the PySpark official documents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37044) Add Row to __all__ in pyspark.sql.types

2021-10-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430304#comment-17430304
 ] 

Hyukjin Kwon commented on SPARK-37044:
--

Agree!

> Add Row to __all__ in pyspark.sql.types
> ---
>
> Key: SPARK-37044
> URL: https://issues.apache.org/jira/browse/SPARK-37044
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from 
> {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import 
> *}} won't import {{Row}}.
> It might be counter-intuitive, especially when we import {{Row}} from 
> {{types}} in {{examples}}.
> Should we add it to {{__all__}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36681) Fail to load Snappy codec

2021-10-18 Thread koert kuipers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430303#comment-17430303
 ] 

koert kuipers commented on SPARK-36681:
---

hadoop Jira issue:
https://issues.apache.org/jira/browse/HADOOP-17891

i have my doubt this only impacts sequence files. i am seeing this issue with 
snappy compressed csv files, snappy compress json files, etc.

> Fail to load Snappy codec
> -
>
> Key: SPARK-36681
> URL: https://issues.apache.org/jira/browse/SPARK-36681
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> snappy-java as a native library should not be relocated in Hadoop shaded 
> client libraries. Currently we use Hadoop shaded client libraries in Spark. 
> If trying to use SnappyCodec to write sequence file, we will encounter the 
> following error:
> {code}
> [info]   Cause: java.lang.UnsatisfiedLinkError: 
> org.apache.hadoop.shaded.org.xerial.snappy.SnappyNative.rawCompress(Ljava/nio/ByteBuffer;IILjava/nio/ByteBuffer;I)I
> [info]   at 
> org.apache.hadoop.shaded.org.xerial.snappy.SnappyNative.rawCompress(Native 
> Method)   
>   
> [info]   at 
> org.apache.hadoop.shaded.org.xerial.snappy.Snappy.compress(Snappy.java:151)   
>   
>
> [info]   at 
> org.apache.hadoop.io.compress.snappy.SnappyCompressor.compressDirectBuf(SnappyCompressor.java:282)
> [info]   at 
> org.apache.hadoop.io.compress.snappy.SnappyCompressor.compress(SnappyCompressor.java:210)
> [info]   at 
> org.apache.hadoop.io.compress.BlockCompressorStream.compress(BlockCompressorStream.java:149)
> [info]   at 
> org.apache.hadoop.io.compress.BlockCompressorStream.finish(BlockCompressorStream.java:142)
> [info]   at 
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1589)
>  
> [info]   at 
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.sync(SequenceFile.java:1605)
> [info]   at 
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.close(SequenceFile.java:1629)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37053) Add metrics system for History server

2021-10-18 Thread angerszhu (Jira)
angerszhu created SPARK-37053:
-

 Summary: Add metrics system for History server
 Key: SPARK-37053
 URL: https://issues.apache.org/jira/browse/SPARK-37053
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


Add metrics system for history server



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37052) Fix spark-3.2 can use --verbose with spark-shell

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430300#comment-17430300
 ] 

Apache Spark commented on SPARK-37052:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34322

> Fix spark-3.2 can use --verbose with spark-shell
> 
>
> Key: SPARK-37052
> URL: https://issues.apache.org/jira/browse/SPARK-37052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Should not pass --verbose to spark-shell since it's not a valid argument for 
> spark-shell



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37052) Fix spark-3.2 can use --verbose with spark-shell

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37052:


Assignee: Apache Spark

> Fix spark-3.2 can use --verbose with spark-shell
> 
>
> Key: SPARK-37052
> URL: https://issues.apache.org/jira/browse/SPARK-37052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Should not pass --verbose to spark-shell since it's not a valid argument for 
> spark-shell



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37052) Fix spark-3.2 can use --verbose with spark-shell

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430299#comment-17430299
 ] 

Apache Spark commented on SPARK-37052:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34322

> Fix spark-3.2 can use --verbose with spark-shell
> 
>
> Key: SPARK-37052
> URL: https://issues.apache.org/jira/browse/SPARK-37052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Should not pass --verbose to spark-shell since it's not a valid argument for 
> spark-shell



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37052) Fix spark-3.2 can use --verbose with spark-shell

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37052:


Assignee: (was: Apache Spark)

> Fix spark-3.2 can use --verbose with spark-shell
> 
>
> Key: SPARK-37052
> URL: https://issues.apache.org/jira/browse/SPARK-37052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Should not pass --verbose to spark-shell since it's not a valid argument for 
> spark-shell



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37052) Fix spark-3.2 can use --verbose with spark-shell

2021-10-18 Thread angerszhu (Jira)
angerszhu created SPARK-37052:
-

 Summary: Fix spark-3.2 can use --verbose with spark-shell
 Key: SPARK-37052
 URL: https://issues.apache.org/jira/browse/SPARK-37052
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 3.2.0
Reporter: angerszhu


Should not pass --verbose to spark-shell since it's not a valid argument for 
spark-shell



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Description: 
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

== Physical Plan ==
 CollectLimit (3)
 +- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)

(1) Scan orc tpcds_bin_partitioned_orc_2.item
 Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Batched: false
 Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
 PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
)]
 ReadSchema: 
struct

(2) Filter
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+

(3) CollectLimit
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Arguments: 100

 

  was:
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

== Physical Plan ==
CollectLimit (3)
+- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)


(1) Scan orc tpcds_bin_partitioned_orc_2.item
Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Batched: false
Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
PushedFilters: [IsNotNull(i_category), EqualTo(i_category,+Music         )]+
ReadSchema: 
struct

(2) Filter
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+

(3) CollectLimit
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, 

[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430297#comment-17430297
 ] 

frankli commented on SPARK-37051:
-

[~dongjoon] Can I trouble you to take a look. Thanks a lot.

> The filter operator gets wrong results in ORC char/varchar types
> 
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Major
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> By the way, Spark's tests should add more cases on ORC format.
>  
> == Physical Plan ==
> CollectLimit (3)
> +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
> Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Batched: false
> Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
> PushedFilters: [IsNotNull(i_category), EqualTo(i_category,+Music         )]+
> ReadSchema: 
> struct
> (2) Filter
> Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
> Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37033) Inline type hints for python/pyspark/resource/requests.py

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37033:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/resource/requests.py
> -
>
> Key: SPARK-37033
> URL: https://issues.apache.org/jira/browse/SPARK-37033
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37033) Inline type hints for python/pyspark/resource/requests.py

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37033:


Assignee: Apache Spark

> Inline type hints for python/pyspark/resource/requests.py
> -
>
> Key: SPARK-37033
> URL: https://issues.apache.org/jira/browse/SPARK-37033
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37033) Inline type hints for python/pyspark/resource/requests.py

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430296#comment-17430296
 ] 

Apache Spark commented on SPARK-37033:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34321

> Inline type hints for python/pyspark/resource/requests.py
> -
>
> Key: SPARK-37033
> URL: https://issues.apache.org/jira/browse/SPARK-37033
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Description: 
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

== Physical Plan ==
CollectLimit (3)
+- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)


(1) Scan orc tpcds_bin_partitioned_orc_2.item
Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Batched: false
Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
PushedFilters: [IsNotNull(i_category), EqualTo(i_category,+Music         )]+
ReadSchema: 
struct

(2) Filter
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+

(3) CollectLimit
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Arguments: 100

 

  was:
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

== Physical Plan ==
 CollectLimit (3)
 +- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)

(1) Scan orc tpcds_bin_partitioned_orc_2.item
 Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Batched: false
PushedFilters: [IsNotNull(i_category), ++EqualTo(i_category,Music          
)]
 ReadSchema: 
struct

(2) Filter
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music           ))+

(3) CollectLimit
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, 

[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Description: 
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

== Physical Plan ==
 CollectLimit (3)
 +- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)

(1) Scan orc tpcds_bin_partitioned_orc_2.item
 Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Batched: false
PushedFilters: [IsNotNull(i_category), ++EqualTo(i_category,Music          
)]
 ReadSchema: 
struct

(2) Filter
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music           ))+

(3) CollectLimit
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Arguments: 100

 

  was:
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

 

== Physical Plan ==
CollectLimit (3)
+- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)


(1) Scan orc tpcds_bin_partitioned_orc_2.item
Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Batched: false
Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,Music          )]+
ReadSchema: 
struct

(2) Filter
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music           ))+

(3) CollectLimit
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 

[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Description: 
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

 

== Physical Plan ==
CollectLimit (3)
+- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)


(1) Scan orc tpcds_bin_partitioned_orc_2.item
Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Batched: false
Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,Music          )]+
ReadSchema: 
struct

(2) Filter
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music           ))+

(3) CollectLimit
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Arguments: 100

 

  was:
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

!image-2021-10-19-11-01-55-597.png|width=1085,height=499!

 


> The filter operator gets wrong results in ORC char/varchar types
> 
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Major
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> By the way, Spark's tests should add more cases on ORC format.
>  
>  
> == Physical Plan ==
> CollectLimit (3)
> +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
> Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Batched: false
> Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
> PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,Music          )]+
> ReadSchema: 
> struct
> (2) Filter
> Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music           
> ))+
> (3) CollectLimit
> Input [22]: [i_item_sk#0L, 

[jira] [Created] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)
frankli created SPARK-37051:
---

 Summary: The filter operator gets wrong results in ORC 
char/varchar types
 Key: SPARK-37051
 URL: https://issues.apache.org/jira/browse/SPARK-37051
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
 Environment: Spark 3.1.2

Scala 2.12 / Java 1.8
Reporter: frankli


When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

!image-2021-10-19-11-01-55-597.png|width=1085,height=499!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36834) Namespace log lines in External Shuffle Service

2021-10-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-36834.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34079
[https://github.com/apache/spark/pull/34079]

> Namespace log lines in External Shuffle Service
> ---
>
> Key: SPARK-36834
> URL: https://issues.apache.org/jira/browse/SPARK-36834
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Thejdeep Gudivada
>Priority: Minor
> Fix For: 3.3.0
>
>
> To differentiate between messages from the different concurrently running ESS 
> on an NM, it would be nice to add a namespace to the log lines emitted by the 
> ESS. This would make querying of logs much more convenient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36834) Namespace log lines in External Shuffle Service

2021-10-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-36834:
---

Assignee: Thejdeep Gudivada

> Namespace log lines in External Shuffle Service
> ---
>
> Key: SPARK-36834
> URL: https://issues.apache.org/jira/browse/SPARK-36834
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Thejdeep Gudivada
>Assignee: Thejdeep Gudivada
>Priority: Minor
> Fix For: 3.3.0
>
>
> To differentiate between messages from the different concurrently running ESS 
> on an NM, it would be nice to add a namespace to the log lines emitted by the 
> ESS. This would make querying of logs much more convenient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18621) PySQL SQL Types (aka Dataframa Schema) have __repr__() with Scala and not Python representation

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430284#comment-17430284
 ] 

Apache Spark commented on SPARK-18621:
--

User 'crflynn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34320

> PySQL SQL Types (aka Dataframa Schema) have __repr__() with Scala and not 
> Python representation
> ---
>
> Key: SPARK-18621
> URL: https://issues.apache.org/jira/browse/SPARK-18621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Romi Kuntsman
>Priority: Minor
>  Labels: bulk-closed
>
> When using Python's repr() on an object, the expected result is a string that 
> Python can evaluate to construct the object.
> See: https://docs.python.org/2/library/functions.html#func-repr
> However, when getting a DataFrame schema in PySpark, the code (in 
> "__repr()__" overload methods) returns the string representation for Scala, 
> rather than for Python.
> Relevant code in PySpark:
> https://github.com/apache/spark/blob/5f02d2e5b4d37f554629cbd0e488e856fffd7b6b/python/pyspark/sql/types.py#L442
> Python Code:
> {code}
> # 1. define object
> struct1 = StructType([StructField("f1", StringType(), True)])
> # 2. print representation, expected to be like above
> print(repr(struct1))
> # 3. actual result:
> # StructType(List(StructField(f1,StringType,true)))
> # 4. try to use result in code
> struct2 = StructType(List(StructField(f1,StringType,true)))
> # 5. get bunch of errors
> # Unresolved reference 'List'
> # Unresolved reference 'f1'
> # StringType is class, not constructed object
> # Unresolved reference 'true'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36834) Namespace log lines in External Shuffle Service

2021-10-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-36834:
---

Assignee: (was: Thejdeep Gudivada)

> Namespace log lines in External Shuffle Service
> ---
>
> Key: SPARK-36834
> URL: https://issues.apache.org/jira/browse/SPARK-36834
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Thejdeep Gudivada
>Priority: Minor
>
> To differentiate between messages from the different concurrently running ESS 
> on an NM, it would be nice to add a namespace to the log lines emitted by the 
> ESS. This would make querying of logs much more convenient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36834) Namespace log lines in External Shuffle Service

2021-10-18 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-36834:
---

Assignee: Thejdeep Gudivada

> Namespace log lines in External Shuffle Service
> ---
>
> Key: SPARK-36834
> URL: https://issues.apache.org/jira/browse/SPARK-36834
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Thejdeep Gudivada
>Assignee: Thejdeep Gudivada
>Priority: Minor
>
> To differentiate between messages from the different concurrently running ESS 
> on an NM, it would be nice to add a namespace to the log lines emitted by the 
> ESS. This would make querying of logs much more convenient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32161) Hide JVM traceback for SparkUpgradeException

2021-10-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32161:


Assignee: pralabhkumar

> Hide JVM traceback for SparkUpgradeException
> 
>
> Key: SPARK-32161
> URL: https://issues.apache.org/jira/browse/SPARK-32161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: pralabhkumar
>Priority: Major
>
> We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. 
> See also https://github.com/apache/spark/pull/28736/files#r449184881
> It should better also whitelist and hide JVM traceback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32161) Hide JVM traceback for SparkUpgradeException

2021-10-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32161.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/34275

> Hide JVM traceback for SparkUpgradeException
> 
>
> Key: SPARK-32161
> URL: https://issues.apache.org/jira/browse/SPARK-32161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: pralabhkumar
>Priority: Major
> Fix For: 3.2.0
>
>
> We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. 
> See also https://github.com/apache/spark/pull/28736/files#r449184881
> It should better also whitelist and hide JVM traceback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37049:


Assignee: Apache Spark

> executorIdleTimeout is not working for pending pods on K8s
> --
>
> Key: SPARK-37049
> URL: https://issues.apache.org/jira/browse/SPARK-37049
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-33099 added the support to respect 
> "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. 
> However, when it checks if a pending executor pod is timed out, it checks 
> against the pod's "startTime". A pending pod "startTime" is empty, and this 
> causes the function "isExecutorIdleTimedOut()" always return true for pending 
> pods.
> This caused the issue, pending pods are deleted immediately when a stage is 
> finished and several new pods got recreated again in the next stage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430257#comment-17430257
 ] 

Apache Spark commented on SPARK-37049:
--

User 'yangwwei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34319

> executorIdleTimeout is not working for pending pods on K8s
> --
>
> Key: SPARK-37049
> URL: https://issues.apache.org/jira/browse/SPARK-37049
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-33099 added the support to respect 
> "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. 
> However, when it checks if a pending executor pod is timed out, it checks 
> against the pod's "startTime". A pending pod "startTime" is empty, and this 
> causes the function "isExecutorIdleTimedOut()" always return true for pending 
> pods.
> This caused the issue, pending pods are deleted immediately when a stage is 
> finished and several new pods got recreated again in the next stage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37048) Clean up inlining type hints under SQL module

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37048:


Assignee: Apache Spark

> Clean up inlining type hints under SQL module
> -
>
> Key: SPARK-37048
> URL: https://issues.apache.org/jira/browse/SPARK-37048
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Now that most of type hits under the SQL module are inlined.
> We should clean up for the module now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37048) Clean up inlining type hints under SQL module

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430255#comment-17430255
 ] 

Apache Spark commented on SPARK-37048:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34318

> Clean up inlining type hints under SQL module
> -
>
> Key: SPARK-37048
> URL: https://issues.apache.org/jira/browse/SPARK-37048
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Now that most of type hits under the SQL module are inlined.
> We should clean up for the module now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37048) Clean up inlining type hints under SQL module

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37048:


Assignee: (was: Apache Spark)

> Clean up inlining type hints under SQL module
> -
>
> Key: SPARK-37048
> URL: https://issues.apache.org/jira/browse/SPARK-37048
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Now that most of type hits under the SQL module are inlined.
> We should clean up for the module now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430256#comment-17430256
 ] 

Apache Spark commented on SPARK-37049:
--

User 'yangwwei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34319

> executorIdleTimeout is not working for pending pods on K8s
> --
>
> Key: SPARK-37049
> URL: https://issues.apache.org/jira/browse/SPARK-37049
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Priority: Major
>
> SPARK-33099 added the support to respect 
> "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. 
> However, when it checks if a pending executor pod is timed out, it checks 
> against the pod's "startTime". A pending pod "startTime" is empty, and this 
> causes the function "isExecutorIdleTimedOut()" always return true for pending 
> pods.
> This caused the issue, pending pods are deleted immediately when a stage is 
> finished and several new pods got recreated again in the next stage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37049:


Assignee: (was: Apache Spark)

> executorIdleTimeout is not working for pending pods on K8s
> --
>
> Key: SPARK-37049
> URL: https://issues.apache.org/jira/browse/SPARK-37049
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Priority: Major
>
> SPARK-33099 added the support to respect 
> "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. 
> However, when it checks if a pending executor pod is timed out, it checks 
> against the pod's "startTime". A pending pod "startTime" is empty, and this 
> causes the function "isExecutorIdleTimedOut()" always return true for pending 
> pods.
> This caused the issue, pending pods are deleted immediately when a stage is 
> finished and several new pods got recreated again in the next stage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37050) Update conda installation instructions

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430254#comment-17430254
 ] 

Apache Spark commented on SPARK-37050:
--

User 'h-vetinari' has created a pull request for this issue:
https://github.com/apache/spark/pull/34315

> Update conda installation instructions
> --
>
> Key: SPARK-37050
> URL: https://issues.apache.org/jira/browse/SPARK-37050
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: H. Vetinari
>Priority: Major
> Fix For: 3.2.1
>
>
> Improve conda installation documentation, as discussed 
> [here|https://github.com/apache/spark-website/pull/361#issuecomment-945660978].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37050) Update conda installation instructions

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37050:


Assignee: Apache Spark

> Update conda installation instructions
> --
>
> Key: SPARK-37050
> URL: https://issues.apache.org/jira/browse/SPARK-37050
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: H. Vetinari
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.1
>
>
> Improve conda installation documentation, as discussed 
> [here|https://github.com/apache/spark-website/pull/361#issuecomment-945660978].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37050) Update conda installation instructions

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37050:


Assignee: (was: Apache Spark)

> Update conda installation instructions
> --
>
> Key: SPARK-37050
> URL: https://issues.apache.org/jira/browse/SPARK-37050
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: H. Vetinari
>Priority: Major
> Fix For: 3.2.1
>
>
> Improve conda installation documentation, as discussed 
> [here|https://github.com/apache/spark-website/pull/361#issuecomment-945660978].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37050) Update conda installation instructions

2021-10-18 Thread H. Vetinari (Jira)
H. Vetinari created SPARK-37050:
---

 Summary: Update conda installation instructions
 Key: SPARK-37050
 URL: https://issues.apache.org/jira/browse/SPARK-37050
 Project: Spark
  Issue Type: Task
  Components: Documentation
Affects Versions: 3.2.0
Reporter: H. Vetinari
 Fix For: 3.2.1


Improve conda installation documentation, as discussed 
[here|https://github.com/apache/spark-website/pull/361#issuecomment-945660978].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s

2021-10-18 Thread Weiwei Yang (Jira)
Weiwei Yang created SPARK-37049:
---

 Summary: executorIdleTimeout is not working for pending pods on K8s
 Key: SPARK-37049
 URL: https://issues.apache.org/jira/browse/SPARK-37049
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Spark Core
Affects Versions: 3.1.0
Reporter: Weiwei Yang


SPARK-33099 added the support to respect 
"spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. 
However, when it checks if a pending executor pod is timed out, it checks 
against the pod's "startTime". A pending pod "startTime" is empty, and this 
causes the function "isExecutorIdleTimedOut()" always return true for pending 
pods.

This caused the issue, pending pods are deleted immediately when a stage is 
finished and several new pods got recreated again in the next stage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36945) Inline type hints for python/pyspark/sql/udf.py

2021-10-18 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36945.
---
Fix Version/s: 3.3.0
 Assignee: dch nguyen
   Resolution: Fixed

Issue resolved by pull request 34289
https://github.com/apache/spark/pull/34289

> Inline type hints for python/pyspark/sql/udf.py
> ---
>
> Key: SPARK-36945
> URL: https://issues.apache.org/jira/browse/SPARK-36945
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37048) Clean up inlining type hints under SQL module

2021-10-18 Thread Takuya Ueshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430245#comment-17430245
 ] 

Takuya Ueshin commented on SPARK-37048:
---

I'm working on this.

> Clean up inlining type hints under SQL module
> -
>
> Key: SPARK-37048
> URL: https://issues.apache.org/jira/browse/SPARK-37048
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Now that most of type hits under the SQL module are inlined.
> We should clean up for the module now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37048) Clean up inlining type hints under SQL module

2021-10-18 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-37048:
-

 Summary: Clean up inlining type hints under SQL module
 Key: SPARK-37048
 URL: https://issues.apache.org/jira/browse/SPARK-37048
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Takuya Ueshin


Now that most of type hits under the SQL module are inlined.
We should clean up for the module now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36933) Reduce duplication in TaskMemoryManager.acquireExecutionMemory

2021-10-18 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-36933:
--

Assignee: Tim Armstrong

> Reduce duplication in TaskMemoryManager.acquireExecutionMemory
> --
>
> Key: SPARK-36933
> URL: https://issues.apache.org/jira/browse/SPARK-36933
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
>
> TaskMemoryManager.acquireExecutionMemory has a bit of redundancy - code is 
> duplicated in the self-spilling case. It would be good to reduce the 
> duplication to make this more maintainable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36933) Reduce duplication in TaskMemoryManager.acquireExecutionMemory

2021-10-18 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-36933.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34186
[https://github.com/apache/spark/pull/34186]

> Reduce duplication in TaskMemoryManager.acquireExecutionMemory
> --
>
> Key: SPARK-36933
> URL: https://issues.apache.org/jira/browse/SPARK-36933
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Fix For: 3.3.0
>
>
> TaskMemoryManager.acquireExecutionMemory has a bit of redundancy - code is 
> duplicated in the self-spilling case. It would be good to reduce the 
> duplication to make this more maintainable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33828) SQL Adaptive Query Execution QA

2021-10-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430184#comment-17430184
 ] 

Dongjoon Hyun commented on SPARK-33828:
---

After 3.2.0 announcement, we will close this JIRA and start `Phase Two` QA, 
[~xkrogen].

> SQL Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: releasenotes
>
> Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA 
> issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by 
> default and collect all information in order to do QA for this feature in 
> Apache Spark 3.2.0 timeframe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37047) Add overloads for lpad and rpad for BINARY strings

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430140#comment-17430140
 ] 

Apache Spark commented on SPARK-37047:
--

User 'mkaravel' has created a pull request for this issue:
https://github.com/apache/spark/pull/34154

> Add overloads for lpad and rpad for BINARY strings
> --
>
> Key: SPARK-37047
> URL: https://issues.apache.org/jira/browse/SPARK-37047
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Menelaos Karavelas
>Priority: Major
>
> Currently, `lpad` and `rpad` accept BINARY strings as input (both in terms of 
> input string to be padded and padding pattern), and these strings get cast to 
> UTF8 strings. The result of the operation is a UTF8 string which may be 
> invalid as it can contain non-UTF8 characters.
> What we would like to do is to overload `lpad` and `rpad` to accept BINARY 
> strings as inputs (both for the string to be padded and the padding pattern) 
> and produce a left or right padded BINARY string as output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37047) Add overloads for lpad and rpad for BINARY strings

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37047:


Assignee: Apache Spark

> Add overloads for lpad and rpad for BINARY strings
> --
>
> Key: SPARK-37047
> URL: https://issues.apache.org/jira/browse/SPARK-37047
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Menelaos Karavelas
>Assignee: Apache Spark
>Priority: Major
>
> Currently, `lpad` and `rpad` accept BINARY strings as input (both in terms of 
> input string to be padded and padding pattern), and these strings get cast to 
> UTF8 strings. The result of the operation is a UTF8 string which may be 
> invalid as it can contain non-UTF8 characters.
> What we would like to do is to overload `lpad` and `rpad` to accept BINARY 
> strings as inputs (both for the string to be padded and the padding pattern) 
> and produce a left or right padded BINARY string as output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37047) Add overloads for lpad and rpad for BINARY strings

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37047:


Assignee: (was: Apache Spark)

> Add overloads for lpad and rpad for BINARY strings
> --
>
> Key: SPARK-37047
> URL: https://issues.apache.org/jira/browse/SPARK-37047
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Menelaos Karavelas
>Priority: Major
>
> Currently, `lpad` and `rpad` accept BINARY strings as input (both in terms of 
> input string to be padded and padding pattern), and these strings get cast to 
> UTF8 strings. The result of the operation is a UTF8 string which may be 
> invalid as it can contain non-UTF8 characters.
> What we would like to do is to overload `lpad` and `rpad` to accept BINARY 
> strings as inputs (both for the string to be padded and the padding pattern) 
> and produce a left or right padded BINARY string as output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36886) Inline type hints for python/pyspark/sql/context.py

2021-10-18 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36886.
---
Fix Version/s: 3.3.0
 Assignee: dch nguyen
   Resolution: Fixed

Issue resolved by pull request 34185
https://github.com/apache/spark/pull/34185

> Inline type hints for python/pyspark/sql/context.py
> ---
>
> Key: SPARK-36886
> URL: https://issues.apache.org/jira/browse/SPARK-36886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dgd_contributor
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints for python/pyspark/sql/context.py from Inline type hints 
> for python/pyspark/sql/context.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37027) Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES

2021-10-18 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430125#comment-17430125
 ] 

Erik Krogen edited comment on SPARK-37027 at 10/18/21, 6:10 PM:


[~yuzhousun] actually this is already fixed by SPARK-28266 in 3.1.3 and 3.2.0. 
Can you try compiling from latest {{master}} or using the 3.2.0 binaries (not 
yet officially released but [can be found on Maven 
Central|https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:3.2.0]).


was (Author: xkrogen):
[~yuzhousun] actually this is already fixed by SPARK-28266 in 3.1.3 and 3.2.0. 
Can you try compiling from latest {{master}} or using the 3.2.0 binaries (not 
yet officially released but 
[https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:3.2.0|can be 
found on Maven Central]).

> Fix behavior inconsistent in Hive table when ‘path’ is provided in 
> SERDEPROPERTIES
> --
>
> Key: SPARK-37027
> URL: https://issues.apache.org/jira/browse/SPARK-37027
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 3.1.2
>Reporter: Yuzhou Sun
>Priority: Trivial
> Attachments: SPARK-37027-test-example.patch
>
>
> If a Hive table is created with both {{WITH SERDEPROPERTIES 
> ('path'='')}} and {{LOCATION }}, Spark can 
> return doubled rows when reading the table. This issue seems to be an 
> extension of SPARK-30507.
>  Reproduce steps:
>  # Create table and insert records via Hive (Spark doesn't allow to insert 
> into table like this)
> {code:sql}
> CREATE TABLE `test_table`(
>   `c1` LONG,
>   `c2` STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES ('path'=''" )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION '';
> INSERT INTO TABLE `test_table`
> VALUES (0, '0');
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> {code}
>  # Read above table from Spark
> {code:sql}
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> -- 0 0
> {code}
> But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will 
> return same result as Hive (i.e. single row)
> A similar case is that, if a Hive table is created with both {{WITH 
> SERDEPROPERTIES ('path'='')}} and {{LOCATION }}, 
> Spark will read both rows under {{anotherPath}} and rows under 
> {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} 
> ‘s value. However, actually Hive seems to return only rows under 
> {{tableLocation}}
> Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, 
> Spark won’t double the rows when {{'path'=''}}. If 
> {{'path'=''}}, Spark will read both rows under {{anotherPath}} 
> and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in 
> {{TBLPROPERTIES}}
> Code examples for the above cases (diff patch wrote in 
> {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37027) Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES

2021-10-18 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430125#comment-17430125
 ] 

Erik Krogen commented on SPARK-37027:
-

[~yuzhousun] actually this is already fixed by SPARK-28266 in 3.1.3 and 3.2.0. 
Can you try compiling from latest {{master}} or using the 3.2.0 binaries (not 
yet officially released but 
[https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:3.2.0|can be 
found on Maven Central]).

> Fix behavior inconsistent in Hive table when ‘path’ is provided in 
> SERDEPROPERTIES
> --
>
> Key: SPARK-37027
> URL: https://issues.apache.org/jira/browse/SPARK-37027
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 3.1.2
>Reporter: Yuzhou Sun
>Priority: Trivial
> Attachments: SPARK-37027-test-example.patch
>
>
> If a Hive table is created with both {{WITH SERDEPROPERTIES 
> ('path'='')}} and {{LOCATION }}, Spark can 
> return doubled rows when reading the table. This issue seems to be an 
> extension of SPARK-30507.
>  Reproduce steps:
>  # Create table and insert records via Hive (Spark doesn't allow to insert 
> into table like this)
> {code:sql}
> CREATE TABLE `test_table`(
>   `c1` LONG,
>   `c2` STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES ('path'=''" )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION '';
> INSERT INTO TABLE `test_table`
> VALUES (0, '0');
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> {code}
>  # Read above table from Spark
> {code:sql}
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> -- 0 0
> {code}
> But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will 
> return same result as Hive (i.e. single row)
> A similar case is that, if a Hive table is created with both {{WITH 
> SERDEPROPERTIES ('path'='')}} and {{LOCATION }}, 
> Spark will read both rows under {{anotherPath}} and rows under 
> {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} 
> ‘s value. However, actually Hive seems to return only rows under 
> {{tableLocation}}
> Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, 
> Spark won’t double the rows when {{'path'=''}}. If 
> {{'path'=''}}, Spark will read both rows under {{anotherPath}} 
> and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in 
> {{TBLPROPERTIES}}
> Code examples for the above cases (diff patch wrote in 
> {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33828) SQL Adaptive Query Execution QA

2021-10-18 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430124#comment-17430124
 ] 

Erik Krogen commented on SPARK-33828:
-

[~dongjoon] as you mentioned above, this epic was initially for collecting 
issues to be resolved in 3.2.0. Now that release has been finalized, but we 
still have a few open issues here, and there are still new AQE issues being 
created (e.g. SPARK-37043 just today). Shall we keep this epic open and 
continue to use it, or create a new one targeted for 3.3.0? Or any other 
suggestions?

> SQL Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: releasenotes
>
> Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA 
> issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by 
> default and collect all information in order to do QA for this feature in 
> Apache Spark 3.2.0 timeframe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37047) Add overloads for lpad and rpad for BINARY strings

2021-10-18 Thread Menelais Karavelas (Jira)
Menelais Karavelas created SPARK-37047:
--

 Summary: Add overloads for lpad and rpad for BINARY strings
 Key: SPARK-37047
 URL: https://issues.apache.org/jira/browse/SPARK-37047
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Menelais Karavelas


Currently, `lpad` and `rpad` accept BINARY strings as input (both in terms of 
input string to be padded and padding pattern), and these strings get cast to 
UTF8 strings. The result of the operation is a UTF8 string which may be invalid 
as it can contain non-UTF8 characters.

What we would like to do is to overload `lpad` and `rpad` to accept BINARY 
strings as inputs (both for the string to be padded and the padding pattern) 
and produce a left or right padded BINARY string as output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37046) Alter view does not preserve column case

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37046:


Assignee: (was: Apache Spark)

> Alter view does not preserve column case
> 
>
> Key: SPARK-37046
> URL: https://issues.apache.org/jira/browse/SPARK-37046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> On running an `alter view` command, the column case is not preserved.
> Repro:
>  
> {code:java}
> scala> sql("create view v as select 1 as A, 1 as B")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [A,int,null]
> [B,int,null]
> scala> sql("alter view v as select 1 as C, 1 as D")
> res4: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [c,int,null]
> [d,int,null]
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37046) Alter view does not preserve column case

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37046:


Assignee: (was: Apache Spark)

> Alter view does not preserve column case
> 
>
> Key: SPARK-37046
> URL: https://issues.apache.org/jira/browse/SPARK-37046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> On running an `alter view` command, the column case is not preserved.
> Repro:
>  
> {code:java}
> scala> sql("create view v as select 1 as A, 1 as B")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [A,int,null]
> [B,int,null]
> scala> sql("alter view v as select 1 as C, 1 as D")
> res4: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [c,int,null]
> [d,int,null]
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37046) Alter view does not preserve column case

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37046:


Assignee: Apache Spark

> Alter view does not preserve column case
> 
>
> Key: SPARK-37046
> URL: https://issues.apache.org/jira/browse/SPARK-37046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Assignee: Apache Spark
>Priority: Major
>
> On running an `alter view` command, the column case is not preserved.
> Repro:
>  
> {code:java}
> scala> sql("create view v as select 1 as A, 1 as B")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [A,int,null]
> [B,int,null]
> scala> sql("alter view v as select 1 as C, 1 as D")
> res4: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [c,int,null]
> [d,int,null]
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37046) Alter view does not preserve column case

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430114#comment-17430114
 ] 

Apache Spark commented on SPARK-37046:
--

User 'somani' has created a pull request for this issue:
https://github.com/apache/spark/pull/34317

> Alter view does not preserve column case
> 
>
> Key: SPARK-37046
> URL: https://issues.apache.org/jira/browse/SPARK-37046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> On running an `alter view` command, the column case is not preserved.
> Repro:
>  
> {code:java}
> scala> sql("create view v as select 1 as A, 1 as B")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [A,int,null]
> [B,int,null]
> scala> sql("alter view v as select 1 as C, 1 as D")
> res4: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [c,int,null]
> [d,int,null]
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37046) Alter view does not preserve column case

2021-10-18 Thread Abhishek Somani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430105#comment-17430105
 ] 

Abhishek Somani commented on SPARK-37046:
-

I'll raise a PR soon

> Alter view does not preserve column case
> 
>
> Key: SPARK-37046
> URL: https://issues.apache.org/jira/browse/SPARK-37046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> On running an `alter view` command, the column case is not preserved.
> Repro:
>  
> {code:java}
> scala> sql("create view v as select 1 as A, 1 as B")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [A,int,null]
> [B,int,null]
> scala> sql("alter view v as select 1 as C, 1 as D")
> res4: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [c,int,null]
> [d,int,null]
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37046) Alter view does not preserve column case

2021-10-18 Thread Abhishek Somani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated SPARK-37046:

Shepherd: Wenchen Fan

> Alter view does not preserve column case
> 
>
> Key: SPARK-37046
> URL: https://issues.apache.org/jira/browse/SPARK-37046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> On running an `alter view` command, the column case is not preserved.
> Repro:
>  
> {code:java}
> scala> sql("create view v as select 1 as A, 1 as B")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [A,int,null]
> [B,int,null]
> scala> sql("alter view v as select 1 as C, 1 as D")
> res4: org.apache.spark.sql.DataFrame = []
> scala> sql("describe v").collect.foreach(println)
> [c,int,null]
> [d,int,null]
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37046) Alter view does not preserve column case

2021-10-18 Thread Abhishek Somani (Jira)
Abhishek Somani created SPARK-37046:
---

 Summary: Alter view does not preserve column case
 Key: SPARK-37046
 URL: https://issues.apache.org/jira/browse/SPARK-37046
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Abhishek Somani


On running an `alter view` command, the column case is not preserved.

Repro:

 
{code:java}
scala> sql("create view v as select 1 as A, 1 as B")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("describe v").collect.foreach(println)
[A,int,null]
[B,int,null]

scala> sql("alter view v as select 1 as C, 1 as D")
res4: org.apache.spark.sql.DataFrame = []

scala> sql("describe v").collect.foreach(println)
[c,int,null]
[d,int,null]
 
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32161) Hide JVM traceback for SparkUpgradeException

2021-10-18 Thread pralabhkumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430099#comment-17430099
 ] 

pralabhkumar commented on SPARK-32161:
--

[~hyukjin.kwon] 

 

Since the PR is being merged , please change the status of the Jira and 
assigned to me . 

 

> Hide JVM traceback for SparkUpgradeException
> 
>
> Key: SPARK-32161
> URL: https://issues.apache.org/jira/browse/SPARK-32161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. 
> See also https://github.com/apache/spark/pull/28736/files#r449184881
> It should better also whitelist and hide JVM traceback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37045) Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests

2021-10-18 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-37045:
-
Description: Extract ALTER TABLE .. ADD COLUMNS tests to the common place 
to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 
specific test suites.  (was: Extract DESCRIBE NAMESPACE tests to the common 
place to run them for V1 and v2 datasources. Some tests can be places to V1 and 
V2 specific test suites.)

> Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests
> 
>
> Key: SPARK-37045
> URL: https://issues.apache.org/jira/browse/SPARK-37045
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> Extract ALTER TABLE .. ADD COLUMNS tests to the common place to run them for 
> V1 and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37045) Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests

2021-10-18 Thread Max Gekk (Jira)
Max Gekk created SPARK-37045:


 Summary: Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests
 Key: SPARK-37045
 URL: https://issues.apache.org/jira/browse/SPARK-37045
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Terry Kim


Extract DESCRIBE NAMESPACE tests to the common place to run them for V1 and v2 
datasources. Some tests can be places to V1 and V2 specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37044) Add Row to __all__ in pyspark.sql.types

2021-10-18 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430059#comment-17430059
 ] 

Maciej Szymkiewicz commented on SPARK-37044:


cc [~hyukjin.kwon]

> Add Row to __all__ in pyspark.sql.types
> ---
>
> Key: SPARK-37044
> URL: https://issues.apache.org/jira/browse/SPARK-37044
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from 
> {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import 
> *}} won't import {{Row}}.
> It might be counter-intuitive, especially when we import {{Row}} from 
> {{types}} in {{examples}}.
> Should we add it to {{__all__}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37044) Add Row to __all__ in pyspark.sql.types

2021-10-18 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-37044:
--

 Summary: Add Row to __all__ in pyspark.sql.types
 Key: SPARK-37044
 URL: https://issues.apache.org/jira/browse/SPARK-37044
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.2.0, 3.1.0, 3.3.0
Reporter: Maciej Szymkiewicz


Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from 
{{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import 
*}} won't import {{Row}}.

It might be counter-intuitive, especially when we import {{Row}} from {{types}} 
in {{examples}}.

Should we add it to {{__all__}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35925) Support DayTimeIntervalType in width-bucket function

2021-10-18 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-35925:


Assignee: PengLei

> Support DayTimeIntervalType in width-bucket function
> 
>
> Key: SPARK-35925
> URL: https://issues.apache.org/jira/browse/SPARK-35925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
>
> At now, width-bucket support the type [DoubleType, DoubleType, DoubleType, 
> LongType],
> we hope that support[DayTimeIntervaType, DayTimeIntervaType, 
> DayTimeIntervaType, LongType]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35925) Support DayTimeIntervalType in width-bucket function

2021-10-18 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35925.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34309
[https://github.com/apache/spark/pull/34309]

> Support DayTimeIntervalType in width-bucket function
> 
>
> Key: SPARK-35925
> URL: https://issues.apache.org/jira/browse/SPARK-35925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> At now, width-bucket support the type [DoubleType, DoubleType, DoubleType, 
> LongType],
> we hope that support[DayTimeIntervaType, DayTimeIntervaType, 
> DayTimeIntervaType, LongType]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-10-18 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-36231:

Comment: was deleted

(was: https://github.com/apache/spark/pull/34314)

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36231
> URL: https://issues.apache.org/jira/browse/SPARK-36231
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>  at 
> 

[jira] [Commented] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429997#comment-17429997
 ] 

Apache Spark commented on SPARK-36231:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34314

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36231
> URL: https://issues.apache.org/jira/browse/SPARK-36231
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
> 

[jira] [Assigned] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36231:


Assignee: (was: Apache Spark)

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36231
> URL: https://issues.apache.org/jira/browse/SPARK-36231
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>  at 
> 

[jira] [Commented] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429996#comment-17429996
 ] 

Apache Spark commented on SPARK-36231:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34314

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36231
> URL: https://issues.apache.org/jira/browse/SPARK-36231
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
> 

[jira] [Assigned] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36231:


Assignee: Apache Spark

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36231
> URL: https://issues.apache.org/jira/browse/SPARK-36231
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>  at 
> 

[jira] [Commented] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-10-18 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429994#comment-17429994
 ] 

Yikun Jiang commented on SPARK-36231:
-

https://github.com/apache/spark/pull/34314

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36231
> URL: https://issues.apache.org/jira/browse/SPARK-36231
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>  at 
> 

[jira] [Assigned] (SPARK-37043) Cancel all running job after AQE plan finished

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37043:


Assignee: (was: Apache Spark)

> Cancel all running job after AQE plan finished
> --
>
> Key: SPARK-37043
> URL: https://issues.apache.org/jira/browse/SPARK-37043
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> We see stage was still running after AQE plan finished. This is because the 
> plan which contains a empty join has been converted to `LocalTableScanExec` 
> during `AQEOptimizer`, but the other side of this join is still running 
> (shuffle map stage).
>  
> It's no meaning to keep running the stage, It's better to cancel the running 
> stage after AQE plan finished in case wasting the task resource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37043) Cancel all running job after AQE plan finished

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37043:


Assignee: Apache Spark

> Cancel all running job after AQE plan finished
> --
>
> Key: SPARK-37043
> URL: https://issues.apache.org/jira/browse/SPARK-37043
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> We see stage was still running after AQE plan finished. This is because the 
> plan which contains a empty join has been converted to `LocalTableScanExec` 
> during `AQEOptimizer`, but the other side of this join is still running 
> (shuffle map stage).
>  
> It's no meaning to keep running the stage, It's better to cancel the running 
> stage after AQE plan finished in case wasting the task resource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36337) decimal('Nan') is unsupported in net.razorvine.pickle

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429987#comment-17429987
 ] 

Apache Spark commented on SPARK-36337:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34314

> decimal('Nan') is unsupported in net.razorvine.pickle 
> --
>
> Key: SPARK-36337
> URL: https://issues.apache.org/jira/browse/SPARK-36337
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> Decimal('NaN') is not supported by net.razorvine.pickle now.
> In Python
> {code:java}
> >>> pickled = cloudpickle.dumps(decimal.Decimal('NaN'))
> b'\x80\x05\x95!\x00\x00\x00\x00\x00\x00\x00\x8c\x07decimal\x94\x8c\x07Decimal\x94\x93\x94\x8c\x03NaN\x94\x85\x94R\x94.'
> >>> pickle.loads(pickled)
> Decimal('NaN')
> {code}
> In Scala
> {code:java}
> scala> import net.razorvine.pickle.\{Pickler, Unpickler, PickleUtils}
> scala> val unpickle = new Unpickler
> scala> 
> unpickle.loads(PickleUtils.str2bytes("\u0080\u0005\u0095!\u\u\u\u\u\u\u\u008c\u0007decimal\u0094\u008c\u0007Decimal\u0094\u0093\u0094\u008c\u0003NaN\u0094\u0085\u0094R\u0094."))
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
>  at 
> net.razorvine.pickle.objects.AnyClassConstructor.construct(AnyClassConstructor.java:29)
>  at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:773)
>  at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:213)
>  at net.razorvine.pickle.Unpickler.load(Unpickler.java:123)
>  at net.razorvine.pickle.Unpickler.loads(Unpickler.java:136)
>  ... 48 elided
> {code}
> I submit an issue in pickle upstream 
> [https://github.com/irmen/pickle/issues/7] .
> we should bump pickle latest version after it fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37043) Cancel all running job after AQE plan finished

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429988#comment-17429988
 ] 

Apache Spark commented on SPARK-37043:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34316

> Cancel all running job after AQE plan finished
> --
>
> Key: SPARK-37043
> URL: https://issues.apache.org/jira/browse/SPARK-37043
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> We see stage was still running after AQE plan finished. This is because the 
> plan which contains a empty join has been converted to `LocalTableScanExec` 
> during `AQEOptimizer`, but the other side of this join is still running 
> (shuffle map stage).
>  
> It's no meaning to keep running the stage, It's better to cancel the running 
> stage after AQE plan finished in case wasting the task resource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37039) np.nan series.astype(bool) should be True

2021-10-18 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429973#comment-17429973
 ] 

Yikun Jiang edited comment on SPARK-37039 at 10/18/21, 12:27 PM:
-

Looks like there are diff behaviors in diff type...


was (Author: yikunkero):
working on this

> np.nan series.astype(bool) should be True
> -
>
> Key: SPARK-37039
> URL: https://issues.apache.org/jira/browse/SPARK-37039
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>
> np.nan series.astype(bool) should be True, rather than Fasle:
> https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147
> >>> pd.Series([1, 2, np.nan], dtype=float).astype(bool)
> >>> pd.Series([1, 2, np.nan], dtype=str).astype(bool)
> >>> pd.Series([datetime.date(1994, 1, 31), datetime.date(1994, 2, 1), np.nan])
> 0 True
> 1 True
> 2 True
> dtype: bool
> But in pyspark, it is:
> 0 True
> 1 True
> 2 False
> dtype: bool



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37043) Cancel all running job after AQE plan finished

2021-10-18 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-37043:
--
Description: 
We see stage was still running after AQE plan finished. This is because the 
plan which contains a empty join has been converted to `LocalTableScanExec` 
during `AQEOptimizer`, but the other side of this join is still running 
(shuffle map stage).

 

It's no meaning to keep running the stage, It's better to cancel the running 
stage after AQE plan finished in case wasting the task resource.

  was:
We see stage was still running after AQE plan finished. This is because the 
plan which contains a empty join has been converted to `LocalTableScanExec` 
during `AQEOptimizer`, but the other side of this join is still running 
(shuffle map stage).

 

It's better to cancel the running stage after AQE plan finished in case wasting 
the task resource.


> Cancel all running job after AQE plan finished
> --
>
> Key: SPARK-37043
> URL: https://issues.apache.org/jira/browse/SPARK-37043
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> We see stage was still running after AQE plan finished. This is because the 
> plan which contains a empty join has been converted to `LocalTableScanExec` 
> during `AQEOptimizer`, but the other side of this join is still running 
> (shuffle map stage).
>  
> It's no meaning to keep running the stage, It's better to cancel the running 
> stage after AQE plan finished in case wasting the task resource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37043) Cancel all running job after AQE plan finished

2021-10-18 Thread XiDuo You (Jira)
XiDuo You created SPARK-37043:
-

 Summary: Cancel all running job after AQE plan finished
 Key: SPARK-37043
 URL: https://issues.apache.org/jira/browse/SPARK-37043
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


We see stage was still running after AQE plan finished. This is because the 
plan which contains a empty join has been converted to `LocalTableScanExec` 
during `AQEOptimizer`, but the other side of this join is still running 
(shuffle map stage).

 

It's better to cancel the running stage after AQE plan finished in case wasting 
the task resource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37039) np.nan series.astype(bool) should be True

2021-10-18 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429973#comment-17429973
 ] 

Yikun Jiang commented on SPARK-37039:
-

working on this

> np.nan series.astype(bool) should be True
> -
>
> Key: SPARK-37039
> URL: https://issues.apache.org/jira/browse/SPARK-37039
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>
> np.nan series.astype(bool) should be True, rather than Fasle:
> https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147
> >>> pd.Series([1, 2, np.nan], dtype=float).astype(bool)
> >>> pd.Series([1, 2, np.nan], dtype=str).astype(bool)
> >>> pd.Series([datetime.date(1994, 1, 31), datetime.date(1994, 2, 1), np.nan])
> 0 True
> 1 True
> 2 True
> dtype: bool
> But in pyspark, it is:
> 0 True
> 1 True
> 2 False
> dtype: bool



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37042) Inline type hints for kinesis.py and listener.py in python/pyspark/streaming

2021-10-18 Thread dch nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dch nguyen updated SPARK-37042:
---
Summary: Inline type hints for kinesis.py and listener.py in 
python/pyspark/streaming  (was: Inline type hints for 
python/pyspark/streaming/kinesis.py)

> Inline type hints for kinesis.py and listener.py in python/pyspark/streaming
> 
>
> Key: SPARK-37042
> URL: https://issues.apache.org/jira/browse/SPARK-37042
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36965) Extend python test runner by logging out the temp output files

2021-10-18 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-36965.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34233
[https://github.com/apache/spark/pull/34233]

> Extend python test runner by logging out the temp output files
> --
>
> Key: SPARK-36965
> URL: https://issues.apache.org/jira/browse/SPARK-36965
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
> Fix For: 3.3.0
>
>
> I was running a python test which was extremely slow and I was surprised the 
> unit-tests.log has not been even created. Looked into the code and as I got 
> the tests can be executed in parallel and each one has its own temporary 
> output file which is only added to the unit-tests.log when a test is finished 
> with a failure (after acquiring a lock to avoid parallel write on 
> unit-tests.log). 
> To avoid such a confusion it would make sense to log out the path of those 
> temporary output files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37042) Inline type hints for python/pyspark/streaming/kinesis.py

2021-10-18 Thread dch nguyen (Jira)
dch nguyen created SPARK-37042:
--

 Summary: Inline type hints for python/pyspark/streaming/kinesis.py
 Key: SPARK-37042
 URL: https://issues.apache.org/jira/browse/SPARK-37042
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: dch nguyen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37042) Inline type hints for python/pyspark/streaming/kinesis.py

2021-10-18 Thread dch nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429918#comment-17429918
 ] 

dch nguyen commented on SPARK-37042:


i am working on this

> Inline type hints for python/pyspark/streaming/kinesis.py
> -
>
> Key: SPARK-37042
> URL: https://issues.apache.org/jira/browse/SPARK-37042
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type

2021-10-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36978.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34263
[https://github.com/apache/spark/pull/34263]

> InferConstraints rule should create IsNotNull constraints on the nested field 
> instead of the root nested type 
> --
>
> Key: SPARK-36978
> URL: https://issues.apache.org/jira/browse/SPARK-36978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
> Fix For: 3.3.0
>
>
> [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206]
>  optimization rule generates {{IsNotNull}} constraints corresponding to null 
> intolerant predicates. The {{IsNotNull}} constraints are generated on the 
> attribute inside the corresponding predicate. 
>  e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a 
> constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int 
> column {{structCol.b}} where {{structCol}} is a struct column results in a 
> constraint {{IsNotNull(structCol)}}.
> This generation of constraints on the root level nested type is extremely 
> conservative as it could lead to materialization of the the entire struct. 
> The constraint should instead be generated on the nested field being 
> referenced by the predicate. In the above example, the constraint should be 
> {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}}
>  
> The new constraints also create opportunities for nested pruning. Currently 
> {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. 
> However the constraint {{IsNotNull(structCol.b)}} could create opportunities 
> to prune {{structCol}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type

2021-10-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36978:
---

Assignee: Utkarsh Agarwal

> InferConstraints rule should create IsNotNull constraints on the nested field 
> instead of the root nested type 
> --
>
> Key: SPARK-36978
> URL: https://issues.apache.org/jira/browse/SPARK-36978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
>
> [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206]
>  optimization rule generates {{IsNotNull}} constraints corresponding to null 
> intolerant predicates. The {{IsNotNull}} constraints are generated on the 
> attribute inside the corresponding predicate. 
>  e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a 
> constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int 
> column {{structCol.b}} where {{structCol}} is a struct column results in a 
> constraint {{IsNotNull(structCol)}}.
> This generation of constraints on the root level nested type is extremely 
> conservative as it could lead to materialization of the the entire struct. 
> The constraint should instead be generated on the nested field being 
> referenced by the predicate. In the above example, the constraint should be 
> {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}}
>  
> The new constraints also create opportunities for nested pruning. Currently 
> {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. 
> However the constraint {{IsNotNull(structCol.b)}} could create opportunities 
> to prune {{structCol}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429875#comment-17429875
 ] 

Apache Spark commented on SPARK-37041:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/34312

> Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
> --
>
> Key: SPARK-37041
> URL: https://issues.apache.org/jira/browse/SPARK-37041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy 
> upgrade Thrift.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37013) `select format_string('%0$s', 'Hello')` has different behavior when using java 8 and Java 17

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429874#comment-17429874
 ] 

Apache Spark commented on SPARK-37013:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34313

> `select format_string('%0$s', 'Hello')` has different behavior when using 
> java 8 and Java 17
> 
>
> Key: SPARK-37013
> URL: https://issues.apache.org/jira/browse/SPARK-37013
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> {code:java}
> --PostgreSQL throw ERROR:  format specifies argument 0, but arguments are 
> numbered from 1
> select format_string('%0$s', 'Hello');
> {code}
> Execute with Java 8
> {code:java}
> -- !query
> select format_string('%0$s', 'Hello')
> -- !query schema
> struct
> -- !query output
> Hello
> {code}
> Execute with Java 17
> {code:java}
> -- !query
> select format_string('%0$s', 'Hello')
> -- !query schema
> struct<>
> -- !query output
> java.util.IllegalFormatArgumentIndexException
> Illegal format argument index = 0
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37041:


Assignee: (was: Apache Spark)

> Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
> --
>
> Key: SPARK-37041
> URL: https://issues.apache.org/jira/browse/SPARK-37041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy 
> upgrade Thrift.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37041:


Assignee: Apache Spark

> Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
> --
>
> Key: SPARK-37041
> URL: https://issues.apache.org/jira/browse/SPARK-37041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy 
> upgrade Thrift.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37013) `select format_string('%0$s', 'Hello')` has different behavior when using java 8 and Java 17

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37013:


Assignee: (was: Apache Spark)

> `select format_string('%0$s', 'Hello')` has different behavior when using 
> java 8 and Java 17
> 
>
> Key: SPARK-37013
> URL: https://issues.apache.org/jira/browse/SPARK-37013
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> {code:java}
> --PostgreSQL throw ERROR:  format specifies argument 0, but arguments are 
> numbered from 1
> select format_string('%0$s', 'Hello');
> {code}
> Execute with Java 8
> {code:java}
> -- !query
> select format_string('%0$s', 'Hello')
> -- !query schema
> struct
> -- !query output
> Hello
> {code}
> Execute with Java 17
> {code:java}
> -- !query
> select format_string('%0$s', 'Hello')
> -- !query schema
> struct<>
> -- !query output
> java.util.IllegalFormatArgumentIndexException
> Illegal format argument index = 0
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37013) `select format_string('%0$s', 'Hello')` has different behavior when using java 8 and Java 17

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37013:


Assignee: Apache Spark

> `select format_string('%0$s', 'Hello')` has different behavior when using 
> java 8 and Java 17
> 
>
> Key: SPARK-37013
> URL: https://issues.apache.org/jira/browse/SPARK-37013
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> --PostgreSQL throw ERROR:  format specifies argument 0, but arguments are 
> numbered from 1
> select format_string('%0$s', 'Hello');
> {code}
> Execute with Java 8
> {code:java}
> -- !query
> select format_string('%0$s', 'Hello')
> -- !query schema
> struct
> -- !query output
> Hello
> {code}
> Execute with Java 17
> {code:java}
> -- !query
> select format_string('%0$s', 'Hello')
> -- !query schema
> struct<>
> -- !query output
> java.util.IllegalFormatArgumentIndexException
> Illegal format argument index = 0
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37013) `select format_string('%0$s', 'Hello')` has different behavior when using java 8 and Java 17

2021-10-18 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-37013:
-
Description: 
{code:java}
--PostgreSQL throw ERROR:  format specifies argument 0, but arguments are 
numbered from 1
select format_string('%0$s', 'Hello');
{code}
Execute with Java 8
{code:java}
-- !query
select format_string('%0$s', 'Hello')
-- !query schema
struct
-- !query output
Hello
{code}
Execute with Java 17
{code:java}
-- !query
select format_string('%0$s', 'Hello')
-- !query schema
struct<>
-- !query output
java.util.IllegalFormatArgumentIndexException
Illegal format argument index = 0
{code}
 

  was:
{code:java}
--PostgreSQL throw ERROR:  format specifies argument 0, but arguments are 
numbered from 1
select format_string('%0$s', 'Hello');
{code}
Execute with Java 8
{code:java}
-- !query
select format_string('%0$s', 'Hello')
-- !query schema
struct
-- !query output
Hello
{code}
Execute with Java 11
{code:java}
-- !query
select format_string('%0$s', 'Hello')
-- !query schema
struct<>
-- !query output
java.util.IllegalFormatArgumentIndexException
Illegal format argument index = 0
{code}
 


> `select format_string('%0$s', 'Hello')` has different behavior when using 
> java 8 and Java 17
> 
>
> Key: SPARK-37013
> URL: https://issues.apache.org/jira/browse/SPARK-37013
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> {code:java}
> --PostgreSQL throw ERROR:  format specifies argument 0, but arguments are 
> numbered from 1
> select format_string('%0$s', 'Hello');
> {code}
> Execute with Java 8
> {code:java}
> -- !query
> select format_string('%0$s', 'Hello')
> -- !query schema
> struct
> -- !query output
> Hello
> {code}
> Execute with Java 17
> {code:java}
> -- !query
> select format_string('%0$s', 'Hello')
> -- !query schema
> struct<>
> -- !query output
> java.util.IllegalFormatArgumentIndexException
> Illegal format argument index = 0
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS

2021-10-18 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-37041:
---

 Summary: Backport HIVE-15025: Secure-Socket-Layer (SSL) support 
for HMS
 Key: SPARK-37041
 URL: https://issues.apache.org/jira/browse/SPARK-37041
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yuming Wang


Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy 
upgrade Thrift.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37039) np.nan series.astype(bool) should be True

2021-10-18 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-37039:

Parent: (was: SPARK-36000)
Issue Type: Bug  (was: Sub-task)

> np.nan series.astype(bool) should be True
> -
>
> Key: SPARK-37039
> URL: https://issues.apache.org/jira/browse/SPARK-37039
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>
> np.nan series.astype(bool) should be True, rather than Fasle:
> https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147
> >>> pd.Series([1, 2, np.nan], dtype=float).astype(bool)
> >>> pd.Series([1, 2, np.nan], dtype=str).astype(bool)
> >>> pd.Series([datetime.date(1994, 1, 31), datetime.date(1994, 2, 1), np.nan])
> 0 True
> 1 True
> 2 True
> dtype: bool
> But in pyspark, it is:
> 0 True
> 1 True
> 2 False
> dtype: bool



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37039) np.nan series.astype(bool) should be True

2021-10-18 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-37039:

Description: 
np.nan series.astype(bool) should be True, rather than Fasle:

https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147

>>> pd.Series([1, 2, np.nan], dtype=float).astype(bool)
>>> pd.Series([1, 2, np.nan], dtype=str).astype(bool)
>>> pd.Series([datetime.date(1994, 1, 31), datetime.date(1994, 2, 1), np.nan])
0 True
1 True
2 True
dtype: bool

But in pyspark, it is:
0 True
1 True
2 False
dtype: bool

  was:
np.nan series.astype(bool) should be True, rather than Fasle:

https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147



> np.nan series.astype(bool) should be True
> -
>
> Key: SPARK-37039
> URL: https://issues.apache.org/jira/browse/SPARK-37039
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>
> np.nan series.astype(bool) should be True, rather than Fasle:
> https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147
> >>> pd.Series([1, 2, np.nan], dtype=float).astype(bool)
> >>> pd.Series([1, 2, np.nan], dtype=str).astype(bool)
> >>> pd.Series([datetime.date(1994, 1, 31), datetime.date(1994, 2, 1), np.nan])
> 0 True
> 1 True
> 2 True
> dtype: bool
> But in pyspark, it is:
> 0 True
> 1 True
> 2 False
> dtype: bool



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37040) SampleExec can set outputOrdering as children's outputOrdering

2021-10-18 Thread chong (Jira)
chong created SPARK-37040:
-

 Summary: SampleExec can set outputOrdering as children's 
outputOrdering
 Key: SPARK-37040
 URL: https://issues.apache.org/jira/browse/SPARK-37040
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: chong


All of the code paths in SampleExec that I can see preserve the child ordering, 
but Spark does not set this.

Is this better to set?

 override def outputOrdering: Seq[SortOrder] = child.outputOrdering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36367) Fix the behavior to follow pandas >= 1.3

2021-10-18 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-36367:
---
Affects Version/s: (was: 3.3.0)
   3.2.0

> Fix the behavior to follow pandas >= 1.3
> 
>
> Key: SPARK-36367
> URL: https://issues.apache.org/jira/browse/SPARK-36367
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> Pandas 1.3 has been released. We should follow the new pandas behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37038) Sample push down in DS v2

2021-10-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37038:


Assignee: Apache Spark

> Sample push down in DS v2
> -
>
> Key: SPARK-37038
> URL: https://issues.apache.org/jira/browse/SPARK-37038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37038) Sample push down in DS v2

2021-10-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429843#comment-17429843
 ] 

Apache Spark commented on SPARK-37038:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34311

> Sample push down in DS v2
> -
>
> Key: SPARK-37038
> URL: https://issues.apache.org/jira/browse/SPARK-37038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >