[jira] [Created] (SPARK-37054) Porting "pandas API on Spark: Internals" to PySpark docs.
Haejoon Lee created SPARK-37054: --- Summary: Porting "pandas API on Spark: Internals" to PySpark docs. Key: SPARK-37054 URL: https://issues.apache.org/jira/browse/SPARK-37054 Project: Spark Issue Type: Improvement Components: docs, PySpark Affects Versions: 3.2.0 Reporter: Haejoon Lee We have a [documents|https://docs.google.com/document/d/1PR88p6yMHIeSxkDkSqCxLofkcnP0YtwQ2tETfyAWLQQ/edit?usp=sharing] for pandas API on Spark internal features, apart from the PySpark official documents. Since pandas API on Spark is officially released in Spark 3.2, it's good to port this internal documents to the PySpark official documents. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37054) Porting "pandas API on Spark: Internals" to PySpark docs.
[ https://issues.apache.org/jira/browse/SPARK-37054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430314#comment-17430314 ] Haejoon Lee commented on SPARK-37054: - I'm working on this > Porting "pandas API on Spark: Internals" to PySpark docs. > - > > Key: SPARK-37054 > URL: https://issues.apache.org/jira/browse/SPARK-37054 > Project: Spark > Issue Type: Improvement > Components: docs, PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > We have a > [documents|https://docs.google.com/document/d/1PR88p6yMHIeSxkDkSqCxLofkcnP0YtwQ2tETfyAWLQQ/edit?usp=sharing] > for pandas API on Spark internal features, apart from the PySpark official > documents. > > Since pandas API on Spark is officially released in Spark 3.2, it's good to > port this internal documents to the PySpark official documents. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37044) Add Row to __all__ in pyspark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430304#comment-17430304 ] Hyukjin Kwon commented on SPARK-37044: -- Agree! > Add Row to __all__ in pyspark.sql.types > --- > > Key: SPARK-37044 > URL: https://issues.apache.org/jira/browse/SPARK-37044 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from > {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import > *}} won't import {{Row}}. > It might be counter-intuitive, especially when we import {{Row}} from > {{types}} in {{examples}}. > Should we add it to {{__all__}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36681) Fail to load Snappy codec
[ https://issues.apache.org/jira/browse/SPARK-36681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430303#comment-17430303 ] koert kuipers commented on SPARK-36681: --- hadoop Jira issue: https://issues.apache.org/jira/browse/HADOOP-17891 i have my doubt this only impacts sequence files. i am seeing this issue with snappy compressed csv files, snappy compress json files, etc. > Fail to load Snappy codec > - > > Key: SPARK-36681 > URL: https://issues.apache.org/jira/browse/SPARK-36681 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > snappy-java as a native library should not be relocated in Hadoop shaded > client libraries. Currently we use Hadoop shaded client libraries in Spark. > If trying to use SnappyCodec to write sequence file, we will encounter the > following error: > {code} > [info] Cause: java.lang.UnsatisfiedLinkError: > org.apache.hadoop.shaded.org.xerial.snappy.SnappyNative.rawCompress(Ljava/nio/ByteBuffer;IILjava/nio/ByteBuffer;I)I > [info] at > org.apache.hadoop.shaded.org.xerial.snappy.SnappyNative.rawCompress(Native > Method) > > [info] at > org.apache.hadoop.shaded.org.xerial.snappy.Snappy.compress(Snappy.java:151) > > > [info] at > org.apache.hadoop.io.compress.snappy.SnappyCompressor.compressDirectBuf(SnappyCompressor.java:282) > [info] at > org.apache.hadoop.io.compress.snappy.SnappyCompressor.compress(SnappyCompressor.java:210) > [info] at > org.apache.hadoop.io.compress.BlockCompressorStream.compress(BlockCompressorStream.java:149) > [info] at > org.apache.hadoop.io.compress.BlockCompressorStream.finish(BlockCompressorStream.java:142) > [info] at > org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1589) > > [info] at > org.apache.hadoop.io.SequenceFile$BlockCompressWriter.sync(SequenceFile.java:1605) > [info] at > org.apache.hadoop.io.SequenceFile$BlockCompressWriter.close(SequenceFile.java:1629) > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37053) Add metrics system for History server
angerszhu created SPARK-37053: - Summary: Add metrics system for History server Key: SPARK-37053 URL: https://issues.apache.org/jira/browse/SPARK-37053 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu Add metrics system for history server -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37052) Fix spark-3.2 can use --verbose with spark-shell
[ https://issues.apache.org/jira/browse/SPARK-37052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430300#comment-17430300 ] Apache Spark commented on SPARK-37052: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34322 > Fix spark-3.2 can use --verbose with spark-shell > > > Key: SPARK-37052 > URL: https://issues.apache.org/jira/browse/SPARK-37052 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Should not pass --verbose to spark-shell since it's not a valid argument for > spark-shell -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37052) Fix spark-3.2 can use --verbose with spark-shell
[ https://issues.apache.org/jira/browse/SPARK-37052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37052: Assignee: Apache Spark > Fix spark-3.2 can use --verbose with spark-shell > > > Key: SPARK-37052 > URL: https://issues.apache.org/jira/browse/SPARK-37052 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Should not pass --verbose to spark-shell since it's not a valid argument for > spark-shell -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37052) Fix spark-3.2 can use --verbose with spark-shell
[ https://issues.apache.org/jira/browse/SPARK-37052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430299#comment-17430299 ] Apache Spark commented on SPARK-37052: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34322 > Fix spark-3.2 can use --verbose with spark-shell > > > Key: SPARK-37052 > URL: https://issues.apache.org/jira/browse/SPARK-37052 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Should not pass --verbose to spark-shell since it's not a valid argument for > spark-shell -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37052) Fix spark-3.2 can use --verbose with spark-shell
[ https://issues.apache.org/jira/browse/SPARK-37052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37052: Assignee: (was: Apache Spark) > Fix spark-3.2 can use --verbose with spark-shell > > > Key: SPARK-37052 > URL: https://issues.apache.org/jira/browse/SPARK-37052 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Should not pass --verbose to spark-shell since it's not a valid argument for > spark-shell -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37052) Fix spark-3.2 can use --verbose with spark-shell
angerszhu created SPARK-37052: - Summary: Fix spark-3.2 can use --verbose with spark-shell Key: SPARK-37052 URL: https://issues.apache.org/jira/browse/SPARK-37052 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 3.2.0 Reporter: angerszhu Should not pass --verbose to spark-shell since it's not a valid argument for spark-shell -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] frankli updated SPARK-37051: Description: When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui). _select * from item where i_category = 'Music' limit 100;_ The table is in ORC format, and i_category is char(50) type. I guest that the char(50) type will remains redundant blanks after the actual word. It will affect the boolean value of "x.equals(Y)", and results in wrong results. By the way, Spark's tests should add more cases on ORC format. == Physical Plan == CollectLimit (3) +- Filter (2) +- Scan orc tpcds_bin_partitioned_orc_2.item (1) (1) Scan orc tpcds_bin_partitioned_orc_2.item Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Batched: false Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music )] ReadSchema: struct (2) Filter Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ (3) CollectLimit Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Arguments: 100 was: When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui). _select * from item where i_category = 'Music' limit 100;_ The table is in ORC format, and i_category is char(50) type. I guest that the char(50) type will remains redundant blanks after the actual word. It will affect the boolean value of "x.equals(Y)", and results in wrong results. By the way, Spark's tests should add more cases on ORC format. == Physical Plan == CollectLimit (3) +- Filter (2) +- Scan orc tpcds_bin_partitioned_orc_2.item (1) (1) Scan orc tpcds_bin_partitioned_orc_2.item Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Batched: false Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] PushedFilters: [IsNotNull(i_category), EqualTo(i_category,+Music )]+ ReadSchema: struct (2) Filter Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ (3) CollectLimit Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4,
[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430297#comment-17430297 ] frankli commented on SPARK-37051: - [~dongjoon] Can I trouble you to take a look. Thanks a lot. > The filter operator gets wrong results in ORC char/varchar types > > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Major > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > By the way, Spark's tests should add more cases on ORC format. > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), EqualTo(i_category,+Music )]+ > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Arguments: 100 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37033) Inline type hints for python/pyspark/resource/requests.py
[ https://issues.apache.org/jira/browse/SPARK-37033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37033: Assignee: (was: Apache Spark) > Inline type hints for python/pyspark/resource/requests.py > - > > Key: SPARK-37033 > URL: https://issues.apache.org/jira/browse/SPARK-37033 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37033) Inline type hints for python/pyspark/resource/requests.py
[ https://issues.apache.org/jira/browse/SPARK-37033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37033: Assignee: Apache Spark > Inline type hints for python/pyspark/resource/requests.py > - > > Key: SPARK-37033 > URL: https://issues.apache.org/jira/browse/SPARK-37033 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37033) Inline type hints for python/pyspark/resource/requests.py
[ https://issues.apache.org/jira/browse/SPARK-37033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430296#comment-17430296 ] Apache Spark commented on SPARK-37033: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/34321 > Inline type hints for python/pyspark/resource/requests.py > - > > Key: SPARK-37033 > URL: https://issues.apache.org/jira/browse/SPARK-37033 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] frankli updated SPARK-37051: Description: When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui). _select * from item where i_category = 'Music' limit 100;_ The table is in ORC format, and i_category is char(50) type. I guest that the char(50) type will remains redundant blanks after the actual word. It will affect the boolean value of "x.equals(Y)", and results in wrong results. By the way, Spark's tests should add more cases on ORC format. == Physical Plan == CollectLimit (3) +- Filter (2) +- Scan orc tpcds_bin_partitioned_orc_2.item (1) (1) Scan orc tpcds_bin_partitioned_orc_2.item Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Batched: false Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] PushedFilters: [IsNotNull(i_category), EqualTo(i_category,+Music )]+ ReadSchema: struct (2) Filter Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ (3) CollectLimit Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Arguments: 100 was: When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui). _select * from item where i_category = 'Music' limit 100;_ The table is in ORC format, and i_category is char(50) type. I guest that the char(50) type will remains redundant blanks after the actual word. It will affect the boolean value of "x.equals(Y)", and results in wrong results. By the way, Spark's tests should add more cases on ORC format. == Physical Plan == CollectLimit (3) +- Filter (2) +- Scan orc tpcds_bin_partitioned_orc_2.item (1) (1) Scan orc tpcds_bin_partitioned_orc_2.item Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Batched: false PushedFilters: [IsNotNull(i_category), ++EqualTo(i_category,Music )] ReadSchema: struct (2) Filter Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ (3) CollectLimit Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9,
[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] frankli updated SPARK-37051: Description: When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui). _select * from item where i_category = 'Music' limit 100;_ The table is in ORC format, and i_category is char(50) type. I guest that the char(50) type will remains redundant blanks after the actual word. It will affect the boolean value of "x.equals(Y)", and results in wrong results. By the way, Spark's tests should add more cases on ORC format. == Physical Plan == CollectLimit (3) +- Filter (2) +- Scan orc tpcds_bin_partitioned_orc_2.item (1) (1) Scan orc tpcds_bin_partitioned_orc_2.item Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Batched: false PushedFilters: [IsNotNull(i_category), ++EqualTo(i_category,Music )] ReadSchema: struct (2) Filter Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ (3) CollectLimit Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Arguments: 100 was: When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui). _select * from item where i_category = 'Music' limit 100;_ The table is in ORC format, and i_category is char(50) type. I guest that the char(50) type will remains redundant blanks after the actual word. It will affect the boolean value of "x.equals(Y)", and results in wrong results. By the way, Spark's tests should add more cases on ORC format. == Physical Plan == CollectLimit (3) +- Filter (2) +- Scan orc tpcds_bin_partitioned_orc_2.item (1) (1) Scan orc tpcds_bin_partitioned_orc_2.item Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Batched: false Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,Music )]+ ReadSchema: struct (2) Filter Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ (3) CollectLimit Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8,
[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] frankli updated SPARK-37051: Description: When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui). _select * from item where i_category = 'Music' limit 100;_ The table is in ORC format, and i_category is char(50) type. I guest that the char(50) type will remains redundant blanks after the actual word. It will affect the boolean value of "x.equals(Y)", and results in wrong results. By the way, Spark's tests should add more cases on ORC format. == Physical Plan == CollectLimit (3) +- Filter (2) +- Scan orc tpcds_bin_partitioned_orc_2.item (1) (1) Scan orc tpcds_bin_partitioned_orc_2.item Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Batched: false Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,Music )]+ ReadSchema: struct (2) Filter Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ (3) CollectLimit Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Arguments: 100 was: When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui). _select * from item where i_category = 'Music' limit 100;_ The table is in ORC format, and i_category is char(50) type. I guest that the char(50) type will remains redundant blanks after the actual word. It will affect the boolean value of "x.equals(Y)", and results in wrong results. By the way, Spark's tests should add more cases on ORC format. !image-2021-10-19-11-01-55-597.png|width=1085,height=499! > The filter operator gets wrong results in ORC char/varchar types > > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Major > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > By the way, Spark's tests should add more cases on ORC format. > > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,Music )]+ > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music > ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L,
[jira] [Created] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types
frankli created SPARK-37051: --- Summary: The filter operator gets wrong results in ORC char/varchar types Key: SPARK-37051 URL: https://issues.apache.org/jira/browse/SPARK-37051 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Environment: Spark 3.1.2 Scala 2.12 / Java 1.8 Reporter: frankli When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui). _select * from item where i_category = 'Music' limit 100;_ The table is in ORC format, and i_category is char(50) type. I guest that the char(50) type will remains redundant blanks after the actual word. It will affect the boolean value of "x.equals(Y)", and results in wrong results. By the way, Spark's tests should add more cases on ORC format. !image-2021-10-19-11-01-55-597.png|width=1085,height=499! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36834) Namespace log lines in External Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-36834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-36834. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34079 [https://github.com/apache/spark/pull/34079] > Namespace log lines in External Shuffle Service > --- > > Key: SPARK-36834 > URL: https://issues.apache.org/jira/browse/SPARK-36834 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Thejdeep Gudivada >Priority: Minor > Fix For: 3.3.0 > > > To differentiate between messages from the different concurrently running ESS > on an NM, it would be nice to add a namespace to the log lines emitted by the > ESS. This would make querying of logs much more convenient. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36834) Namespace log lines in External Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-36834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-36834: --- Assignee: Thejdeep Gudivada > Namespace log lines in External Shuffle Service > --- > > Key: SPARK-36834 > URL: https://issues.apache.org/jira/browse/SPARK-36834 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Thejdeep Gudivada >Assignee: Thejdeep Gudivada >Priority: Minor > Fix For: 3.3.0 > > > To differentiate between messages from the different concurrently running ESS > on an NM, it would be nice to add a namespace to the log lines emitted by the > ESS. This would make querying of logs much more convenient. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18621) PySQL SQL Types (aka Dataframa Schema) have __repr__() with Scala and not Python representation
[ https://issues.apache.org/jira/browse/SPARK-18621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430284#comment-17430284 ] Apache Spark commented on SPARK-18621: -- User 'crflynn' has created a pull request for this issue: https://github.com/apache/spark/pull/34320 > PySQL SQL Types (aka Dataframa Schema) have __repr__() with Scala and not > Python representation > --- > > Key: SPARK-18621 > URL: https://issues.apache.org/jira/browse/SPARK-18621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2, 2.0.2 >Reporter: Romi Kuntsman >Priority: Minor > Labels: bulk-closed > > When using Python's repr() on an object, the expected result is a string that > Python can evaluate to construct the object. > See: https://docs.python.org/2/library/functions.html#func-repr > However, when getting a DataFrame schema in PySpark, the code (in > "__repr()__" overload methods) returns the string representation for Scala, > rather than for Python. > Relevant code in PySpark: > https://github.com/apache/spark/blob/5f02d2e5b4d37f554629cbd0e488e856fffd7b6b/python/pyspark/sql/types.py#L442 > Python Code: > {code} > # 1. define object > struct1 = StructType([StructField("f1", StringType(), True)]) > # 2. print representation, expected to be like above > print(repr(struct1)) > # 3. actual result: > # StructType(List(StructField(f1,StringType,true))) > # 4. try to use result in code > struct2 = StructType(List(StructField(f1,StringType,true))) > # 5. get bunch of errors > # Unresolved reference 'List' > # Unresolved reference 'f1' > # StringType is class, not constructed object > # Unresolved reference 'true' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36834) Namespace log lines in External Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-36834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-36834: --- Assignee: (was: Thejdeep Gudivada) > Namespace log lines in External Shuffle Service > --- > > Key: SPARK-36834 > URL: https://issues.apache.org/jira/browse/SPARK-36834 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Thejdeep Gudivada >Priority: Minor > > To differentiate between messages from the different concurrently running ESS > on an NM, it would be nice to add a namespace to the log lines emitted by the > ESS. This would make querying of logs much more convenient. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36834) Namespace log lines in External Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-36834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-36834: --- Assignee: Thejdeep Gudivada > Namespace log lines in External Shuffle Service > --- > > Key: SPARK-36834 > URL: https://issues.apache.org/jira/browse/SPARK-36834 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Thejdeep Gudivada >Assignee: Thejdeep Gudivada >Priority: Minor > > To differentiate between messages from the different concurrently running ESS > on an NM, it would be nice to add a namespace to the log lines emitted by the > ESS. This would make querying of logs much more convenient. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32161) Hide JVM traceback for SparkUpgradeException
[ https://issues.apache.org/jira/browse/SPARK-32161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32161: Assignee: pralabhkumar > Hide JVM traceback for SparkUpgradeException > > > Key: SPARK-32161 > URL: https://issues.apache.org/jira/browse/SPARK-32161 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: pralabhkumar >Priority: Major > > We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. > See also https://github.com/apache/spark/pull/28736/files#r449184881 > It should better also whitelist and hide JVM traceback. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32161) Hide JVM traceback for SparkUpgradeException
[ https://issues.apache.org/jira/browse/SPARK-32161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32161. -- Fix Version/s: 3.2.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/34275 > Hide JVM traceback for SparkUpgradeException > > > Key: SPARK-32161 > URL: https://issues.apache.org/jira/browse/SPARK-32161 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: pralabhkumar >Priority: Major > Fix For: 3.2.0 > > > We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. > See also https://github.com/apache/spark/pull/28736/files#r449184881 > It should better also whitelist and hide JVM traceback. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s
[ https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37049: Assignee: Apache Spark > executorIdleTimeout is not working for pending pods on K8s > -- > > Key: SPARK-37049 > URL: https://issues.apache.org/jira/browse/SPARK-37049 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Apache Spark >Priority: Major > > SPARK-33099 added the support to respect > "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. > However, when it checks if a pending executor pod is timed out, it checks > against the pod's "startTime". A pending pod "startTime" is empty, and this > causes the function "isExecutorIdleTimedOut()" always return true for pending > pods. > This caused the issue, pending pods are deleted immediately when a stage is > finished and several new pods got recreated again in the next stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s
[ https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430257#comment-17430257 ] Apache Spark commented on SPARK-37049: -- User 'yangwwei' has created a pull request for this issue: https://github.com/apache/spark/pull/34319 > executorIdleTimeout is not working for pending pods on K8s > -- > > Key: SPARK-37049 > URL: https://issues.apache.org/jira/browse/SPARK-37049 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Apache Spark >Priority: Major > > SPARK-33099 added the support to respect > "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. > However, when it checks if a pending executor pod is timed out, it checks > against the pod's "startTime". A pending pod "startTime" is empty, and this > causes the function "isExecutorIdleTimedOut()" always return true for pending > pods. > This caused the issue, pending pods are deleted immediately when a stage is > finished and several new pods got recreated again in the next stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37048) Clean up inlining type hints under SQL module
[ https://issues.apache.org/jira/browse/SPARK-37048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37048: Assignee: Apache Spark > Clean up inlining type hints under SQL module > - > > Key: SPARK-37048 > URL: https://issues.apache.org/jira/browse/SPARK-37048 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > Now that most of type hits under the SQL module are inlined. > We should clean up for the module now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37048) Clean up inlining type hints under SQL module
[ https://issues.apache.org/jira/browse/SPARK-37048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430255#comment-17430255 ] Apache Spark commented on SPARK-37048: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/34318 > Clean up inlining type hints under SQL module > - > > Key: SPARK-37048 > URL: https://issues.apache.org/jira/browse/SPARK-37048 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Now that most of type hits under the SQL module are inlined. > We should clean up for the module now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37048) Clean up inlining type hints under SQL module
[ https://issues.apache.org/jira/browse/SPARK-37048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37048: Assignee: (was: Apache Spark) > Clean up inlining type hints under SQL module > - > > Key: SPARK-37048 > URL: https://issues.apache.org/jira/browse/SPARK-37048 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Now that most of type hits under the SQL module are inlined. > We should clean up for the module now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s
[ https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430256#comment-17430256 ] Apache Spark commented on SPARK-37049: -- User 'yangwwei' has created a pull request for this issue: https://github.com/apache/spark/pull/34319 > executorIdleTimeout is not working for pending pods on K8s > -- > > Key: SPARK-37049 > URL: https://issues.apache.org/jira/browse/SPARK-37049 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Priority: Major > > SPARK-33099 added the support to respect > "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. > However, when it checks if a pending executor pod is timed out, it checks > against the pod's "startTime". A pending pod "startTime" is empty, and this > causes the function "isExecutorIdleTimedOut()" always return true for pending > pods. > This caused the issue, pending pods are deleted immediately when a stage is > finished and several new pods got recreated again in the next stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s
[ https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37049: Assignee: (was: Apache Spark) > executorIdleTimeout is not working for pending pods on K8s > -- > > Key: SPARK-37049 > URL: https://issues.apache.org/jira/browse/SPARK-37049 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Priority: Major > > SPARK-33099 added the support to respect > "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. > However, when it checks if a pending executor pod is timed out, it checks > against the pod's "startTime". A pending pod "startTime" is empty, and this > causes the function "isExecutorIdleTimedOut()" always return true for pending > pods. > This caused the issue, pending pods are deleted immediately when a stage is > finished and several new pods got recreated again in the next stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37050) Update conda installation instructions
[ https://issues.apache.org/jira/browse/SPARK-37050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430254#comment-17430254 ] Apache Spark commented on SPARK-37050: -- User 'h-vetinari' has created a pull request for this issue: https://github.com/apache/spark/pull/34315 > Update conda installation instructions > -- > > Key: SPARK-37050 > URL: https://issues.apache.org/jira/browse/SPARK-37050 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 3.2.0 >Reporter: H. Vetinari >Priority: Major > Fix For: 3.2.1 > > > Improve conda installation documentation, as discussed > [here|https://github.com/apache/spark-website/pull/361#issuecomment-945660978]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37050) Update conda installation instructions
[ https://issues.apache.org/jira/browse/SPARK-37050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37050: Assignee: Apache Spark > Update conda installation instructions > -- > > Key: SPARK-37050 > URL: https://issues.apache.org/jira/browse/SPARK-37050 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 3.2.0 >Reporter: H. Vetinari >Assignee: Apache Spark >Priority: Major > Fix For: 3.2.1 > > > Improve conda installation documentation, as discussed > [here|https://github.com/apache/spark-website/pull/361#issuecomment-945660978]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37050) Update conda installation instructions
[ https://issues.apache.org/jira/browse/SPARK-37050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37050: Assignee: (was: Apache Spark) > Update conda installation instructions > -- > > Key: SPARK-37050 > URL: https://issues.apache.org/jira/browse/SPARK-37050 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 3.2.0 >Reporter: H. Vetinari >Priority: Major > Fix For: 3.2.1 > > > Improve conda installation documentation, as discussed > [here|https://github.com/apache/spark-website/pull/361#issuecomment-945660978]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37050) Update conda installation instructions
H. Vetinari created SPARK-37050: --- Summary: Update conda installation instructions Key: SPARK-37050 URL: https://issues.apache.org/jira/browse/SPARK-37050 Project: Spark Issue Type: Task Components: Documentation Affects Versions: 3.2.0 Reporter: H. Vetinari Fix For: 3.2.1 Improve conda installation documentation, as discussed [here|https://github.com/apache/spark-website/pull/361#issuecomment-945660978]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s
Weiwei Yang created SPARK-37049: --- Summary: executorIdleTimeout is not working for pending pods on K8s Key: SPARK-37049 URL: https://issues.apache.org/jira/browse/SPARK-37049 Project: Spark Issue Type: Bug Components: Kubernetes, Spark Core Affects Versions: 3.1.0 Reporter: Weiwei Yang SPARK-33099 added the support to respect "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. However, when it checks if a pending executor pod is timed out, it checks against the pod's "startTime". A pending pod "startTime" is empty, and this causes the function "isExecutorIdleTimedOut()" always return true for pending pods. This caused the issue, pending pods are deleted immediately when a stage is finished and several new pods got recreated again in the next stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36945) Inline type hints for python/pyspark/sql/udf.py
[ https://issues.apache.org/jira/browse/SPARK-36945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-36945. --- Fix Version/s: 3.3.0 Assignee: dch nguyen Resolution: Fixed Issue resolved by pull request 34289 https://github.com/apache/spark/pull/34289 > Inline type hints for python/pyspark/sql/udf.py > --- > > Key: SPARK-36945 > URL: https://issues.apache.org/jira/browse/SPARK-36945 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37048) Clean up inlining type hints under SQL module
[ https://issues.apache.org/jira/browse/SPARK-37048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430245#comment-17430245 ] Takuya Ueshin commented on SPARK-37048: --- I'm working on this. > Clean up inlining type hints under SQL module > - > > Key: SPARK-37048 > URL: https://issues.apache.org/jira/browse/SPARK-37048 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Now that most of type hits under the SQL module are inlined. > We should clean up for the module now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37048) Clean up inlining type hints under SQL module
Takuya Ueshin created SPARK-37048: - Summary: Clean up inlining type hints under SQL module Key: SPARK-37048 URL: https://issues.apache.org/jira/browse/SPARK-37048 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin Now that most of type hits under the SQL module are inlined. We should clean up for the module now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36933) Reduce duplication in TaskMemoryManager.acquireExecutionMemory
[ https://issues.apache.org/jira/browse/SPARK-36933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-36933: -- Assignee: Tim Armstrong > Reduce duplication in TaskMemoryManager.acquireExecutionMemory > -- > > Key: SPARK-36933 > URL: https://issues.apache.org/jira/browse/SPARK-36933 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Tim Armstrong >Assignee: Tim Armstrong >Priority: Major > > TaskMemoryManager.acquireExecutionMemory has a bit of redundancy - code is > duplicated in the self-spilling case. It would be good to reduce the > duplication to make this more maintainable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36933) Reduce duplication in TaskMemoryManager.acquireExecutionMemory
[ https://issues.apache.org/jira/browse/SPARK-36933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-36933. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34186 [https://github.com/apache/spark/pull/34186] > Reduce duplication in TaskMemoryManager.acquireExecutionMemory > -- > > Key: SPARK-36933 > URL: https://issues.apache.org/jira/browse/SPARK-36933 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Tim Armstrong >Assignee: Tim Armstrong >Priority: Major > Fix For: 3.3.0 > > > TaskMemoryManager.acquireExecutionMemory has a bit of redundancy - code is > duplicated in the self-spilling case. It would be good to reduce the > duplication to make this more maintainable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33828) SQL Adaptive Query Execution QA
[ https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430184#comment-17430184 ] Dongjoon Hyun commented on SPARK-33828: --- After 3.2.0 announcement, we will close this JIRA and start `Phase Two` QA, [~xkrogen]. > SQL Adaptive Query Execution QA > --- > > Key: SPARK-33828 > URL: https://issues.apache.org/jira/browse/SPARK-33828 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Critical > Labels: releasenotes > > Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA > issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by > default and collect all information in order to do QA for this feature in > Apache Spark 3.2.0 timeframe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37047) Add overloads for lpad and rpad for BINARY strings
[ https://issues.apache.org/jira/browse/SPARK-37047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430140#comment-17430140 ] Apache Spark commented on SPARK-37047: -- User 'mkaravel' has created a pull request for this issue: https://github.com/apache/spark/pull/34154 > Add overloads for lpad and rpad for BINARY strings > -- > > Key: SPARK-37047 > URL: https://issues.apache.org/jira/browse/SPARK-37047 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Menelaos Karavelas >Priority: Major > > Currently, `lpad` and `rpad` accept BINARY strings as input (both in terms of > input string to be padded and padding pattern), and these strings get cast to > UTF8 strings. The result of the operation is a UTF8 string which may be > invalid as it can contain non-UTF8 characters. > What we would like to do is to overload `lpad` and `rpad` to accept BINARY > strings as inputs (both for the string to be padded and the padding pattern) > and produce a left or right padded BINARY string as output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37047) Add overloads for lpad and rpad for BINARY strings
[ https://issues.apache.org/jira/browse/SPARK-37047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37047: Assignee: Apache Spark > Add overloads for lpad and rpad for BINARY strings > -- > > Key: SPARK-37047 > URL: https://issues.apache.org/jira/browse/SPARK-37047 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Menelaos Karavelas >Assignee: Apache Spark >Priority: Major > > Currently, `lpad` and `rpad` accept BINARY strings as input (both in terms of > input string to be padded and padding pattern), and these strings get cast to > UTF8 strings. The result of the operation is a UTF8 string which may be > invalid as it can contain non-UTF8 characters. > What we would like to do is to overload `lpad` and `rpad` to accept BINARY > strings as inputs (both for the string to be padded and the padding pattern) > and produce a left or right padded BINARY string as output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37047) Add overloads for lpad and rpad for BINARY strings
[ https://issues.apache.org/jira/browse/SPARK-37047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37047: Assignee: (was: Apache Spark) > Add overloads for lpad and rpad for BINARY strings > -- > > Key: SPARK-37047 > URL: https://issues.apache.org/jira/browse/SPARK-37047 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Menelaos Karavelas >Priority: Major > > Currently, `lpad` and `rpad` accept BINARY strings as input (both in terms of > input string to be padded and padding pattern), and these strings get cast to > UTF8 strings. The result of the operation is a UTF8 string which may be > invalid as it can contain non-UTF8 characters. > What we would like to do is to overload `lpad` and `rpad` to accept BINARY > strings as inputs (both for the string to be padded and the padding pattern) > and produce a left or right padded BINARY string as output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36886) Inline type hints for python/pyspark/sql/context.py
[ https://issues.apache.org/jira/browse/SPARK-36886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-36886. --- Fix Version/s: 3.3.0 Assignee: dch nguyen Resolution: Fixed Issue resolved by pull request 34185 https://github.com/apache/spark/pull/34185 > Inline type hints for python/pyspark/sql/context.py > --- > > Key: SPARK-36886 > URL: https://issues.apache.org/jira/browse/SPARK-36886 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dgd_contributor >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > > Inline type hints for python/pyspark/sql/context.py from Inline type hints > for python/pyspark/sql/context.pyi. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37027) Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES
[ https://issues.apache.org/jira/browse/SPARK-37027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430125#comment-17430125 ] Erik Krogen edited comment on SPARK-37027 at 10/18/21, 6:10 PM: [~yuzhousun] actually this is already fixed by SPARK-28266 in 3.1.3 and 3.2.0. Can you try compiling from latest {{master}} or using the 3.2.0 binaries (not yet officially released but [can be found on Maven Central|https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:3.2.0]). was (Author: xkrogen): [~yuzhousun] actually this is already fixed by SPARK-28266 in 3.1.3 and 3.2.0. Can you try compiling from latest {{master}} or using the 3.2.0 binaries (not yet officially released but [https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:3.2.0|can be found on Maven Central]). > Fix behavior inconsistent in Hive table when ‘path’ is provided in > SERDEPROPERTIES > -- > > Key: SPARK-37027 > URL: https://issues.apache.org/jira/browse/SPARK-37027 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5, 3.1.2 >Reporter: Yuzhou Sun >Priority: Trivial > Attachments: SPARK-37027-test-example.patch > > > If a Hive table is created with both {{WITH SERDEPROPERTIES > ('path'='')}} and {{LOCATION }}, Spark can > return doubled rows when reading the table. This issue seems to be an > extension of SPARK-30507. > Reproduce steps: > # Create table and insert records via Hive (Spark doesn't allow to insert > into table like this) > {code:sql} > CREATE TABLE `test_table`( > `c1` LONG, > `c2` STRING) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > WITH SERDEPROPERTIES ('path'=''" ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION ''; > INSERT INTO TABLE `test_table` > VALUES (0, '0'); > SELECT * FROM `test_table`; > -- will return > -- 0 0 > {code} > # Read above table from Spark > {code:sql} > SELECT * FROM `test_table`; > -- will return > -- 0 0 > -- 0 0 > {code} > But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will > return same result as Hive (i.e. single row) > A similar case is that, if a Hive table is created with both {{WITH > SERDEPROPERTIES ('path'='')}} and {{LOCATION }}, > Spark will read both rows under {{anotherPath}} and rows under > {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} > ‘s value. However, actually Hive seems to return only rows under > {{tableLocation}} > Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, > Spark won’t double the rows when {{'path'=''}}. If > {{'path'=''}}, Spark will read both rows under {{anotherPath}} > and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in > {{TBLPROPERTIES}} > Code examples for the above cases (diff patch wrote in > {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37027) Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES
[ https://issues.apache.org/jira/browse/SPARK-37027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430125#comment-17430125 ] Erik Krogen commented on SPARK-37027: - [~yuzhousun] actually this is already fixed by SPARK-28266 in 3.1.3 and 3.2.0. Can you try compiling from latest {{master}} or using the 3.2.0 binaries (not yet officially released but [https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:3.2.0|can be found on Maven Central]). > Fix behavior inconsistent in Hive table when ‘path’ is provided in > SERDEPROPERTIES > -- > > Key: SPARK-37027 > URL: https://issues.apache.org/jira/browse/SPARK-37027 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5, 3.1.2 >Reporter: Yuzhou Sun >Priority: Trivial > Attachments: SPARK-37027-test-example.patch > > > If a Hive table is created with both {{WITH SERDEPROPERTIES > ('path'='')}} and {{LOCATION }}, Spark can > return doubled rows when reading the table. This issue seems to be an > extension of SPARK-30507. > Reproduce steps: > # Create table and insert records via Hive (Spark doesn't allow to insert > into table like this) > {code:sql} > CREATE TABLE `test_table`( > `c1` LONG, > `c2` STRING) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > WITH SERDEPROPERTIES ('path'=''" ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION ''; > INSERT INTO TABLE `test_table` > VALUES (0, '0'); > SELECT * FROM `test_table`; > -- will return > -- 0 0 > {code} > # Read above table from Spark > {code:sql} > SELECT * FROM `test_table`; > -- will return > -- 0 0 > -- 0 0 > {code} > But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will > return same result as Hive (i.e. single row) > A similar case is that, if a Hive table is created with both {{WITH > SERDEPROPERTIES ('path'='')}} and {{LOCATION }}, > Spark will read both rows under {{anotherPath}} and rows under > {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} > ‘s value. However, actually Hive seems to return only rows under > {{tableLocation}} > Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, > Spark won’t double the rows when {{'path'=''}}. If > {{'path'=''}}, Spark will read both rows under {{anotherPath}} > and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in > {{TBLPROPERTIES}} > Code examples for the above cases (diff patch wrote in > {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33828) SQL Adaptive Query Execution QA
[ https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430124#comment-17430124 ] Erik Krogen commented on SPARK-33828: - [~dongjoon] as you mentioned above, this epic was initially for collecting issues to be resolved in 3.2.0. Now that release has been finalized, but we still have a few open issues here, and there are still new AQE issues being created (e.g. SPARK-37043 just today). Shall we keep this epic open and continue to use it, or create a new one targeted for 3.3.0? Or any other suggestions? > SQL Adaptive Query Execution QA > --- > > Key: SPARK-33828 > URL: https://issues.apache.org/jira/browse/SPARK-33828 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Critical > Labels: releasenotes > > Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA > issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by > default and collect all information in order to do QA for this feature in > Apache Spark 3.2.0 timeframe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37047) Add overloads for lpad and rpad for BINARY strings
Menelais Karavelas created SPARK-37047: -- Summary: Add overloads for lpad and rpad for BINARY strings Key: SPARK-37047 URL: https://issues.apache.org/jira/browse/SPARK-37047 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: Menelais Karavelas Currently, `lpad` and `rpad` accept BINARY strings as input (both in terms of input string to be padded and padding pattern), and these strings get cast to UTF8 strings. The result of the operation is a UTF8 string which may be invalid as it can contain non-UTF8 characters. What we would like to do is to overload `lpad` and `rpad` to accept BINARY strings as inputs (both for the string to be padded and the padding pattern) and produce a left or right padded BINARY string as output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37046) Alter view does not preserve column case
[ https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37046: Assignee: (was: Apache Spark) > Alter view does not preserve column case > > > Key: SPARK-37046 > URL: https://issues.apache.org/jira/browse/SPARK-37046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Priority: Major > > On running an `alter view` command, the column case is not preserved. > Repro: > > {code:java} > scala> sql("create view v as select 1 as A, 1 as B") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [A,int,null] > [B,int,null] > scala> sql("alter view v as select 1 as C, 1 as D") > res4: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [c,int,null] > [d,int,null] > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37046) Alter view does not preserve column case
[ https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37046: Assignee: (was: Apache Spark) > Alter view does not preserve column case > > > Key: SPARK-37046 > URL: https://issues.apache.org/jira/browse/SPARK-37046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Priority: Major > > On running an `alter view` command, the column case is not preserved. > Repro: > > {code:java} > scala> sql("create view v as select 1 as A, 1 as B") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [A,int,null] > [B,int,null] > scala> sql("alter view v as select 1 as C, 1 as D") > res4: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [c,int,null] > [d,int,null] > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37046) Alter view does not preserve column case
[ https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37046: Assignee: Apache Spark > Alter view does not preserve column case > > > Key: SPARK-37046 > URL: https://issues.apache.org/jira/browse/SPARK-37046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Assignee: Apache Spark >Priority: Major > > On running an `alter view` command, the column case is not preserved. > Repro: > > {code:java} > scala> sql("create view v as select 1 as A, 1 as B") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [A,int,null] > [B,int,null] > scala> sql("alter view v as select 1 as C, 1 as D") > res4: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [c,int,null] > [d,int,null] > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37046) Alter view does not preserve column case
[ https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430114#comment-17430114 ] Apache Spark commented on SPARK-37046: -- User 'somani' has created a pull request for this issue: https://github.com/apache/spark/pull/34317 > Alter view does not preserve column case > > > Key: SPARK-37046 > URL: https://issues.apache.org/jira/browse/SPARK-37046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Priority: Major > > On running an `alter view` command, the column case is not preserved. > Repro: > > {code:java} > scala> sql("create view v as select 1 as A, 1 as B") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [A,int,null] > [B,int,null] > scala> sql("alter view v as select 1 as C, 1 as D") > res4: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [c,int,null] > [d,int,null] > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37046) Alter view does not preserve column case
[ https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430105#comment-17430105 ] Abhishek Somani commented on SPARK-37046: - I'll raise a PR soon > Alter view does not preserve column case > > > Key: SPARK-37046 > URL: https://issues.apache.org/jira/browse/SPARK-37046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Priority: Major > > On running an `alter view` command, the column case is not preserved. > Repro: > > {code:java} > scala> sql("create view v as select 1 as A, 1 as B") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [A,int,null] > [B,int,null] > scala> sql("alter view v as select 1 as C, 1 as D") > res4: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [c,int,null] > [d,int,null] > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37046) Alter view does not preserve column case
[ https://issues.apache.org/jira/browse/SPARK-37046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Somani updated SPARK-37046: Shepherd: Wenchen Fan > Alter view does not preserve column case > > > Key: SPARK-37046 > URL: https://issues.apache.org/jira/browse/SPARK-37046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Abhishek Somani >Priority: Major > > On running an `alter view` command, the column case is not preserved. > Repro: > > {code:java} > scala> sql("create view v as select 1 as A, 1 as B") > res2: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [A,int,null] > [B,int,null] > scala> sql("alter view v as select 1 as C, 1 as D") > res4: org.apache.spark.sql.DataFrame = [] > scala> sql("describe v").collect.foreach(println) > [c,int,null] > [d,int,null] > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37046) Alter view does not preserve column case
Abhishek Somani created SPARK-37046: --- Summary: Alter view does not preserve column case Key: SPARK-37046 URL: https://issues.apache.org/jira/browse/SPARK-37046 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Abhishek Somani On running an `alter view` command, the column case is not preserved. Repro: {code:java} scala> sql("create view v as select 1 as A, 1 as B") res2: org.apache.spark.sql.DataFrame = [] scala> sql("describe v").collect.foreach(println) [A,int,null] [B,int,null] scala> sql("alter view v as select 1 as C, 1 as D") res4: org.apache.spark.sql.DataFrame = [] scala> sql("describe v").collect.foreach(println) [c,int,null] [d,int,null] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32161) Hide JVM traceback for SparkUpgradeException
[ https://issues.apache.org/jira/browse/SPARK-32161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430099#comment-17430099 ] pralabhkumar commented on SPARK-32161: -- [~hyukjin.kwon] Since the PR is being merged , please change the status of the Jira and assigned to me . > Hide JVM traceback for SparkUpgradeException > > > Key: SPARK-32161 > URL: https://issues.apache.org/jira/browse/SPARK-32161 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. > See also https://github.com/apache/spark/pull/28736/files#r449184881 > It should better also whitelist and hide JVM traceback. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37045) Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests
[ https://issues.apache.org/jira/browse/SPARK-37045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-37045: - Description: Extract ALTER TABLE .. ADD COLUMNS tests to the common place to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test suites. (was: Extract DESCRIBE NAMESPACE tests to the common place to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test suites.) > Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests > > > Key: SPARK-37045 > URL: https://issues.apache.org/jira/browse/SPARK-37045 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Priority: Major > > Extract ALTER TABLE .. ADD COLUMNS tests to the common place to run them for > V1 and v2 datasources. Some tests can be places to V1 and V2 specific test > suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37045) Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests
Max Gekk created SPARK-37045: Summary: Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests Key: SPARK-37045 URL: https://issues.apache.org/jira/browse/SPARK-37045 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Terry Kim Extract DESCRIBE NAMESPACE tests to the common place to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37044) Add Row to __all__ in pyspark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430059#comment-17430059 ] Maciej Szymkiewicz commented on SPARK-37044: cc [~hyukjin.kwon] > Add Row to __all__ in pyspark.sql.types > --- > > Key: SPARK-37044 > URL: https://issues.apache.org/jira/browse/SPARK-37044 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from > {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import > *}} won't import {{Row}}. > It might be counter-intuitive, especially when we import {{Row}} from > {{types}} in {{examples}}. > Should we add it to {{__all__}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37044) Add Row to __all__ in pyspark.sql.types
Maciej Szymkiewicz created SPARK-37044: -- Summary: Add Row to __all__ in pyspark.sql.types Key: SPARK-37044 URL: https://issues.apache.org/jira/browse/SPARK-37044 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.2.0, 3.1.0, 3.3.0 Reporter: Maciej Szymkiewicz Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import *}} won't import {{Row}}. It might be counter-intuitive, especially when we import {{Row}} from {{types}} in {{examples}}. Should we add it to {{__all__}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35925) Support DayTimeIntervalType in width-bucket function
[ https://issues.apache.org/jira/browse/SPARK-35925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-35925: Assignee: PengLei > Support DayTimeIntervalType in width-bucket function > > > Key: SPARK-35925 > URL: https://issues.apache.org/jira/browse/SPARK-35925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: PengLei >Assignee: PengLei >Priority: Major > > At now, width-bucket support the type [DoubleType, DoubleType, DoubleType, > LongType], > we hope that support[DayTimeIntervaType, DayTimeIntervaType, > DayTimeIntervaType, LongType] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35925) Support DayTimeIntervalType in width-bucket function
[ https://issues.apache.org/jira/browse/SPARK-35925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-35925. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34309 [https://github.com/apache/spark/pull/34309] > Support DayTimeIntervalType in width-bucket function > > > Key: SPARK-35925 > URL: https://issues.apache.org/jira/browse/SPARK-35925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: PengLei >Assignee: PengLei >Priority: Major > Fix For: 3.3.0 > > > At now, width-bucket support the type [DoubleType, DoubleType, DoubleType, > LongType], > we hope that support[DayTimeIntervaType, DayTimeIntervaType, > DayTimeIntervaType, LongType] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)
[ https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-36231: Comment: was deleted (was: https://github.com/apache/spark/pull/34314) > Support arithmetic operations of Series containing Decimal(np.nan) > --- > > Key: SPARK-36231 > URL: https://issues.apache.org/jira/browse/SPARK-36231 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > Arithmetic operations of Series containing Decimal(np.nan) raise > java.lang.NullPointerException in driver. An example is shown as below: > {code:java} > >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), > >>> decimal.Decimal(np.nan)]) > >>> psser = ps.from_pandas(pser) > >>> pser + 1 > 0 2 > 1 3 > 2 NaN > >>> psser + 1 > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107) > at > org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at >
[jira] [Commented] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)
[ https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429997#comment-17429997 ] Apache Spark commented on SPARK-36231: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34314 > Support arithmetic operations of Series containing Decimal(np.nan) > --- > > Key: SPARK-36231 > URL: https://issues.apache.org/jira/browse/SPARK-36231 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > Arithmetic operations of Series containing Decimal(np.nan) raise > java.lang.NullPointerException in driver. An example is shown as below: > {code:java} > >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), > >>> decimal.Decimal(np.nan)]) > >>> psser = ps.from_pandas(pser) > >>> pser + 1 > 0 2 > 1 3 > 2 NaN > >>> psser + 1 > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107) > at > org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68) > Caused by: java.lang.NullPointerException > at >
[jira] [Assigned] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)
[ https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36231: Assignee: (was: Apache Spark) > Support arithmetic operations of Series containing Decimal(np.nan) > --- > > Key: SPARK-36231 > URL: https://issues.apache.org/jira/browse/SPARK-36231 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > Arithmetic operations of Series containing Decimal(np.nan) raise > java.lang.NullPointerException in driver. An example is shown as below: > {code:java} > >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), > >>> decimal.Decimal(np.nan)]) > >>> psser = ps.from_pandas(pser) > >>> pser + 1 > 0 2 > 1 3 > 2 NaN > >>> psser + 1 > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107) > at > org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at >
[jira] [Commented] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)
[ https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429996#comment-17429996 ] Apache Spark commented on SPARK-36231: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34314 > Support arithmetic operations of Series containing Decimal(np.nan) > --- > > Key: SPARK-36231 > URL: https://issues.apache.org/jira/browse/SPARK-36231 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > Arithmetic operations of Series containing Decimal(np.nan) raise > java.lang.NullPointerException in driver. An example is shown as below: > {code:java} > >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), > >>> decimal.Decimal(np.nan)]) > >>> psser = ps.from_pandas(pser) > >>> pser + 1 > 0 2 > 1 3 > 2 NaN > >>> psser + 1 > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107) > at > org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68) > Caused by: java.lang.NullPointerException > at >
[jira] [Assigned] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)
[ https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36231: Assignee: Apache Spark > Support arithmetic operations of Series containing Decimal(np.nan) > --- > > Key: SPARK-36231 > URL: https://issues.apache.org/jira/browse/SPARK-36231 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Arithmetic operations of Series containing Decimal(np.nan) raise > java.lang.NullPointerException in driver. An example is shown as below: > {code:java} > >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), > >>> decimal.Decimal(np.nan)]) > >>> psser = ps.from_pandas(pser) > >>> pser + 1 > 0 2 > 1 3 > 2 NaN > >>> psser + 1 > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107) > at > org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at >
[jira] [Commented] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)
[ https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429994#comment-17429994 ] Yikun Jiang commented on SPARK-36231: - https://github.com/apache/spark/pull/34314 > Support arithmetic operations of Series containing Decimal(np.nan) > --- > > Key: SPARK-36231 > URL: https://issues.apache.org/jira/browse/SPARK-36231 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > Arithmetic operations of Series containing Decimal(np.nan) raise > java.lang.NullPointerException in driver. An example is shown as below: > {code:java} > >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), > >>> decimal.Decimal(np.nan)]) > >>> psser = ps.from_pandas(pser) > >>> pser + 1 > 0 2 > 1 3 > 2 NaN > >>> psser + 1 > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107) > at > org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at >
[jira] [Assigned] (SPARK-37043) Cancel all running job after AQE plan finished
[ https://issues.apache.org/jira/browse/SPARK-37043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37043: Assignee: (was: Apache Spark) > Cancel all running job after AQE plan finished > -- > > Key: SPARK-37043 > URL: https://issues.apache.org/jira/browse/SPARK-37043 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > We see stage was still running after AQE plan finished. This is because the > plan which contains a empty join has been converted to `LocalTableScanExec` > during `AQEOptimizer`, but the other side of this join is still running > (shuffle map stage). > > It's no meaning to keep running the stage, It's better to cancel the running > stage after AQE plan finished in case wasting the task resource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37043) Cancel all running job after AQE plan finished
[ https://issues.apache.org/jira/browse/SPARK-37043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37043: Assignee: Apache Spark > Cancel all running job after AQE plan finished > -- > > Key: SPARK-37043 > URL: https://issues.apache.org/jira/browse/SPARK-37043 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > We see stage was still running after AQE plan finished. This is because the > plan which contains a empty join has been converted to `LocalTableScanExec` > during `AQEOptimizer`, but the other side of this join is still running > (shuffle map stage). > > It's no meaning to keep running the stage, It's better to cancel the running > stage after AQE plan finished in case wasting the task resource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36337) decimal('Nan') is unsupported in net.razorvine.pickle
[ https://issues.apache.org/jira/browse/SPARK-36337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429987#comment-17429987 ] Apache Spark commented on SPARK-36337: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34314 > decimal('Nan') is unsupported in net.razorvine.pickle > -- > > Key: SPARK-36337 > URL: https://issues.apache.org/jira/browse/SPARK-36337 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > > Decimal('NaN') is not supported by net.razorvine.pickle now. > In Python > {code:java} > >>> pickled = cloudpickle.dumps(decimal.Decimal('NaN')) > b'\x80\x05\x95!\x00\x00\x00\x00\x00\x00\x00\x8c\x07decimal\x94\x8c\x07Decimal\x94\x93\x94\x8c\x03NaN\x94\x85\x94R\x94.' > >>> pickle.loads(pickled) > Decimal('NaN') > {code} > In Scala > {code:java} > scala> import net.razorvine.pickle.\{Pickler, Unpickler, PickleUtils} > scala> val unpickle = new Unpickler > scala> > unpickle.loads(PickleUtils.str2bytes("\u0080\u0005\u0095!\u\u\u\u\u\u\u\u008c\u0007decimal\u0094\u008c\u0007Decimal\u0094\u0093\u0094\u008c\u0003NaN\u0094\u0085\u0094R\u0094.")) > net.razorvine.pickle.PickleException: problem construction object: > java.lang.reflect.InvocationTargetException > at > net.razorvine.pickle.objects.AnyClassConstructor.construct(AnyClassConstructor.java:29) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:773) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:213) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:123) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:136) > ... 48 elided > {code} > I submit an issue in pickle upstream > [https://github.com/irmen/pickle/issues/7] . > we should bump pickle latest version after it fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37043) Cancel all running job after AQE plan finished
[ https://issues.apache.org/jira/browse/SPARK-37043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429988#comment-17429988 ] Apache Spark commented on SPARK-37043: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34316 > Cancel all running job after AQE plan finished > -- > > Key: SPARK-37043 > URL: https://issues.apache.org/jira/browse/SPARK-37043 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > We see stage was still running after AQE plan finished. This is because the > plan which contains a empty join has been converted to `LocalTableScanExec` > during `AQEOptimizer`, but the other side of this join is still running > (shuffle map stage). > > It's no meaning to keep running the stage, It's better to cancel the running > stage after AQE plan finished in case wasting the task resource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37039) np.nan series.astype(bool) should be True
[ https://issues.apache.org/jira/browse/SPARK-37039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429973#comment-17429973 ] Yikun Jiang edited comment on SPARK-37039 at 10/18/21, 12:27 PM: - Looks like there are diff behaviors in diff type... was (Author: yikunkero): working on this > np.nan series.astype(bool) should be True > - > > Key: SPARK-37039 > URL: https://issues.apache.org/jira/browse/SPARK-37039 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > > np.nan series.astype(bool) should be True, rather than Fasle: > https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147 > >>> pd.Series([1, 2, np.nan], dtype=float).astype(bool) > >>> pd.Series([1, 2, np.nan], dtype=str).astype(bool) > >>> pd.Series([datetime.date(1994, 1, 31), datetime.date(1994, 2, 1), np.nan]) > 0 True > 1 True > 2 True > dtype: bool > But in pyspark, it is: > 0 True > 1 True > 2 False > dtype: bool -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37043) Cancel all running job after AQE plan finished
[ https://issues.apache.org/jira/browse/SPARK-37043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-37043: -- Description: We see stage was still running after AQE plan finished. This is because the plan which contains a empty join has been converted to `LocalTableScanExec` during `AQEOptimizer`, but the other side of this join is still running (shuffle map stage). It's no meaning to keep running the stage, It's better to cancel the running stage after AQE plan finished in case wasting the task resource. was: We see stage was still running after AQE plan finished. This is because the plan which contains a empty join has been converted to `LocalTableScanExec` during `AQEOptimizer`, but the other side of this join is still running (shuffle map stage). It's better to cancel the running stage after AQE plan finished in case wasting the task resource. > Cancel all running job after AQE plan finished > -- > > Key: SPARK-37043 > URL: https://issues.apache.org/jira/browse/SPARK-37043 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > We see stage was still running after AQE plan finished. This is because the > plan which contains a empty join has been converted to `LocalTableScanExec` > during `AQEOptimizer`, but the other side of this join is still running > (shuffle map stage). > > It's no meaning to keep running the stage, It's better to cancel the running > stage after AQE plan finished in case wasting the task resource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37043) Cancel all running job after AQE plan finished
XiDuo You created SPARK-37043: - Summary: Cancel all running job after AQE plan finished Key: SPARK-37043 URL: https://issues.apache.org/jira/browse/SPARK-37043 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You We see stage was still running after AQE plan finished. This is because the plan which contains a empty join has been converted to `LocalTableScanExec` during `AQEOptimizer`, but the other side of this join is still running (shuffle map stage). It's better to cancel the running stage after AQE plan finished in case wasting the task resource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37039) np.nan series.astype(bool) should be True
[ https://issues.apache.org/jira/browse/SPARK-37039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429973#comment-17429973 ] Yikun Jiang commented on SPARK-37039: - working on this > np.nan series.astype(bool) should be True > - > > Key: SPARK-37039 > URL: https://issues.apache.org/jira/browse/SPARK-37039 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > > np.nan series.astype(bool) should be True, rather than Fasle: > https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147 > >>> pd.Series([1, 2, np.nan], dtype=float).astype(bool) > >>> pd.Series([1, 2, np.nan], dtype=str).astype(bool) > >>> pd.Series([datetime.date(1994, 1, 31), datetime.date(1994, 2, 1), np.nan]) > 0 True > 1 True > 2 True > dtype: bool > But in pyspark, it is: > 0 True > 1 True > 2 False > dtype: bool -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37042) Inline type hints for kinesis.py and listener.py in python/pyspark/streaming
[ https://issues.apache.org/jira/browse/SPARK-37042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dch nguyen updated SPARK-37042: --- Summary: Inline type hints for kinesis.py and listener.py in python/pyspark/streaming (was: Inline type hints for python/pyspark/streaming/kinesis.py) > Inline type hints for kinesis.py and listener.py in python/pyspark/streaming > > > Key: SPARK-37042 > URL: https://issues.apache.org/jira/browse/SPARK-37042 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36965) Extend python test runner by logging out the temp output files
[ https://issues.apache.org/jira/browse/SPARK-36965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros resolved SPARK-36965. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34233 [https://github.com/apache/spark/pull/34233] > Extend python test runner by logging out the temp output files > -- > > Key: SPARK-36965 > URL: https://issues.apache.org/jira/browse/SPARK-36965 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.3.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Minor > Fix For: 3.3.0 > > > I was running a python test which was extremely slow and I was surprised the > unit-tests.log has not been even created. Looked into the code and as I got > the tests can be executed in parallel and each one has its own temporary > output file which is only added to the unit-tests.log when a test is finished > with a failure (after acquiring a lock to avoid parallel write on > unit-tests.log). > To avoid such a confusion it would make sense to log out the path of those > temporary output files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37042) Inline type hints for python/pyspark/streaming/kinesis.py
dch nguyen created SPARK-37042: -- Summary: Inline type hints for python/pyspark/streaming/kinesis.py Key: SPARK-37042 URL: https://issues.apache.org/jira/browse/SPARK-37042 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: dch nguyen -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37042) Inline type hints for python/pyspark/streaming/kinesis.py
[ https://issues.apache.org/jira/browse/SPARK-37042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429918#comment-17429918 ] dch nguyen commented on SPARK-37042: i am working on this > Inline type hints for python/pyspark/streaming/kinesis.py > - > > Key: SPARK-37042 > URL: https://issues.apache.org/jira/browse/SPARK-37042 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type
[ https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36978. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34263 [https://github.com/apache/spark/pull/34263] > InferConstraints rule should create IsNotNull constraints on the nested field > instead of the root nested type > -- > > Key: SPARK-36978 > URL: https://issues.apache.org/jira/browse/SPARK-36978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: Utkarsh Agarwal >Assignee: Utkarsh Agarwal >Priority: Major > Fix For: 3.3.0 > > > [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206] > optimization rule generates {{IsNotNull}} constraints corresponding to null > intolerant predicates. The {{IsNotNull}} constraints are generated on the > attribute inside the corresponding predicate. > e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a > constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int > column {{structCol.b}} where {{structCol}} is a struct column results in a > constraint {{IsNotNull(structCol)}}. > This generation of constraints on the root level nested type is extremely > conservative as it could lead to materialization of the the entire struct. > The constraint should instead be generated on the nested field being > referenced by the predicate. In the above example, the constraint should be > {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}} > > The new constraints also create opportunities for nested pruning. Currently > {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. > However the constraint {{IsNotNull(structCol.b)}} could create opportunities > to prune {{structCol}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type
[ https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36978: --- Assignee: Utkarsh Agarwal > InferConstraints rule should create IsNotNull constraints on the nested field > instead of the root nested type > -- > > Key: SPARK-36978 > URL: https://issues.apache.org/jira/browse/SPARK-36978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: Utkarsh Agarwal >Assignee: Utkarsh Agarwal >Priority: Major > > [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206] > optimization rule generates {{IsNotNull}} constraints corresponding to null > intolerant predicates. The {{IsNotNull}} constraints are generated on the > attribute inside the corresponding predicate. > e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a > constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int > column {{structCol.b}} where {{structCol}} is a struct column results in a > constraint {{IsNotNull(structCol)}}. > This generation of constraints on the root level nested type is extremely > conservative as it could lead to materialization of the the entire struct. > The constraint should instead be generated on the nested field being > referenced by the predicate. In the above example, the constraint should be > {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}} > > The new constraints also create opportunities for nested pruning. Currently > {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. > However the constraint {{IsNotNull(structCol.b)}} could create opportunities > to prune {{structCol}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
[ https://issues.apache.org/jira/browse/SPARK-37041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429875#comment-17429875 ] Apache Spark commented on SPARK-37041: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/34312 > Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS > -- > > Key: SPARK-37041 > URL: https://issues.apache.org/jira/browse/SPARK-37041 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy > upgrade Thrift. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37013) `select format_string('%0$s', 'Hello')` has different behavior when using java 8 and Java 17
[ https://issues.apache.org/jira/browse/SPARK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429874#comment-17429874 ] Apache Spark commented on SPARK-37013: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/34313 > `select format_string('%0$s', 'Hello')` has different behavior when using > java 8 and Java 17 > > > Key: SPARK-37013 > URL: https://issues.apache.org/jira/browse/SPARK-37013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > --PostgreSQL throw ERROR: format specifies argument 0, but arguments are > numbered from 1 > select format_string('%0$s', 'Hello'); > {code} > Execute with Java 8 > {code:java} > -- !query > select format_string('%0$s', 'Hello') > -- !query schema > struct > -- !query output > Hello > {code} > Execute with Java 17 > {code:java} > -- !query > select format_string('%0$s', 'Hello') > -- !query schema > struct<> > -- !query output > java.util.IllegalFormatArgumentIndexException > Illegal format argument index = 0 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
[ https://issues.apache.org/jira/browse/SPARK-37041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37041: Assignee: (was: Apache Spark) > Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS > -- > > Key: SPARK-37041 > URL: https://issues.apache.org/jira/browse/SPARK-37041 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy > upgrade Thrift. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
[ https://issues.apache.org/jira/browse/SPARK-37041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37041: Assignee: Apache Spark > Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS > -- > > Key: SPARK-37041 > URL: https://issues.apache.org/jira/browse/SPARK-37041 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy > upgrade Thrift. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37013) `select format_string('%0$s', 'Hello')` has different behavior when using java 8 and Java 17
[ https://issues.apache.org/jira/browse/SPARK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37013: Assignee: (was: Apache Spark) > `select format_string('%0$s', 'Hello')` has different behavior when using > java 8 and Java 17 > > > Key: SPARK-37013 > URL: https://issues.apache.org/jira/browse/SPARK-37013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > --PostgreSQL throw ERROR: format specifies argument 0, but arguments are > numbered from 1 > select format_string('%0$s', 'Hello'); > {code} > Execute with Java 8 > {code:java} > -- !query > select format_string('%0$s', 'Hello') > -- !query schema > struct > -- !query output > Hello > {code} > Execute with Java 17 > {code:java} > -- !query > select format_string('%0$s', 'Hello') > -- !query schema > struct<> > -- !query output > java.util.IllegalFormatArgumentIndexException > Illegal format argument index = 0 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37013) `select format_string('%0$s', 'Hello')` has different behavior when using java 8 and Java 17
[ https://issues.apache.org/jira/browse/SPARK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37013: Assignee: Apache Spark > `select format_string('%0$s', 'Hello')` has different behavior when using > java 8 and Java 17 > > > Key: SPARK-37013 > URL: https://issues.apache.org/jira/browse/SPARK-37013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > {code:java} > --PostgreSQL throw ERROR: format specifies argument 0, but arguments are > numbered from 1 > select format_string('%0$s', 'Hello'); > {code} > Execute with Java 8 > {code:java} > -- !query > select format_string('%0$s', 'Hello') > -- !query schema > struct > -- !query output > Hello > {code} > Execute with Java 17 > {code:java} > -- !query > select format_string('%0$s', 'Hello') > -- !query schema > struct<> > -- !query output > java.util.IllegalFormatArgumentIndexException > Illegal format argument index = 0 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37013) `select format_string('%0$s', 'Hello')` has different behavior when using java 8 and Java 17
[ https://issues.apache.org/jira/browse/SPARK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-37013: - Description: {code:java} --PostgreSQL throw ERROR: format specifies argument 0, but arguments are numbered from 1 select format_string('%0$s', 'Hello'); {code} Execute with Java 8 {code:java} -- !query select format_string('%0$s', 'Hello') -- !query schema struct -- !query output Hello {code} Execute with Java 17 {code:java} -- !query select format_string('%0$s', 'Hello') -- !query schema struct<> -- !query output java.util.IllegalFormatArgumentIndexException Illegal format argument index = 0 {code} was: {code:java} --PostgreSQL throw ERROR: format specifies argument 0, but arguments are numbered from 1 select format_string('%0$s', 'Hello'); {code} Execute with Java 8 {code:java} -- !query select format_string('%0$s', 'Hello') -- !query schema struct -- !query output Hello {code} Execute with Java 11 {code:java} -- !query select format_string('%0$s', 'Hello') -- !query schema struct<> -- !query output java.util.IllegalFormatArgumentIndexException Illegal format argument index = 0 {code} > `select format_string('%0$s', 'Hello')` has different behavior when using > java 8 and Java 17 > > > Key: SPARK-37013 > URL: https://issues.apache.org/jira/browse/SPARK-37013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > --PostgreSQL throw ERROR: format specifies argument 0, but arguments are > numbered from 1 > select format_string('%0$s', 'Hello'); > {code} > Execute with Java 8 > {code:java} > -- !query > select format_string('%0$s', 'Hello') > -- !query schema > struct > -- !query output > Hello > {code} > Execute with Java 17 > {code:java} > -- !query > select format_string('%0$s', 'Hello') > -- !query schema > struct<> > -- !query output > java.util.IllegalFormatArgumentIndexException > Illegal format argument index = 0 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
Yuming Wang created SPARK-37041: --- Summary: Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS Key: SPARK-37041 URL: https://issues.apache.org/jira/browse/SPARK-37041 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Yuming Wang Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy upgrade Thrift. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37039) np.nan series.astype(bool) should be True
[ https://issues.apache.org/jira/browse/SPARK-37039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-37039: Parent: (was: SPARK-36000) Issue Type: Bug (was: Sub-task) > np.nan series.astype(bool) should be True > - > > Key: SPARK-37039 > URL: https://issues.apache.org/jira/browse/SPARK-37039 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > > np.nan series.astype(bool) should be True, rather than Fasle: > https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147 > >>> pd.Series([1, 2, np.nan], dtype=float).astype(bool) > >>> pd.Series([1, 2, np.nan], dtype=str).astype(bool) > >>> pd.Series([datetime.date(1994, 1, 31), datetime.date(1994, 2, 1), np.nan]) > 0 True > 1 True > 2 True > dtype: bool > But in pyspark, it is: > 0 True > 1 True > 2 False > dtype: bool -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37039) np.nan series.astype(bool) should be True
[ https://issues.apache.org/jira/browse/SPARK-37039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-37039: Description: np.nan series.astype(bool) should be True, rather than Fasle: https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147 >>> pd.Series([1, 2, np.nan], dtype=float).astype(bool) >>> pd.Series([1, 2, np.nan], dtype=str).astype(bool) >>> pd.Series([datetime.date(1994, 1, 31), datetime.date(1994, 2, 1), np.nan]) 0 True 1 True 2 True dtype: bool But in pyspark, it is: 0 True 1 True 2 False dtype: bool was: np.nan series.astype(bool) should be True, rather than Fasle: https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147 > np.nan series.astype(bool) should be True > - > > Key: SPARK-37039 > URL: https://issues.apache.org/jira/browse/SPARK-37039 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > > np.nan series.astype(bool) should be True, rather than Fasle: > https://github.com/apache/spark/blob/46bcef7472edd40c23afd9ac74cffe13c6a608ad/python/pyspark/pandas/data_type_ops/base.py#L147 > >>> pd.Series([1, 2, np.nan], dtype=float).astype(bool) > >>> pd.Series([1, 2, np.nan], dtype=str).astype(bool) > >>> pd.Series([datetime.date(1994, 1, 31), datetime.date(1994, 2, 1), np.nan]) > 0 True > 1 True > 2 True > dtype: bool > But in pyspark, it is: > 0 True > 1 True > 2 False > dtype: bool -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37040) SampleExec can set outputOrdering as children's outputOrdering
chong created SPARK-37040: - Summary: SampleExec can set outputOrdering as children's outputOrdering Key: SPARK-37040 URL: https://issues.apache.org/jira/browse/SPARK-37040 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: chong All of the code paths in SampleExec that I can see preserve the child ordering, but Spark does not set this. Is this better to set? override def outputOrdering: Seq[SortOrder] = child.outputOrdering -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36367) Fix the behavior to follow pandas >= 1.3
[ https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-36367: --- Affects Version/s: (was: 3.3.0) 3.2.0 > Fix the behavior to follow pandas >= 1.3 > > > Key: SPARK-36367 > URL: https://issues.apache.org/jira/browse/SPARK-36367 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.2.0 > > > Pandas 1.3 has been released. We should follow the new pandas behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37038) Sample push down in DS v2
[ https://issues.apache.org/jira/browse/SPARK-37038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37038: Assignee: Apache Spark > Sample push down in DS v2 > - > > Key: SPARK-37038 > URL: https://issues.apache.org/jira/browse/SPARK-37038 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37038) Sample push down in DS v2
[ https://issues.apache.org/jira/browse/SPARK-37038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429843#comment-17429843 ] Apache Spark commented on SPARK-37038: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34311 > Sample push down in DS v2 > - > > Key: SPARK-37038 > URL: https://issues.apache.org/jira/browse/SPARK-37038 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org